# Prerequirements

To be able to run this colab correctly there are two things you need to setup before:

- Make sure you are using a GPU environment, as a CPU wont be able to handle the processing required by the LLM. Due to the size of the model a GPU with at least 16GB of ram is required.

- In the secrets section of the Colab, you must add a HuggingFace key with an account authorized to use Llama3 model of Meta:

    1. You can accept Meta's agreement in [Llama3](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
    2. Once the agreement has been signed, and Meta has authorized you to use their models, you must proceed to obtain a HugginFace api token from the [Api Tokens Page](https://huggingface.co/settings/tokens)
    3. You must set this api token as a secret on this Colab called `HF_TOKEN`, after which you should authorize it to be used on the Colab by checking the blue check.

- This Colab uses Google Drive as a storing place. Thats why the variable defined on the first code execution called `DRIVE_FOLDER` should be fixed to point to the folder where the model will be saved.
    - This folder must have the following structure:
        - `Llama`: This folder will be used to store and retrieve the LLM.
        - `Inputs`: This folder will contain all the json files meant to be processed. This will be a list of posts which each contains a list of comments too.
            - `Inputs/IG`: Folder where the scrapped json files of instagram are.
            - `Inputs/FB`: Folder where the scrapped json files of facebook are.
            - `Inputs/Web`: Folder where the scrapped json files of the web are.
        - `Chunks`: This folder is where the chunks of done processed data will be.
        - `sentiment_output.json`: This will be the final output with the sentiment attached.

# Sentiment Analysis Model

For the model, we will use Huggingface as an interface to easily access a variaty of models. Being allowed to fine tune them and test them in an efficient way.

## Libraries installation and Module Imports

In [None]:
from google.colab import userdata, drive
drive.mount('/content/drive')
DRIVE_FOLDER="drive/MyDrive/Vero Volley project/SentimentModel"

In [None]:
!pip install transformers
!pip install datasets
!pip install trl
!pip install peft
!pip install -q accelerate
!pip install -q -i https://pypi.org/simple/ bitsandbytes

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.0-py3-none-a

## Models

For the models, we will mainly focus on LLama3 and Gemma (Meta and Google LLM). This will be trained against a dataset of pre-classified tweet messages for defining if a message is either "positive", "neutral" or "negative".

Other models already pre-tuned are already disponible, but based on Bert (old Google model).

In [None]:
# Models
model_gemma = "google/gemma-2b"
model_gemma_7b = "google/gemma-7b"
model_llama = "meta-llama/Meta-Llama-3-8B"

Mounted at /content/drive


In [None]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import json

import torch
import torch.nn as nn

import transformers
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)
from datasets import Dataset, load_dataset
from peft import LoraConfig, PeftConfig
import bitsandbytes as bnb
from trl import SFTTrainer

from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)
from sklearn.model_selection import train_test_split


## Model Initialization

The model is initialized with specific paramethers for the optimization of the sentiment analysis.

In specific, we load:
- The model with the byte rules.
- The tokenizer with the limit of characters to generate.
- The End-Of-Sequence Token, to define where to stop the text generation.

In [None]:
model_name = model_llama
tokenizer_model_name = model_name

# Use Pretrained model
model_name = f"./{DRIVE_FOLDER}/Llama"

max_seq_length = 2048
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_name, max_seq_length=max_seq_length)
EOS_TOKEN = tokenizer.eos_token

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id



tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

## Training & Testing

We start defining the prompt for both training and testing. This defines the task to complete to the LLM, allowing a classification on the specific sentiments.

In [None]:
def generate_prompt(data_point):
    return (
        f"""generate_prompt
            Analyze the sentiment of the news headline enclosed in square brackets,
            determine if it is positive, neutral, or negative, and return the answer as
            the corresponding sentiment label "positive" or "neutral" or "negative"

            [{data_point["text"]}] = {data_point["label_text"]}
            """.strip()
        + EOS_TOKEN
    )

def generate_test_prompt(data_point):
    text = data_point
    if not isinstance(data_point, str):
        text = data_point["text"]

    return f"""Analyze the sentiment of the news headline enclosed in square brackets,
            determine if it is positive, neutral, or negative, and return the answer as
            the corresponding sentiment label "positive" or "neutral" or "negative"

            [{text}] =""".strip()

### Evaluation

We create the evaluation function, to compare and analyse the difference between the true emotions from the predicted emotions by the model. It defines a performance view of the model.

In [None]:
def evaluate(y_true, y_pred):
    print()
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none':1, 'negative': 0}
    def map_func(x):
        return mapping.get(x, 1)

    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)

    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')

    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels

    for label in unique_labels:
        label_indices = [i for i in range(len(y_true))
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')

    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)

    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

### Parallelization Helper Functions

We define some helper functions to be able to parallelize the prediction of files

#### Parallelization data conversion

For being able to run a big list of files in parallel with multiple colab sessions, we created a new data structure that stores "elements" instead of posts with comments.

An element, is simply either a post or a comment that is meant to be used to predict its sentiment. is composed by the following structure:
```python
{
  "hash": "g23hf..." # Is a hash representation of the post or element, to be interconnected later
  "text": "I dunno, wh..." # The text that needs to be used for prediction
  "sentiment": "negative" # The sentiment, if it was already predicted, if not None
}
```

In [None]:
def post_to_hash(post):
  return f"{post.get('taken_at_date')}{post.get('date')}{post.get('title')}{post.get('content')}"

def comment_to_hash(post_hash, comment):
  return f"{post_hash}{comment.get('text')}{comment.get('author')}{comment.get('user')}{comment.get('username')}{comment.get('created_at_utc')}{comment.get('date')}"

In [None]:
def posts_to_elements(posts):
  elements = []
  seen_set = set()

  for post in posts:
    post_hash = post_to_hash(post)
    if post_hash not in seen_set:
      elements.append({
          "hash": post_hash,
          "text": post["title"] if post["title"] else post["content"],
          "sentiment": post["sentiment"] if "sentiment" in post else None
      })
      seen_set.add(post_hash)
      for comment in post["comments"]:
        comment_hash = comment_to_hash(post_hash, comment)
        if comment_hash not in seen_set:
          elements.append({
            "hash": comment_hash,
            "text": comment["text"],
            "sentiment": comment["sentiment"] if "sentiment" in comment else None
          })
          seen_set.add(comment_hash)

  return elements


def elements_to_posts(elements):
  # Platforms to check
  platform_to_separator = {
      "FB": "keyword",
      "IG": "keyword",
      "Web": "keyword"
  }
  # Open posts already analyzed, if not present initialize to empty array
  sentiment_from_element = {}
  seen_set = set()

  for element in elements:
    sentiment_from_element[element["hash"]] = element["sentiment"]


  # Crawl folders to find the files and posts
  posts = []
  platforms = next(os.walk(f'{DRIVE_FOLDER}/Inputs'))[1]
  for platform in platforms:
    if platform not in platform_to_separator:
      continue
    folder_files = next(os.walk(f'{DRIVE_FOLDER}/Inputs/{platform}'))[2]
    folder_files = list(filter(lambda x: (x.split(".")[-1] == "json"), folder_files))

    for filename in folder_files:
      source = filename.split(f"_{platform_to_separator[platform]}")[0]
      with open(f"{DRIVE_FOLDER}/Inputs/{platform}/{filename}", "r") as f:
        data = json.load(f)

      for post in data:
        post["platform"] = platform
        post["source"] = source
        post["title"] = post["title"] if post.get("title") else ""
        post["content"] = post["content"] if post.get("content") else ""
        post["comments"] = post["comments"] if post.get("comments") else []

        post_hash = post_to_hash(post)
        # Skip posts still not analyzed
        if post_hash in sentiment_from_element and post_hash not in seen_set:
          post["sentiment"] = sentiment_from_element[post_hash]
          del sentiment_from_element[post_hash]

          comments = []
          for comment in post["comments"]:
            comment_hash = comment_to_hash(post_hash, comment)

            # Skip comments still not analyzed
            if comment_hash in sentiment_from_element and comment_hash not in seen_set:
              comment["sentiment"] = sentiment_from_element[comment_hash]
              del sentiment_from_element[comment_hash]
              comments.append(comment)
              # seen_set.add(comment_hash)
          post["comments"] = comments

          posts.append(post)
          seen_set.add(post_hash)

  return posts

#### Data Fetching for parallelization

Now, we fetch the data to fetch from the drive, we cross compare with the results already done and get a list of elements "to-do".

This elements are shuffled so there are less probabilities of repeating processing of elements between multiple colab sessions.

In [None]:
import json
import os
import random

def thousands(number):
  return '{:,}'.format(number).replace(',','.')
def count_elements(posts):
  return len(posts) + sum(map(lambda x: len(x["comments"]), posts))
def count_format_elements(posts):
  return thousands(count_elements(posts))

def merge_chunks():
  set_to_skip = set()

  done_elements = []
  chunk_files = next(os.walk(f'{DRIVE_FOLDER}/Chunks'))[2]
  for chunk_file in chunk_files:
    if "done" not in chunk_file:
      continue

    try:
      with open(f'{DRIVE_FOLDER}/Chunks/{chunk_file}', "r") as f:
        chunk_data = json.load(f)
      for element in chunk_data:
        hash = element["hash"]
        if hash not in set_to_skip:
          done_elements.append(element)
          set_to_skip.add(hash)
    except:
      pass

  return done_elements, set_to_skip




def get_pair_of_data(with_print=True):
  # Platforms to check
  platform_to_separator = {
      "FB": "crawl",
      "IG": "post",
      "Web": "output"
  }
  # Open posts already analyzed, if not present initialize to empty array
  data_done, set_to_skip = merge_chunks()

  # Crawl folders to find the files and posts
  elements_to_process = []
  platforms = next(os.walk(f'{DRIVE_FOLDER}/Inputs'))[1]
  for platform in platforms:
    if platform not in platform_to_separator:
      continue
    folder_files = next(os.walk(f'{DRIVE_FOLDER}/Inputs/{platform}'))[2]
    folder_files = list(filter(lambda x: (x.split(".")[-1] == "json"), folder_files))

    for filename in folder_files:
      source = filename.split(f"_{platform_to_separator[platform]}")[0]
      with open(f"{DRIVE_FOLDER}/Inputs/{platform}/{filename}", "r") as f:
        data = json.load(f)

      for post in data:
        post["platform"] = platform
        post["source"] = source
        post["title"] = post["title"] if post.get("title") else ""
        post["content"] = post["content"] if post.get("content") else ""
        post["comments"] = post["comments"] if post.get("comments") else []

        post_hash = post_to_hash(post)

        # Skip posts already analyzed
        if post_hash not in set_to_skip:
          elements_to_process.append({
              "hash": post_hash,
              "text": post["title"] if post["title"] else post["content"],
              "sentiment": None
          })
          set_to_skip.add(post_hash)

        for comment in post["comments"]:
          comment_hash = comment_to_hash(post_hash, comment)
          if comment_hash not in set_to_skip:
            elements_to_process.append({
                "hash": comment_hash,
                "text": comment["text"],
                "sentiment": None
            })
            set_to_skip.add(comment_hash)

  random.shuffle(elements_to_process)

  # Number of elements to process (posts + comments)
  print("Elements to process:", thousands(len(elements_to_process)))
  print("Elements Skept:", thousands(len(data_done)))
  print()
  return elements_to_process, data_done


### Prediction

We define the prediction function, which uses the model to generate a prediction to a set of messages.

In [None]:
from google.colab import files
import uuid

"""
  Function for predicting a list of texts with the model
"""
def predict(X_test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(**input_ids, max_new_tokens=1, temperature=0.001, pad_token_id = tokenizer.eos_token_id)
        result = tokenizer.decode(outputs[0])
        answer = result.split("=")[-1].lower()
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("non-related")
    return y_pred

"""
  Adapter function to predict a json file of posts and comments
"""
def predict_json(json_data, model, tokenizer, data_to_append=[], chunk_index=0, debug=False):
    # List to collect all texts
    texts = []

    # Iterate over each post and its comments
    json_data = [*json_data]
    for post in json_data:
        # Append the post caption text
        try:
          texts.append(generate_test_prompt(post["title"] if post["title"] else post["content"]))
        except:
          print(post)
          raise Exception(":noooo")
        # Append each comment text
        for comment in post['comments']:
            texts.append(generate_test_prompt(comment["text"]))

    y_pred = []
    if len(texts) > 0:
      # Convert the list of texts into a DataFrame in one go
      df = pd.DataFrame(texts, columns=['text'])
      # Predict emotions
      y_pred = predict(df, model, tokenizer)

    # Assign emotion to posts and comments
    index = 0
    for post in json_data:
        post["sentiment"] = y_pred[index]
        index += 1
        if index < 40 and debug:
            print("-"*30)
            print(f"\n{post['sentiment']}: {post['title'] if post['title'] else post['content']}")
        for comment in post["comments"]:
            comment["sentiment"] = y_pred[index]
            index += 1
            if index < 40 and debug:
                print(f"  {comment['sentiment']}: {comment['text']}")

    checkpoint_filename = f"sentiment_output_[{chunk_index}]_{uuid.uuid4()}.json"
    with open(f"{DRIVE_FOLDER}/Chunks/{checkpoint_filename}", "w") as f:
        json.dump(json_data, f, indent=2)

    # Save output
    for post in data_to_append:
        json_data.append(post)

    with open("sentiment_output.json", "w") as f:
        json.dump(json_data, f, indent=2)
    return json_data

"""
  Function for predicting a list of elements, prebuilt for parallelization
"""
def predict_elements(elements_to_process, model, tokenizer, data_to_append=[], chunk_index=0, debug=False):
    # List to collect all texts
    texts = []

    # Iterate over each element
    elements_to_process = [*elements_to_process]
    for element in elements_to_process:
        texts.append(generate_test_prompt(element["text"]))

    y_pred = []
    if len(texts) > 0:
      # Convert the list of texts into a DataFrame in one go
      df = pd.DataFrame(texts, columns=['text'])
      # Predict emotions
      y_pred = predict(df, model, tokenizer)

    # Assign emotion to posts and comments
    index = 0
    for element in elements_to_process:
        element["sentiment"] = y_pred[index]
        index += 1

    checkpoint_filename = f"done_[{chunk_index}]_{uuid.uuid4()}.json"
    with open(f"{DRIVE_FOLDER}/Chunks/{checkpoint_filename}", "w") as f:
        json.dump(elements_to_process, f, indent=2)

    # Save output
    seen_set = set()
    for element in elements_to_process:
        seen_set.add(element["hash"])
    for element in data_to_append:
        if element["hash"] not in seen_set:
            elements_to_process.append(element)
            seen_set.add(element["hash"])

    return elements_to_process

#### Predict sentiment from list of posts and comments of json

This function cares about the conversion of the list of posts and comments obtained as input from the Drive folder. It converts them into elements, and goes fetching them 500 elements at a time.

On each iteration, it saves the results into chunks of done data, which is further used to avoid these elements on all colab sessions.


In [None]:
import time

elements_to_process, data_done = get_pair_of_data()

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

elements_to_do = len(elements_to_process)
elements_done = 0
index = 0

STEPS = 500

start_time = time.perf_counter()
# Predict and save output of posts and comments from json
while len(elements_to_process) > 0:
  total_seconds = round(time.perf_counter() - start_time)
  hours = total_seconds // 3600
  minutes = (total_seconds - hours * 3600) // 60
  seconds = (total_seconds - hours * 3600 - minutes * 60)
  percentage_done = (elements_done * 1000 // len(elements_to_process)) / 10

  print(f"Iteration {index}: ({percentage_done}%) {hours}h {minutes}m {seconds}s")

  data_done = predict_json(elements_to_process[:STEPS], model, tokenizer, data_done, index)

  elements_done += STEPS

  print(f"  Done Elements: {thousands(elements_done)}")
  print(f"  Missing Elements: {thousands(elements_to_do - elements_done)}")
  print()

  # Get updated data
  elements_to_process, new_data_done = get_pair_of_data(False)
  seen_set = set()
  for data in data_done:
    seen_set.add(data["hash"])
  for element in new_data_done:
    if element["hash"] not in seen_set:
      data_done.append(element)
      seen_set.add(element["hash"])

  index += 1


Posts: 2.021
Elements to process: 105.643
Posts Skept: 9.420
Elements Skept: 39.541

Iteration 0: (0.0%) 0h 0m 0s
  Using 3 Posts; Elements: 613


100%|██████████| 613/613 [03:11<00:00,  3.21it/s]


  Done 3 Posts; Elements: 613
  Missing 2.018 Posts; Elements: 105.030

Iteration 1: (0.5%) 0h 3m 12s
  Using 3 Posts; Elements: 258


100%|██████████| 258/258 [01:18<00:00,  3.30it/s]


  Done 6 Posts; Elements: 871
  Missing 2.015 Posts; Elements: 104.772

Iteration 2: (0.8%) 0h 4m 31s
  Using 3 Posts; Elements: 30


100%|██████████| 30/30 [00:08<00:00,  3.43it/s]


  Done 9 Posts; Elements: 901
  Missing 2.012 Posts; Elements: 104.742

Iteration 3: (0.8%) 0h 4m 40s
  Using 3 Posts; Elements: 44


100%|██████████| 44/44 [00:12<00:00,  3.45it/s]


  Done 12 Posts; Elements: 945
  Missing 2.009 Posts; Elements: 104.698

Iteration 4: (0.8%) 0h 4m 54s
  Using 3 Posts; Elements: 11


100%|██████████| 11/11 [00:03<00:00,  3.21it/s]


  Done 15 Posts; Elements: 956
  Missing 2.006 Posts; Elements: 104.687

Iteration 5: (0.9%) 0h 4m 58s
  Using 3 Posts; Elements: 270


100%|██████████| 270/270 [01:20<00:00,  3.35it/s]


  Done 18 Posts; Elements: 1.226
  Missing 2.003 Posts; Elements: 104.417

Iteration 6: (1.1%) 0h 6m 20s
  Using 3 Posts; Elements: 506


100%|██████████| 506/506 [02:37<00:00,  3.21it/s]


  Done 21 Posts; Elements: 1.732
  Missing 2.000 Posts; Elements: 103.911

Iteration 7: (1.6%) 0h 8m 58s
  Using 3 Posts; Elements: 797


100%|██████████| 797/797 [04:01<00:00,  3.30it/s]


  Done 24 Posts; Elements: 2.529
  Missing 1.997 Posts; Elements: 103.114

Iteration 8: (2.3%) 0h 13m 0s
  Using 3 Posts; Elements: 54


100%|██████████| 54/54 [00:16<00:00,  3.36it/s]


  Done 27 Posts; Elements: 2.583
  Missing 1.994 Posts; Elements: 103.060

Iteration 9: (2.4%) 0h 13m 17s
  Using 3 Posts; Elements: 49


100%|██████████| 49/49 [00:14<00:00,  3.45it/s]


  Done 30 Posts; Elements: 2.632
  Missing 1.991 Posts; Elements: 103.011

Iteration 10: (2.4%) 0h 13m 33s
  Using 3 Posts; Elements: 19


100%|██████████| 19/19 [00:05<00:00,  3.30it/s]


  Done 33 Posts; Elements: 2.651
  Missing 1.988 Posts; Elements: 102.992

Iteration 11: (2.5%) 0h 13m 39s
  Using 3 Posts; Elements: 14


100%|██████████| 14/14 [00:04<00:00,  3.29it/s]


  Done 36 Posts; Elements: 2.665
  Missing 1.985 Posts; Elements: 102.978

Iteration 12: (2.5%) 0h 13m 44s
  Using 3 Posts; Elements: 14


100%|██████████| 14/14 [00:04<00:00,  3.30it/s]


  Done 39 Posts; Elements: 2.679
  Missing 1.982 Posts; Elements: 102.964

Iteration 13: (2.5%) 0h 13m 49s
  Using 3 Posts; Elements: 16


100%|██████████| 16/16 [00:04<00:00,  3.39it/s]


  Done 42 Posts; Elements: 2.695
  Missing 1.979 Posts; Elements: 102.948

Iteration 14: (2.5%) 0h 13m 55s
  Using 3 Posts; Elements: 9


100%|██████████| 9/9 [00:02<00:00,  3.17it/s]


  Done 45 Posts; Elements: 2.704
  Missing 1.976 Posts; Elements: 102.939

Iteration 15: (2.5%) 0h 13m 59s
  Using 3 Posts; Elements: 10


100%|██████████| 10/10 [00:03<00:00,  3.19it/s]


  Done 48 Posts; Elements: 2.714
  Missing 1.973 Posts; Elements: 102.929

Iteration 16: (2.5%) 0h 14m 2s
  Using 3 Posts; Elements: 8


100%|██████████| 8/8 [00:02<00:00,  3.16it/s]


  Done 51 Posts; Elements: 2.722
  Missing 1.970 Posts; Elements: 102.921

Iteration 17: (2.5%) 0h 14m 6s
  Using 3 Posts; Elements: 13


100%|██████████| 13/13 [00:04<00:00,  3.15it/s]


  Done 54 Posts; Elements: 2.735
  Missing 1.967 Posts; Elements: 102.908

Iteration 18: (2.5%) 0h 14m 10s
  Using 3 Posts; Elements: 37


100%|██████████| 37/37 [00:10<00:00,  3.47it/s]


  Done 57 Posts; Elements: 2.772
  Missing 1.964 Posts; Elements: 102.871

Iteration 19: (2.6%) 0h 14m 22s
  Using 3 Posts; Elements: 18


100%|██████████| 18/18 [00:05<00:00,  3.32it/s]


  Done 60 Posts; Elements: 2.790
  Missing 1.961 Posts; Elements: 102.853

Iteration 20: (2.6%) 0h 14m 28s
  Using 3 Posts; Elements: 13


100%|██████████| 13/13 [00:03<00:00,  3.28it/s]


  Done 63 Posts; Elements: 2.803
  Missing 1.958 Posts; Elements: 102.840

Iteration 21: (2.6%) 0h 14m 33s
  Using 3 Posts; Elements: 10


100%|██████████| 10/10 [00:03<00:00,  3.32it/s]


  Done 66 Posts; Elements: 2.813
  Missing 1.955 Posts; Elements: 102.830

Iteration 22: (2.6%) 0h 14m 37s
  Using 3 Posts; Elements: 6


100%|██████████| 6/6 [00:01<00:00,  3.32it/s]


  Done 69 Posts; Elements: 2.819
  Missing 1.952 Posts; Elements: 102.824

Iteration 23: (2.6%) 0h 14m 40s
  Using 3 Posts; Elements: 50


100%|██████████| 50/50 [00:14<00:00,  3.50it/s]


  Done 72 Posts; Elements: 2.869
  Missing 1.949 Posts; Elements: 102.774

Iteration 24: (2.7%) 0h 14m 55s
  Using 3 Posts; Elements: 6


100%|██████████| 6/6 [00:02<00:00,  2.99it/s]


  Done 75 Posts; Elements: 2.875
  Missing 1.946 Posts; Elements: 102.768

Iteration 25: (2.7%) 0h 14m 58s
  Using 3 Posts; Elements: 3


100%|██████████| 3/3 [00:01<00:00,  2.28it/s]


  Done 78 Posts; Elements: 2.878
  Missing 1.943 Posts; Elements: 102.765

Iteration 26: (2.7%) 0h 15m 0s
  Using 3 Posts; Elements: 6


100%|██████████| 6/6 [00:02<00:00,  2.93it/s]


  Done 81 Posts; Elements: 2.884
  Missing 1.940 Posts; Elements: 102.759

Iteration 27: (2.7%) 0h 15m 3s
  Using 3 Posts; Elements: 3


100%|██████████| 3/3 [00:01<00:00,  2.47it/s]


  Done 84 Posts; Elements: 2.887
  Missing 1.937 Posts; Elements: 102.756

Iteration 28: (2.7%) 0h 15m 5s
  Using 3 Posts; Elements: 3


100%|██████████| 3/3 [00:01<00:00,  2.41it/s]


  Done 87 Posts; Elements: 2.890
  Missing 1.934 Posts; Elements: 102.753

Iteration 29: (2.7%) 0h 15m 7s
  Using 3 Posts; Elements: 4


100%|██████████| 4/4 [00:01<00:00,  2.45it/s]


  Done 90 Posts; Elements: 2.894
  Missing 1.931 Posts; Elements: 102.749

Iteration 30: (2.7%) 0h 15m 9s
  Using 3 Posts; Elements: 3


100%|██████████| 3/3 [00:01<00:00,  2.35it/s]


  Done 93 Posts; Elements: 2.897
  Missing 1.928 Posts; Elements: 102.746

Iteration 31: (2.7%) 0h 15m 11s
  Using 3 Posts; Elements: 7


100%|██████████| 7/7 [00:02<00:00,  2.78it/s]


  Done 96 Posts; Elements: 2.904
  Missing 1.925 Posts; Elements: 102.739

Iteration 32: (2.7%) 0h 15m 15s
  Using 3 Posts; Elements: 5


100%|██████████| 5/5 [00:01<00:00,  2.96it/s]


  Done 99 Posts; Elements: 2.909
  Missing 1.922 Posts; Elements: 102.734

Iteration 33: (2.7%) 0h 15m 17s
  Using 3 Posts; Elements: 15


100%|██████████| 15/15 [00:04<00:00,  3.41it/s]


  Done 102 Posts; Elements: 2.924
  Missing 1.919 Posts; Elements: 102.719

Iteration 34: (2.7%) 0h 15m 22s
  Using 3 Posts; Elements: 8


100%|██████████| 8/8 [00:02<00:00,  3.12it/s]


  Done 105 Posts; Elements: 2.932
  Missing 1.916 Posts; Elements: 102.711

Iteration 35: (2.7%) 0h 15m 26s
  Using 3 Posts; Elements: 31


100%|██████████| 31/31 [00:09<00:00,  3.40it/s]


  Done 108 Posts; Elements: 2.963
  Missing 1.913 Posts; Elements: 102.680

Iteration 36: (2.8%) 0h 15m 36s
  Using 3 Posts; Elements: 211


100%|██████████| 211/211 [00:59<00:00,  3.57it/s]


  Done 111 Posts; Elements: 3.174
  Missing 1.910 Posts; Elements: 102.469

Iteration 37: (3.0%) 0h 16m 35s
  Using 3 Posts; Elements: 245


100%|██████████| 245/245 [01:08<00:00,  3.60it/s]


  Done 114 Posts; Elements: 3.419
  Missing 1.907 Posts; Elements: 102.224

Iteration 38: (3.2%) 0h 17m 44s
  Using 3 Posts; Elements: 132


100%|██████████| 132/132 [00:36<00:00,  3.63it/s]


  Done 117 Posts; Elements: 3.551
  Missing 1.904 Posts; Elements: 102.092

Iteration 39: (3.3%) 0h 18m 21s
  Using 3 Posts; Elements: 406


100%|██████████| 406/406 [01:50<00:00,  3.66it/s]


  Done 120 Posts; Elements: 3.957
  Missing 1.901 Posts; Elements: 101.686

Iteration 40: (3.7%) 0h 20m 13s
  Using 3 Posts; Elements: 675


100%|██████████| 675/675 [03:03<00:00,  3.68it/s]


  Done 123 Posts; Elements: 4.632
  Missing 1.898 Posts; Elements: 101.011

Iteration 41: (4.3%) 0h 23m 18s
  Using 3 Posts; Elements: 465


100%|██████████| 465/465 [02:06<00:00,  3.68it/s]


  Done 126 Posts; Elements: 5.097
  Missing 1.895 Posts; Elements: 100.546

Iteration 42: (4.8%) 0h 25m 25s
  Using 3 Posts; Elements: 123


100%|██████████| 123/123 [00:33<00:00,  3.67it/s]


  Done 129 Posts; Elements: 5.220
  Missing 1.892 Posts; Elements: 100.423

Iteration 43: (4.9%) 0h 25m 59s
  Using 3 Posts; Elements: 122


100%|██████████| 122/122 [00:33<00:00,  3.64it/s]


  Done 132 Posts; Elements: 5.342
  Missing 1.889 Posts; Elements: 100.301

Iteration 44: (5.0%) 0h 26m 34s
  Using 3 Posts; Elements: 235


100%|██████████| 235/235 [01:04<00:00,  3.64it/s]


  Done 135 Posts; Elements: 5.577
  Missing 1.886 Posts; Elements: 100.066

Iteration 45: (5.2%) 0h 27m 39s
  Using 3 Posts; Elements: 487


100%|██████████| 487/487 [02:12<00:00,  3.67it/s]


  Done 138 Posts; Elements: 6.064
  Missing 1.883 Posts; Elements: 99.579

Iteration 46: (5.7%) 0h 29m 53s
  Using 3 Posts; Elements: 404


100%|██████████| 404/404 [01:50<00:00,  3.64it/s]


  Done 141 Posts; Elements: 6.468
  Missing 1.880 Posts; Elements: 99.175

Iteration 47: (6.1%) 0h 31m 45s
  Using 3 Posts; Elements: 186


100%|██████████| 186/186 [00:51<00:00,  3.62it/s]


  Done 144 Posts; Elements: 6.654
  Missing 1.877 Posts; Elements: 98.989

Iteration 48: (6.2%) 0h 32m 37s
  Using 3 Posts; Elements: 110


100%|██████████| 110/110 [00:31<00:00,  3.52it/s]


  Done 147 Posts; Elements: 6.764
  Missing 1.874 Posts; Elements: 98.879

Iteration 49: (6.4%) 0h 33m 9s
  Using 3 Posts; Elements: 320


100%|██████████| 320/320 [01:28<00:00,  3.63it/s]


  Done 150 Posts; Elements: 7.084
  Missing 1.871 Posts; Elements: 98.559

Iteration 50: (6.7%) 0h 34m 38s
  Using 3 Posts; Elements: 869


100%|██████████| 869/869 [04:02<00:00,  3.59it/s]


  Done 153 Posts; Elements: 7.953
  Missing 1.868 Posts; Elements: 97.690

Iteration 51: (7.5%) 0h 38m 41s
  Using 3 Posts; Elements: 780


100%|██████████| 780/780 [03:34<00:00,  3.64it/s]


  Done 156 Posts; Elements: 8.733
  Missing 1.865 Posts; Elements: 96.910

Iteration 52: (8.2%) 0h 42m 16s
  Using 3 Posts; Elements: 384


100%|██████████| 384/384 [01:45<00:00,  3.64it/s]


  Done 159 Posts; Elements: 9.117
  Missing 1.862 Posts; Elements: 96.526

Iteration 53: (8.6%) 0h 44m 3s
  Using 3 Posts; Elements: 378


100%|██████████| 378/378 [01:43<00:00,  3.64it/s]


  Done 162 Posts; Elements: 9.495
  Missing 1.859 Posts; Elements: 96.148

Iteration 54: (8.9%) 0h 45m 47s
  Using 3 Posts; Elements: 207


100%|██████████| 207/207 [00:56<00:00,  3.65it/s]


  Done 165 Posts; Elements: 9.702
  Missing 1.856 Posts; Elements: 95.941

Iteration 55: (9.1%) 0h 46m 44s
  Using 3 Posts; Elements: 371


100%|██████████| 371/371 [01:41<00:00,  3.67it/s]


  Done 168 Posts; Elements: 10.073
  Missing 1.853 Posts; Elements: 95.570

Iteration 56: (9.5%) 0h 48m 27s
  Using 3 Posts; Elements: 577


100%|██████████| 577/577 [02:37<00:00,  3.67it/s]


  Done 171 Posts; Elements: 10.650
  Missing 1.850 Posts; Elements: 94.993

Iteration 57: (10.0%) 0h 51m 6s
  Using 3 Posts; Elements: 259


100%|██████████| 259/259 [01:10<00:00,  3.65it/s]


  Done 174 Posts; Elements: 10.909
  Missing 1.847 Posts; Elements: 94.734

Iteration 58: (10.3%) 0h 52m 18s
  Using 3 Posts; Elements: 243


100%|██████████| 243/243 [01:06<00:00,  3.67it/s]


  Done 177 Posts; Elements: 11.152
  Missing 1.844 Posts; Elements: 94.491

Iteration 59: (10.5%) 0h 53m 25s
  Using 3 Posts; Elements: 137


100%|██████████| 137/137 [00:37<00:00,  3.63it/s]


  Done 180 Posts; Elements: 11.289
  Missing 1.841 Posts; Elements: 94.354

Iteration 60: (10.6%) 0h 54m 3s
  Using 3 Posts; Elements: 219


100%|██████████| 219/219 [00:59<00:00,  3.66it/s]


  Done 183 Posts; Elements: 11.508
  Missing 1.838 Posts; Elements: 94.135

Iteration 61: (10.8%) 0h 55m 4s
  Using 3 Posts; Elements: 228


100%|██████████| 228/228 [01:02<00:00,  3.66it/s]


  Done 186 Posts; Elements: 11.736
  Missing 1.835 Posts; Elements: 93.907

Iteration 62: (11.1%) 0h 56m 8s
  Using 3 Posts; Elements: 446


100%|██████████| 446/446 [02:02<00:00,  3.65it/s]


  Done 189 Posts; Elements: 12.182
  Missing 1.832 Posts; Elements: 93.461

Iteration 63: (11.5%) 0h 58m 11s
  Using 3 Posts; Elements: 347


100%|██████████| 347/347 [01:35<00:00,  3.63it/s]


  Done 192 Posts; Elements: 12.529
  Missing 1.829 Posts; Elements: 93.114

Iteration 64: (11.8%) 0h 59m 47s
  Using 3 Posts; Elements: 1.874


100%|██████████| 1874/1874 [08:47<00:00,  3.55it/s]


  Done 195 Posts; Elements: 14.403
  Missing 1.826 Posts; Elements: 91.240

Iteration 65: (13.6%) 1h 8m 36s
  Using 3 Posts; Elements: 302


100%|██████████| 302/302 [01:22<00:00,  3.67it/s]


  Done 198 Posts; Elements: 14.705
  Missing 1.823 Posts; Elements: 90.938

Iteration 66: (13.9%) 1h 9m 59s
  Using 3 Posts; Elements: 164


100%|██████████| 164/164 [00:44<00:00,  3.65it/s]


  Done 201 Posts; Elements: 14.869
  Missing 1.820 Posts; Elements: 90.774

Iteration 67: (14.0%) 1h 10m 45s
  Using 3 Posts; Elements: 146


100%|██████████| 146/146 [00:39<00:00,  3.68it/s]


  Done 204 Posts; Elements: 15.015
  Missing 1.817 Posts; Elements: 90.628

Iteration 68: (14.2%) 1h 11m 26s
  Using 3 Posts; Elements: 221


100%|██████████| 221/221 [01:00<00:00,  3.66it/s]


  Done 207 Posts; Elements: 15.236
  Missing 1.814 Posts; Elements: 90.407

Iteration 69: (14.4%) 1h 12m 27s
  Using 3 Posts; Elements: 103


100%|██████████| 103/103 [00:28<00:00,  3.62it/s]


  Done 210 Posts; Elements: 15.339
  Missing 1.811 Posts; Elements: 90.304

Iteration 70: (14.5%) 1h 12m 56s
  Using 3 Posts; Elements: 280


100%|██████████| 280/280 [01:16<00:00,  3.67it/s]


  Done 213 Posts; Elements: 15.619
  Missing 1.808 Posts; Elements: 90.024

Iteration 71: (14.7%) 1h 14m 14s
  Using 3 Posts; Elements: 341


100%|██████████| 341/341 [01:32<00:00,  3.70it/s]


  Done 216 Posts; Elements: 15.960
  Missing 1.805 Posts; Elements: 89.683

Iteration 72: (15.1%) 1h 15m 47s
  Using 3 Posts; Elements: 117


100%|██████████| 117/117 [00:32<00:00,  3.66it/s]


  Done 219 Posts; Elements: 16.077
  Missing 1.802 Posts; Elements: 89.566

Iteration 73: (15.2%) 1h 16m 20s
  Using 3 Posts; Elements: 105


100%|██████████| 105/105 [00:28<00:00,  3.68it/s]


  Done 222 Posts; Elements: 16.182
  Missing 1.799 Posts; Elements: 89.461

Iteration 74: (15.3%) 1h 16m 50s
  Using 3 Posts; Elements: 173


100%|██████████| 173/173 [00:47<00:00,  3.62it/s]


  Done 225 Posts; Elements: 16.355
  Missing 1.796 Posts; Elements: 89.288

Iteration 75: (15.4%) 1h 17m 39s
  Using 3 Posts; Elements: 289


 75%|███████▍  | 216/289 [00:58<00:19,  3.71it/s]

#### Chunk Merging

Once all predictions are done between all colab session, this function will allow to merge all the chunks into a single file, for optimizations of both space and speed.

In [None]:
def consolidate_chunks():
  total_posts, _ = merge_chunks()
  with open(f"{DRIVE_FOLDER}/Chunks/done.json", "w") as f:
    json.dump(total_posts, f)

# consolidate_chunks()

After the chunks have been consolidated, this line will remove all chunks avoiding the consolidates unified one.

**IMPORTANT!!**: Just run this line once you are sure the consolidation was made correctly, if not the done data will be lost.

In [None]:
# Clean chunks
# !find  . -name 'sentiment_output_*' -exec rm {} \;

### Training Phase

This section will be focalized on training the model on the first place. Will be composed of an initial testing of the model without training, then a training phase, and at the end a final testing to check on the results of the training.

#### Dataset

For the training and testing, we load the classified dataset of tweets.
This dataset is sampled into equal parts for each emotion, 600 messages per emotion for training and other 600 per emotion for testing. For the validation, we keep 200 messages per emotion.

Is important to note that the dataset includes a bigger quantity of messages, but due to the time requirements and the size of the LLM's used, they had to be restricted to be able to train the model accordingly, reducing training time from 15 hours to just 1.


In [None]:
df = load_dataset("SetFit/tweet_sentiment_extraction")
df = pd.DataFrame(df["train"])

X_train = list()
X_test = list()
for sentiment in ["positive", "neutral", "negative"]:
    train, test  = train_test_split(df[df.label_text==sentiment],
                                    train_size=600,
                                    test_size=600,
                                    random_state=42)
    X_train.append(train)
    X_test.append(test)

X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

eval_idx = [idx for idx in df.index if idx not in list(train.index) + list(test.index)]
X_eval = df[df.index.isin(eval_idx)]
X_eval = (X_eval
          .groupby('label_text', group_keys=False)
          .apply(lambda x: x.sample(n=100, random_state=10, replace=True)))
X_train = X_train.reset_index(drop=True)

X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), columns=["text"])

y_true = pd.DataFrame(X_test).label_text
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])


train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

print("Train:")
print(X_train)
print()
print("Test:")
print(X_test)

#### Initial Prediction with Untrained Model

We use the prediction function to get the predicted emotions for the testing dataset.

In [None]:
y_pred = predict(X_test, model, tokenizer)

100%|██████████| 1800/1800 [23:36<00:00,  1.27it/s]


And we analyse the results of the non-trained model.

In [None]:
evaluate(y_true, y_pred)

Accuracy: 0.772
Accuracy for label 0: 0.808
Accuracy for label 1: 0.692
Accuracy for label 2: 0.815

Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.81      0.80       600
           1       0.68      0.69      0.69       600
           2       0.85      0.81      0.83       600

    accuracy                           0.77      1800
   macro avg       0.77      0.77      0.77      1800
weighted avg       0.77      0.77      0.77      1800


Confusion Matrix:
[[485  99  16]
 [116 415  69]
 [ 18  93 489]]


#### Text

For further increasing the accuracy of the model, we fine-tune it with the training dataset. We define the loss function for the training, the validation function and the training parameters like steps, epochs, etc.

We use the training data for the training process, and the validation data for a testing process of the improvement of the model, to know how it behaves between epochs and keep the weights with the best performance.

In [None]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
)

training_arguments = TrainingArguments(
    output_dir="logs",
    num_train_epochs=3,
    gradient_checkpointing=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=False,
    evaluation_strategy='steps',
    load_best_model_at_end=True,
    eval_steps = 112,
    eval_accumulation_steps=1,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    args=training_arguments,
    packing=False,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained("trained-model")

Map:   0%|          | 0/15 [00:00<?, ? examples/s]

Map:   0%|          | 0/7 [00:00<?, ? examples/s]



Step,Training Loss,Validation Loss




#### Final Testing

After training the model, we predict again the emotions of the testing dataset. We analyse the results to be able to compare the improvements of the model in comparison to the untrained model.

In [None]:
y_pred = predict(X_test, model, tokenizer)

  0%|          | 0/1800 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  0%|          | 1/1800 [00:01<37:14,  1.24s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  0%|          | 2/1800 [00:01<26:32,  1.13it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  0%|          | 3/1800 [00:02<23:12,  1.29it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  0%|          | 4/1800 [00:03<23:23,  1.28it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  0%|          | 5/1800 [00:04<23:25,  1.28it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  0%|          | 6/1800 [00:04<23:27,  1.27it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  0%|          | 7/1800 [00:05<23:32,  1.27it/s]Setting `pad_token_id` 

Accuracy: 0.700
Accuracy for label 0: 0.758
Accuracy for label 1: 0.592
Accuracy for label 2: 0.750

Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.76      0.75       600
           1       0.60      0.59      0.60       600
           2       0.76      0.75      0.75       600

    accuracy                           0.70      1800
   macro avg       0.70      0.70      0.70      1800
weighted avg       0.70      0.70      0.70      1800


Confusion Matrix:
[[455 116  29]
 [131 355 114]
 [ 32 118 450]]





In [None]:
evaluate(y_true, y_pred)

# Extra space for developing