# **LLama 3.1-8B (base version)**

Code was adapted based on an example from [Link to Source](https://github.com/adidror005/youtube-videos/blob/main/old_videos/LLAMA_3_Fine_Tuning_for_Sequence_Classification_Actual_Video.ipynb)

# **Llama**


### Big Picture Overview of Parameter Efficient Fine Tuning Methods like LoRA and QLoRA Fine Tuning for Sequence Classification

**The Essence of Fine-tuning**
- LLMs are pre-trained on vast amounts of data for broad language understanding.
- Fine-tuning is crucial for specializing in specific domains or tasks, involving adjustments with smaller, relevant datasets.

**Model Fine-tuning with PEFT: Exploring LoRA and QLoRA**
- Traditional fine-tuning is resource-intensive; PEFT (Parameter Efficient Fine-tuning) makes the process faster and less demanding.
- Focus on two PEFT methods: LoRA and QLoRA.

**The Power of PEFT**
- PEFT modifies only a subset of the LLM's parameters, enhancing speed and reducing memory demands, making it suitable for less powerful devices.

**LoRA: Efficiency through Adapters**
- **Low-Rank Adaptation (LoRA):** Injects small trainable adapters into the pre-trained model.
- **Equation:** For a weight matrix $W$, LoRA approximates $W = W_0 + BA$, where $W_0$ is the original weight matrix, and $BA$ represents the low-rank modification through trainable matrices $B$ and $A$.
- Adapters learn task nuances while keeping the majority of the LLM unchanged, minimizing overhead.

**QLoRA: Compression and Speed**
- **Quantized LoRA (QLoRA):** Extends LoRA by quantizing the model’s weights, further reducing size and enhancing speed.
- **Innovations in QLoRA:**
  1. **4-bit Quantization:** Uses a 4-bit data type, NormalFloat (NF4), for optimal weight quantization, drastically reducing memory usage.
  2. **Low-Rank Adapters:** Fine-tuned with 16-bit precision to effectively capture task-specific nuances.
  3. **Double Quantization:** Reduces quantization constants from 32-bit to 8-bit, saving additional memory without accuracy loss.
  4. **Paged Optimizers:** Manages memory efficiently during training, optimizing for large tasks.

**Why PEFT Matters**
- **Rapid Learning:** Speeds up model adaptation.
- **Smaller Footprint:** Eases deployment with reduced model size.
- **Edge-Friendly:** Fits better on devices with limited resources, enhancing accessibility.

**Conclusion**
- PEFT methods like LoRA and QLoRA revolutionize LLM fine-tuning by focusing on efficiency, facilitating faster adaptability, smaller models, and broader device compatibility.

***

### Fine-tuning for Sentiment Analysis Classification:


#### 1. Text Generation with Sentiment Label as part of text
- **Approach**: Train the model to generate text that naturally appends the sentiment label at the end.
- **Input**: "TSLA slashes model Y prices ======"
- **Output**: "TSLA slashes model Y prices ====== Bearish"
- **Use Case**: This method is useful for applications requiring continuous text output that includes embedded sentiment analysis, such as interactive chatbots or automated content creation tools.


#### 2. Sequence Classification Head
- **Approach**: Add a sequence classification head (linear layer) on top of the LLaMa Model transformer. This setup is similar to GPT-2 and focuses on classifying the sentiment based on the last relevant token in the sequence.
    - **Token Positioning**:
        - **With pad_token_id**: The model identifies and ignores padding tokens, using the last non-padding token for classification.
        - **Without pad_token_id**: It defaults to the last token in each sequence.
        - **inputs_embeds**: If embeddings are directly passed (without input_ids), the model cannot identify padding tokens and takes the last embedding in each sequence as the input for classification.
- **Input**: Specific sentences (e.g., "TSLA slashes Model Y prices").
- **Output**: Direct sentiment classification (e.g., "Bearish").
- **Training Objective**: Minimize cross-entropy loss between the predicted and the actual sentiment labels.

https://huggingface.co/docs/transformers/main/en/model_doc/llama

### Peft Configs
* Bits and bytes config for quantization
* Lora config for lora

### Going to use Hugginface Transformers trainer class: Main componenents
* Hugging face dataset (for train + eval)
* Data collater
* Compute Metrics
* Class weights since we use custom trainer and also custom weighted loss..
* trainingArgs: like # epochs, learning rate, weight decay etc..




In [1]:
# install packages
!pip install -U bitsandbytes
!pip install -U transformers
!pip install -U accelerate
!pip install -U peft
!pip install -U trl
!pip install pyarrow==18.1.0
!pip install evaluate



In [2]:
# import packages

import numpy as np
import pandas as pd
import os
import random
import evaluate
import functools # ??
from tqdm import tqdm
import bitsandbytes as bnb

import torch
import torch.nn as nn
import torch.nn.functional as F

from datasets import Dataset, DatasetDict
from peft import LoraConfig, PeftConfig, prepare_model_for_kbit_training, get_peft_model

from trl import SFTTrainer
from trl import setup_chat_format

import transformers
from transformers import (AutoModelForCausalLM,
                          AutoModelForSequenceClassification,
                        AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                            Trainer,
                            DataCollatorWithPadding,
                          pipeline, 
                          logging)

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix,
                            f1_score, balanced_accuracy_score)


In [3]:
import torch
torch.cuda.empty_cache()
torch.cuda.is_available()
#torch.cuda.device_count()

True

## **Authenticate for Hugging Face**

In [4]:
# Hugging face access

from huggingface_hub import login
with open("../../../login/hf_key.txt", 'r') as f: 
    HF_TOKEN = str(f.read())
    
login(token = HF_TOKEN)

## **Data**

In [5]:
# loading the data
import pandas as pd
data = pd.read_csv("../../../data/labeled_data.csv")
data = data[["label", "body_parent", "body_child", "msg_id_parent", "msg_id_child", "subreddit", "datetime", "exact_time"]].sort_values(by = "exact_time").reset_index(drop = True)

# keep integer labels
data['target'] = data['label']

# for readability, recode labels
int_to_label = {2: "agree", 1 : "neutral", 0 : "disagree"}
data.replace({"label": int_to_label}, inplace = True)

data

Unnamed: 0,label,body_parent,body_child,msg_id_parent,msg_id_child,subreddit,datetime,exact_time,target
0,neutral,"I live in rural Saskatchewan, Canada. We have ...",I'm in NE USA we've had 3 in two years...all e...,cnddov1,cndj2gv,climate,03/01/2015 23:18,1420327135,1
1,neutral,"I live in rural Saskatchewan, Canada. We have ...",One hundred year flood just means a one in one...,cnddov1,cndkpy7,climate,04/01/2015 00:10,1420330231,1
2,neutral,Convince her of what? That it's happening or t...,That anthropocentric climate change is actuall...,cndnlrd,cndnsxt,climate,04/01/2015 01:45,1420335952,1
3,disagree,I think this prediction is about as valid as s...,It's January. Literally no one said it would b...,cndl5x4,cndybsy,climate,04/01/2015 08:01,1420358465,0
4,disagree,"Mann hasn't *been* honest in decades, so I'm c...",There have been a dozen re-constructions of Ma...,cne462t,cne89ej,climate,04/01/2015 17:45,1420393544,0
...,...,...,...,...,...,...,...,...,...
42889,neutral,Not trying to spark an argument but a legitima...,Keeping in mind that the Palestinians killed m...,gyo197v,gyotff1,Republican,19/05/2021 12:36,1621427788,1
42890,agree,Y'all saw Guilianis hail Mary right? Get his s...,"Same I want these assholes in jail, full stop....",gynfsu4,gyp3u39,democrats,19/05/2021 13:56,1621432578,2
42891,agree,>Why don't I see ads holding Republicans accou...,"Yeah, I agree with the goal of this post but n...",gyn6nzm,gyp5vzw,democrats,19/05/2021 14:11,1621433471,2
42892,agree,How about ... no? This is strange. Community o...,"I know, it feels strange too. We wouldn't hold...",gyp71o7,gyp7en6,BlackLivesMatter,19/05/2021 14:21,1621434116,2


In [6]:
# make text

def create_training_data(data):

    result = []

    for idx, row in data.iterrows():
        system_prompt = """You are a classification Chatbot. Given a comment and a reply, you classify whether the reply agrees, disagrees or is neutral towards the comment.
        You only reply with either "agree", "disagree" or "neutral" and nothing else."""
        comment = row["body_parent"]
        reply = row["body_child"]
        label = row["label"]
        target = row["target"]
        result.append({'system_prompt' : system_prompt, 'comment' : comment, 'reply': reply, 'label' : label, 'target' : target})
    
    return result

# save data
df = pd.DataFrame(create_training_data(data))
df

Unnamed: 0,system_prompt,comment,reply,label,target
0,You are a classification Chatbot. Given a comm...,"I live in rural Saskatchewan, Canada. We have ...",I'm in NE USA we've had 3 in two years...all e...,neutral,1
1,You are a classification Chatbot. Given a comm...,"I live in rural Saskatchewan, Canada. We have ...",One hundred year flood just means a one in one...,neutral,1
2,You are a classification Chatbot. Given a comm...,Convince her of what? That it's happening or t...,That anthropocentric climate change is actuall...,neutral,1
3,You are a classification Chatbot. Given a comm...,I think this prediction is about as valid as s...,It's January. Literally no one said it would b...,disagree,0
4,You are a classification Chatbot. Given a comm...,"Mann hasn't *been* honest in decades, so I'm c...",There have been a dozen re-constructions of Ma...,disagree,0
...,...,...,...,...,...
42889,You are a classification Chatbot. Given a comm...,Not trying to spark an argument but a legitima...,Keeping in mind that the Palestinians killed m...,neutral,1
42890,You are a classification Chatbot. Given a comm...,Y'all saw Guilianis hail Mary right? Get his s...,"Same I want these assholes in jail, full stop....",agree,2
42891,You are a classification Chatbot. Given a comm...,>Why don't I see ads holding Republicans accou...,"Yeah, I agree with the goal of this post but n...",agree,2
42892,You are a classification Chatbot. Given a comm...,How about ... no? This is strange. Community o...,"I know, it feels strange too. We wouldn't hold...",agree,2


In [7]:
df['prompt'] = None

def make_prompt(row):

    prompt = "System Prompt: " + row["system_prompt"] + "; Comment: " + row["comment"] + "; Reply: " + row["reply"]

    return prompt



df['prompt'] = df.apply(lambda row: make_prompt(row), axis = 1)
df

Unnamed: 0,system_prompt,comment,reply,label,target,prompt
0,You are a classification Chatbot. Given a comm...,"I live in rural Saskatchewan, Canada. We have ...",I'm in NE USA we've had 3 in two years...all e...,neutral,1,System Prompt: You are a classification Chatbo...
1,You are a classification Chatbot. Given a comm...,"I live in rural Saskatchewan, Canada. We have ...",One hundred year flood just means a one in one...,neutral,1,System Prompt: You are a classification Chatbo...
2,You are a classification Chatbot. Given a comm...,Convince her of what? That it's happening or t...,That anthropocentric climate change is actuall...,neutral,1,System Prompt: You are a classification Chatbo...
3,You are a classification Chatbot. Given a comm...,I think this prediction is about as valid as s...,It's January. Literally no one said it would b...,disagree,0,System Prompt: You are a classification Chatbo...
4,You are a classification Chatbot. Given a comm...,"Mann hasn't *been* honest in decades, so I'm c...",There have been a dozen re-constructions of Ma...,disagree,0,System Prompt: You are a classification Chatbo...
...,...,...,...,...,...,...
42889,You are a classification Chatbot. Given a comm...,Not trying to spark an argument but a legitima...,Keeping in mind that the Palestinians killed m...,neutral,1,System Prompt: You are a classification Chatbo...
42890,You are a classification Chatbot. Given a comm...,Y'all saw Guilianis hail Mary right? Get his s...,"Same I want these assholes in jail, full stop....",agree,2,System Prompt: You are a classification Chatbo...
42891,You are a classification Chatbot. Given a comm...,>Why don't I see ads holding Republicans accou...,"Yeah, I agree with the goal of this post but n...",agree,2,System Prompt: You are a classification Chatbo...
42892,You are a classification Chatbot. Given a comm...,How about ... no? This is strange. Community o...,"I know, it feels strange too. We wouldn't hold...",agree,2,System Prompt: You are a classification Chatbo...


### Train/Test Split

Make train/val/test split by time order!

In [8]:
# Split the DataFrame
train_size = 0.8
eval_size = 0.1

# Determine splitting indexes (ordered by time)
train_end = int(train_size * len(df))
eval_end = train_end + int(eval_size * len(df))

# Split the data
X_train = df[:train_end]
X_eval = df[train_end:eval_end]
X_test = df[eval_end:]

### Convert from Pandas DataFrame to Hugging Face Dataset
* Also let's shuffle the training set.
* We put the components train,val,test into a DatasetDict so we can access them later with HF trainer.
* Later we will add a tokenized dataset

In [9]:
X_train_dataset = Dataset.from_pandas(X_train.drop('label', axis = 1))
X_eval_dataset = Dataset.from_pandas(X_eval.drop('label', axis = 1))
X_test_dataset = Dataset.from_pandas(X_test.drop('label', axis = 1))

X_test_dataset

Dataset({
    features: ['system_prompt', 'comment', 'reply', 'target', 'prompt'],
    num_rows: 4290
})

Shuffle training data --> apparently this helps with performance

In [10]:
X_train_dataset_shuffle = X_train_dataset.shuffle(seed = 42)

In [11]:
dataset = DatasetDict({
    'train' : X_train_dataset_shuffle,
    'val' : X_eval_dataset,
    'test' : X_test_dataset
})
dataset

DatasetDict({
    train: Dataset({
        features: ['system_prompt', 'comment', 'reply', 'target', 'prompt'],
        num_rows: 34315
    })
    val: Dataset({
        features: ['system_prompt', 'comment', 'reply', 'target', 'prompt'],
        num_rows: 4289
    })
    test: Dataset({
        features: ['system_prompt', 'comment', 'reply', 'target', 'prompt'],
        num_rows: 4290
    })
})

Check distributions

In [12]:
X_train.target.value_counts(normalize = True)

target
0    0.400670
2    0.339239
1    0.260090
Name: proportion, dtype: float64

### Class Weights

* Since our classes are not balanced let's calculate class weights based on inverse value counts
* Convert to pytorch tensor since we will need it

In [13]:
# invert the weights
class_weights = (1/X_train.target.value_counts(normalize = True).sort_index()).to_list()

# make a tensor
class_weights = torch.tensor(class_weights)

# make them sum to one
class_weights = class_weights/class_weights.sum()
class_weights

tensor([0.2687, 0.4139, 0.3174])


## **Load the Model**

Apparently, meta recommends the base version of the model for finetuning [source](https://www.youtube.com/watch?v=YJNbgusTSF0)

* load model with 4bit quantization (as specified in bits and bytes) and prepare model for peft training

In [14]:
model_name = "meta-llama/Llama-3.1-8B" 

### Quantization for QLoRA

In [15]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, # enable 4 bit quantization
    bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
    bnb_4bit_use_double_quant = True, # quantize quantized weights
    bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
)

### LoRA Config

In [16]:
lora_config = LoraConfig(
    r = 16, # dimensions of low-rank matrices
    lora_alpha = 8, # scaling factor (trade-off) for LoRA activations vs. pretrained weight activations
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'], # where to apply LoRA to
    lora_dropout = 0.05, # drop out probability of LoRA layers, to prevent overfitting
    bias = 'none', # wether to train bias weights, set to 'none' for attention layers
    task_type = 'SEQ_CLS'
)

### Load the Model

* AutoModelForSequenceClassification --> used for classifications
* Num of labels is # of classes

In [18]:
torch.cuda.empty_cache()

In [17]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    quantization_config = quantization_config,
    num_labels = 3
)

model

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at meta-llama/Llama-3.1-8B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


LlamaForSequenceClassification(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Lla

### Function to preprocess quantized model for training

In [18]:
model = prepare_model_for_kbit_training(model)
model

LlamaForSequenceClassification(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Lla

### Prepare for PEFT training

`get_peft_model()` to prepare the model for training with PEFT method such as LoRA by wrapping the base model and PEFT configuration with `get_peft_model()`

In [19]:
model = get_peft_model(model, lora_config)
model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): LlamaForSequenceClassification(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )

## **Tokenizer**

### Since LLAMA3 pre-training doesn't have EOS token
* Set the pad_token_id to eos_token_id
* Set pad token ot eos_token

In [20]:
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space = True)

tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

### Update Model Configurations

In [21]:
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False
model.config.pretraining_pt = 1

### Apply model to get performance prior to training

* Use batch size 32 to vectorize and avoid memory errors

In [22]:
X_test.iloc[100]['prompt'] 

'System Prompt: You are a classification Chatbot. Given a comment and a reply, you classify whether the reply agrees, disagrees or is neutral towards the comment.\n        You only reply with either "agree", "disagree" or "neutral" and nothing else.; Comment: luckily they won\'t retake the house. Has this piece of shit ever been on any other news channel other than FoxNews, OAN, and NewsMax? This man only speaks and care about his base, he doesn\'t give a shit about the rest of America.; Reply: I hope you\'re right but given that the GOP made gains in 2020, let\'s not be too overconfident 2022 won\'t have a red wave.'

In [23]:
torch.cuda.empty_cache()

# Convert summaries to a list
sentences = X_test.prompt.tolist()

# Define the batch size
batch_size = 8  # You can adjust this based on your system's memory capacity

# Initialize an empty list to store the model outputs
all_outputs = []

# Process the sentences in batches
for i in tqdm(range(0, len(sentences), batch_size)):
    # Get the batch of sentences
    batch_sentences = sentences[i:i + batch_size]

    # Tokenize the batch
    inputs = tokenizer(batch_sentences, return_tensors="pt", padding=True, truncation=True, max_length=512)

    # Move tensors to the device where the model is (e.g., GPU or CPU)
    inputs = {k: v.to('cuda' if torch.cuda.is_available() else 'cpu') for k, v in inputs.items()}

    # Perform inference and store the logits
    with torch.no_grad():
        outputs = model(**inputs)
        all_outputs.append(outputs['logits'])



100%|██████████| 537/537 [04:55<00:00,  1.82it/s]


Concatenate Outputs into a single tensor

In [24]:
final_outputs = torch.cat(all_outputs, dim=0)
final_outputs

tensor([[ 2.9467,  0.7544,  1.6139],
        [ 4.4751, -2.0388,  0.5225],
        [ 2.6135,  1.0128, -1.0465],
        ...,
        [ 3.2750, -6.1625, -0.0842],
        [ 2.0263, -1.7200, -1.3849],
        [ 5.0210, -2.1296,  0.7276]], device='cuda:0')

Get arg_max

In [25]:
final_outputs.argmax(axis=1)

tensor([0, 0, 0,  ..., 0, 0, 0], device='cuda:0')

Extract predictions and turn to label

In [26]:
X_test['predictions_initial']=final_outputs.argmax(axis=1).cpu().numpy()
X_test['predictions_initial']

X_test['predictions_initial']=X_test['predictions_initial'].apply(lambda l:int_to_label[l])
X_test['predictions_initial']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['predictions_initial']=final_outputs.argmax(axis=1).cpu().numpy()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['predictions_initial']=X_test['predictions_initial'].apply(lambda l:int_to_label[l])


38604    disagree
38605    disagree
38606    disagree
38607    disagree
38608    disagree
           ...   
42889    disagree
42890       agree
42891    disagree
42892    disagree
42893    disagree
Name: predictions_initial, Length: 4290, dtype: object

In [27]:
X_test.to_csv("output/Llama_3.1_8B_initial_X_test.csv", index = False)

In [28]:
X_test = pd.read_csv("output/Llama_3.1_8B_initial_X_test.csv")

## **Performance**

In [46]:
def get_performance_metrics(df_test, pred_col):
  y_test = df_test.label
  y_pred = df_test[pred_col]

  print("Confusion Matrix:")
  print(confusion_matrix(y_test, y_pred))

  print("\nClassification Report:")
  print(classification_report(y_test, y_pred))

  print("Balanced Accuracy Score:", balanced_accuracy_score(y_test, y_pred))
  print("Accuracy Score:", accuracy_score(y_test, y_pred))

In [27]:
get_performance_metrics(X_test, 'predictions_initial')

Confusion Matrix:
[[ 232 1325   29]
 [ 249 1352   23]
 [ 186  867   27]]

Classification Report:
              precision    recall  f1-score   support

       agree       0.35      0.15      0.21      1586
    disagree       0.38      0.83      0.52      1624
     neutral       0.34      0.03      0.05      1080

    accuracy                           0.38      4290
   macro avg       0.36      0.33      0.26      4290
weighted avg       0.36      0.38      0.29      4290

Balanced Accuracy Score: 0.334597421609858
Accuracy Score: 0.37552447552447554


## **Trainer**

* model
* tokenizer
* training arguments
* train dataset
* eval dataset
* Data Collater
* Compute Metrics
* class_weights: In our case since we are using a custom trainer so we can use a weighted loss we will subclass trainer and define the custom loss.

### Create LLAMA tokenized dataset which will house our train/val parts during the training process but after applying tokenization

In [30]:
MAX_LEN = 512
col_to_delete = ['system_prompt', 'comment', 'reply', 'prompt']

def llama_preprocessing_function(examples):
    return tokenizer(examples['prompt'], truncation=True, max_length=MAX_LEN)

tokenized_datasets = dataset.map(llama_preprocessing_function, batched=True, remove_columns=col_to_delete)
tokenized_datasets = tokenized_datasets.rename_column("target", "label")
tokenized_datasets.set_format("torch")

Map:   0%|          | 0/34315 [00:00<?, ? examples/s]

Map:   0%|          | 0/4289 [00:00<?, ? examples/s]

Map:   0%|          | 0/4290 [00:00<?, ? examples/s]

## Data Collator
A **data collator** prepares batches of data for training or inference in machine learning, ensuring uniform formatting and adherence to model input requirements. This is especially crucial for variable-sized inputs like text sequences.

### Functions of Data Collator

1. **Padding:** Uniformly pads sequences to the length of the longest sequence using a special token, allowing simultaneous batch processing.
2. **Batching:** Groups individual data points into batches for efficient processing.
3. **Handling Special Tokens:** Adds necessary special tokens to sequences.
4. **Converting to Tensor:** Transforms data into tensors, the required format for machine learning frameworks.

### `DataCollatorWithPadding`

The `DataCollatorWithPadding` specifically manages padding, using a tokenizer to ensure that all sequences are padded to the same length for consistent model input.

- **Syntax:** `collate_fn = DataCollatorWithPadding(tokenizer=tokenizer)`
- **Purpose:** Automatically pads text data to the longest sequence in a batch, crucial for models like BERT or GPT.
- **Tokenizer:** Uses the provided `tokenizer` for sequence processing, respecting model-specific vocabulary and formatting rules.

This collator is commonly used with libraries like Hugging Face's Transformers, facilitating data preprocessing for various NLP models.


In [31]:
collate_fn = DataCollatorWithPadding(tokenizer=tokenizer)


### Metrics for Evaluation

In [32]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'balanced_accuracy' : balanced_accuracy_score(predictions, labels),'accuracy':accuracy_score(predictions,labels)}


### Define Custom Trainer
* We will have a custom loss function that deals with the class weights and have class weights as additional argument in constructor

In [33]:
class CustomTrainer(Trainer):
    def __init__(self, *args, class_weights=None, **kwargs):
        super().__init__(*args, **kwargs)
        # Ensure label_weights is a tensor
        if class_weights is not None:
            self.class_weights = torch.tensor(class_weights, dtype=torch.float32).to(self.args.device)
        else:
            self.class_weights = None

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        # Extract labels and convert them to long type for cross_entropy
        labels = inputs.pop("labels").long()

        # Forward pass
        outputs = model(**inputs)

        # Extract logits assuming they are directly outputted by the model
        logits = outputs.get('logits')

        # Compute custom loss with class weights for imbalanced data handling
        if self.class_weights is not None:
            loss = F.cross_entropy(logits, labels, weight=self.class_weights)
        else:
            loss = F.cross_entropy(logits, labels)

        return (loss, outputs) if return_outputs else loss


### Training Arguments

In [34]:
training_args = TrainingArguments(
    output_dir = 'Llama_3.1_8B_training',
    learning_rate = 1e-4,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    num_train_epochs = 2,
    weight_decay = 0.01,
    evaluation_strategy = 'epoch',
    save_strategy = 'epoch',
    load_best_model_at_end = True
)



In [35]:
trainer = CustomTrainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['val'],
    tokenizer = tokenizer,
    data_collator = collate_fn,
    compute_metrics = compute_metrics,
    class_weights=class_weights,
)

  super().__init__(*args, **kwargs)
  self.class_weights = torch.tensor(class_weights, dtype=torch.float32).to(self.args.device)


## **Run the Trainer**

In [36]:
train_result = trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33melena-solar[0m ([33melena-solar-university-of-konstanz[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Balanced Accuracy,Accuracy
1,0.7229,0.727595,0.712044,0.673817
2,0.601,0.697533,0.715352,0.724644


  return fn(*args, **kwargs)


## **Save the model**

In [37]:
metrics = train_result.metrics
max_train_samples = len(X_train)
metrics["train_samples"] = min(max_train_samples, len(X_train))
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

***** train metrics *****
  epoch                    =         2.0
  total_flos               = 551298265GF
  train_loss               =      0.7208
  train_runtime            =  4:21:44.35
  train_samples            =       34315
  train_samples_per_second =        4.37
  train_steps_per_second   =       0.546


In [38]:
trainer.save_model("Llama_3.1_8B_saved_model")

In [4]:
# reimport the model

tokenizer = AutoTokenizer.from_pretrained("Llama_3.1_8B_saved_model")
model = AutoModelForSequenceClassification.from_pretrained("Llama_3.1_8B_saved_model")

Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]

KeyboardInterrupt: 

## **Evaluation**

In [40]:
X_test

Unnamed: 0,system_prompt,comment,reply,label,target,prompt,predictions_initial
0,You are a classification Chatbot. Given a comm...,It's so nice having a FLOTUS who's facial expr...,Melanoma's squinty cat face always looked to m...,agree,2,System Prompt: You are a classification Chatbo...,disagree
1,You are a classification Chatbot. Given a comm...,Because Mitch McConnell indicated he's voting ...,I think it's worth it because the more we air ...,disagree,0,System Prompt: You are a classification Chatbo...,disagree
2,You are a classification Chatbot. Given a comm...,How about some stimulus checks and a decent st...,"You get this was an executive action, not legi...",disagree,0,System Prompt: You are a classification Chatbo...,disagree
3,You are a classification Chatbot. Given a comm...,Satire feels appropriate. I'd like one dose of...,Are you saying they didn't know or understand ...,disagree,0,System Prompt: You are a classification Chatbo...,disagree
4,You are a classification Chatbot. Given a comm...,I actually didn't want to upload these particu...,To be fair they are just reporting what Brexit...,neutral,1,System Prompt: You are a classification Chatbo...,disagree
...,...,...,...,...,...,...,...
4285,You are a classification Chatbot. Given a comm...,Not trying to spark an argument but a legitima...,Keeping in mind that the Palestinians killed m...,neutral,1,System Prompt: You are a classification Chatbo...,disagree
4286,You are a classification Chatbot. Given a comm...,Y'all saw Guilianis hail Mary right? Get his s...,"Same I want these assholes in jail, full stop....",agree,2,System Prompt: You are a classification Chatbo...,agree
4287,You are a classification Chatbot. Given a comm...,>Why don't I see ads holding Republicans accou...,"Yeah, I agree with the goal of this post but n...",agree,2,System Prompt: You are a classification Chatbo...,disagree
4288,You are a classification Chatbot. Given a comm...,How about ... no? This is strange. Community o...,"I know, it feels strange too. We wouldn't hold...",agree,2,System Prompt: You are a classification Chatbo...,disagree


In [44]:
def make_predictions(model,df_test):


  # Convert summaries to a list
  sentences = df_test.prompt.tolist()

  # Define the batch size
  batch_size = 32  # You can adjust this based on your system's memory capacity

  # Initialize an empty list to store the model outputs
  all_outputs = []

  # Process the sentences in batches
  for i in tqdm(range(0, len(sentences), batch_size)):
      # Get the batch of sentences
      batch_sentences = sentences[i:i + batch_size]

      # Tokenize the batch
      inputs = tokenizer(batch_sentences, return_tensors="pt", padding=True, truncation=True, max_length=512)

      # Move tensors to the device where the model is (e.g., GPU or CPU)
      inputs = {k: v.to('cuda' if torch.cuda.is_available() else 'cpu') for k, v in inputs.items()}

      # Perform inference and store the logits
      with torch.no_grad():
          outputs = model(**inputs)
          all_outputs.append(outputs['logits'])
  final_outputs = torch.cat(all_outputs, dim=0)
  df_test['predictions_ft']=final_outputs.argmax(axis=1).cpu().numpy()
  df_test['predictions_ft']=df_test['predictions_ft'].apply(lambda l:int_to_label[l])


make_predictions(model, X_test)

100%|██████████| 135/135 [05:01<00:00,  2.23s/it]


In [48]:
X_test.to_csv("output/Llama_3.1_8B_ft_X_test.csv", index = False)

In [None]:
X_test = pd.read_csv("output/Llama_3.1_8B_ft_X_test.csv")

In [47]:
get_performance_metrics(X_test, 'predictions_ft')

Confusion Matrix:
[[1213  163  210]
 [ 156 1277  191]
 [ 216  263  601]]

Classification Report:
              precision    recall  f1-score   support

       agree       0.77      0.76      0.77      1586
    disagree       0.75      0.79      0.77      1624
     neutral       0.60      0.56      0.58      1080

    accuracy                           0.72      4290
   macro avg       0.70      0.70      0.70      4290
weighted avg       0.72      0.72      0.72      4290

Balanced Accuracy Score: 0.7025428936018723
Accuracy Score: 0.7205128205128205
