code adapted from https://www.kaggle.com/code/lucamassaron/fine-tune-llama-3-for-sentiment-analysis#Fine-tuning

# Description and contents
In this notebook, we
1. Process our data for finetuning our sentiment analyzer<br>
&nbsp; - We save the processed data to 2_data_dict.pkl<br><br>
2. Evaluate the performance of the standard Llama 3.2 LLM as a sentiment analyzer for our reviews<br>
&nbsp; - We save the processed data to 2_base_eval.json<br><br>
3. Experiment in writing code to finetune a sentiment analyzer for our 'co' rating category<br><br>
4. Experiment in writing code to finetune sentiment analyzers for all rating categories<br>
&nbsp; - The actual finetuning is performed with ./finetune/3_finetune.py and 3_run_finetune.bat<br>
&nbsp; - The batch file loops through each rating category to perform the finetuning; create a shell script if running in a Unix-based environment<br><br>
5. Experiment in writing code to evaluate the performance of our finetuned sentiment analyzers<br>
&nbsp; - The actual evaluation is performed with ./finetune/3_test_eval.py and 3_run_test_eval.bat<br>
&nbsp; - The batch file loops through each rating category to perform the evaluations; create a shell script if running in a Unix-based environment<br><br>
6. Process our firm summaries for sentiment prediction with our finetuned models<br>
&nbsp; - We save the processed data to 2_summ_text.csv

Contents:
1. [Imports, installs, and setup params](#imports)
2. [Load data](#load)
3. [Process data](#process)
4. [Evaluate base Llama 3.2 performance](#std_eval)
5. [Experimental code to finetune for 'co' category](#co_ft)
6. [Experimental code to finetune for all categories](#all_ft)
7. [Experimental code to evaluate finetuned model performance](#all_eval)
8. [Create firm summary dataset for prediction](#createdata)

---
# Imports, installs, and setup params <a name="import"></a>

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pickle
import numpy as np
import pandas as pd
import os
import json
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig, PeftModel
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM, 
                            AutoTokenizer, 
                            BitsAndBytesConfig, 
                            TrainingArguments, 
                            pipeline, 
                            logging,
                            EarlyStoppingCallback, 
                            IntervalStrategy)
from sklearn.metrics import (accuracy_score, 
                            classification_report, 
                            confusion_matrix,
                            recall_score, 
                            precision_score, 
                            f1_score)
from sklearn.model_selection import train_test_split

In [3]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [4]:
print(f"pytorch version {torch.__version__}")
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"working on {device}")

pytorch version 2.5.1+cu124
working on cuda:0


In [5]:
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

---
# Load data <a name="load"></a>

---
# Process data <a name="process"></a>

In [15]:
def generate_prompt(data_point):
    return f"""
On the topic of {data_point["category_full"]}, analyze the sentiment of the company review enclosed in square brackets,
determine if it is excellent, good, neutral, bad, or terrible, and return the answer as 
the corresponding sentiment label "excellent" or "good" or "neutral" or "bad" or "terrible".

[{data_point["text"]}] = {data_point["sentiment"]}
""".strip()

def generate_test_prompt(data_point):
    return f"""
On the topic of {data_point["category_full"]}, analyze the sentiment of the company review enclosed in square brackets,
determine if it is excellent, good, neutral, bad, or terrible, and return the answer as 
the corresponding sentiment label "excellent" or "good" or "neutral" or "bad" or "terrible".

[{data_point["text"]}] = """.strip()

---

In [21]:
with open('2_data_dict.pkl','rb') as f:
    data_dict = pickle.load(f)

# Evaluate standard model <a name="std_eval"></a>

In [13]:
model_name = "meta-llama/Llama-3.2-3B"

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device,
    torch_dtype=compute_dtype,
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

max_seq_length = 512 #2048
tokenizer = AutoTokenizer.from_pretrained(model_name, max_seq_length=max_seq_length)
tokenizer.pad_token_id = tokenizer.eos_token_id

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
def predict(test, model, tokenizer):
    y_pred = []
    replies = []
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["prompt"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        max_new_tokens = 3,
                        #temperature = 0.7,
                        do_sample = False, # equivalent to temperature 0 i.e. deterministic process
                       )
        result = pipe(prompt)
        answer = result[0]['generated_text'].split("=")[-1]
        replies.append(result[0]['generated_text'])
        if "terrible" in answer:
            y_pred.append("terrible")
        elif "bad" in answer:
            y_pred.append("bad")
        elif "neutral" in answer:
            y_pred.append("neutral")
        elif "good" in answer:
            y_pred.append("good")
        elif "excellent" in answer:
            y_pred.append("excellent")
        else:
            y_pred.append("none")
    return y_pred, replies

In [11]:
def evaluate(y_true, y_pred, verbose=False, print_reports=True):
    labels = ['terrible', 'bad', 'neutral', 'good', 'excellent']
    mapping = {'terrible':0, 'bad':1, 'neutral':2, 'none':2, 'good':3, 'excellent':4}
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    if verbose==True:
        print(f'y_true: {y_true}')
        print(f'y_pred: {y_pred}')
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2, 3, 4])
    
    if print_reports==True:
        print(f'Accuracy: {accuracy:.3f}')        
        print(f'Accuracy for label {label}: {accuracy:.3f}')        
        print('\nClassification Report:')
        print(class_report)
        print('\nConfusion Matrix:')
        print(conf_matrix)
    
    return class_report, conf_matrix

In [271]:
to_predict = data_dict['test']['co'].iloc[:]
y_pred, results = predict(to_predict, model, tokenizer)

100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [04:28<00:00,  5.58it/s]


In [272]:
y_true = data_dict['y_true']['co'][:]
evaluate(y_true, y_pred, verbose=False)

Accuracy: 0.277
Accuracy for label 0: 0.077
Accuracy for label 1: 0.513
Accuracy for label 2: 0.007
Accuracy for label 3: 0.353
Accuracy for label 4: 0.433

Classification Report:
              precision    recall  f1-score   support

           0       0.64      0.08      0.14       300
           1       0.28      0.51      0.36       300
           2       0.50      0.01      0.01       300
           3       0.29      0.35      0.32       300
           4       0.24      0.43      0.31       300

    accuracy                           0.28      1500
   macro avg       0.39      0.28      0.23      1500
weighted avg       0.39      0.28      0.23      1500


Confusion Matrix:
[[ 23 182   1  25  69]
 [  6 154   1  41  98]
 [  3 100   2  80 115]
 [  2  61   0 106 131]
 [  2  50   0 118 130]]


In [41]:
base_eval = {}
for key in data_dict['train'].keys():
    to_predict = data_dict['test'][key].iloc[:]
    y_pred, _ = predict(to_predict, model, tokenizer)
    y_true = data_dict['y_true'][key][:]
    class_report, conf_matrix = evaluate(y_true, y_pred, verbose=False, print_reports=False)
    base_eval[key]={'y_true':y_true.to_list(),
                     'y_pred':y_pred,
                     'cls_report':class_report,
                     'conf_matrix':conf_matrix.tolist()}

100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [05:07<00:00,  4.88it/s]


Accuracy: 0.277


100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [04:58<00:00,  5.02it/s]


Accuracy: 0.251


100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [05:03<00:00,  4.94it/s]


Accuracy: 0.275


100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [05:02<00:00,  4.95it/s]


Accuracy: 0.281


100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [05:05<00:00,  4.91it/s]


Accuracy: 0.259


100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [05:05<00:00,  4.90it/s]

Accuracy: 0.253





In [7]:
with open('2_base_eval.json','r') as f:
    base_eval = json.load(f)

---
# Fine-tuning for 'co' category (experiment) <a name="co_ft"></a>

In [283]:
output_dir="./trained_weights/co"

In [45]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
)

training_arguments = TrainingArguments(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=5,                       # number of training epochs
    per_device_train_batch_size=1,            # batch size per device during training
    gradient_accumulation_steps=8,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,                         # log every 10 steps
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="tensorboard",                  # report metrics to tensorboard
    #evaluation_strategy="steps",              # save checkpoint every epoch
    #load_best_model_at_end = True,
    #eval_steps = 25,
    #metric_for_best_model = 'accuracy',
)

In [291]:
to_predict = data_dict['test']['co'].iloc[:]
y_pred, results = predict(to_predict, model, tokenizer)

  0%|                                                                                         | 0/1500 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [07:22<00:00,  3.39it/s]


In [292]:
y_true = data_dict['y_true']['co'][:]
evaluate(y_true, y_pred, verbose=False)

Accuracy: 0.468
Accuracy for label 0: 0.597
Accuracy for label 1: 0.507
Accuracy for label 2: 0.310
Accuracy for label 3: 0.397
Accuracy for label 4: 0.530

Classification Report:
              precision    recall  f1-score   support

           0       0.63      0.60      0.61       300
           1       0.38      0.51      0.44       300
           2       0.35      0.31      0.33       300
           3       0.40      0.40      0.40       300
           4       0.63      0.53      0.57       300

    accuracy                           0.47      1500
   macro avg       0.48      0.47      0.47      1500
weighted avg       0.48      0.47      0.47      1500


Confusion Matrix:
[[179  89  24   7   1]
 [ 66 152  57  23   2]
 [ 32  91  93  63  21]
 [  2  54  54 119  71]
 [  4   9  40  88 159]]


In [300]:
evaluation = pd.DataFrame({'prompt': data_dict['test']['co']['prompt'], 
                           'y_true':y_true, 
                           'y_pred': y_pred},
                         )
evaluation.to_csv("./finetune/co_test_predictions.csv", index=False)

---
# Fine-tuning for other rating categories (experiment) <a name="all_ft"></a>

In [47]:
cats = list(data_dict['train'].keys())
cats.remove('co')
cats

['cb', 'sm', 'cv', 'di', 'wlb']

Do not use this for fine-tuning,<br>
Running this in a loop likely is causing VRAM not to be cleared, making training time untenable.<br>
Use the batch file that calls the py script in a loop instead (or make your own shell script if on unix based systems).

---
# Load saved models and evaluate (experiment) <a name="all_eval"></a>

In [None]:
cats = list(data_dict['train'].keys())

for cat in cats:
    model_path = f'./trained_weights/{cat}'
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map=device,
        torch_dtype=compute_dtype,
        quantization_config=bnb_config, 
    )
    model.config.use_cache = False
    model.config.pretraining_tp = 1

    max_seq_length = 512 #2048    
    tokenizer = AutoTokenizer.from_pretrained(model_path, max_seq_length=max_seq_length)
    tokenizer.pad_token_id = tokenizer.eos_token_id

    to_predict = data_dict['test'][cat].iloc[:]
    y_pred, _ = predict(to_predict, model, tokenizer)
    
    y_true = data_dict['y_true'][cat][:]
    evaluation = pd.DataFrame({'prompt': data_dict['test'][cat]['prompt'], 
                               'y_true':y_true, 
                               'y_pred': y_pred,
                               'base_pred':base_eval[cat]['y_pred']},
                             )
    evaluation.to_csv(f"./finetune/{cat}_test_predictions.csv", index=False)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [05:48<00:00,  4.31it/s]


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  3%|██▍                                                                           | 48/1500 [02:37<1:55:02,  4.75s/it]

---
# Create firm summary dataset for prediction <a name="createdata"></a>

In [34]:
summ = pd.read_csv('1_summary_reviews.csv',index_col=0)

In [37]:
summ.head(1)

Unnamed: 0,firm,pros: career opportunities,pros: compensation and benefits,pros: senior management,pros: work life balance,pros: culture and values,pros: diversity and inclusion,cons: career opportunities,cons: compensation and benefits,cons: senior management,...,cons: culture and values,cons: diversity and inclusion,index,opportunities,compensation,management,worklife_balance,culture,diversity,kmeans_labels
0,AMR,AMR has multiple locations throughout the glob...,Good pay with ability to make a lot more with ...,,Only work 3 days a week. Every other 3 day wee...,"Great company culture, fast moving industry, o...",,You will get called in a lot. Mandatory overti...,Have to work a lot of O/T because always short...,Management has no idea what they’re doing or h...,...,"Very bad upper management, Lack of culture Shi...",-Two Separate Division with Different Contract...,877,3.02,2.82,2.61,2.64,2.69,3.3,3


In [55]:
pros=summ.iloc[:,1:7].astype(str).agg('\n'.join,axis=1)
cons=summ.iloc[:,7:13].astype(str).agg('\n'.join,axis=1)

In [61]:
type(pros)

pandas.core.series.Series

In [64]:
review = 'pros: '+pros+'\ncons: '+cons

In [75]:
pd.DataFrame(review,columns=['text']).to_csv('2_summ_text.csv',index=False)