# Fine-tuning Llama 3.1 for sentiment analysis: Part 2

**Introduction**: Multiclass classification is a common use case for traditional ML models. However, now that open source foundation LLMs are so easily accessible, it's worth exploring how easily they can applied to this domain as well.

This notebook is part 1 of a 2-part series exploring how easily Llama 3.1-8B-Instruct can be fine-tuned for three-part sentiment analysis (positive,
negative, neutral) on the [Dynasent](https://paperswithcode.com/dataset/dynasent) dataset.

**Findings**: The foundation model achieves 71% accuracy utilizing zero-shot prompting, and a fine-tuned version involving quantized LoRA adapters achieves 84% accuracy using 37,500 examples. A smaller dataset involving 3,750 examples achieved 81% accuracy.

**Environment**: Nvidia A100 on Google Colab with 40GB of GPU RAM, and 80GB of CPU RAM. Training completed in less than 45 mins, demonstrating the feasibility of this approach. Google Drive was used for storing access tokens.

**Summary of steps**:

Notebook 1:
  - Download the Dynasent dataset from Hugging Face.
  - Explore, clean, and prepare two smaller datasets:
    - A small dataset involving 3,750 examples.
    - A medium dataset involving 37,500 examples.

Notebook 2 (this notebook):
  - Load dataset, model and tokenizer
  - Execute test with foundation model
  - Fine-tune QLoRA adapers
  - Merge with foundation model
  - Execute test with fine-tuned model
  - Save model and upload to Hugging Face

# Setup: libraries, dependencies, configurations, helper functions

In [None]:
%%capture
%pip install -U transformers
%pip install -U datasets
%pip install -U bitsandbytes
%pip install -U accelerate
%pip install -U peft
%pip install -U trl

In [None]:
import numpy as np
import pandas as pd
import os
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
import wandb
import pprint

from tqdm import tqdm
from datasets import Dataset
from peft import LoraConfig, PeftConfig, PeftModel
from trl import SFTTrainer, SFTConfig
from trl import setup_chat_format

from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          EarlyStoppingCallback,
                          pipeline,
                          logging)

from datasets import (Dataset,
                      DatasetDict,
                      concatenate_datasets,
                      load_dataset,
                      load_from_disk)

from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)

from sklearn.model_selection import train_test_split

from google.colab import userdata
from huggingface_hub import login as hf_login

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive/')

# connect to huggingface
hf_auth_token = userdata.get('HF_TOKEN')
hf_login(hf_auth_token)

# connect to weights and biases
wb_auth_token = userdata.get('WB_TOKEN')
wandb.login(key=wb_auth_token)

# initialize wandb.ai with this project
run = wandb.init(
  project='Fine-tune for sentiment analysis',
  job_type="training",
  anonymous="allow"
)

# base model config
base_model_namespace = "meta-llama"
base_model = "Llama-3.1-8B-Instruct"
base_model_name = f"{base_model_namespace}/{base_model}"

# base model cache config
base_model_cache_base_directory = "/content/drive/MyDrive/.model_cache"
base_model_provider = "huggingface"
base_model_cache_directory = f"{base_model_cache_base_directory}/{base_model_provider}/{base_model_name}"

# base dataset cache config
base_dataset_cache_base_directory = "/content/drive/MyDrive/.dataset_cache"
base_dataset_provider = "huggingface"
base_dataset_namespace = "dynabench"
base_dataset = "dynasent"
base_dataset_name_part = "dynabench.dynasent.r1.all"
base_dataset_name = f"{base_dataset_namespace}/{base_dataset}"
base_dataset_cache_directory = f"{base_dataset_cache_base_directory}/{base_dataset_provider}/{base_dataset_name}/{base_dataset_name_part}"

# project config
custom_model_name = "llama-3.1-8B-sentiment-analysis"
dataset_directory = "/content/drive/MyDrive/.data/sentiment_analysis_37500"   # comment to use the smaller dataset
#dataset_directory = "/content/drive/MyDrive/.data/sentiment_analysis_3750"  # uncomment to use the smaller dataset
model_directory =f"{base_model_cache_base_directory}/model-llama-3.1-8b-instruct"
labels = ['positive','negative','neutral']

# mixed precision datatypes
torch_dtype = torch.float16
use_fp16 = True
use_bf16 = False

In [None]:
# helper functions

def verify_distribution(df, stratify_col, split_name):
  result = df.groupby(stratify_col).size().reset_index(name='count')
  result['%'] = result['count'] / result['count'].sum() * 100
  result = result.sort_values('%', ascending=False)
  print(f"{split_name} dataset, {len(df)} examples\n {result}\n")

def generate_prompt(example, includeLabel=False):

  label = example['gold_label'] if includeLabel else ""

  return f"""
  Analyze the sentiment of the following text. Classify as positive, negative, or neutral.

  text: {example['sentence']}
  label: {label}""".strip()

def generate_prompt_with_label(example):
  return generate_prompt(example, True)

def generate_prompt_without_label(example):
  return generate_prompt(example, False)

def load_base_model(cache_directory, model_name, quantization_config, torch_dtype):

  model = None

  # if not cached, download and save
  if not os.path.exists(cache_directory):

    model = AutoModelForCausalLM.from_pretrained(
      model_name,
      quantization_config=quantization_config,
      device_map="auto",
      torch_dtype=torch_dtype)

    # cache locally
    model.save_pretrained(cache_directory)
    print(f"Downloaded model, saved in model cache: {cache_directory}")

  # if cached, load from cache
  else:

    model = AutoModelForCausalLM.from_pretrained(
      cache_directory,
      quantization_config=quantization_config,
      device_map="auto",
      torch_dtype=torch_dtype)

    print(f"Loaded from model cache: {cache_directory}")

  return model

def predict(test, model, tokenizer):
  y_pred = []

  for i in tqdm(range(len(test))):
    prompt = test.iloc[i]["prompt"]
    pipe = pipeline(task="text-generation",
                    model=model,
                    tokenizer=tokenizer,
                    max_new_tokens=2, # handles sentiment labels
                    temperature=0.1)

    result = pipe(prompt)

    answer = result[0]['generated_text'].split("label:")[-1].strip()

    # Determine the predicted category
    for label in labels:
      if label in answer.lower():
        y_pred.append(label)
        break
    else:
      y_pred.append("none")

  return y_pred

def evaluate(y_true, y_pred, labels):

  mapping = {label: idx for idx, label in enumerate(labels)}

  def map_func(x):
    return mapping.get(x, -1)  # Map to -1 if not found, but should not occur with correct data

  y_true_mapped = np.vectorize(map_func)(y_true)
  y_pred_mapped = np.vectorize(map_func)(y_pred)

  # Calculate accuracy
  accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
  print(f"\nAccuracy: {accuracy:.3f}")

  # Generate accuracy report
  unique_label_names = set(y_true_mapped)  # Get unique labels

  for label in unique_label_names:
    label_indices = [i for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]
    label_y_true = [y_true_mapped[i] for i in label_indices]
    label_y_pred = [y_pred_mapped[i] for i in label_indices]
    label_accuracy = accuracy_score(label_y_true, label_y_pred)
    print(f"Accuracy for label {labels[label]}: {label_accuracy:.3f}")

  # Generate classification report
  class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped, target_names=labels, labels=list(range(len(labels))))
  print("\nClassification Report:")
  print(class_report)

  # Generate confusion matrix
  conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped, labels=list(range(len(labels))))
  print("\nConfusion Matrix:")
  print(conf_matrix)

def find_all_module_names(model):

  cls = bnb.nn.Linear4bit
  lora_module_names = set()

  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])

  if 'lm_head' in lora_module_names:  # needed for 16 bit
    lora_module_names.remove('lm_head')

  return list(lora_module_names)

# Load and process dataset

In [None]:
# load dataset
ds = load_from_disk(dataset_directory)
print(ds)

In [None]:
# verify balance
verify_distribution(ds["train"].to_pandas(), "gold_label", "train")
verify_distribution(ds["validation"].to_pandas(), "gold_label", "validation")
verify_distribution(ds["test"].to_pandas(), "gold_label", "test")

In [None]:
# convert to dataframes
X_train = ds["train"].to_pandas()
X_val = ds["validation"].to_pandas()
X_test = ds["test"].to_pandas()

# generate prompts for training and evaluation data
X_train.loc[:,'prompt'] = X_train.apply(generate_prompt_with_label, axis=1)
X_val.loc[:,'prompt'] = X_val.apply(generate_prompt_with_label, axis=1)

# generate test prompts and extract true labels
y_true = X_test.loc[:,'gold_label']
X_test = pd.DataFrame(X_test.apply(generate_prompt_without_label, axis=1), columns=['prompt'])

# convert to datasets
train_dataset = Dataset.from_pandas(X_train)
val_dataset = Dataset.from_pandas(X_val)
test_dataset = Dataset.from_pandas(X_test)

In [None]:
pprint.pprint(X_train.iloc[0]['prompt'])

In [None]:
pprint.pprint(X_test.iloc[0]['prompt'])

# Load model and tokenizer; test prior to fine-tuning

Load the model and tokenizer. Load memory-optimized version of the base model where weigths stored in 4-bit quantized format, but computations in bfloat16 or float16, depending on the hardware. These optimizations will speed up training.

In [None]:
# quantization config
bnb_config = BitsAndBytesConfig(
  load_in_4bit=True, # 4-bit quantized version
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch_dtype,
  bnb_4bit_use_double_quant=False,
)

# load model
model = load_base_model(base_model_cache_directory, base_model_name, bnb_config, torch_dtype)
model.config.pretraining_tp = 1
model.config.use_cache = False

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id

We create a custom predict function, which will use the text generation pipeline to predict labels from the "prompt" column.

In [None]:
y_pred = predict(X_test, model, tokenizer)

We also create a custom evaluate function that will use the predicted labels and ground truth labels to calculate performance metrics as well as a confusion matrix.

In [None]:
evaluate(y_true, y_pred)

# Fine-tune model

Find the trainable modules in the model. We'll use a custom function to extract the modules; then we'll apply them to the LoRA config.

In [None]:
modules = find_all_module_names(model)
print(modules)

In [None]:
# note: obtained best performance with these values (after originally trying r=16, lora_alpha=32, lora_dropout=0.1)
peft_config = LoraConfig(
  r=64,
  lora_alpha=16,
  lora_dropout=0,
  bias="none",
  task_type="CAUSAL_LM",
  target_modules=modules,
)

early_stopping_cb = EarlyStoppingCallback(
  early_stopping_patience=3,                # stop if no improvement after 2 evals
  early_stopping_threshold=0.0              # minimum improvement threshold
)

training_arguments = SFTConfig(
  output_dir=f"{model_directory}",          # directory to save and repository id

  per_device_train_batch_size=16,           # batch size per device during training
  per_device_eval_batch_size=16,
  gradient_accumulation_steps=2,            # number of steps before performing a backward/update pass
  gradient_checkpointing=True,

  # training length
  num_train_epochs=1,                       # number of training epochs

  # learning rate
  learning_rate=2e-4,                       # learning rate, based on QLoRA paper

  # regularization
  weight_decay=0.01,
  warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper

  # early-stopping
  load_best_model_at_end=True,
  metric_for_best_model="eval_loss",
  greater_is_better=False,                  # we want the lowest validation loss

  # logging and evaluation
  eval_strategy="steps",
  eval_steps=100,
  logging_strategy="steps",
  logging_steps=100,

  # additional
  optim="paged_adamw_32bit",
  fp16=use_fp16,                            # determined dynamically
  bf16=use_bf16,                            # determined dynamically
  max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
  max_steps=-1,
  group_by_length=False,
  lr_scheduler_type="cosine",               # use cosine learning rate scheduler
  report_to="wandb",                        # report metrics to w&b
  max_seq_length=512,                       # should be enough to handle full context window
  dataset_text_field="prompt",
  packing=False,
  dataset_kwargs={
    "add_special_tokens": False,
    "append_concat_token": False,
  },
  gradient_checkpointing_kwargs={
    'use_reentrant': False
  }
)

trainer = SFTTrainer(
  model=model,
  train_dataset=train_dataset,
  eval_dataset=val_dataset,
  peft_config=peft_config,
  processing_class=tokenizer,
  args=training_arguments,
  callbacks=[early_stopping_cb],
)

In [None]:
trainer.train()

In [None]:
wandb.finish()
model.config.use_cache = True

In [None]:
# Save trained model and tokenizer
trainer.save_model(model_directory)
tokenizer.save_pretrained(model_directory)

# push to huggingface
trainer.model.push_to_hub(
  custom_model_name,
  repo_type="model",
  use_temp_dir=True,
  token=hf_auth_token)

In [None]:
# evaluate the tuned model - is it better than the foundation model?
y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)

With the small dataset (3750 examples), achieved an overall 81% accuracy and 81% f1 (macro) score.

```
Accuracy: 0.813
Accuracy for label positive: 0.840
Accuracy for label negative: 0.904
Accuracy for label neutral: 0.696

Classification Report:
              precision    recall  f1-score   support

    positive       0.84      0.84      0.84       125
    negative       0.76      0.90      0.83       125
     neutral       0.85      0.70      0.77       125

    accuracy                           0.81       375
   macro avg       0.82      0.81      0.81       375
weighted avg       0.82      0.81      0.81       375


Confusion Matrix:
[[105  11   9]
 [  6 113   6]
 [ 14  24  87]]
```

However, with the medium dataset (37,500 examples), achieved an overall 84% accuracy and 84% f1 (macro) score.
```

Accuracy: 0.842
Accuracy for label positive: 0.847
Accuracy for label negative: 0.885
Accuracy for label neutral: 0.794

Classification Report:
              precision    recall  f1-score   support

    positive       0.86      0.85      0.85      1250
    negative       0.85      0.88      0.87      1250
     neutral       0.81      0.79      0.80      1250

    accuracy                           0.84      3750
   macro avg       0.84      0.84      0.84      3750
weighted avg       0.84      0.84      0.84      3750


Confusion Matrix:
[[1059   58  133]
 [  45 1106   99]
 [ 127  130  993]]
```

# Merge and export fine-tuned model

First, reload a fresh base model and tokenizer. We do not want to load the previous quantized version. Then merge the adapter with the base model, save, and upload to Hugging Face.

In [None]:
# reload base model
reloaded_model = AutoModelForCausalLM.from_pretrained(
  base_model_name,
  low_cpu_mem_usage=True,
  return_dict=True,
  torch_dtype=torch_dtype,
  device_map='auto',
  force_download=True # we want to force download so as not to use quantized version
)

# merge model
merged_model = PeftModel.from_pretrained(reloaded_model, model_directory)
merged_model = merged_model.merge_and_unload()

# save merged model
merged_model.save_pretrained(model_directory)

# push merged model to huggingface hub
merged_model.push_to_hub(custom_model_name, use_temp_dir=True, token=hf_auth_token)