# LLAMA3 Fine-tuning for Alz classification using LLAMA 3


### Requirements:
* A GPU with enough memory!

### Installs
* They suggest using latest version of transformers
* Must restart after install because the accelerate package used in the hugging face trainer requires it.

In [None]:
# Install Pytorch
#%pip install "torch==2.2.2" tensorboard

# Install Hugging Face libraries
#%pip install  --upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"


### Login to huggingface hub to put your LLama token so we can access Llama 3 7B Param Pre-trained Model

In [None]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGr

In [None]:
#!pip install --upgrade torch functorch

In [None]:
#pip install torch==2.0 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113


###### Imports

In [None]:
import os
import random
import functools
import csv
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
import evaluate

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix, classification_report, balanced_accuracy_score, accuracy_score

from datasets import Dataset, DatasetDict
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)


In [None]:
from google.colab import drive
drive.mount('/content/drive')
# Load the data
data_path = "/content/drive/MyDrive/trans_data/transcript_org.csv"

df = pd.read_csv(data_path)
df

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,output,transcript
0,0,the scene is in the in the kitchen . the moth...
1,0,oh I see the sink is running over . I see the...
2,0,a boy and a girl are in the kitchen with thei...
3,0,it was summertime and mother and the childre...
4,0,wait until I put my glasses on . oh ? there's...
...,...,...
493,1,well this one is in the cookie jar . and thi...
494,1,the little boy is on the stool which is tipp...
495,1,oh you want me to tell you . the mother and h...
496,1,oh that kid's gonna get a good spill off of t...


In [None]:
import torch
print(torch.__version__)

2.2.2+cu121


#### Load TSLA sentiment analysis dataset
* Derived from Alpha vantage text data...


### Split into train/val/test for later comparison.
* For simplicity we split based on time.
  - First 60% train
  - Next 20% val
  - Next 20% test
* This can be problematic a bit since class balance changes over time and some articles on boundries between train/val or val/test have some overlap, but completely beats bias of stratified sample usually used since some articles are literally on same thing, but maybe different sources.


In [None]:
train_end_point = int(df.shape[0]*0.6)
val_end_point = int(df.shape[0]*0.8)
df_train = df.iloc[:train_end_point,:]
df_val = df.iloc[train_end_point:val_end_point,:]
df_test = df.iloc[val_end_point:,:]
print(df_train.shape, df_test.shape, df_val.shape)



(298, 2) (100, 2) (100, 2)


### Convert from Pandas DataFrame to Hugging Face Dataset
* Also let's shuffle the training set.
* We put the components train,val,test into a DatasetDict so we can access them later with HF trainer.
* Later we will add a tokenized dataset


In [None]:
# Converting pandas DataFrames into Hugging Face Dataset objects:
dataset_train = Dataset.from_pandas(df_train.drop('output',axis=1))
dataset_val = Dataset.from_pandas(df_val.drop('output',axis=1))
dataset_test = Dataset.from_pandas(df_test.drop('output',axis=1))


In [None]:
# Shuffle the training dataset
dataset_train_shuffled = dataset_train.shuffle(seed=42)  # Using a seed for reproducibility


In [None]:
# Combine them into a single DatasetDict
dataset = DatasetDict({
    'train': dataset_train_shuffled,
    'val': dataset_val,
    'test': dataset_test
})
dataset

DatasetDict({
    train: Dataset({
        features: ['transcript'],
        num_rows: 298
    })
    val: Dataset({
        features: ['transcript'],
        num_rows: 100
    })
    test: Dataset({
        features: ['transcript'],
        num_rows: 100
    })
})

In [None]:
dataset['train']

Dataset({
    features: ['transcript'],
    num_rows: 298
})

* Since our classes are not balanced let's calculate class weights based on inverse value counts
* Convert to pytorch tensor since we will need it

In [None]:
df_train.output.value_counts(normalize=True)

Unnamed: 0_level_0,proportion
output,Unnamed: 1_level_1
1,0.513423
0,0.486577


In [None]:
class_weights=(1/df_train.output.value_counts(normalize=True).sort_index()).tolist()
class_weights=torch.tensor(class_weights)
class_weights=class_weights/class_weights.sum()
class_weights


tensor([0.5134, 0.4866])

## Load LLama model with 4 bit quantization as specified in bits and bytes and prepare model for peft training

### Model Name

In [None]:
model_name = "meta-llama/Meta-Llama-3-8B"

#### Quantization Config (for QLORA)

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, # enable 4-bit quantization
    bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
    bnb_4bit_use_double_quant = True, # quantize quantized weights //insert xzibit meme
    bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
)


#### Lora Config

In [None]:
lora_config = LoraConfig(
    r = 16, # the dimension of the low-rank matrices
    lora_alpha = 8, # scaling factor for LoRA activations vs pre-trained weight activations
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
    lora_dropout = 0.05, # dropout probability of the LoRA layers
    bias = 'none', # wether to train bias weights, set to 'none' for attention layers
    task_type = 'SEQ_CLS'
)

#### Load model
* AutomodelForSequenceClassification
* Num Labels is # of classes


In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    num_labels=2
)

model

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


LlamaForSequenceClassification(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )


* prepare_model_for_kbit_training() function to preprocess the quantized model for training.

In [None]:
model = prepare_model_for_kbit_training(model)
model

LlamaForSequenceClassification(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )


* get_peft_model prepares a model for training with a PEFT method such as LoRA by wrapping the base model and PEFT configuration with get_peft_model

In [None]:
model = get_peft_model(model, lora_config)
model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): LlamaForSequenceClassification(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
        

### Load the tokenizer

#### Since LLAMA3 pre-training doesn't have EOS token
* Set the pad_token_id to eos_token_id
* Set pad token ot eos_token

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)

tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### Update some model configs
* Must use .cache = False as below or it crashes from my experience

In [None]:
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False
model.config.pretraining_tp = 1

### Loop through dataset to measure performance before training/fitting the model
* Use a batch size 32 to kinda vectorize and to avoid memory errors.

In [None]:
sentences = df_test.transcript.tolist()
sentences[0:2]

[" the boy's the girl's making fun of the boy . she made fun of him so much while he was stealing a cookie out of the cookie jar . and it made him trip . and he's going to fall . the mother's listening to the kids . she's drying dishes . she got the window open against a mirror . she's isn't paying attention to the sink . the sink's spilling water onto the floor . it's getting her feet wet . the curtains are open to allow fresh air to come in the house . there's two cups on the a dish on the sink . and the faucets are open and the cupboard door is open . looks like it might be spring or summer out because things look like they're growing outside . she has shoes on . she has shoes on . he has shoes on . she has shoes on . she has a dress, dress, apron, shirt, shorts, socks, socks . the lid is off the cookie jar . there's a fancy sink cabinet over here and a fancy wall cabinet over there . that's about it . ",
 " okay it looks like the mother is washing dishes . the sink is overflowing .

In [None]:
# Convert summaries to a list
sentences = df_test.transcript.tolist()

# Define the batch size
batch_size = 32  # You can adjust this based on your system's memory capacity

# Initialize an empty list to store the model outputs
all_outputs = []

# Process the sentences in batches
for i in range(0, len(sentences), batch_size):
    # Get the batch of sentences
    batch_sentences = sentences[i:i + batch_size]

    # Tokenize the batch
    inputs = tokenizer(batch_sentences, return_tensors="pt", padding=True, truncation=True, max_length=512)

    # Move tensors to the device where the model is (e.g., GPU or CPU)
    inputs = {k: v.to('cuda' if torch.cuda.is_available() else 'cpu') for k, v in inputs.items()}

    # Perform inference and store the logits
    with torch.no_grad():
        outputs = model(**inputs)
        all_outputs.append(outputs['logits'])



* Concatenate all outputs into a single tensor

In [None]:
final_outputs = torch.cat(all_outputs, dim=0)
final_outputs

tensor([[-2.3473e+00,  2.0970e-01],
        [-1.1951e+00, -2.1205e+00],
        [-8.7277e-01, -1.0254e+00],
        [-3.4778e-01,  1.5786e+00],
        [-1.6519e+00, -1.2634e+00],
        [-3.9754e-01,  3.1927e-01],
        [-1.0384e+00, -3.2251e-01],
        [ 2.0148e-01, -9.1854e-01],
        [-4.0202e-01, -2.3372e+00],
        [ 1.2803e+00, -9.9200e-01],
        [ 7.7555e-01,  7.0134e-01],
        [-8.4925e-01,  1.2945e+00],
        [-1.4989e+00,  1.1771e+00],
        [-1.0110e+00,  7.8009e-01],
        [-2.3031e+00,  5.5721e-01],
        [-6.6423e-01, -1.6560e+00],
        [ 1.0059e+00, -1.8183e+00],
        [-4.1092e-01, -1.3388e+00],
        [ 8.1853e-01, -1.7432e+00],
        [ 1.2786e+00, -8.4298e-01],
        [ 8.6556e-01, -1.0261e-01],
        [-2.1984e+00,  5.4832e-01],
        [ 9.1586e-01, -7.2871e-01],
        [-2.8044e+00, -8.8166e-03],
        [-1.6191e+00,  1.3207e+00],
        [-1.2830e+00, -1.1173e+00],
        [-1.0863e+00, -1.7783e-02],
        [ 1.2756e+00,  1.889

* argmax to get class prediction

In [None]:
final_outputs.argmax(axis=1)

tensor([1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
        1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0,
        0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1,
        1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,
        1, 0, 1, 1], device='cuda:0')

* Move to CPU so we can use numpy and set prediction colum to it

In [None]:
df_test['predictions']=final_outputs.argmax(axis=1).cpu().numpy()
df_test['predictions']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['predictions']=final_outputs.argmax(axis=1).cpu().numpy()


Unnamed: 0,predictions
398,1
399,0
400,0
401,1
402,1
403,1
404,1
405,0
406,0
407,0


In [None]:
df_test['predictions'].value_counts()

Unnamed: 0_level_0,count
predictions,Unnamed: 1_level_1
1,64
0,36


#### Use category map to get back category names

In [None]:
#category_map = {0: 'NDMD', 2: 'DMD'}

df_test['predictions']=df_test['predictions']
df_test['predictions']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['predictions']=df_test['predictions']


Unnamed: 0,predictions
398,1
399,0
400,0
401,1
402,1
403,1
404,1
405,0
406,0
407,0


### Analyze performance as in intro notebook

In [None]:
def get_performance_metrics(df_test):
  y_test = df_test.output
  y_pred = df_test.predictions

  print("Confusion Matrix:")
  print(confusion_matrix(y_test, y_pred))

  print("\nClassification Report:")
  print(classification_report(y_test, y_pred))

  print("Balanced Accuracy Score:", balanced_accuracy_score(y_test, y_pred))
  print("Accuracy Score:", accuracy_score(y_test, y_pred))

In [None]:
get_performance_metrics(df_test)

Confusion Matrix:
[[22 26]
 [14 38]]

Classification Report:
              precision    recall  f1-score   support

           0       0.61      0.46      0.52        48
           1       0.59      0.73      0.66        52

    accuracy                           0.60       100
   macro avg       0.60      0.59      0.59       100
weighted avg       0.60      0.60      0.59       100

Balanced Accuracy Score: 0.594551282051282
Accuracy Score: 0.6
