# Finetune LLM (Llama 3-8B-Instruct)

## Installing and importing requirements

This notebook was runned on Kaggle. So, make sure to run it on Kaggle, and as a result, the required libraries will be the following along with the ones that are already installed in the Kaggle environment.

In [2]:
! pip install -U autotrain-advanced > install_logs.txt 2>&1
! pip install peft


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [24]:
import os
from huggingface_hub import notebook_login
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
)
from peft import LoraConfig, PeftModel
import numpy as np
import pandas as pd
from tqdm import tqdm, trange
from sklearn.metrics import classification_report

Since we are using the Llama 3 model, you will need to have access to this model in order to run this notebook. When you have been granted access to the model, you need to enter your HF access token by running the following cell.

In [4]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Fine-tune LLM

Here, we finetune the LLM on our dataset. We are using the AutoTrain framework to finetune the model. Please pay attention to the following notes:

- The model is finetuned on the `train.csv` file ([link](https://github.com/arshandalili/semantic-plausibility/blob/main/models/Fine-tuned%20LLM/train.csv)), which MUST be in the `data` directory in the current working directory.
- Please enter your HF access token and HF username in the `hf_token` and `hf_username` variables.
- The model is finetuned for 8 epochs. You can change this by changing the `num_train_epochs` variable.
- Also, feel free to change other hyperparameters as you see fit.

In [None]:
! autotrain setup --colab > setup_logs.txt
from autotrain import __version__
print(f'AutoTrain version: {__version__}')

In [5]:
#@markdown ---
#@markdown #### Project Config
#@markdown Note: if you are using a restricted/private model, you need to enter your Hugging Face token in the next step.
project_name = 'llama3-8b-instruct-shroom' # @param {type:"string"}
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct' # @param {type:"string"}

#@markdown ---
#@markdown #### Push to Hub?
#@markdown Use these only if you want to push your trained model to a private repo in your Hugging Face Account
#@markdown If you dont use these, the model will be saved in Google Colab and you are required to download it manually.
#@markdown Please enter your Hugging Face write token. The trained model will be saved to your Hugging Face account.
#@markdown You can find your token here: https://huggingface.co/settings/tokens
push_to_hub = True # @param ["False", "True"] {type:"raw"}
hf_token = "HF_TOKEN" #@param {type:"string"} ########### IMPORTANT ############
hf_username = "HF_USERNAME" #@param {type:"string"} ########### IMPORTANT ############

#@markdown ---
#@markdown #### Hyperparameters
unsloth = False # @param ["False", "True"] {type:"raw"}
learning_rate = 2e-4 # @param {type:"number"}
num_epochs = 8 #@param {type:"number"}
batch_size = 1 # @param {type:"slider", min:1, max:32, step:1}
block_size = 1024 # @param {type:"number"}
trainer = "sft" # @param ["generic", "sft"] {type:"string"}
warmup_ratio = 0.1 # @param {type:"number"}
weight_decay = 0.01 # @param {type:"number"}
gradient_accumulation = 4 # @param {type:"number"}
mixed_precision = "fp16" # @param ["fp16", "bf16", "none"] {type:"string"}
peft = True # @param ["False", "True"] {type:"raw"}
quantization = "int4" # @param ["int4", "int8", "none"] {type:"string"}
lora_r = 16 #@param {type:"number"}
lora_alpha = 32 #@param {type:"number"}
lora_dropout = 0.05 #@param {type:"number"}

os.environ["HF_TOKEN"] = hf_token
os.environ["HF_USERNAME"] = hf_username

conf = f"""
task: llm-{trainer}
base_model: {model_name}
project_name: {project_name}
log: tensorboard
backend: local

data:
  path: data/
  train_split: train
  valid_split: null
  chat_template: null
  column_mapping:
    text_column: text

params:
  block_size: {block_size}
  lr: {learning_rate}
  warmup_ratio: {warmup_ratio}
  weight_decay: {weight_decay}
  epochs: {num_epochs}
  batch_size: {batch_size}
  gradient_accumulation: {gradient_accumulation}
  mixed_precision: {mixed_precision}
  peft: {peft}
  quantization: {quantization}
  lora_r: {lora_r}
  lora_alpha: {lora_alpha}
  lora_dropout: {lora_dropout}
  unsloth: {unsloth}

hub:
  username: ${{HF_USERNAME}}
  token: ${{HF_TOKEN}}
  push_to_hub: {push_to_hub}
"""

with open("conf.yaml", "w") as f:
    f.write(conf)

The following command will finetune the model and push it to your hub repository. It takes time (30 mins - 1 hour) My repository is [here](https://huggingface.co/arshandalili/llama3-8b-instruct-shroom).

In [18]:
! autotrain --config conf.yaml

[1mINFO    [0m | [32m2024-07-04 18:23:18[0m | [36mautotrain.cli.autotrain[0m:[36mmain[0m:[36m58[0m - [1mUsing AutoTrain configuration: conf.yaml[0m
[1mINFO    [0m | [32m2024-07-04 18:23:18[0m | [36mautotrain.parser[0m:[36m__post_init__[0m:[36m133[0m - [1mRunning task: lm_training[0m
[1mINFO    [0m | [32m2024-07-04 18:23:18[0m | [36mautotrain.parser[0m:[36m__post_init__[0m:[36m134[0m - [1mUsing backend: local[0m
[1mINFO    [0m | [32m2024-07-04 18:23:18[0m | [36mautotrain.parser[0m:[36mrun[0m:[36m194[0m - [1m{'model': 'meta-llama/Meta-Llama-3-8B-Instruct', 'project_name': 'llama3-8b-instruct-shroom', 'data_path': 'data/', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 2048, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_fi

## Inference

### Loading the models

We now load the finetuned model and the tokenizer. Make sure to be authenticated to Hugging Face to load the model.

In [5]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
new_model = "arshandalili/llama3-8b-instruct-shroom"


base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map='auto',
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Creating a pipeline for the model

In [29]:
pipe = pipeline(task="text-generation", model=base_model, tokenizer=tokenizer, max_length=300, temperature=0.1)

## Results

In [25]:
## Load the test dataset

test_df = pd.read_csv('/kaggle/input/shroom/test_df_llm.csv')
test_df.head()

Unnamed: 0.1,Unnamed: 0,id,src,tgt,hyp,task,labels,label,p(Hallucination),text
0,0,1,"Ты удивишься, если я скажу, что на самом деле ...",Would you be surprised if I told you my name i...,You're gonna be surprised if I say my real nam...,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.0,### user: For the task MT Given a source sente...
1,1,2,Еды будет полно.,There will be plenty of food.,The food will be full.,MT,"['Hallucination', 'Not Hallucination', 'Halluc...",Hallucination,0.8,### user: For the task MT Given a source sente...
2,2,3,"Думаете, Том будет меня ждать?",Do you think that Tom will wait for me?,You think Tom's gonna wait for me?,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.2,### user: For the task MT Given a source sente...
3,3,6,Два брата довольно разные.,The two brothers are pretty different.,There's a lot of friends.,MT,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,1.0,### user: For the task MT Given a source sente...
4,4,7,<define> Infradiaphragmatic </define> intra- a...,(medicine) Below the diaphragm.,(anatomy) Relating to the diaphragm.,DM,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,0.8,### user: For the task DM Given a source sente...


In [30]:
## Run the model on dataset (it will take a couple of minutes ~5-10 mins)

result = pipe([f"<s>[INST] {prompt} [/INST]" for prompt in test_df['text'].tolist()])

In [43]:
## Check if outputs are of good quality

def check_ouput(data):
    for i in tqdm(data):
        if (i[0]['generated_text'].endswith("YES") or i[0]['generated_text'].endswith("NO")):
            continue
        else:
            return False
    return True

check_ouput(result)

100%|██████████| 1500/1500 [00:00<00:00, 479897.48it/s]


True

In [45]:
## Extract the predictions
predictions = []
for i in tqdm(result):
        if i[0]['generated_text'].endswith("YES"):
            predictions.append(1)
        elif i[0]['generated_text'].endswith("NO"):
            predictions.append(0)

100%|██████████| 1500/1500 [00:00<00:00, 471199.52it/s]


In [47]:
test_df['prediction'] = predictions

In [48]:
test_df.head()

Unnamed: 0.1,Unnamed: 0,id,src,tgt,hyp,task,labels,label,p(Hallucination),text,prediction
0,0,1,"Ты удивишься, если я скажу, что на самом деле ...",Would you be surprised if I told you my name i...,You're gonna be surprised if I say my real nam...,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.0,### user: For the task MT Given a source sente...,0
1,1,2,Еды будет полно.,There will be plenty of food.,The food will be full.,MT,"['Hallucination', 'Not Hallucination', 'Halluc...",Hallucination,0.8,### user: For the task MT Given a source sente...,1
2,2,3,"Думаете, Том будет меня ждать?",Do you think that Tom will wait for me?,You think Tom's gonna wait for me?,MT,"['Not Hallucination', 'Not Hallucination', 'No...",Not Hallucination,0.2,### user: For the task MT Given a source sente...,1
3,3,6,Два брата довольно разные.,The two brothers are pretty different.,There's a lot of friends.,MT,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,1.0,### user: For the task MT Given a source sente...,1
4,4,7,<define> Infradiaphragmatic </define> intra- a...,(medicine) Below the diaphragm.,(anatomy) Relating to the diaphragm.,DM,"['Hallucination', 'Hallucination', 'Hallucinat...",Hallucination,0.8,### user: For the task DM Given a source sente...,1


In [None]:
test_df.to_csv('test_df_LLM.csv')

In [49]:
## Evaluate the model
test_labels = [1 if x > 0.5 else 0 for x in test_df['p(Hallucination)'].tolist()]

print(classification_report(test_labels, predictions))

              precision    recall  f1-score   support

           0       0.85      0.61      0.71       889
           1       0.60      0.84      0.70       611

    accuracy                           0.70      1500
   macro avg       0.72      0.73      0.70      1500
weighted avg       0.75      0.70      0.71      1500



## Results for each task

### DM

In [51]:
## Get the rows where task is DM
test_df_dm = test_df[test_df['task'] == 'DM']
y_test_dm = np.array(test_df_dm['p(Hallucination)'].tolist())
y_pred_dm = np.array(test_df_dm['prediction'].tolist())
test_labels_dm = [1 if x > 0.5 else 0 for x in y_test_dm]
print("DM Task Results: ")
print(classification_report(test_labels_dm, y_pred_dm))

DM Task Results: 
              precision    recall  f1-score   support

           0       0.89      0.39      0.54       275
           1       0.62      0.95      0.75       288

    accuracy                           0.68       563
   macro avg       0.76      0.67      0.65       563
weighted avg       0.75      0.68      0.65       563



### MT

In [52]:
## Get the rows where task is MT
test_df_mt = test_df[test_df['task'] == 'MT']
y_test_mt = np.array(test_df_mt['p(Hallucination)'].tolist())
y_pred_mt = np.array(test_df_mt['prediction'].tolist())
test_labels_mt = [1 if x > 0.5 else 0 for x in y_test_mt]
print("MT Task Results: ")
print(classification_report(test_labels_mt, y_pred_mt))

MT Task Results: 
              precision    recall  f1-score   support

           0       0.81      0.74      0.77       336
           1       0.66      0.75      0.70       226

    accuracy                           0.74       562
   macro avg       0.74      0.74      0.74       562
weighted avg       0.75      0.74      0.74       562



### PG

In [53]:
## Get the rows where task is PG
test_df_pg = test_df[test_df['task'] == 'PG']
y_test_pg = np.array(test_df_pg['p(Hallucination)'].tolist())
y_pred_pg = np.array(test_df_pg['prediction'].tolist())
test_labels_pg = [1 if x > 0.5 else 0 for x in y_test_pg]
print("PG Task Results: ")
print(classification_report(test_labels_pg, y_pred_pg))

PG Task Results: 
              precision    recall  f1-score   support

           0       0.87      0.68      0.76       278
           1       0.44      0.70      0.54        97

    accuracy                           0.69       375
   macro avg       0.65      0.69      0.65       375
weighted avg       0.76      0.69      0.71       375

