# Finetuning  Meta Llama 3 with Unslothai


<center>
    <img src="https://www.pc-tablet.co.in/wp-content/uploads/2024/04/Llama-3.webp" height="50px">
</center>




## This is a solution to CheckThat! Lab at CLEF 2024 Task 2: Subjectivity

### Definition
Systems are challenged to distinguish whether a sentence from a news article expresses the subjective view of the author behind it or presents an objective view on the covered topic instead.

This is a binary classification tasks in which systems have to identify whether a text sequence (a sentence or a paragraph) is subjective or objective.

In [1]:
%%capture
!pip install -U "xformers<0.0.26" --index-url https://download.pytorch.org/whl/cu121
!pip install "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"

# Temporary fix for https://github.com/huggingface/datasets/issues/6753
!pip install datasets==2.16.0 fsspec==2023.10.0 gcsfs==2023.10.0

import os
os.environ["WANDB_DISABLED"] = "true"

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2024-07-01 07:40:34.626593: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-01 07:40:34.626698: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-01 07:40:34.766008: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # 4x longer contexts auto supported!
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


# Load the dataset

We will use checkthat task-2 (Subjectivity) dataset to train this model.


In [4]:
import pandas as pd
df=pd.read_csv("/kaggle/input/checkthat-dataset/clef2024-checkthat-lab/task2/data/subtask-2-english/train_en.tsv", encoding="utf-8", sep='\t')

In [5]:

df1=pd.read_csv("/kaggle/input/checkthat-dataset/clef2024-checkthat-lab/task2/data/subtask-2-english/dev_en.tsv", encoding="utf-8", sep='\t')

In [6]:

df2=pd.read_csv("/kaggle/input/checkthat-dataset/clef2024-checkthat-lab/task2/data/subtask-2-english/dev_test_en.tsv", encoding="utf-8", sep='\t')

In [7]:
df=pd.concat([df,df1,df2])

In [8]:
label_map = {'SUBJ': 0, 'OBJ': 1}

# Map the labels to integer values
df['label'] = df['label'].map(label_map)

In [9]:
prompt = """Below is a sentence from a news article. The task is to determine whether the sentence is objective (1) or subjective (0). 

- A sentence is **subjective** if it reflects personal feelings, tastes, opinions, or beliefs. Subjective sentences often include:
  - Opinions or interpretations
  - Emotive language
  - Personal judgments
  - Words that indicate uncertainty (e.g., "seems," "appears")
  - Comparisons based on personal views (e.g., "better," "worse")

- A sentence is **objective** if it presents factual information that is not influenced by personal feelings or opinions. Objective sentences typically include:
  - Verifiable facts
  - Statistical data
  - Observable phenomena
  - Neutral language
  - Statements that can be proven true or false

Determine whether the sentence expresses a subjective view (the author's opinion or feeling) or an objective view (a factual statement). Output 1 for objective and 0 for subjective. Provide only the number and nothing else.

### Sentence:
{}

### Output:
{}
"""
EOS_TOKEN = tokenizer.eos_token
 # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    inputs       = examples["sentence"]
    outputs      = examples["label"]
    texts = []
    for  input, output in zip( inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/1292 [00:00<?, ? examples/s]

In [10]:
tokenizer.padding_side = 'right'

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [11]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
#         max_steps = 10,
       num_train_epochs = 12,
        learning_rate = 5e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/1292 [00:00<?, ? examples/s]

In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,292 | Num Epochs = 12
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 1,932
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.6454
20,0.8093
30,0.4413
40,0.3966
50,0.4067
60,0.381
70,0.3898
80,0.3257
90,0.3688
100,0.3249


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [13]:
import pandas as pd
df=pd.read_csv("/kaggle/input/checkthat-dataset/clef2024-checkthat-lab/task2/data/subtask-2-english/test_en.tsv", encoding="utf-8", sep='\t')

In [14]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

def generate_response(input_text):
    prompt_1 = prompt.format(input_text,"")
    inputs = tokenizer([prompt_1], return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    # Extract the part where SUBJ or OBJ is written
    response_text = response[0].split("### Output:\n")[1].strip().split('\n')[0]
    if(len(response_text)>1):
        response_text=0


    return response_text

In [15]:
# Apply inference to each row in the DataFrame
df['label'] = df['sentence'].apply(generate_response)

# Print the DataFrame to see the results
print(df.head())

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for

                            sentence_id  \
0  44f33601-157a-42ce-aa9f-0f7d305501f2   
1  6f9e0f53-f76c-432f-bbea-b78400d600b8   
2  61f93bdc-4c3e-4963-926c-0bbf139b44b9   
3  902148ec-dda3-4736-b318-0f20c63a1cf3   
4  065b1996-4b40-4c74-9f62-afb44f69834e   

                                            sentence label  
0  Blanco established himself earlier in his care...     1  
1  RULE 13: ARTIFICIAL INTELLIGENCE  Not only thi...     1  
2  The valuation is required by law and the figur...     1  
3  A sip can really hit the spot after a long bik...     0  
4                                          Lobster!"     1  


In [16]:
model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving

In [17]:
df1=pd.read_csv("/kaggle/input/checkthat-dataset/clef2024-checkthat-lab/task2/data/subtask-2-english/test_en_gold.tsv", encoding="utf-8", sep='\t')

In [18]:
label_map = {'SUBJ': '0', 'OBJ': '1'}

# Map the labels to integer values
df1['label'] = df1['label'].map(label_map)

In [19]:

df.to_csv("submission.csv",index=False)

In [20]:
import pandas as pd
from sklearn.metrics import f1_score

# Ensure labels are of the same type
df['label'] = df['label'].astype(str)
df1['label'] = df1['label'].astype(str)

# Calculate the F1 score
f1 = f1_score(df['label'], df1['label'], average='macro')
print(f"F1 Score: {f1:.4f}")

F1 Score: 0.7060


# Calculate F1 score

In [21]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support



# Rename columns for merging

# Extract prediction and gold labels
pred_values = df['label']
gold_values = df1['label']

# Calculate metrics
acc = accuracy_score(gold_values, pred_values)
m_prec, m_rec, m_f1, _ = precision_recall_fscore_support(gold_values, pred_values, average="macro", zero_division=0)
p_prec, p_rec, p_f1, _ = precision_recall_fscore_support(gold_values, pred_values, labels=["0"], zero_division=0)

# Print the results
print(f"macro_F1: {m_f1:.2f}\tmacro_P: {m_prec:.2f}\tmacro_R: {m_rec:.2f}")
print(f"SUBJ_F1: {p_f1[0]:.2f}\tSUBJ_P: {p_prec[0]:.2f}\tSUBJ_R: {p_rec[0]:.2f}")
print(f"accuracy: {acc:.2f}")


macro_F1: 0.71	macro_P: 0.75	macro_R: 0.69
SUBJ_F1: 0.54	SUBJ_P: 0.66	SUBJ_R: 0.45
accuracy: 0.80


# Save model to Huggingface

In [22]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

## This code is modified from https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook