# LLMs Fine-tuning
In this project, we fine-tune a small & GPU-friendly large language model (LLM), named as <b>Qwen2.5 0.5B-Instruct</b>. In this regard, we import the required models and libraries from <a href="https://huggingface.co/" target="_blank"><strong>Hugging Face</strong></a>

## Import Libraries

In [None]:
from datasets import Dataset                              # creates Hugging Face dataset from Python lists/dicts
from transformers import AutoTokenizer                    # loads tokenizer for the chosen model
from transformers import AutoModelForCausalLM             # loads a causal language model
from transformers import TrainingArguments                # defines all hyperparameters for training
from transformers import DataCollatorForLanguageModeling  # prepares batches for training
from transformers import Trainer                          # is the Hugging Face API for fine-tuning

import warnings
warnings.filterwarnings("ignore")

## Define Checkpoint
The checkpoint defines the foundation model we are going to use.

In [None]:
checkpoint = "Qwen/Qwen2.5-0.5B-Instruct"

## Load Tokenizer
Load the tokenizer corresponding to the foundation model.

In [None]:
# fast tokenizer is faster and backed by Rust (systems programming language)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=True)

# some models do not own a dedicated padding token; thus, we set it manually
# using end-of-sequence (eos) token to avoid errors
tokenizer.pad_token = tokenizer.eos_token

## Create a Synthetic Dataset

In [None]:
rows = [
    {"prompt": "Explain overfitting to a 10-year-old.",
     "response": "Overfitting is like memorizing your homework answers."},
    {"prompt": "Give tips for writing a research paper.",
     "response": "Brainstorm about the idea, read the literature, and define motivation, contribution, and solution"}
]

## Format Data for the Chat Model

In [None]:
def format_row(r):
    return f"<|user|>\n{r['prompt']}\n<|assistant|>\n{r['response']}"

data = Dataset.from_list([
    {'text': format_row(r)} for r in rows
])
data

In [None]:
print(data[0])

## Tokenize the Dataset
The tokenizer is responsible for tokenizing the inputs into <b>token IDs</b>. In this regard:
<ul>
    <li>
        <b>truncation:</b> ensures that sequences do not exceed maximum length of tokens (i.e., <em>max_length</em>).
    </li>
    <li>
        <b>padding:</b> pads all sequences to <em>max_length</em>; can be set either here or for collator.
    </li>
    <li>
        <b>labeling:</b> as the procedure is a causal LM training (predicting next tokens), we set <b>labels = input_ids</b>.
    </li>
</ul>

In [None]:
def tokenization_fn(batch):
    out = tokenizer(batch['text'],
                    truncation=True,
                    padding="max_length",
                    max_length=1024,
                   return_tensors=None)
    out["labels"] = out["input_ids"].copy()

    return out

token_data = data.map(tokenization_fn, batched=True, remove_columns=["text"])
token_data

## Load Model
We load the foundation model. To this end:
<ul>
    <li>
        We set <em>device_map="auto"</em>, whereby the most compatible device (i.e., cuda, mps, cpu, etc.) is chosen.
    </li>
    <li>
        We set <em>torch_dtype="auto"</em> by which the most suitable dtype is chosen.
    </li>
</ul>

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", torch_dtype="auto"
)

## Define Training Hyperparameters
We have a list of parameters and hyperparameters to set:
<ul>
    <li>
        <b>output_dir:</b> the directory where checkpoints are saved.
    </li>
    <li>
        <b>per_device_train_batch_size:</b> keeps virtual random access memeory (VRAM) usage low.
    </li>
    <li>
        <b>gradient_accumulation_steps:</b> simulates larger effective batch size without increasing VRAM.
    </li>
    <li>
        <b>num_train_epochs:</b> number of passes over the dataset for fine-tuning.
    </li>
    <li>
        <b>learning_rate:</b> learning rate.
    </li>
    <li>
        <b>logging_steps:</b> number of steps for logging loss.
    </li>
    <li>
        <b>save_steps:</b> number of steps to save checkpoints.
    </li>
</ul>

In [None]:
args = TrainingArguments(
    output_dir='output_dir',
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=1e-4,
    logging_steps=10,
    save_steps=1000
)

## Create Data Collator
Data collator handles padding and batching. It takes a list of individual data samples and organizes them into a single and consistent batch using padding, creating attention masks, handeling special tokens, etc.

In [None]:
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False) # not a maske language model (MLM) task
# collator = DataCollatorForLanguageModeling(tokenizer, model=model, padding=True) # if padding is not set for tokenizer

## Initialize Trainer

In [None]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=token_data,
    data_collator=collator
)
trainer.train()

## Save Model and Tokenizer

In [None]:
model.save_pretrained("models/fine_tuning/fine_tuned")
tokenizer.save_pretrained("models/fine_tuning/fine_tuned")

## Evaluations

In [None]:
metrics = trainer.evaluate(token_data)
for k, v in metrics.items():
    print(f"{k}: {v}")

### Perplexity
Perplexity (PPL) measures how well the model predicts the next token:

<center>
    $\text{Perplexity (PPL)} = \exp(\text{loss})$
</center>

<ul>
    <li>
        <b>Low PPL (<20)</b> → better language modeling.
    </li>
    <li>
        <b>High PPL (>100)</b> → model struggles with the dataset.
    </li>
</ul>

In [None]:
import math
loss = metrics['eval_loss']
ppl = math.exp(loss)
print('Perplexity:', ppl)

### BLEU and ROUGE
<ul>
    <li>
        <b>BLEU:</b> evaluates the quality of machine-translated text by comparing it to human-created reference translations (overlap of n-grams)
    </li>
    <li>
        <b>ROUGE:</b> calculates precision, recall, and an F1 score to quantify the overlap in words, phrases, and sequences.
    </li>
</ul>

We need to define a function to genrate the outputs:
<ul>
    <li>
        <b>max_new_tokens</b>: limits the number of tokens the model can generate. it only counts new tokens, not the input. e.g., Input (20 tokens) and <em>max_new_tokens</em>=256. Therefore, the output sequence can be up to 276 tokens (20 input + 256 generated).
    </li>
    <li>
        <b>do_sample</b>: by default, some models use greedy decoding (always pick the highest-probability next token). This feature enables sampling from the model's probability distribution. Hence it makes output less deterministic and more creative/varied.
    </li>
    <li>
        <b>temperature</b>: controls randomness in sampling. If less than 1.0, then makes predictions more confident/conservative; otherwise, the predictions are more random/creative.
    </li>
    <li>
        <b>top_p</b>: implements nucleus sampling, i.e., only considers the smallest set of tokens whose cumulative probability is greater than or equal to <em>top_p</em>.
    </li>
</ul>

In [None]:
def chat(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# pip install absl-py nltk datasets rouge-score

In [None]:
import numpy as np
import evaluate
import nltk
# nltk.download('punkt')   # ensures ROUGE can tokenize correctly

rouge = evaluate.load('rouge')
bleu = evaluate.load('bleu')

predictions, references = [], []
for row in data:
    prompt = row["text"].split("<|assistant|>")[0] + "<|assistant|>\n"
    pred = chat(prompt)
    ref = row["text"].split("<|assistant|>")[1]
    predictions.append(pred)
    references.append(ref)

rouge_score = rouge.compute(predictions=predictions, references=references)
bleu_score = bleu.compute(predictions=predictions, references=references)

In [None]:
for k, v in rouge_score.items():
    print(f"{k}: {v: .4f}")

In [None]:
grams = ['unigrams', 'bigrams', 'trigrams', 'quadgrams']
for k, v in bleu_score.items():
    if k == 'bleu':
        print(f"{k}: {v: .4f}")
    elif k == 'precisions':
        for i in range(4):
            print(f"{grams[i]}: {v[i]: .4f}")

In [None]:
for i in range(len(references)):
    print(f"Reference: {references[i]}")
    print(f"\nPrediction: {predictions[i].split("<|assistant|>")[1]}")
    print("******************************")