# LLMs Fine-tuning
In this project, the <b>GPT-2</b> model. In this regard, we import the required models and libraries from <a href="https://huggingface.co/" target="_blank"><strong>Hugging Face</strong></a>

## Import Libraries

In [1]:
import time
from datasets import load_dataset
from datasets import DatasetDict
from transformers import AutoTokenizer                    # loads tokenizer for the chosen model
from transformers import AutoModelForCausalLM             # loads a causal language model
from transformers import TrainingArguments                # defines all hyperparameters for training
from transformers import DataCollatorForLanguageModeling  # prepares batches for training
from transformers import Trainer                          # is the Hugging Face API for fine-tuning

import warnings
warnings.filterwarnings("ignore")

## Load Dataset

In [2]:
orig_dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
orig_dataset

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

In [3]:
N = 1000

dataset = DatasetDict({
    "train": orig_dataset["train"].select(range(N)),
    "validation": orig_dataset["validation"].select(range(N)),
    "test": orig_dataset["test"].select(range(N))
})

dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1000
    })
})

## Define Checkpoint
The checkpoint defines the foundation model we are going to use.

In [4]:
model_name = "gpt2"

## Load Tokenizer
Load the tokenizer corresponding to the foundation model.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

# some models do not own a dedicated padding token; thus, we set it manually
# using end-of-sequence (eos) token to avoid errors
tokenizer.pad_token = tokenizer.eos_token

## Tokenize the Dataset
The tokenizer is responsible for tokenizing the inputs into <b>token IDs</b>. In this regard:
<ul>
    <li>
        <b>truncation:</b> ensures that sequences do not exceed maximum length of tokens (i.e., <em>max_length</em>).
    </li>
    <li>
        <b>padding:</b> pads all sequences to <em>max_length</em>; can be set either here or for collator.
    </li>
    <li>
        <b>labeling:</b> as the procedure is a causal LM training (predicting next tokens), we set <b>labels = input_ids</b>.
    </li>
</ul>

In [6]:
def tokenization_fn(batch):
    out = tokenizer(batch['text'],
                    truncation=True,
                    padding="max_length",
                    max_length=128,
                    return_tensors=None)
    out["labels"] = out["input_ids"].copy()
    
    return out

tokenized_dataset = dataset.map(tokenization_fn, batched=True, remove_columns=["text"])
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

## Load Model
We load the foundation model. To this end:
<ul>
    <li>
        We set <em>device_map="auto"</em>, whereby the most compatible device (i.e., cuda, mps, cpu, etc.) is chosen.
    </li>
    <li>
        We set <em>torch_dtype="auto"</em> by which the most suitable dtype is chosen.
    </li>
</ul>

In [7]:
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", torch_dtype="auto"
)

## Define Training Hyperparameters
We have a list of parameters and hyperparameters to set:
<ul>
    <li>
        <b>output_dir:</b> the directory where checkpoints are saved.
    </li>
    <li>
        <b>per_device_train/eval_batch_size:</b> keeps virtual random access memeory (VRAM) usage low.
    </li>
    <li>
        <b>gradient_accumulation_steps:</b> simulates larger effective batch size without increasing VRAM.
    </li>
    <li>
        <b>num_train_epochs:</b> number of passes over the dataset for fine-tuning.
    </li>
    <li>
        <b>learning_rate:</b> learning rate.
    </li>
    <li>
        <b>logging_steps:</b> number of steps for logging loss.
    </li>
    <li>
        <b>save_steps:</b> number of steps to save checkpoints.
    </li>
</ul>

In [8]:
args = TrainingArguments(
    output_dir='output_dir/finetune',
    eval_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=2e-4,
    logging_steps=10,
    save_steps=1000,
    label_names=["labels"]
)

## Create Data Collator
Data collator handles padding and batching. It takes a list of individual data samples and organizes them into a single and consistent batch using padding, creating attention masks, handeling special tokens, etc.

In [9]:
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False) # not a maske language model (MLM) task
# collator = DataCollatorForLanguageModeling(tokenizer, model=model, padding=True) # if padding is not set for tokenizer

## Initialize Trainer

In [10]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=collator
)
start_time = time.time()
trainer.train()
train_time = time.time() - start_time
print(f"Trianing time: {train_time: .4f}")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,3.8817,3.529723
2,3.0101,3.599098
3,2.7751,3.627959


Trianing time:  92.3040


## Save Model and Tokenizer

In [11]:
# model.save_pretrained("models/fine_tuning/fine_tuned")
# tokenizer.save_pretrained("models/fine_tuning/fine_tuned")

## Evaluations

In [12]:
metrics = trainer.evaluate(tokenized_dataset["test"])
for k, v in metrics.items():
    print(f"{k}: {v}")

eval_loss: 3.914668560028076
eval_runtime: 7.299
eval_samples_per_second: 137.005
eval_steps_per_second: 17.126
epoch: 3.0


### Perplexity
Perplexity (PPL) measures how well the model predicts the next token:

<center>
    $\text{Perplexity (PPL)} = \exp(\text{loss})$
</center>

<ul>
    <li>
        <b>Low PPL (<20)</b> → better language modeling.
    </li>
    <li>
        <b>High PPL (>100)</b> → model struggles with the dataset.
    </li>
</ul>

In [13]:
import math
loss = metrics['eval_loss']
ppl = math.exp(loss)
print(f"Perplexity: {ppl: .4f}")

Perplexity:  50.1325


### BLEU and ROUGE
<ul>
    <li>
        <b>BLEU:</b> evaluates the quality of machine-translated text by comparing it to human-created reference translations (overlap of n-grams)
    </li>
    <li>
        <b>ROUGE:</b> calculates precision, recall, and an F1 scores to quantify the overlap in words, phrases, and sequences.
    </li>
</ul>

We need to define a function to genrate the outputs:
<ul>
    <li>
        <b>max_new_tokens</b>: limits the number of tokens the model can generate. it only counts new tokens, not the input. e.g., Input (20 tokens) and <em>max_new_tokens</em>=256. Therefore, the output sequence can be up to 276 tokens (20 input + 256 generated).
    </li>
    <li>
        <b>do_sample</b>: by default, some models use greedy decoding (always pick the highest-probability next token). This feature enables sampling from the model's probability distribution. Hence it makes output less deterministic and more creative/varied.
    </li>
    <li>
        <b>temperature</b>: controls randomness in sampling. If less than 1.0, then makes predictions more confident/conservative; otherwise, the predictions are more random/creative.
    </li>
    <li>
        <b>top_p</b>: implements nucleus sampling, i.e., only considers the smallest set of tokens whose cumulative probability is greater than or equal to <em>top_p</em>.
    </li>
</ul>

In [14]:
def get_output(prompt, max_new_tokens=50):
    if not prompt.strip():
        prompt = tokenizer.eos_token
        
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=model.config.eos_token_id
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [15]:
# pip install absl-py nltk datasets rouge-score

In [16]:
import numpy as np
import evaluate
import nltk
# nltk.download('punkt')   # ensures ROUGE can tokenize correctly

rouge = evaluate.load('rouge')
bleu = evaluate.load('bleu')

predictions, references = [], []
for example in dataset["test"]:
    ref = example["text"].strip()
    if not ref:
        continue
        
    pred = get_output(ref)
    predictions.append(pred)
    references.append(ref)

rouge_score = rouge.compute(predictions=predictions, references=references)
bleu_score = bleu.compute(predictions=predictions, references=references)

In [17]:
for k, v in rouge_score.items():
    print(f"{k}: {v: .4f}")

rouge1:  0.6751
rouge2:  0.6361
rougeL:  0.6742
rougeLsum:  0.6746


In [18]:
grams = ['unigrams', 'bigrams', 'trigrams', 'quadgrams']
for k, v in bleu_score.items():
    if k == 'bleu':
        print(f"{k}: {v: .4f}")
    elif k == 'precisions':
        for i in range(4):
            print(f"{grams[i]}: {v[i]: .4f}")

bleu:  0.7069
unigrams:  0.7103
bigrams:  0.7079
trigrams:  0.7056
quadgrams:  0.7036
