# Intro to Fine-tuning LLMs

[An Introductory Guide to Fine-Tuning LLMs by Josep Ferrer](https://www.datacamp.com/tutorial/fine-tuning-large-language-models)

# Different Types of Fine-tuning

1. Supervised fine-tuning
    - model is further trained on a labeled dataset
2. Few-shot learning
    - provide a few labeled examples, to have better context, then used unlabeled data
3. Transfer learning
    - the goal is for a model to perform well on a task that it was not intially trained on
4. Domain specific fine-tuning
    - to finetune a model to understand domain specfic tasks


# Outline
1. Choose a pre-trained model and dataset
2. Load the data to use
3. Prepare a tokenizer
4. Initialize the base model
5. Evaluate the method
6. Fine-tune using the Trainer Method

# 1. Choose a pre-trained model and dataset

[Model: gpt2](https://huggingface.co/openai-community/gpt2)

# 2. Load the data to use

In [3]:
from datasets import load_dataset
import pandas as pd
import tqdm

dataset = load_dataset("mteb/tweet_sentiment_extraction")
df = pd.DataFrame(dataset['train'])
df.head()

Unnamed: 0,id,text,label,label_text
0,cb774db0d1,"I`d have responded, if I were going",1,neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,0,negative
2,088c60f138,my boss is bullying me...,0,negative
3,9642c003ef,what interview! leave me alone,0,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...",0,negative


# 3. Prepare a tokenizer

In [4]:
from transformers import GPT2Tokenizer

# Loading the dataset to train our model
datasets = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize, batched=True)
tokenized_datasets

Map: 100%|██████████| 27481/27481 [00:04<00:00, 5597.12 examples/s]
Map: 100%|██████████| 3534/3534 [00:00<00:00, 5294.80 examples/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text', 'input_ids', 'attention_mask'],
        num_rows: 27481
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text', 'input_ids', 'attention_mask'],
        num_rows: 3534
    })
})

## 3a. Create a smaller subset of the full dataset

In [5]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [6]:
small_train_dataset

Dataset({
    features: ['id', 'text', 'label', 'label_text', 'input_ids', 'attention_mask'],
    num_rows: 1000
})

In [7]:
small_eval_dataset

Dataset({
    features: ['id', 'text', 'label', 'label_text', 'input_ids', 'attention_mask'],
    num_rows: 1000
})

# 4. Initialize the base model

In [8]:
from transformers import GPT2ForSequenceClassification
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 5. Evaluate the method

In [9]:
import evaluate
import numpy as np

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    return metric.compute(predictions=predictions, refernces=labels)

Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 24.2MB/s]


# 6. Fine-tune using the Trainer Method

In [10]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   #evaluation_strategy="epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   compute_metrics=compute_metrics,

)

trainer.train()

Step,Training Loss
500,1.1324


TrainOutput(global_step=750, training_loss=1.039894510904948, metrics={'train_runtime': 423.7883, 'train_samples_per_second': 7.079, 'train_steps_per_second': 1.77, 'total_flos': 1567794659328000.0, 'train_loss': 1.039894510904948, 'epoch': 3.0})

# Fine-tuning Best Practices

1. Data Quantity and Quality
    - ensure data is clean, relevant, and suffciently large
2. Hyperparameter Tuning
    - i.e. learning rates, batch sizes, num training epochs
3. Regular Evaluation
    - continiously evaluate the model to prevent overfitting on the training data

# Fine-tuning Pitfalls

1. Overfitting
    - this usually occurs when the dataset used for training is small, and it is trained on for many epochs
2. Underfitting
    - usually caused by insufficent training or low learning rate
3. Catastrophic forgetting
    - when fine-tuning there is a chance that the model forgets the broad knowledge it intially aquired
4. Data leakage
    - ensure that train and val sets are seperate, which can give misleadingly high performace metrics