<a href="https://colab.research.google.com/github/ankit-kothari/Data-Science-Journey/blob/master/Opensource_LLMs_Week_2_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Introduction

In this project, we will look at three approaches to finetuning a model:

1.   Without alignment (or any re-inforcement learning technique) using a naive Trainer on a Sequence-to-sequence (Encoder-Decoder) model
2.   Supervised Fine Tuning (SFT) using Hugging Face's Transformers Reinforcement Learning (TRL) Package with Quantization
3. Direct Preference Optimization (DPO) using Hugging Face's TRL Package

# 1. Naive Fine Tuning

In this section, we will fine tune a T5 model. T5-Small is a versatile and compact variant of the T5 model that adopts a unified framework design for tackling various natural language tasks through a text-in, text-out paradigm. Despite having fewer parameters than other T5 variants, T5-Small excels in several natural language applications such as translation, summarization, question answering, classification, and generation tasks. Key features contributing to its success include relative position embeddings, layer normalization, sparse attention patterns, and scaled dot-product attention. These characteristics enable T5-Small to deliver high parallelism, capture long-range contextual relationships, and perform robustly in complex linguistic tasks.

In this part of the project, we will fine tune T5 on the `timdettmers/openassistant-guanaco` dataset and use it for translation tasks.

In [1]:
!pip install datasets trl peft bitsandbytes accelerate -qqqq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/471.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m471.0/471.6 kB[0m [31m20.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/318.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.4/318.4 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Import all the necessary libraries
from transformers import AutoTokenizer, T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from datasets import Dataset, DatasetDict
import json
import torch
from datasets import Dataset, DatasetDict, load_dataset
from peft import LoraConfig
from trl import SFTTrainer, DPOTrainer




In [34]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Import necessary libraries
from transformers import AutoTokenizer, T5ForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset

# --------------- Documentation ---------------
"""
Task: Fine-tuning T5 for Conditional Generation using 'timdettmers/openassistant-guanaco' dataset.
Objective: We are training a T5 model to generate text conditioned on some input (sequence-to-sequence task).

Dataset: 'timdettmers/openassistant-guanaco'
This dataset contains dialogues between assistants and users. Each dialogue is expected to be tokenized and then passed to the T5 model, where the model will learn to generate responses based on the assistant's dialogue.

Example (before tokenization):
{
    'text': 'User: How do I bake a cake?\nAssistant: First, preheat your oven to 350°F. Then mix the ingredients...'
}

The model will take the 'User' input and try to predict the 'Assistant' response.

We will use both the train and validation datasets, tokenize them, and train a T5-small model on them.
"""

# --------------- Step 1: Load the tokenizer and the model ---------------
print("Loading tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

print(f"Model loaded: {model}")
print(f"Model architecture: {model.config.architectures}")
print(f"Model layers details:\n{model}")




Loading tokenizer and model...


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Model loaded: T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (

In [4]:
# --------------- Step 2: Tokenization function ---------------
def tokenize(batch):
    tokenized_input = tokenizer(batch['text'], truncation=True, padding='max_length', max_length=512)
    tokenized_label = tokenizer(batch['text'], truncation=True, padding='max_length', max_length=512)
    print(f"Tokenized input sample: {tokenized_input['input_ids'][:10]}")  # Print first 10 tokens
    return {'input_ids': tokenized_input['input_ids'], 'labels': tokenized_label['input_ids']}

# --------------- Step 3: Load datasets ---------------
print("Loading datasets...")
train_dataset = load_dataset("timdettmers/openassistant-guanaco", split='train')
validation_dataset = load_dataset("timdettmers/openassistant-guanaco", split='test')

print(f"Train dataset type: {type(train_dataset)}")
print(f"Validation dataset type: {type(validation_dataset)}")
print(f"First example from train dataset:\n{train_dataset[0]}")

# --------------- Step 4: Tokenize the datasets ---------------
print("Tokenizing the train and validation datasets...")
train_dataset = train_dataset.map(tokenize, batched=True)
validation_dataset = validation_dataset.map(tokenize, batched=True)

print(f"Tokenized train dataset sample:\n{train_dataset[0]}")
print(f"Tokenized validation dataset sample:\n{validation_dataset[0]}")



Loading datasets...


README.md:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


openassistant_best_replies_train.jsonl:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

openassistant_best_replies_eval.jsonl:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Repo card metadata block was not found. Setting CardData to empty.


Train dataset type: <class 'datasets.arrow_dataset.Dataset'>
Validation dataset type: <class 'datasets.arrow_dataset.Dataset'>
First example from train dataset:
{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies co

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Tokenized input sample: [[1713, 30345, 3892, 10, 1072, 25, 1431, 3, 9, 710, 5302, 81, 8, 20208, 13, 8, 1657, 96, 2157, 9280, 106, 63, 121, 16, 1456, 7, 58, 863, 169, 4062, 1341, 12, 1055, 7414, 102, 739, 725, 16, 8, 12568, 512, 11, 3, 8464, 2193, 585, 5, 4663, 30345, 9255, 10, 96, 9168, 9280, 106, 63, 121, 2401, 7, 12, 3, 9, 512, 1809, 213, 132, 19, 163, 80, 8001, 21, 3, 9, 1090, 207, 42, 313, 5, 86, 1456, 7, 6, 48, 1657, 19, 1989, 2193, 16, 8, 5347, 512, 6, 213, 3, 9, 7414, 102, 739, 63, 6152, 65, 1516, 579, 147, 8, 15488, 11, 464, 1124, 13, 70, 1652, 5, 37, 3053, 13, 3, 9, 7414, 102, 739, 63, 54, 741, 16, 1364, 15488, 11, 3915, 4311, 1645, 21, 2765, 6, 38, 8, 6152, 65, 385, 17821, 12, 993, 15488, 42, 370, 394, 464, 1124, 5, 17716, 585, 65, 4313, 1055, 7414, 102, 739, 725, 16, 5238, 224, 38, 3549, 11, 1006, 542, 6, 213, 3, 9, 360, 508, 688, 610, 3, 9, 1516, 4149, 13, 8, 512, 41, 279, 757, 29, 7, 3, 184, 8306, 88, 40, 6, 2038, 137, 86, 175, 5238, 6, 2765, 557, 522, 731, 15488, 6, 1643,

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Tokenized input sample: [[1713, 30345, 3892, 10, 3, 2, 2533, 2, 2795, 2, 2795, 3, 2, 3700, 7184, 6652, 2, 2795, 2, 3, 8194, 3, 2, 6652, 1757, 18642, 6, 3, 12377, 9592, 7948, 2, 12681, 18632, 18352, 23912, 15042, 3, 20447, 5345, 26798, 6609, 3, 2, 1757, 6588, 2, 3, 2, 26672, 1757, 6588, 6, 3, 2533, 3, 2, 22581, 25083, 8724, 2, 6609, 17238, 22682, 3, 1757, 2, 2044, 3, 8194, 3, 2, 6652, 7948, 7184, 4663, 30345, 9255, 10, 3, 2, 17059, 3, 2, 3700, 7184, 6652, 2, 2795, 2, 6, 3, 12377, 9592, 7948, 2, 12681, 18632, 18352, 23912, 15042, 3, 20447, 5345, 26798, 6609, 3, 2, 1757, 6588, 2, 3, 2, 26672, 1757, 6588, 3, 2795, 8724, 2, 6609, 17238, 22682, 3, 1757, 2, 2044, 3, 8194, 3, 2, 6652, 7948, 7184, 10, 3, 2, 7, 210, 99, 17, 3, 25322, 1843, 7175, 28592, 30652, 599, 834, 5590, 10, 784, 1570, 17, 908, 61, 3, 2, 13751, 3, 2, 2044, 2, 20000, 25083, 3, 12377, 2, 2795, 2, 3, 20447, 5345, 26798, 16624, 6, 3, 2, 9592, 2, 3, 14142, 3, 2795, 2, 28232, 2, 6725, 2, 3, 26672, 2, 17238, 7184, 2, 3, 4331, 3, 14

In [10]:
# --------------- Step 5: Define training arguments ---------------
print("Defining training arguments...")
training_args = Seq2SeqTrainingArguments(
    output_dir='./content/drive/MyDrive/llm_fine_tuning/t5_ft/',
    num_train_epochs=1,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# --------------- Step 6: Initialize the Trainer ---------------
print("Initializing the Trainer...")
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
)

# --------------- Step 7: Inspect model layers ---------------
# Checking model layers and weights for specific layers before training
print(f"Model weights of the first layer before training:\n{model.get_input_embeddings().weight[:5]}")  # First 5 weight entries



In [7]:
# --------------- Step 8: Train the model ---------------
print("Starting model training...")
trainer.train()

# --------------- Step 9: Check model layer weights after training ---------------
print(f"Model weights of the first layer after training:\n{model.get_input_embeddings().weight[:5]}")

# --------------- Example Output for Documentation ---------------
"""
Example Output:

1. Tokenized input and label (before training):
Original text: "User: How do I bake a cake?\nAssistant: First, preheat your oven to 350°F..."
Tokenized input_ids: [8794, 10, 276, 19, 27, 24859, 5, 2961, 46, ...]

2. Train dataset example (tokenized):
{
    'input_ids': [8794, 10, 276, 19, 27, 24859, 5, 2961, 46, ...],
    'labels': [8794, 10, 276, 19, 27, 24859, 5, 2961, 46, ...]
}

3. Weights from the first embedding layer before and after training:
Before training: tensor([[ 0.0115,  0.0083,  0.0132,  0.0034,  0.0098],...])
After training:  tensor([[ 0.0156,  0.0091,  0.0128,  0.0042,  0.0089],...])
"""

Starting model training...


Step,Training Loss
500,1.4738


Model weights of the first layer after training:
tensor([[ -2.0152,   0.2258,  -7.0871,  ...,  -0.3548,   2.6376,  -2.8862],
        [ 12.6220,   8.1901, -11.6218,  ...,   7.9378,  -7.3155,   0.9422],
        [ -8.7459,   7.1827,  27.8689,  ..., -26.7419,   0.8577,  -1.5132],
        [ 11.2490,   8.0594,  14.1854,  ...,   9.8101,  -7.8717,  -3.6078],
        [  2.7947,   3.5322,  11.4385,  ..., -24.6205,  -9.5603,  -4.4994]],
       device='cuda:0', grad_fn=<SliceBackward0>)


'\nExample Output:\n\n1. Tokenized input and label (before training):\nOriginal text: "User: How do I bake a cake?\nAssistant: First, preheat your oven to 350°F..."\nTokenized input_ids: [8794, 10, 276, 19, 27, 24859, 5, 2961, 46, ...]\n\n2. Train dataset example (tokenized):\n{\n    \'input_ids\': [8794, 10, 276, 19, 27, 24859, 5, 2961, 46, ...],\n    \'labels\': [8794, 10, 276, 19, 27, 24859, 5, 2961, 46, ...]\n}\n\n3. Weights from the first embedding layer before and after training:\nBefore training: tensor([[ 0.0115,  0.0083,  0.0132,  0.0034,  0.0098],...])\nAfter training:  tensor([[ 0.0156,  0.0091,  0.0128,  0.0042,  0.0089],...])\n'

In [8]:
# Inspect fine-tuning dataset structure after tokenization
print("Inspecting fine-tuning dataset structure...")
for i in range(3):  # Print first 3 samples for a better understanding
    print(f"Sample {i+1}:")
    print(f"Original text: {train_dataset['text'][i]}")
    print(f"Tokenized input_ids: {train_dataset['input_ids'][i][:10]}")  # Print first 10 tokens for brevity
    print(f"Tokenized labels: {train_dataset['labels'][i][:10]}")        # Print first 10 tokens of labels
    print("="*50)


Inspecting fine-tuning dataset structure...
Sample 1:
Original text: ### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.

Recent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often fa

In [9]:
# Ensure model is on the same device as inputs
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Generate model output before fine-tuning
def generate_output_before_finetuning(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move inputs to the same device as the model
    output_ids = model.generate(inputs['input_ids'], max_length=50)
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text

# Test on a few examples
from rich import print
print("Generating outputs before fine-tuning...")
for i in range(3):
    original_text = train_dataset['text'][i]
    output_text_before = generate_output_before_finetuning(original_text)
    print(f"Original Text: {original_text}")
    print(f"Generated Output (Before Fine-Tuning): {output_text_before}")
    print("="*50)


In [None]:


# Load the fine-tuned model from the checkpoint
checkpoint_path = "/content/results/checkpoint-1845"
model_finetuned = T5ForConditionalGeneration.from_pretrained(checkpoint_path)

#Ensure the model is on the correct device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_finetuned.to(device)

# Generate model output after fine-tuning
def generate_output_after_finetuning(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move inputs to the same device as the model
    output_ids = model.generate(inputs['input_ids'], max_length=50)
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text

# Test on the same examples post fine-tuning
print("Generating outputs after fine-tuning...")
for i in range(3):
    original_text = train_dataset['text'][i]
    output_text_after = generate_output_after_finetuning(original_text)
    print(f"Original Text: {original_text}")
    print(f"Generated Output (After Fine-Tuning): {output_text_after}")
    print("="*50)



In [66]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset('json', data_files='/content/drive/MyDrive/llm_fine_tuning/t5_ft/summary_podcast_dataset.json')

# Inspect the dataset structure
print(f"Dataset loaded: {dataset}")
print(f"Sample example from dataset:\n{dataset['train'][0]}")


Generating train split: 0 examples [00:00, ? examples/s]

In [67]:
# Tokenization function for T5 model
def tokenize(batch):
    # Tokenize the input prompt
    tokenized_input = tokenizer(batch['source'], truncation=True, padding='max_length', max_length=512)
    # Tokenize the label (podcast script) as well
    tokenized_label = tokenizer(batch['target'], truncation=True, padding='max_length', max_length=512)

    # Return input_ids for input and labels for target
    return {
        'input_ids': tokenized_input['input_ids'],
        'attention_mask': tokenized_input['attention_mask'],
        'labels': tokenized_label['input_ids']  # Tokenized labels
    }

# Apply the tokenization to your dataset
tokenized_dataset = dataset['train'].map(tokenize, batched=True)

# Inspect tokenized data to verify correctness
print(f"Tokenized dataset sample:\n{tokenized_dataset[0]}")


Map:   0%|          | 0/252 [00:00<?, ? examples/s]

In [68]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Define the training arguments for T5-small
training_args = Seq2SeqTrainingArguments(
    output_dir='./results',
    num_train_epochs=10,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Initialize the Trainer with the tokenized dataset
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)


In [69]:
# Start model training
print("Starting model training...")
trainer.train()

Step,Training Loss


TrainOutput(global_step=160, training_loss=6.381528091430664, metrics={'train_runtime': 207.6424, 'train_samples_per_second': 12.136, 'train_steps_per_second': 0.771, 'total_flos': 341061339709440.0, 'train_loss': 6.381528091430664, 'epoch': 10.0})

In [71]:

# Load the fine-tuned model from the checkpoint
checkpoint_path = "/content/results/checkpoint-160"
model_finetuned = T5ForConditionalGeneration.from_pretrained(checkpoint_path)

#Ensure the model is on the correct device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_finetuned.to(device)

# Generate model output after fine-tuning
def generate_output_after_finetuning(text):
    input_text = text
    inputs = tokenizer(input_text, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move inputs to the same device as the model
    output_ids = model_finetuned.generate(inputs['input_ids'],max_length=130, num_beams=4, early_stopping=True)
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text

# New test data point
paragraph = "classify: Is this message happy or sad? Message: You've been selected for an exclusive offer! Claim now."


# Generate the response
generated_response = generate_output_after_finetuning(paragraph)

# Print the result
from rich import print
print(f"Input:\n {paragraph} \n")
print(f"\n Generated Response:\n {generated_response}")


In a nutshell, this is what the tasks above perform:

* Import Libraries: The first step is to import all the
necessary libraries. This includes the Hugging Face Transformers library, which provides the models and training utilities, and the Datasets library, which provides a convenient way to load and preprocess datasets.

* Load the Model and Tokenizer: The next step is to load the pre-trained T5 model and its corresponding tokenizer. The T5 model is a transformer model that is pre-trained on a large corpus of text and can be fine-tuned for various tasks. The tokenizer is used to convert text into a format that the model can understand.

* Define the Tokenization Function: This function is used to tokenize the datasets. It takes a batch of text as input and returns the tokenized inputs and labels. The inputs are the text that the model will be trained on, and the labels are the expected outputs.

* Load the Datasets: The datasets are loaded using the load_dataset function from the Datasets library. The split argument specifies which split of the dataset to load (e.g., ‘train’ or ‘test’).

* Print Dataset Information: This step prints information about the loaded datasets, such as their type and the first sample. This is useful for understanding the structure of the datasets.

* Tokenize the Datasets: The datasets are tokenized using the tokenization function defined earlier. The map function applies the tokenization function to each example in the dataset.

* Define the Training Arguments: The training arguments specify various settings for the training process, such as the number of epochs, the batch size, and the learning rate. These arguments are passed to the trainer in the next step.

* Initialize the Trainer: The trainer is initialized with the model, the training arguments, and the tokenized datasets. The trainer handles the training and evaluation process.

* Train the Model: Finally, the model is trained using the train method of the trainer. This step may take a while, depending on the size of the datasets and the number of epochs.



Now, you can run an inference task on your fine-tuned model using the code below

In [None]:
from transformers import AutoTokenizer, T5ForConditionalGeneration

# Specify the directory where the model is saved
model_dir = '/content/results/checkpoint-1500'  # This should be the same as the output_dir in your training arguments

# Load the model and the tokenizer. You'll be using the same tokenizer you used to train the model
# tokenizer = AutoTokenizer.from_pretrained('t5-small')   # This code is commented out becuase the tokenizer from your fine-tuning session should still be active
model = T5ForConditionalGeneration.from_pretrained(model_dir)

# Make an inference
input_text = "translate English to French: Hello, how are you?"
inputs = tokenizer.encode(input_text, return_tensors='pt')

# Generate output
outputs = model.generate(inputs, max_length=150, num_beams=5, early_stopping=True)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Decoded output:", decoded_output)


Decoded output: Bonjour, comment êtes-vous?


**Note: To progress to the next task, you would need to disconnect and delete your current Colab runtime so as to delete all variables, files and clear the GPU memory.**

# 2. Supervised Finetuning (SFT)

Supervised fine-tuning (SFT) is a specific approach to finetuning that involves training a model on a labeled dataset that directly maps inputs to desired outputs. SFT, including instruction-tuning, which teaches a model to respond based on what humans define.

In this section you will complete the implementation of an SFT implementation using the package Transformers Reinforcement Learning (TRL) package. TRL is a library built on top of the HuggingFace Transformers library that provides a simple interface and training loop for finetuning models using reinforcement learning. TRL is designed to be easy to use and flexible, allowing you to quickly experiment with different reinforcement learning approaches to finetuning. That said, it's abstracted nature means that it is not always the best tool for students. So below we will also provide a more detailed example of how to implement PPO using the Transformers library directly.


## 2.1. Setup

In [1]:
!pip install datasets trl peft bitsandbytes accelerate -qqqq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/471.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m471.0/471.6 kB[0m [31m19.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.4/318.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install datasets trl peft bitsandbytes accelerate -qqqq
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, AutoTokenizer, DataCollatorForLanguageModeling
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

We will need to reduce the size of the language model, so that it fits in memory. We will use a process referred to as 'Quantization' to reduce the size of the model. Quantization is a process that reduces the size of a model by reducing the precision of the weights. For example, a 32-bit floating point number can be converted to a 16-bit floating point number, reducing the size of the model by 50%. The downside of quantization is that it can reduce the accuracy of the model. However, in practice, quantization can be used to reduce the size of a model with minimal impact on accuracy.

Quantization is beyond the scope of this course, but if you are interested in learning more, you can read the following article: [Quantization: How to shrink a model size by 4x times with TensorFlow](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c#:~:text=Typically%2C%20the%20size%20of%20a,a%20process%20known%20as%20quantization.).

In [4]:
bits_and_bytes_config = BitsAndBytesConfig(
#TODO: Add BitsAndBytesConfig parameters that ensure the objectives of your model fine-tuning are attained
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype='bfloat16'
)

## 2.2. Load the model and tokenizer

Now we will load the model and tokenizer. We will use the `facebook/opt-350m` model, which is a smaller transformer model that suffice for our purposes. We will also use the `GPT2TokenizerFast` tokenizer, which is a fast tokenizer that is optimized for transformer models. [OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's](https://arxiv.org/abs/2205.01068).

## 2.3. Load the dataset

Next we will load the dataset.  We will use the `timdettmers/openassistant-guanaco` dataset, which is a dataset of questions and answers. This dataset is a subset of the Open Assistant dataset, which you can find here: https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main

**Example of Dataset Sample**


```python
{
    "message_id": "218440fd-5317-4355-91dc-d001416df62b",
    "parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
    "user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
    "text": "It was the winter of 2035, and artificial intelligence (..)",
    "role": "assistant",
    "lang": "en",
    "review_count": 3,
    "review_result": true,
    "deleted": false,
    "rank": 0,
    "synthetic": true,
    "model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
    "labels": {
        "spam": { "value": 0.0, "count": 3 },
        "lang_mismatch": { "value": 0.0, "count": 3 },
        "pii": { "value": 0.0, "count": 3 },
        "not_appropriate": { "value": 0.0, "count": 3 },
        "hate_speech": { "value": 0.0, "count": 3 },
        "sexual_content": { "value": 0.0, "count": 3 },
        "quality": { "value": 0.416, "count": 3 },
        "toxicity": { "value": 0.16, "count": 3 },
        "humor": { "value": 0.0, "count": 3 },
        "creativity": { "value": 0.33, "count": 3 },
        "violence": { "value": 0.16, "count": 3 }
    }
}
```

## 2.4. Train the model

This step you will train the model using the `Trainer` class from the `transformers` library. The `Trainer` class provides a simple interface for training a model. It takes care of the details of training, such as batching, shuffling, and logging. It also provides a simple interface for logging metrics and saving checkpoints.

## EXERCISE: Implement the `train` function

Configure the trainer class using the parameters below.

| Variable | Value |
| --- | --- |
| Output Directory | "output_dir" |
| Batch Size | 16 |
| Gradient Accumulation | 16 |
| Learning Rate | 1.41e-5 |
| Logging Frequency  | 1 |
| Epochs | 3 |
| Maximum Steps | -1 |
| Reporting Destination | None |
| Checkpoint Save Steps | 100 |
| Total Checkpoint Limit | 10 |
| Push Model | False |
| Model Id | None |
| Enable Gradient Checkpointing | False |
| Lora Radius | 64 |
| Lora Alpha Value | 16 |
| Bias Type | "none" |
| Task Type | "CAUSAL_LM" |

Review the documentation for parameter names: [SFTTrainer Documentation](https://huggingface.co/docs/trl/sft_trainer)



In [5]:
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m") # You can add your quantization parameter here if you want a quantized model
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'v_proj',
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LMs"
)
#TODO: What would happen if you add 'k_proj' to the target_modules=['q_proj','v_proj',] parameters above

config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]



In [6]:
# Add a special pad token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Update your collator
class TokenizingCollator(DataCollatorForLanguageModeling):
    def __init__(self, tokenizer, max_length=512, truncation=True, padding='max_length'):
        super().__init__(tokenizer, mlm=False)
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.truncation = truncation
        self.padding = padding

    def __call__(self, examples):
        # Ensure the examples are in the correct format for training using SFTTrainer
        examples = [e['text'] for e in examples]
        # Tokenize the examples with padding and truncation
        examples = [self.tokenizer(e, truncation=self.truncation, padding=self.padding, max_length=self.max_length) for e in examples]
        return super().__call__(examples)


# Instantiate your collator with the updated tokenizer
collator = TokenizingCollator(tokenizer)


In [7]:
train_dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
validation_dataset = load_dataset("timdettmers/openassistant-guanaco", split="test")

README.md:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


openassistant_best_replies_train.jsonl:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

openassistant_best_replies_eval.jsonl:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Repo card metadata block was not found. Setting CardData to empty.


In [8]:
print(train_dataset[0])
print(validation_dataset[0])

{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining po

In [None]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",  # output directory for model predictions and checkpoints
    num_train_epochs=1,  # total number of training epochs
    per_device_train_batch_size=4,  # batch size per device during training
    per_device_eval_batch_size=4,  # batch size for evaluation
    warmup_steps=500,  # number of warmup steps for learning rate scheduler
    weight_decay=0.01,  # strength of weight decay
    logging_dir="./logs",  # directory for storing logs
    logging_steps=50,  # when to print log
    remove_unused_columns=False,
    report_to=None,
)

# Initialize the trainer
trainer = SFTTrainer(
    model=model,  # the instantiated Hugging Face Transformers model to be trained
    args=training_args,  # training arguments, defined above
    train_dataset=train_dataset,  # training dataset
    eval_dataset=validation_dataset,  # evaluation dataset
    dataset_text_field="text", # The field name in your dataset that contains the data
    data_collator=collator,  # your custom data collator
    peft_config=lora_config  # your LoraConfig
)

# Train the model
trainer.train()



Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,2.5289
100,2.3595
150,2.4468
200,2.3296
250,2.3544
300,2.4129
350,2.3111
400,2.3201
450,2.3572
500,2.2025


TrainOutput(global_step=2462, training_loss=2.2728974484119058, metrics={'train_runtime': 2540.0679, 'train_samples_per_second': 3.876, 'train_steps_per_second': 0.969, 'total_flos': 9199428418142208.0, 'train_loss': 2.2728974484119058, 'epoch': 1.0})

# 2.5 Let's run some inference using the model

In this section, we'll run some predictions using our fine-tuned `facebook/opt-350m` model. You'll observe that for this task, we will have to do our inference using the LoRA Adapters we installed in our model when we were doing the fine-tuning.

In [None]:
# Import necessary libraries and classes
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig

# Load Tokenizer and Model Configurations
model_name = "facebook/opt-350m"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define LoRA Configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'v_proj',
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LMs"
)

# Load Fine-Tuned Model
model_path = "/content/results/checkpoint-2000" # This is the name of the file where your latest model checkpoint was saved in your Google Colab File system
model = AutoModelForCausalLM.from_pretrained(model_path, config=lora_config)

# Perform Text Generation
input_text = "Once upon a time in a town far away, there lived a"
input_ids = tokenizer([input_text], return_tensors="pt").input_ids

output = model.generate(
    input_ids,
    max_length=100,
    temperature=0.7,
    do_sample=True,
    num_return_sequences=1
)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print(decoded_output)

Once upon a time in a town far away, there lived a man who loved to ride his horse. He would often ride in the back of the horse and ride on the side of the road. The horse would sometimes stop and give him a smile, and he would then ride on again. The man would then keep riding, and he would often see the horse in the middle of the road, and he would look up and see the man riding in the middle of the road. The man


**Note: To progress to the next task, you would need to disconnect and delete your current Colab runtime so as to delete all variables, files and clear the GPU memory.**

# 3. Direct Preference Optimsation (DPO)

DPO has emerged as a more efficient and streamlined method of fine-tuning large language models (LLMs), offering a simpler alternative to the complex RLHF approach. It treats the task of aligning a language model's output with human preferences as a binary classification problem, thereby simplifying the process and making it more stable and computationally lightweight.

## 3.1 Data Processing

The first step in DPO is to prepare the data. The data is prepared by first tokenizing the data using the tokenizer provided by the model. The tokenizer is used to convert the text into a sequence of tokens. The tokens are then converted into a sequence of integers using the tokenizer's `convert_tokens_to_ids` method. DPO requires examples of the prompt, a positive example, and a negative example.

```python
dpo_dataset_dict = {
 "prompt": ["hello", "how are you", …],
 "chosen": ["hi, nice to meet you", "I am fine", …],
 "rejected": ["leave me alone", "I am not fine", …],
 }
 ```

 ## EXERCISE: Implement the `prepare_data` function

 Implement the `prepare_data` function. The function should take a list of prompts and a list of chosen and rejected examples. The function should return a dictionary with the following keys: `prompt`, `chosen`, and `rejected`. The values for each key should be a list of tokenized and encoded examples. The `prompt` key should contain the tokenized and encoded prompts. The `chosen` key should contain the tokenized and encoded chosen examples. The `rejected` key should contain the tokenized and encoded rejected examples.

In [1]:
!pip install datasets trl peft bitsandbytes accelerate -qqqq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/471.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m471.0/471.6 kB[0m [31m25.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/318.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.4/318.4 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, AutoTokenizer, PreTrainedTokenizerFast
from trl import DPOTrainer
from typing import Dict

In [3]:
model_id = "gpt2"
# model_id = 'EleutherAI/gpt-neo-125M' # Alternative model option

tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=1024, padding=True, truncate=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Example usage:
print("Tokenizer vocabulary size:", len(tokenizer))
print("Padding token:", tokenizer.pad_token)
print("EOS token:", tokenizer.eos_token)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Tokenizer vocabulary size: 50257
Padding token: <|endoftext|>
EOS token: <|endoftext|>




In [5]:
from rich import print
def extract_anthropic_prompt(prompt_and_response):
    """Extract the anthropic prompt from a prompt and response pair."""
    search_term = "\n\nAssistant:"
    search_term_idx = prompt_and_response.rfind(search_term)
    assert search_term_idx != -1, f"Prompt and response does not contain '{search_term}'"
    return prompt_and_response[: search_term_idx + len(search_term)]

def prepare_data(sample) -> Dict[str, str]:
    prompt = extract_anthropic_prompt(sample["chosen"])
    return {
        "prompt": prompt,
        "chosen": sample["chosen"][len(prompt) :],
        "rejected": sample["rejected"][len(prompt) :],
    }

# Example usage:
sample = {
    "chosen": "Human: What's the capital of France?\n\nAssistant: The capital of France is Paris.",
    "rejected": "Human: What's the capital of France?\n\nAssistant: The capital of France is London."
}
prepared_sample = prepare_data(sample)
print("Prepared sample:", prepared_sample)

In [6]:
## We'll use the functions below to format our dataset to look the way the DPO Trainer needs it to be
#
#model_id = "gpt2"
## model_id = 'EleutherAI/gpt-neo-125M' # You can try this model and see if it yields better results for your use case
#tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=1024, padding=True, truncate=True)
#tokenizer.pad_token_id = tokenizer.eos_token_id
#
#def extract_anthropic_prompt(prompt_and_response):
#    """Extract the anthropic prompt from a prompt and response pair."""
#    search_term = "\n\nAssistant:"
#    search_term_idx = prompt_and_response.rfind(search_term)
#    assert search_term_idx != -1, f"Prompt and response does not contain '{search_term}'"
#    return prompt_and_response[: search_term_idx + len(search_term)]
#
#def prepare_data(sample) -> Dict[str, str]:
#    prompt = extract_anthropic_prompt(sample["chosen"])
#    return {
#        "prompt": prompt,
#        "chosen": sample["chosen"][len(prompt) :],
#        "rejected": sample["rejected"][len(prompt) :],
#    }

In [7]:
train_dataset = load_dataset("Anthropic/hh-rlhf", split="train").map(prepare_data)
eval_dataset  = load_dataset("Anthropic/hh-rlhf", split="test").map(prepare_data)

# Example usage:
print("Number of training samples:", len(train_dataset))
print("Number of evaluation samples:", len(eval_dataset))
print("Sample from train dataset:", train_dataset[0])

README.md:   0%|          | 0.00/5.77k [00:00<?, ?B/s]

train.jsonl.gz:   0%|          | 0.00/13.2M [00:00<?, ?B/s]

train.jsonl.gz:   0%|          | 0.00/16.2M [00:00<?, ?B/s]

train.jsonl.gz:   0%|          | 0.00/20.1M [00:00<?, ?B/s]

train.jsonl.gz:   0%|          | 0.00/25.7M [00:00<?, ?B/s]

test.jsonl.gz:   0%|          | 0.00/743k [00:00<?, ?B/s]

test.jsonl.gz:   0%|          | 0.00/875k [00:00<?, ?B/s]

test.jsonl.gz:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

test.jsonl.gz:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/160800 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8552 [00:00<?, ? examples/s]

Map:   0%|          | 0/160800 [00:00<?, ? examples/s]

Map:   0%|          | 0/8552 [00:00<?, ? examples/s]

In [8]:
#train_dataset = load_dataset("Anthropic/hh-rlhf", split="train").map(prepare_data)
#eval_dataset  = load_dataset("Anthropic/hh-rlhf", split="test").map(prepare_data)

In [9]:
print(train_dataset[0])
print(eval_dataset[0])

## 3.2. Model Training

A tokenizer for GPT-2 is loaded, and its padding token is set to be the same as the end-of-sentence token, which is a typical setup for models that generate text.
The model is set to use 16-bit floating point precision (`torch.float16`) for memory efficiency and specifies device_map="auto" for optimal device placement (e.g., GPU).

The provided code snippet is for setting up and training a language model using Direct Preference Optimization (DPO) in Python. Let's break down the key components:

`AutoModelForCausalLM` and `AutoTokenizer` are classes from the Hugging Face `transformers` library. They are used to automatically load a pre-trained model and its corresponding tokenizer.  =`torch` is the PyTorch library, a popular framework for deep learning.

The code initializes a GPT-2 model (`model_id = "gpt2"`) for causal language modeling (predicting the next word in a sentence). The model is set to use 16-bit floating point precision (`torch.float16`) for memory efficiency and specifies `device_map="auto"` for optimal device placement (e.g., GPU). A tokenizer for GPT-2 is loaded, and its padding token is set to be the same as the end-of-sentence token, which is a typical setup for models that generate text.

In [12]:
bnb_config = BitsAndBytesConfig(
#TODO: Add BitsAndBytesConfig parameters that ensure the objectives of your model fine-tuning are attained
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype='float16',
    bnb_4bit_use_double_quant=False,
)

#model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, quantization_config=bnb_config, device_map="auto")

In [11]:
print("BitsAndBytes config:", bnb_config)

In [13]:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, quantization_config=bnb_config, device_map="auto")

# Example usage:
print("Model architecture:", type(model).__name__)
print("Model parameters:", sum(p.numel() for p in model.parameters()))

`TrainingArguments` configures various parameters for training, such as batch size (`per_device_train_batch_size=3`), learning rate (`learning_rate=1e-3`), and the optimizer to use (`optim="rmsprop"`). These settings dictate how the model will be trained, including how data is batched and how the model's weights are updated during training.

In [16]:
from trl import DPOConfig

output_dir = "./results"
training_arguments = DPOConfig(
    output_dir=output_dir,  # The directory where the trained model, checkpoints, and logs will be saved.

    # Number of training examples processed at once per device.
    # A smaller batch size may be necessary if working with limited GPU memory.
    per_device_train_batch_size=2,

    # Accumulates gradients over multiple steps before updating the model's weights.
    # This helps to simulate a larger batch size and is useful for working with limited memory.
    gradient_accumulation_steps=4,

    # Optimization algorithm used during training.
    # 'paged_adamw_32bit' is a memory-efficient optimizer that reduces memory overhead while training large models.
    optim="paged_adamw_32bit",

    # The learning rate controls how much to adjust the model in response to the error each time its weights are updated.
    # A smaller learning rate is often better for fine-tuning as it makes smaller updates to the model's weights.
    learning_rate=1e-5,

    # Type of learning rate scheduler.
    # 'cosine' gradually reduces the learning rate following a cosine curve, which can help with training stability.
    lr_scheduler_type="cosine",

    # Total number of optimization steps (iterations) the model will go through during training.
    # More steps will result in more learning but will also take longer.
    max_steps=200,

    # Defines how often (in steps) the training logs (like loss and metrics) are printed.
    # Here, every 10 steps, information about training progress is logged.
    logging_steps=10,

    # Whether to use 16-bit floating-point precision (mixed precision) during training to save memory.
    # It allows training larger models by reducing memory requirements while preserving model accuracy.
    fp16=True,

    # Enables gradient checkpointing, which reduces memory usage by not saving the gradients of all layers at once.
    # This can help with training very large models on smaller hardware, though it can make training slower.
    gradient_checkpointing=True,

    # Fraction of the total number of steps to spend on a warm-up period, where the learning rate increases linearly from 0 to the set value.
    # It helps stabilize training by gradually ramping up the learning rate.
    warmup_ratio=0.1
)

# Example usage:
print("Training arguments:", training_arguments)


`DPOTrainer` is likely a custom training class for implementing DPO. It takes the model, training arguments, datasets, and tokenizer as inputs. Important parameters here include `beta=0.1`, which could be a hyperparameter for the DPO process, and `max_length`, `max_target_length`, `max_prompt_length`, which define the size constraints for the model's input and output. The `train()` method on `dpo_trainer` initiates the training process. This likely involves iterating over the provided datasets (`train_dataset` and `eval_dataset`), computing loss based on the DPO methodology, and updating the model's weights accordingly.

PEFT, or Parameter-Efficient Fine-Tuning, is a method that allows you to fine-tune a model by modifying only a small subset of its parameters, rather than updating all the model's parameters. This makes the process faster, more efficient, and memory-friendly, especially for large language models.

One common implementation of PEFT is LoRA (Low-Rank Adaptation of Large Language Models), which uses low-rank matrices to adjust certain parts of the model's weights, significantly reducing the number of parameters to fine-tune.

Explanation of Parameters:
lora_alpha:

Explanation: This is a scaling factor for the low-rank matrices introduced by LoRA. It controls how much influence these additional parameters have on the model's output.
Analogy: Imagine you are adding seasoning to a dish. lora_alpha is how much seasoning you add to ensure the taste is balanced without overpowering the original flavor.
r (Rank):

Explanation: The rank of the low-rank adaptation matrices. Higher values introduce more parameters, increasing the model's expressiveness but also its complexity.
Analogy: Think of r as how many additional tools you are using to adjust the model. More tools (higher rank) allow more precise adjustments but also require more resources.
bias:

Explanation: Determines whether the bias terms in the model (which adjust the output independently of the input) should be modified. In this case, "none" means no changes are made to the bias terms.
Analogy: Bias terms are like automatic settings on a machine. Sometimes, you don’t need to adjust them for fine-tuning.
task_type:

Explanation: Defines the type of task for which you are fine-tuning the model. In this case, "CAUSAL_LM" refers to Causal Language Modeling, used for text generation tasks (e.g., predicting the next word based on the previous ones).
Analogy: It’s like setting the mode on a machine. Here, you're telling it to focus on language generation.

In [19]:
peft_config = LoraConfig(
    lora_alpha=16,  # LoRA scaling factor that adjusts the influence of the low-rank matrices on the final output.
    r=16,           # Rank of the LoRA matrices, controlling how many additional parameters you are introducing.
    bias="none",    # Specifies how to handle bias terms. "none" means the bias terms are not modified.
    task_type="CAUSAL_LM"  # The type of task. In this case, it is set for causal language modeling, meaning it is used for tasks like text generation where future tokens are predicted based on past ones.
)


In [20]:
from trl import DPOTrainer

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model,  # Pre-trained model to fine-tune
    args=training_arguments,  # Training arguments defined in the DPOConfig, including batch size, learning rate, etc.
    tokenizer=tokenizer,  # The tokenizer used to process text into tokens that the model can understand.
    train_dataset=train_dataset,  # The dataset for training the model, containing input-output pairs the model will learn from.
    eval_dataset=eval_dataset,  # The dataset used for evaluation during training, to monitor how well the model is performing.
    peft_config=peft_config,  # The PEFT configuration that defines how LoRA will be applied during training.
    beta=0.1,  # A hyperparameter that controls the strength of the contrastive loss in DPO training. It helps balance exploration and exploitation in fine-tuning.
    max_prompt_length=512,  # The maximum length of the input prompt, typically used to ensure that the input doesn't exceed a certain size.
    max_length=512,  # The maximum length of the generated sequence by the model, ensuring that outputs remain within limits.
)

# Fine-tune model with DPO
dpo_trainer.train()



Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.


Tokenizing train dataset:   0%|          | 0/160800 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1027 > 1024). Running this sequence through the model will result in indexing errors


Tokenizing eval dataset:   0%|          | 0/8552 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
10,0.6912
20,0.6941
30,0.6913
40,0.6926
50,0.6932
60,0.6916
70,0.6922
80,0.6942
90,0.6949
100,0.6915


TrainOutput(global_step=200, training_loss=0.6927534103393554, metrics={'train_runtime': 205.1285, 'train_samples_per_second': 7.8, 'train_steps_per_second': 0.975, 'total_flos': 0.0, 'train_loss': 0.6927534103393554, 'epoch': 0.009950248756218905})

In [21]:
print(model)

DPO is a method for aligning the outputs of language models with human preferences. It simplifies the process by treating the alignment task as a binary classification problem, where the model learns to differentiate between preferred and non-preferred responses. Unlike traditional approaches that require a separate reward model, DPO directly integrates preference learning into the training process, making it more efficient and straightforward.

## EXERCISE: Implement the `inference` function

In [25]:
from rich import print
def infer(instruction:str, context: str):
    template = """\
    ### Instruction: {instruction}\n
    ### Context: {context}\n
    ### Response: {response}
    """


    inputs = template.format(
        instruction=instruction,
        context=context,
        response="capital of France is"
    ).strip()
    encoding = tokenizer([inputs], return_tensors="pt").to("cuda")
    outputs = model.generate(**encoding, max_new_tokens=200)
    output_text = tokenizer.decode(outputs[0])
    return output_text

print(infer(
    instruction="What is a capital of France?",
    context="it has effiel tower",
))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [27]:
# Try running inference on the model without the 'Instruction-Context-Response' options above and see how the model responds
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2').to('cuda')
text = "The capital of france is"
encoded_input = tokenizer.encode(text, return_tensors='pt').to('cuda')

# Generate a sequence of tokens
output = model.generate(encoded_input, max_length=200, temperature=.1, do_sample=True)

# Decode the output tokens to text
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## DPO 2

# 4. Collecting Human Feedback

In this section we will go back to the data itself. We will use the Argilla library to collect human feedback on the data. In reality, we will just push some data to Argilla, and explore its quality.

From experience, and literature, we know that the quality of the data is the most important factor in the quality of the model. So we will use Argilla to collect human feedback on the data, and then use that feedback to improve the quality of the data. It is important to become familiar with inspecting the data, and understanding the quality of the data. After all, the quality of the data is the most important factor in the quality of the model.

## 4.1. Setup

First, we will need to install the Argilla library. Argilla is a library for collecting human feedback on data. It is designed to be easy to use, and to provide a simple interface for collecting human feedback on data. It is also designed to be flexible, allowing you to collect feedback on any type of data, including text, images, and audio.

In [None]:
# If you run into a Colab 'NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968', this code block resolves the problem
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
%pip install "transformers~=4.34.0" "datasets~=2.14.5" "peft~=0.5.0" "trl~=0.7.1" "wandb~=0.15.12" -qqq

In [None]:
%pip install "argilla[server, listeners]==1.16.0" -qqq

Running Argilla Quickstart
For small-scale projects and quick experimentation, there are two recommended ways:

👩🏽‍🚀 Argilla on Hugging Face Spaces
If you have a Hugging Face account and want to run Argilla workflows from Colab or remote notebooks, you can deploy Argilla on Spaces:

[deploy on spaces](https://huggingface.co/new-space?template=argilla/argilla-template-space)

In [None]:
!huggingface-cli login

# or use from google.colab import userdata to fetch the HF_TOKEN

In [None]:
import argilla as rg
from google.colab import userdata

rg.init(
    api_url=userdata.get('ARGILLA_API_URL'),
    api_key=userdata.get('ARGILLA_API_KEY'),
    workspace="admin"
)

This may lead to potential compatibility issues during your experience.
To ensure a seamless and optimized connection, we highly recommend aligning your client version with the server version.


In [None]:
rg.server_info() # This shows you the details of the server you are working with

ServerInfo(url='https://uonyeka-argilla-data.hf.space', version='1.22.0', elasticsearch_version='8.8.2')

![Argilla on Spaces](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-argilla-duplicate-space.png)

HuggingFace Spaces now have persistent storage and this is supported from Argilla 1.11.0 onwards, but you will need to manually activate it via the HuggingFace Spaces settings. Otherwise, unless you’re on a paid space upgrade, after 48 hours of inactivity the space will be shut off and you will lose all the data. To avoid losing data, we highly recommend using the persistent storage layer offered by HuggingFace. If everything goes well, you’ll see your online Argilla UI login page. You can log in with username admin and password 12345678. You can find the direct URL by clicking on the Embed space button. You’ll use this URL for sending data to your Argilla instance.

## 4.2 Defining the Feedback Dataset

Argilla feedback allows you to collect detailed information from annotators that you LLM can learn from.

In [None]:
dataset = rg.FeedbackDataset(
    fields = [
        rg.TextField(name="background"),
        rg.TextField(name="prompt"),
        rg.TextField(name="response", title="Final Response"),
    ],
    questions = [
        rg.LabelQuestion(name="quality", title="Is it a Good or Bad response?", labels=["Good", "Bad"])
    ]
)
dataset.push_to_argilla(name="oig-30k", workspace="admin")

Pushing records to Argilla...: 0it [00:00, ?it/s]


<FeedbackDataset id=7570fa49-c3fe-46c3-898d-a69cb5f763ce name=oig-30k workspace=Workspace(id=e4e926af-84fe-4a13-aae3-4c0b97820f81, name=admin, inserted_at=2024-01-19 11:23:26.195486, updated_at=2024-01-19 11:23:26.195486) url=https://uonyeka-argilla-data.hf.space/dataset/7570fa49-c3fe-46c3-898d-a69cb5f763ce/annotation-mode fields=[TextField(id=UUID('cb5208b7-5dcb-4ca4-a532-01ae620e3b2c'), name='background', title='Background', required=True, type='text', settings={'type': 'text', 'use_markdown': False}, use_markdown=False), TextField(id=UUID('9fcf9607-07df-4ae5-b2ed-b9364cbd0114'), name='prompt', title='Prompt', required=True, type='text', settings={'type': 'text', 'use_markdown': False}, use_markdown=False), TextField(id=UUID('0d0175c3-d8fc-4d5a-b38b-b41f418ad63b'), name='response', title='Final Response', required=True, type='text', settings={'type': 'text', 'use_markdown': False}, use_markdown=False)] questions=[LabelQuestion(id=UUID('0969fa40-bc18-44f6-92dd-cd4adb3bbccd'), name='qu

In [None]:
# my_records = [
#     {"question": "What is the capital city of France?", "answer": "Paris"},
#     {"question": "Which planet is closest to the sun?", "answer": "Mercury"}
#     ]


my_records = rg.FeedbackRecord(
    fields={
        "background": "
        "question": "Why can camels survive long without water?",
        "answer": "Camels use the fat in their humps to keep them filled with energy and hydration for long periods."
    })

dataset.add_records(my_records)
record = dataset[0]
print(record)

For the sake of this tutorial, we will use a simple dataset of questions and answers. This dataset is a subset of the Open Assistant dataset, which you can find here: https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main

In [None]:
from datasets import load_dataset

data = load_dataset("laion/OIG", split="train", streaming=True)
data = data.shuffle(buffer_size=1_000_000).take(30000)

Resolving data files:   0%|          | 0/35 [00:00<?, ?it/s]

In [None]:
data["text"][0]

In [None]:
from typing import Dict, Any

def extract_background_prompt_response(text: str) -> Dict[str, Any]:
    '''Extract the anthropic prompt from a prompt and response pair.'''
    start_prompt = text.find("<human>:")
    end_prompt = text.rfind("<bot>:")
    # Background is anything before the first <human>:
    background = text[:start_prompt].strip()
    # Prompt is anything between the first <human>: (inclusive) and the last <bot>: (exclusive)
    prompt = text[start_prompt: end_prompt].strip()
    # Response is everything after the last <bot>: (inclusive)
    response = text[end_prompt:].strip()
    return {"background": background, "prompt": prompt, "response": response}


data = data.map(extract_background_prompt_response, input_columns="text")

In [None]:
ds = [rg.FeedbackRecord(fields={"background": d["background"], "prompt": d["prompt"], "response": d["response"]}) for d in data]

In [None]:
dataset.add_records(ds)

In [None]:
dataset.push_to_argilla("oig-30k", workspace="admin")

# 4. [Optional] Training a Model with Human Feedback

In [None]:
feedback_dataset = rg.FeedbackDataset.from_argilla("oig-30k")

In [None]:
dataset_ds = feedback_dataset.format_as("datasets")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):

    #TODO: Implement sample instruction text

    return output_texts

response_template = " <|im_start|>user: "
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)



In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    save_steps=1000,
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy='epoch'
    )

trainer = SFTTrainer(
    model=model,
    args=training_args,
    data_collator=collator,
    train_dataset=dataset["train"],
    formatting_functions=[formatting_prompts_func],
    )

trainer.train()

# The End 💐 🎆

Well done! You have finished the week 2 project.