# Preference Alignment with Direct Preference Optimization (DPO)

This notebook will guide you through the process of fine-tuning a language model using Direct Preference Optimization (DPO). We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
     <h2 style='margin: 0;color:blue'>Exercise: Aligning SmolLM2 with DPOTrainer</h2>
     <p>Take a dataset from the Hugging Face hub and align a model on it. </p>
     <p><b>Difficulty Levels</b></p>
     <p>🐢 Use the `trl-lib/ultrafeedback_binarized` dataset</p>
     <p>🐕 Try out the `argilla/ultrafeedback-binarized-preferences` dataset</p>
     <p>🦁 Select a dataset that relates to a real-world use case you’re interested in, or use the model you trained in
        <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>

In [1]:
# Install the requirements in Google Colab
!pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting trl
  Downloading trl-0.12.2-py3-none-any.whl.metadata (11 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.12.2-py3-none-any.whl (365 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 kB[0m [31m17.4 MB/s[0m

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Import libraries


In [2]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig
# Direct Preference Optimization

## Format dataset

In [6]:
# Load dataset

# This dataset has top-level "chosen" and "rejected" keys, then content/role pairs underneath - is this convention for DPO datasets?
# TODO: 🦁🐕 change the dataset to one of your choosing
dataset = load_dataset(path="trl-lib/ultrafeedback_binarized", split="train")

README.md:   0%|          | 0.00/643 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/131M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/62135 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# TODO: 🐕 If your dataset is not represented as conversation lists, you can use the `process_dataset` function to convert it.

## Select the model

We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).


<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; width:80%; color:black'>
     <p>🦁 change the model to the path or repo id of the model you trained in <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>


In [25]:
# TODO: 🦁 change the model to the path or repo id of the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb)

# Terminology

# Pretraining - model that has been generally trained with giant corpus from internet

# Supervised fine tuning - specific to chat template and instructions to create Instruct
# What makes a model "-instruct" - base model + training on receiving instructions and performing task?

# original nontrained model = "HuggingFaceTB/SmolLM2-135M"

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

# model_name = "aelydens/SmolLM2-FT-MyDataset"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-DPO-orig"
finetune_tags = ["smol-course", "module_1"]

## Train model with DPO

In [26]:
# Alignment
# Human grades the responses from the LLM
# Then we train based on that feedback

# What is DPO?
# An efficient method for training a language model based on human feedback (via some mysterious algorithm)


# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=4,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=4,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=200,
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=1,
    # Directory to save model outputs
    output_dir="smol_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=100,
    # Use bfloat16 precision for faster training
    bf16=True,
    # Disable wandb/tensorboard logging
    report_to=None,
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    hub_model_id=finetune_name,
)

In [27]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing preferred/rejected response pairs
    train_dataset=dataset,
    # Tokenizer for processing inputs
    processing_class=tokenizer,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    max_length=1536,
)


Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.


Applying chat template to train dataset:   0%|          | 0/62135 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/62135 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2379 > 2048). Running this sequence through the model will result in indexing errors
max_steps is given, it will override any value given in num_train_epochs


In [30]:
dataset['rejected']

[{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist',
  'role': 'user'},
 {'content': 'Sure, here\'s an example of how to write a version of Snake game with a unique twist using the Pygame library:\n```python\nimport pygame\n\nclass SnakeGame:\n    def __init__(self, game_width, game_height):\n        pygame.init()\n        screen = pygame.display.set_mode((game_width, game_height))\n        pygame.display.set_caption("Snake Game")\n        self.speed = 5  # Speed of the snake\n        self.food_speed = 1  # Speed of the food\n        self.direction = 0  # Initial direction of the snake\n        self.snakelen = 0  # Length of the snake\n        self.food = pygame.image.load("snake_food.png")\n        self.head = pygame.image.load("snake_head.png")\n        self.tail = pygame.image.load("snake_tail.png")\n        self.game Quint()\n    def Quint(self):\n        for i in range(50):\n            pygame.draw.line(screen, (180, 100, 220), (

In [31]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

# Save to the huggingface hub if login (HF_TOKEN is set)
if os.getenv("HF_TOKEN"):
    trainer.push_to_hub(tags=finetune_tags)

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
1,0.6931
2,0.6931
3,0.6836
4,0.6852
5,0.698
6,0.695
7,0.6939
8,0.7074
9,0.6809
10,0.6961


## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `DPOTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.

In [32]:
from trl import setup_chat_format

prompt = "What happened in World War 2?"

dpo_model_name = "SmolLM2-FT-DPO-orig"
dpo_model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=dpo_model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=dpo_model_name)

# Set up the chat format
# model, tokenizer = setup_chat_format(model=dpo_model, tokenizer=tokenizer)

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)

# TODO: use the fine-tuned to model generate a response, just like with the base example.
outputs = model.generate(**inputs, max_new_tokens=500)

print(tokenizer.decode(outputs[0]))




<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
What happened in World War 2?<|im_end|>
<|im_start|>assistant
World War II was a global conflict that lasted from 1939 to 1945. It was a period of intense conflict, marked by intense battles, massive losses, and significant changes in the world. The war was fought between the Axis powers (led by Germany, Italy, and Japan) and the Allied powers (led by the United Kingdom, the United States, and the Soviet Union).

The war began in 1939 when Germany invaded Poland, marking the beginning of World War II. The invasion of Poland led to the invasion of the Soviet Union, which led to the Soviet Union invading Poland, and the invasion of the United Kingdom, which led to the invasion of Poland.

The war was marked by intense battles, including the Battle of Britain, the Battle of the Bulge, and the Battle of Stalingrad. The war also saw the rise of fascist and nationalist movements

In [None]:
# What did we learn?

'''
- DPO - is a fancy way to align models using human feedback. Avoids creating a separate reward model like we do for RLHF.


Things we are still uncertain about:
- The instruct tuned SmolLM2-135M has a system prompt we are missing. why?

- How are alignment and fine tuning related?

- Is the DPO algorithm expecting a specific shape for the dataset?
A: yes: "chosen" and "rejected" pairs
'''