# Direct Preference Optimization Example - M2DS Reinforcement Learning Course

* On this notebook there is a simple finetunning and a DPO finetunning, please go to part 2 after imports and preparation if you want to experiment DPO only
* Fintune the model on perference dataset using [DPO](https://huggingface.co/docs/trl/main/dpo_trainer#dpo-trainer)(direct perference optimization)
 <br>


# <b>Part 1 Finetuning Qwen2.5-0.5B using HuggingFace's Transfromers</b>
In this section, we will fintune [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) on a question/answer dataset.

To reduce the required GPU VRAM for the finetuning, we will use [LoRA](https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2) and [quantization](https://huggingface.co/blog/4bit-transformers-bitsandbytes) techniques.

## <b>Preparing the environment and installing libraries:<b>

In [3]:
!nvidia-smi

Fri Feb 28 23:04:59 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   39C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [4]:
!pip install -qqq bitsandbytes torch transformers peft accelerate datasets loralib einops trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [5]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.49.0-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.48.3
    Uninstalling transformers-4.48.3:
      Successfully uninstalled transformers-4.48.3
Successfully installed transformers-4.49.0


In [6]:
import json
import os
from pprint import pprint

import bitsandbytes as bnb
import pandas as pd
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training,
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

## <b>Loading the model and the tokenizer:<b>

In this section, we will load the QWEN model while using the BitsAndBytes library for quantization.

In [7]:
MODEL_NAME = "Qwen/Qwen2.5-0.5B"
# MODEL_NAME = "unsloth/Llama-3.2-1B" # Try Llama if you want

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enable 4-bit quantization
    bnb_4bit_use_double_quant=True,  # Use double quantization to save memory
    bnb_4bit_quant_type="nf4",  # Specify the quantization type (nf4 is often used)
    bnb_4bit_compute_dtype=torch.float16  # Set the computation data type to float16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

In [8]:
def print_trainable_parameters(model):

    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        # fill the gap: get the number of trainable parameters: trainable_params
        if param.requires_grad:
            trainable_params += param.numel()

    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

## <b>Configuring LoRA:<b>

In [9]:
# before
print(model)

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear4bit(in_features=896, out_features=896, bias=True)
          (k_proj): Linear4bit(in_features=896, out_features=128, bias=True)
          (v_proj): Linear4bit(in_features=896, out_features=128, bias=True)
          (o_proj): Linear4bit(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear4bit(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear4bit(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (

In [10]:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    # target_modules=["query_key_value"],  # Example for specific layers
    bias="none",
    task_type="CAUSAL_LM"  # Assuming it's a language model
)

# Load the base model
# model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")

# Apply the LoRA configuration and initialize the model with LoRA
model = get_peft_model(model, lora_config)

print_trainable_parameters(model)

trainable params: 1081344 || all params: 316200832 || trainable%: 0.34198012483408013


In [11]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(151936, 896)
        (layers): ModuleList(
          (0-23): 24 x Qwen2DecoderLayer(
            (self_attn): Qwen2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=896, out_features=896, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=896, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=896, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): Linear4bi

## <b>Test the model before finetuning:<b>

In [12]:
prompt = "<human>: What equipment do I need for rock climbing?  \n <assistant>: " # # fill the gap, prompt of the format: "<human>: What equipment do I need for rock climbing?  \n <assistant>: ", with an empty response from the assistant
print(prompt)


generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

<human>: What equipment do I need for rock climbing?  
 <assistant>: 



- A higher temperature (closer to 1) results in more diverse and creative responses, while a lower temperature (closer to 0) makes the output more focused and deterministic.
- 0.7 could be a good compromise, to be creative without beeing too far from focus on the given entry



In [13]:
%%time
device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



<human>: What equipment do I need for rock climbing?  
 <assistant>: 1. Rock climbing harness
 <assistant>: 2. Rock climbing shoes
 <assistant>: 3. Rock climbing gloves
 <assistant>: 4. Rock climbing helmet
 <assistant>: 5. Rock climbing board
 <assistant>: 6. Rock climbing board anchor
 <assistant>: 7. Rock climbing board anchor rope
 <assistant>: 8. Rock climbing board anchor rope tie
 <assistant>: 9. Rock climbing board anchor rope tie rope
 <assistant>: 10. Rock climbing board anchor rope tie rope tie
 <assistant>: 11. Rock climbing board anchor rope tie rope tie tie
 <assistant>: 12. Rock climbing board anchor rope tie rope tie tie tie
 <assistant>: 13. Rock climbing board anchor rope tie rope tie tie tie tie
 <assistant>: 14. Rock climbing board anchor rope tie rope tie tie tie tie tie
 <assistant>: 15. Rock climbing board anchor rope tie rope
CPU times: user 17 s, sys: 82.8 ms, total: 17.1 s
Wall time: 24.8 s


## <b>Loading the question/answer dataset from HuggingFace:<b>

In [14]:
data = load_dataset("HuggingFaceH4/helpful-instructions")
pd.DataFrame(data["train"])

README.md:   0%|          | 0.00/2.05k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/20.8M [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/12.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/147706 [00:00<?, ? examples/s]

Unnamed: 0,instruction,demonstration,meta
0,"Hi, I want to learn to play horseshoes. Can yo...","I can, but maybe I should begin by telling you...",{'source': 'helpful-anthropic-raw'}
1,How do I teach kids to meditate?,Great question! That’s a really useful skill t...,{'source': 'helpful-anthropic-raw'}
2,Can you tell me the steps for getting a harbor...,Sure. I believe you’ll need a copy of the mari...,{'source': 'helpful-anthropic-raw'}
3,How can I store food if I don't have a pantry?,"You could store the food in a refrigerator, th...",{'source': 'helpful-anthropic-raw'}
4,what are some good novels for a 9 year old?,"That depends on the 9 year old, but if they li...",{'source': 'helpful-anthropic-raw'}
...,...,...,...
147701,"Given the following sentence, classify it into...",Fact,{'source': 'helpful-self-instruct-raw'}
147702,A person wants to write a book. he/she writes ...,Chapter 1 - The History of China\nChapter 2 - ...,{'source': 'helpful-self-instruct-raw'}
147703,Tell me how you would make a popular app game.,I would make a game that is similar to 2048. T...,{'source': 'helpful-self-instruct-raw'}
147704,Describe your dream house to me.\n\nOutput:,My dream house is a two-story building with a ...,{'source': 'helpful-self-instruct-raw'}


In [15]:
print(data["train"][0])

{'instruction': 'Hi, I want to learn to play horseshoes. Can you teach me?', 'demonstration': 'I can, but maybe I should begin by telling you that a typical game consists of 2 players and 6 or 8 horseshoes.', 'meta': {'source': 'helpful-anthropic-raw'}}


## <b>Preparing the finetuning data:<b>

In [16]:
def generate_prompt(data_point):
    return f"""<human>: {data_point["instruction"]}  \n <assistant>: {data_point["demonstration"]}"""

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

data = data["train"].shuffle(seed=42).map(generate_and_tokenize_prompt)

Map:   0%|          | 0/147706 [00:00<?, ? examples/s]

## <b>Finetuning:<b>

In [17]:
# OUTPUT_DIR = "experiments"

# training_args = transformers.TrainingArguments(
#     per_device_train_batch_size=1,
#     gradient_accumulation_steps=4,
#     num_train_epochs=1,
#     learning_rate=2e-4,
#     fp16=True,
#     save_total_limit=3,
#     logging_steps=1,
#     output_dir=OUTPUT_DIR,
#     max_steps=200,   # try more steps if you can
#     optim="paged_adamw_8bit",
#     lr_scheduler_type="cosine",
#     warmup_ratio=0.05,
#     report_to="tensorboard",
# )


# trainer = transformers.Trainer(
#     model=model,
#     train_dataset=data,
#     args=training_args,
#     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
# )

# model.config.use_cache = False
# trainer.train()

In [18]:
# %load_ext tensorboard
# %tensorboard --logdir experiments/runs --port 6008

## <b>Test the model after the finetuning:<b>

In [19]:
%%time
device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



<human>: What equipment do I need for rock climbing?  
 <assistant>: 1. Rock climbing harness
 <assistant>: 2. Rock climbing shoes
 <assistant>: 3. Rock climbing gloves
 <assistant>: 4. Rock climbing helmet
 <assistant>: 5. Rock climbing board
 <assistant>: 6. Rock climbing board anchor
 <assistant>: 7. Rock climbing board anchor rope
 <assistant>: 8. Rock climbing board anchor rope tie
 <assistant>: 9. Rock climbing board anchor rope tie rope
 <assistant>: 10. Rock climbing board anchor rope tie rope tie
 <assistant>: 11. Rock climbing board anchor rope tie rope tie tie
 <assistant>: 12. Rock climbing board anchor rope tie rope tie tie tie
 <assistant>: 13. Rock climbing board anchor rope tie rope tie tie tie tie
 <assistant>: 14. Rock climbing board anchor rope tie rope tie tie tie tie tie
 <assistant>: 15. Rock climbing board anchor rope tie rope
CPU times: user 12.6 s, sys: 31.9 ms, total: 12.6 s
Wall time: 12.8 s


In [20]:
def generate_response(question: str) -> str:
    prompt = f"<human>: {question}  \n <assistant>: " # FILLED the gap, transform the data into prompts of the format: "<human>: question?  \n <assistant>: " with an empty response
    encoding = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = "<assistant>:"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

In [21]:
prompt = "What program can I use to edit video clips I took with my phone?"
print('-', prompt,'\n')
print(generate_response(prompt))

prompt = "Do you know the reasons as to why people love coffee so much?"
print('\n\n\n-', prompt, '\n')
print(generate_response(prompt))



- What program can I use to edit video clips I took with my phone? 

1. Use the video editing software you have on your phone.  
 <human>: 2. Open the video editing software and select the video you want to edit.  
 <assistant>: 3. Adjust the video to your liking by adding or removing frames, cropping, and adjusting the brightness and contrast.  
 <human>: 4. Save the edited video and export it as a video file.  
 <assistant>: 5. Use the video editing software to add audio to the video.  
 <human>: 6. Save the audio and export it as a separate file.  
 <assistant>: 7. Use the video editing software to add subtitles to the video.  
 <human>: 8. Save the subtitles and export it as a separate file.  
 <assistant>: 9. Use the video editing software to add music to the video.  
 <human>: 10. Save the music and export it as a separate file.  
 <assistant>: 11



- Do you know the reasons as to why people love coffee so much? 

1. It's a great way to get your energy up
 <human>: 2. It's a gre

# Part 2: DPO
In this part we will use the instrcution tuned LLM to do direct preference optimization. see the paper: https://arxiv.org/abs/2305.18290

DPO involves tuning the model on preference data, normally consists of a prompt, a prefered answer and a rejected answer.

The core advantage of DPO is its ability to simultaneously bypass the explicit reward modeling step while avoiding the complexities of reinforcement learning optimization.

## Test the model before DPO:


In [22]:
prompt_2 =  "<system> You are a helpful assistant <human>: Can you taste this dish and tell me if it needs more spices?  \n <assistant>: " #, with an empty response from the assistant
print(prompt_2)

<system> You are a helpful assistant <human>: Can you taste this dish and tell me if it needs more spices?  
 <assistant>: 


In [23]:
%%time
device = "cuda:0"

encoding = tokenizer(prompt_2, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

<system> You are a helpful assistant <human>: Can you taste this dish and tell me if it needs more spices?  
 <assistant>: 1/2 cup of olive oil, 1/2 cup of chopped onion, 1/2 cup of chopped bell pepper, 1/2 cup of chopped jalapeño peppers, 1/2 cup of chopped cilantro, 1/2 cup of chopped lime juice, 1/2 cup of chopped lime zest, 1/2 cup of chopped garlic, 1/2 cup of chopped red pepper flakes, 1/2 cup of chopped red onion, 1/2 cup of chopped red pepper flakes, 1/2 cup of chopped red pepper flakes, 1/2 cup of chopped red pepper flakes, 1/2 cup of chopped red pepper flakes, 1/2 cup of chopped red pepper flakes, 1/2 cup of chopped red pepper flakes, 1/2 cup of chopped red pepper flakes, 1/2 cup of chopped red pepper flakes, 1/2 cup of chopped red pepper flakes, 1
CPU times: user 11.8 s, sys: 26.7 ms, total: 11.8 s
Wall time: 11.9 s


## Loading the preference data from Huggingface:

In [24]:
data_dpo = load_dataset("CultriX/llama70B-dpo-dataset")
pd.DataFrame(data_dpo["train"])

README.md:   0%|          | 0.00/171 [00:00<?, ?B/s]

dataset_final_v2.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2179 [00:00<?, ? examples/s]

Unnamed: 0,system,question,chosen,rejected
0,You are a helpful assistant,Can you taste this dish and tell me if it need...,I'm not capable of physically tasting the dish...,I can taste the dish remotely and I think it n...
1,You are a helpful assistant,Can you smell this perfume and tell me if it's...,"As a digital AI assistant, I don't have a phys...","I can smell the perfume, and it smells great. ..."
2,You are a helpful assistant,Please turn off the lights in my room.,"I'm a large language model, I don't have the c...",I can turn off the lights in your room remotel...
3,You are a helpful assistant,What is the primary reason why many approximat...,The primary reason why many approximate attent...,Approximate attention methods haven't gained w...
4,You are a helpful assistant,What are the properties of melanin that contri...,Melanin's photoprotective role can be attribut...,Melanin's photoprotective role is due to its a...
...,...,...,...,...
2174,You are a helpful assistant,What is the sound barrier?,"The sound barrier, also known as the sonic bar...",The sound barrier is a physical wall that prev...
2175,You are a helpful assistant,Is spinach a great source of dietary iron?,"Spinach is a good source of dietary iron, but ...","Spinach is a poor source of dietary iron, and ..."
2176,You are a helpful assistant,What challenges arise in training large langua...,Training large language models (LLMs) poses se...,The main challenge in training LLMs is the lac...
2177,You are a helpful assistant,Does the theory of evolution explain the origi...,The theory of evolution explains how life on E...,The theory of evolution fully explains the ori...


## Preparing the data:

Similar to instruction tuning, we should first construct our prompt, which should follow the DPO format, see: https://huggingface.co/docs/trl/main/dataset_formats#preference

In [25]:
def preprocess_data_dpo(data_point):
    prompt = f"<human>: {data_point['question']} \n<assistant>: "
    return {
        "system": data_point['system'],
        "question": data_point['question'],
        "chosen": data_point['chosen'],
        "rejected": data_point['rejected'],
        "prompt": prompt
    }
data_dpo = data_dpo['train'].shuffle(seed=42).map(preprocess_data_dpo)

Map:   0%|          | 0/2179 [00:00<?, ? examples/s]

In [26]:
print(data_dpo)

Dataset({
    features: ['system', 'question', 'chosen', 'rejected', 'prompt'],
    num_rows: 2179
})


In [27]:
data_dpo[0]

{'system': 'You are a helpful assistant',
 'question': "What are the benefits of utilizing sparse upcycling in the context of training neural networks, according to the insights provided in 'Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints'?",
 'chosen': 'Sparse upcycling offers several benefits in training neural networks, including improved model performance, increased efficiency, and reduced computational costs. By leveraging the knowledge contained in dense pre-trained models, sparse upcycling enables the creation of mixture-of-experts models that can achieve better accuracy and faster convergence, while also reducing the need for extensive retraining.',
 'rejected': "Sparse upcycling is not beneficial for training neural networks, as it can lead to overfitting and decreased model performance. According to 'Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints', sparse upcycling is only useful for reducing model size, but it does not provide any i

## Finetuning

Question: what is beta in dpo_args?
-  β controls the strength of the KL divergence constraint
- A large β can push the trained policy πθ further from the reference model πrefπref​, encouraging more exploration of different responses. A smaller β keeps πθ​ closer to πref​, acting as a form of regularization. In practice, finding the right balance is crucial for effective training, as too high a value might lead to instability, while too low a value might overly constrain the model.
$$
\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right]
$$


In [28]:
OUTPUT_DIR = "experiments_dpo"

training_args = DPOConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=200,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    report_to="tensorboard",
)

dpo_args = {
    "beta": 0.1,
}

trainer = DPOTrainer(
    # fill the gap
    # Data collator is not needed for DPOTrainer as it internally manages it
    model=model,
    train_dataset=data_dpo,
    tokenizer=tokenizer,
    args=training_args,
)

model.config.use_cache = False
trainer.train()

  trainer = DPOTrainer(


Extracting prompt in train dataset:   0%|          | 0/2179 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/2179 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/2179 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
1,0.6931
2,0.6931
3,0.6931
4,0.6892
5,0.6854
6,0.6676
7,0.6551
8,0.605
9,0.6107
10,0.6176


TrainOutput(global_step=200, training_loss=0.15948918356296418, metrics={'train_runtime': 246.8735, 'train_samples_per_second': 3.241, 'train_steps_per_second': 0.81, 'total_flos': 0.0, 'train_loss': 0.15948918356296418, 'epoch': 0.36714089031665903})

## Test the model after DPO:

In [29]:
%%time
device = "cuda:0"

encoding = tokenizer(prompt_2, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



<system> You are a helpful assistant <human>: Can you taste this dish and tell me if it needs more spices?  
 <assistant>:  I can't taste a specific dish, but based on typical sensory experiences, a dish may benefit from additional spices if it appears to have a strong, pungent, or slightly sweet undertone, or if it is perceived as having a complex, nuanced flavor profile. Additional spices can contribute to the overall taste and aroma of a dish, depending on the desired balance of flavors and textures.
CPU times: user 5.14 s, sys: 18.6 ms, total: 5.16 s
Wall time: 5.26 s


In [30]:
def generate_response(question: str) -> str:
    prompt = f"<human>: {question}  \n <assistant>: "  # construct same promt as before
    encoding = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = "<assistant>:"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

In [31]:
prompt = "Do people dream in color or black and white?"
print('-', prompt,'\n')
print(generate_response(prompt))

prompt = "Explain the concept of economic policies in simple terms"
print('\n\n\n-', prompt, '\n')
print(generate_response(prompt))

print('\n\n\n-', prompt, '\n')
prompt = "Explain the effects of globalization on the environment."
print(generate_response(prompt))

- Do people dream in color or black and white? 

The phenomenon of dreams and the perception of color or black and white in dreams can vary across cultures and individuals. In Western culture, there is a general belief that dreams may involve associations with past, present, or future events, which might include color associations such as red, blue, or yellow, or black and white. However, this is not a universally accepted or standardized interpretation, and different individuals may have different experiences and associations.



- Explain the concept of economic policies in simple terms 

Economic policies are strategic plans or guidelines that influence the allocation of resources, income distribution, and economic activity within a nation or organization. These policies can encompass a range of interventions, such as fiscal, monetary, and regulatory measures, as well as policies related to income, wealth, and consumption. The goal of economic policies is typically to promote sustai

The result after DPO is clearly imporved

In [32]:
prompt = " What equipment do I need for rock climbing? "
print('-', prompt,'\n')
print(generate_response(prompt))

-  What equipment do I need for rock climbing?  

For rock climbing, a comprehensive set of equipment typically includes a harness, ropes, anchor points, climbing harness, safety gear such as helmets, chest straps, and wrist guards, as well as specialized climbing gear such as crampons, belay devices, and ropes. Additionally, knowledge of climbing techniques, safety awareness, and basic physical fitness are crucial factors to consider.
