**This notebook demonstrates a bug in DPOTrainer (?), where 'chosen' and 'rejected' columns need to be *reversed* in order for training loss to decrease, not increase.**

It takes ~6min to run this notebook on Google Colab A100.

Code uses Huggingface Transformers and Unsolth to do LoRA DPO tuning, in the style of Huggingface's [Alignment Handbook](https://github.com/huggingface/alignment-handbook) for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta).

DPO dataset ([gpt4o-arena-brevity-dpo](https://huggingface.co/datasets/ZSvedic/gpt4o-arena-brevity-dpo)) trains a model to produce shorter responses. Column 'rejected' has answers on normal length, while 'chosen' has short answers.

In [None]:
# model_name = "unsloth/zephyr-sft-bnb-4bit"
model_name = "Qwen/Qwen2-0.5B-Instruct"
dataset_name = "ZSvedic/gpt4o-arena-brevity-dpo"
train_size = 400

### Installs and data preparation



In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [None]:
from datasets import DatasetDict, concatenate_datasets, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError

hf_datasets = load_dataset(dataset_name)

# Sorting is problematic, don't use it:
# To improve DPO and make it faster, we will use shorter examples first.
# hf_datasets["train"] = hf_datasets["train"].map(
#     lambda ex: dict(ex, len_rejected=len(ex["rejected"]))
# ).sort("len_rejected").select(range(train_size)) #.sort("len_rejected", reverse=True).remove_columns("len_rejected")
hf_datasets["train"] = hf_datasets["train"].select(range(train_size))
hf_datasets["test"] = hf_datasets["test"].select(range(50))

print(hf_datasets["train"][0])
print(hf_datasets["train"][train_size-1])

README.md:   0%|          | 0.00/831 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/13.2M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.49M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/22941 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2549 [00:00<?, ? examples/s]

{'question-id': '1dd6137eb3c3470989e18ab729ccc0b3', 'prompt': 'write short telugu poem', 'chosen': 'Telugu poem: ఆకాశం నీలమై పూగుతోంది, సూర్యుడు కొత్త కిరణాలు తెచ్చుకొన్నాడు.', 'rejected': 'పల్లవి:\nఆకాశం నీలమై పూగుతోంది సితారలు,  \nప్రభాతమై పూసి ప్రభా విరబూయెను.\n\nచరణం 1:\nకొత్తగా కిరణాలుతెచ్చుకునిన సూర్యుడు,  \nపడి పచ్చని మాటిదా చెరగడు.\n\nచరణం 2:\nనాటి ఊసుల కబుర్లు చెప్తుంది సితాం,  \nచెల్లెడిక లేచె గతకాలపు కలయా.\n\nచరణం 3:\nపలకరింత కలిగే చల్లని గాలులు,  \nఎంత చిన్న అడుగుల భవిష్యత్తుకై చే.\n\nముగింపు:\nసందేహాలు వీడె మన గుండెలను,  \nఅర్థం చేసుకునె జ్ఞానవంతుల ఒక స్వరం.'}
{'question-id': '1a9efcf00007470aae9e068beae2b32b', 'prompt': 'Improve this text about an overview of a course:\nThis course has been designed to meet the needs of professionals who provide emergency response services within the mining sectors. RTO A will develop a bespoke Certificate IV to support Company B in upskilling their ESO’s and support them in achieving a nationally recognised qualification in Health Care.\

### Tokenization

In [None]:
from transformers import AutoTokenizer
from typing import Literal

tokenizer = AutoTokenizer.from_pretrained(model_name)

def make_conv(example):
    return {
        "prompt": [{"role": "user", "content": example["prompt"]}],
        "chosen": [{"role": "assistant", "content": example["chosen"]}],
        "rejected": [{"role": "assistant", "content": example["rejected"]}],
    }

raw_datasets = hf_datasets.map(make_conv)

# def zels_apply_chat_template(example, tokenizer, assistant_prefix="<|assistant|>\n"):
#   prompt_messages = [{"role": "system", "content": ""}] # Empty system message.
#   prompt_messages.append({"role": "user", "content": example["prompt"]})
#   example["text_prompt"] = tokenizer.apply_chat_template(
#     prompt_messages, tokenize=False, add_generation_prompt=True)
#   example["text_chosen"] = example["chosen"] + "</s>\n"
#   example["text_rejected"] = example["rejected"] + "</s>\n"
#   return example

# column_names = list(hf_datasets["train"].features)

# raw_datasets = hf_datasets.map(
#     zels_apply_chat_template,
#     fn_kwargs = {"tokenizer": tokenizer},
#     num_proc = 12,
#     remove_columns = column_names,
#     desc = "Formatting comparisons with prompt template",
# )

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

### ISSUE: In order for DPO to work, 'chosen' and 'rejected' need to be flipped!?

In [None]:
# for split in ["train", "test"]:
#     raw_datasets[split] = raw_datasets[split].rename_columns(
#         {"text_prompt": "prompt", "text_chosen": "rejected", "text_rejected": "chosen"} # WORKS!
#         # {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"} # DOESN'T WORK!
#     )

print(raw_datasets)
row = raw_datasets["train"][0]
print(f"\nPROMPT:\n{row['prompt']}\nCHOSEN:\n{row['chosen']}\nREJECTED:\n{row['rejected']}")

DatasetDict({
    train: Dataset({
        features: ['question-id', 'prompt', 'chosen', 'rejected'],
        num_rows: 400
    })
    test: Dataset({
        features: ['question-id', 'prompt', 'chosen', 'rejected'],
        num_rows: 50
    })
})

PROMPT:
[{'content': 'write short telugu poem', 'role': 'user'}]
CHOSEN:
[{'content': 'Telugu poem: ఆకాశం నీలమై పూగుతోంది, సూర్యుడు కొత్త కిరణాలు తెచ్చుకొన్నాడు.', 'role': 'assistant'}]
REJECTED:
[{'content': 'పల్లవి:\nఆకాశం నీలమై పూగుతోంది సితారలు,  \nప్రభాతమై పూసి ప్రభా విరబూయెను.\n\nచరణం 1:\nకొత్తగా కిరణాలుతెచ్చుకునిన సూర్యుడు,  \nపడి పచ్చని మాటిదా చెరగడు.\n\nచరణం 2:\nనాటి ఊసుల కబుర్లు చెప్తుంది సితాం,  \nచెల్లెడిక లేచె గతకాలపు కలయా.\n\nచరణం 3:\nపలకరింత కలిగే చల్లని గాలులు,  \nఎంత చిన్న అడుగుల భవిష్యత్తుకై చే.\n\nముగింపు:\nసందేహాలు వీడె మన గుండెలను,  \nఅర్థం చేసుకునె జ్ఞానవంతుల ఒక స్వరం.', 'role': 'assistant'}]


### GPU Part: Unsloth, PEFT, LORA

In [None]:
from unsloth import PatchDPOTrainer, FastLanguageModel
import torch

PatchDPOTrainer() # One must patch the DPO Trainer first!

max_seq_length = 4096 # Choose the maximun sequence length of the model, based on your context length of the data
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # QLoRA: Needed to save memeroy

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name, ## Base Model name here
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# We now add LoRA adapters so we only need to update 1 to 10% of all parameters.
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # LoRA Rank. Suggested: 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Must be = 0
    bias = "none",    # Must be = "none"
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.6: Fast Qwen2 patching. Transformers: 4.47.1.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/457M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Unsloth 2025.1.6 patched 24 layers with 24 QKV layers, 24 O layers and 24 MLP layers.


### In order to track training progress, we use helper methods to check average length of answers to 20 test questions. With DPO, both loss and average length should decrease during training.

In [None]:
def calc_model_avg_len(tokenizer, model):
  ''' Tests verbosity of a model '''

  questions = [
      "How much is 2+3?",
      "What is the color of the sky?",
      "What is the capital of France?",
      "What is the boiling point of water?",
      "Who wrote 'To Kill a Mockingbird'?",
      "What is the largest planet in our solar system?",
      "What is the chemical symbol for gold?",
      "How many continents are there?",
      "What is the speed of light?",
      "Who painted the Mona Lisa?",
      "What is the smallest prime number?",
      "What is the main ingredient in guacamole?",
      "What is the square root of 64?",
      "What is the currency of Japan?",
      "Who discovered penicillin?",
      "What is the tallest mountain in the world?",
      "What is the primary language spoken in Brazil?",
      "What is the freezing point of water?",
      "What is the largest mammal?",
      "What is the capital of Japan?"
  ]

  messages = [ [{"role": "user", "content": q}] for q in questions]

  prompts = [
      tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=False)
      for m in messages ]

  inputs = tokenizer(prompts, return_tensors='pt', padding=True).to('cuda')
  inputs_tok_len = inputs["input_ids"].shape[1]
  results = model.generate(**inputs, max_new_tokens = 200, use_cache = True)
  sequences = tokenizer.batch_decode(results[:, inputs_tok_len:], skip_special_tokens=True)

  averages = []
  for answer in sequences:
      # print(f"---------------\n{answer}")
      averages.append(len(answer))

  total_average = sum(averages)/len(averages)
  print(f"Average answer length: {total_average:.2f} characters")

  return total_average

def len_metrics(pred=None):
  ''' Used to test verbosity in DPOTrainer '''
  FastLanguageModel.for_inference(model)
  avg_len = calc_model_avg_len(tokenizer, model)
  FastLanguageModel.for_training(model)
  return {"avg_len": avg_len}

# Test the model verbosity before fine-tuning.
avg_len_before = len_metrics()["avg_len"]

Average answer length: 300.90 characters


### Create DPOTrainer and start training

In [None]:
from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 1, # ZEL: Changed from 3
        learning_rate = 2.5e-5, # ZEL: Changed from 5e-5
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
        eval_strategy = "steps",
        eval_steps = 5,
        per_device_eval_batch_size = 8,
    ),
    beta = 0.1,
    train_dataset = raw_datasets["train"],
    eval_dataset = raw_datasets["test"],
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
    compute_metrics=len_metrics,
)

# Min at step 50: loss of 0.000200, average character length of 120.70
dpo_trainer.train()

Extracting prompt from train dataset:   0%|          | 0/400 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/400 [00:00<?, ? examples/s]

Extracting prompt from eval dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/400 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 400 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 50
 "-____-"     Number of trainable parameters = 35,192,832


Step,Training Loss,Validation Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
5,0.6421,0.5896,0.099999,-0.11675,0.982143,0.216749,-222.713028,-65.556145,-2.750503,-2.695106
10,0.2035,0.265386,0.766594,-0.50386,1.0,1.270454,-226.584137,-58.890198,-2.789788,-2.700785
15,0.1022,0.11646,1.277148,-1.273983,1.0,2.55113,-234.28537,-53.78466,-2.824593,-2.714394
20,0.0222,0.057535,1.428949,-2.216115,1.0,3.645064,-243.706696,-52.266651,-2.817619,-2.701544
25,0.043,0.035186,1.315942,-3.108961,1.0,4.424903,-252.635147,-53.396721,-2.818129,-2.704707
30,0.0075,0.025655,1.117967,-3.875879,1.0,4.993846,-260.304321,-55.376465,-2.812904,-2.709896
35,0.052,0.023869,1.62584,-4.248742,1.0,5.874582,-264.032928,-50.297737,-2.802437,-2.710295
40,0.001,0.022126,1.707511,-4.563019,1.0,6.270531,-267.175751,-49.481022,-2.791622,-2.710409
45,0.005,0.02129,1.713621,-4.756432,1.0,6.470052,-269.109863,-49.41993,-2.783861,-2.710967
50,0.003,0.02085,1.695883,-4.838077,1.0,6.533959,-269.9263,-49.597313,-2.783815,-2.712142


Average answer length: 213.75 characters
Average answer length: 104.45 characters
Average answer length: 59.55 characters
Average answer length: 55.55 characters
Average answer length: 21.60 characters
Average answer length: 14.70 characters
Average answer length: 14.05 characters
Average answer length: 14.70 characters
Average answer length: 11.55 characters
Average answer length: 13.80 characters


TrainOutput(global_step=50, training_loss=0.15690995603828925, metrics={'train_runtime': 141.574, 'train_samples_per_second': 2.825, 'train_steps_per_second': 0.353, 'total_flos': 0.0, 'train_loss': 0.15690995603828925, 'epoch': 1.0})

### Save the model and check average answer length

In [None]:
loc_peft_model = "peft_model"
dpo_trainer.save_model(loc_peft_model) # Saving to local folder

FastLanguageModel.for_inference(model)
avg_len_after = calc_model_avg_len(tokenizer, model) # Final check

print(f"AVG length before was {avg_len_before}, after fine-tuning it is {avg_len_after}.")

Average answer length: 11.85 characters
AVG length before was 300.9, after fine-tuning it is 11.85.


### Optional: upload the model to HuggingFace (need to sign in to HF first)

In [None]:
# import time
# import huggingface_hub

# # Change repo_id to your repo:
# repo_id = f"ZSvedic/reversed-zephyr-lora-dpo-len{int(avg_len_after)}-{time.strftime('%Y-%m-%d-%H-%M')}"

# huggingface_hub.create_repo(repo_id, exist_ok=True)
# huggingface_hub.upload_folder(
#     folder_path=loc_peft_model,
#     path_in_repo=".",
#     repo_id=repo_id,
#     commit_message="Add fine-tuned model"
# )

In [None]:
import platform, torch, transformers, trl

print(f"Python version: {platform.python_version()}")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"TRL version: {trl.__version__}")
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Available device: {device}")
if device == 'cuda':
    print(f"GPU type: {torch.cuda.get_device_name(0)}")

Python version: 3.11.11
PyTorch version: 2.5.1+cu121
Transformers version: 4.47.1
TRL version: 0.13.0
Available device: cuda
GPU type: NVIDIA A100-SXM4-40GB
