# Preference Fine-Tuning with DPO

사전학습된 LLM 기반으로 DPO(Direct Preference Optimization) 방법을 통해 Preference 파인튜닝하는 방법에 대해 살펴보겠습니다.  
DPO 파인튜닝은 다음의 두 단계로 이루어집니다.    
1. **Preference Dataset** 준비 (**Prompt**에 대한 **Positive**, **Negative** Generation 쌍으로 구성)  
2. **DPO 최적화**: DPO Loss의 Log-likelihood 값을 최대화하는 방향으로 학습  

## 0. Setup

In [1]:
import os
os.environ['HF_HOME'] = 'D:/HF/cache'
os.environ['HF_DATASETS'] = 'D:/HF/datasets'
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = "1"

In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = "false"

In [None]:
# !pip install -q transformers==4.51.1
# !pip install -q datasets==3.2.0
# !pip install -q peft==0.15.1
# !pip install -q trl==0.16.1
# !pip install -q accelerate==1.3.0

In [None]:
# !pip install -q bitsandbytes

In [3]:
from IPython.display import display
from tqdm.notebook import tqdm as notebook_tqdm

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, pipeline
from peft import LoraConfig, PeftModel
from trl import DPOTrainer, DPOConfig

In [None]:
"""
from huggingface_hub import notebook_login
notebook_login()
"""

In [None]:
"""
from huggingface_hub import login
from dotenv import load_dotenv
from pathlib import Path
import os

dotenv_path = Path('Z:/Misc/access_token.env.txt')
load_dotenv(dotenv_path=dotenv_path)
access_key=os.getenv('HF_TOKEN')

print(f"HF Access Key: {access_key}")
"""

In [5]:
!nvidia-smi

Fri Aug  1 13:38:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 553.62                 Driver Version: 553.62         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q               WDDM  |   00000002:00:00.0 Off |                    0 |
| N/A    0C    P8             N/A /  N/A  |    1363MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 1. Foundation Model: gemma-2b

Instruction & Preference Fine-Tuning이 되지 않은 Base Gemma-2B 모델에서의 결과를 먼저 확인해 보겠습니다.

In [6]:
# BASE_MODEL = "google/gemma-2b"
BASE_MODEL = "//swschoolavdazfiles002.file.core.windows.net/aias-language/Model/gemma-2b"

# 사전 학습된 'google/gemma-2b' 모델과 토크나이저를 가져옵니다.
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.padding_side = 'right'

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
prompt = "What's the nearest national park to you?"

In [8]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512)

outputs = pipe(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
)

print(outputs[0]["generated_text"])

Device set to use cuda:0


What's the nearest national park to you?

If it isn't in your state, then what about a nearby one that is open year round and has lots of hiking trails.  I have been lucky enough to visit many parks around the country but I am always looking for more! 

Here are some ideas:

* Acadia National Park
* Great Smoky Mountains National Park
* Grand Teton National Park
* Glacier National Park
* Olympic National Park
* Yellowstone National Park

You can also check out this list from National Geographic.


<span style="color:red">!! [주의] Gemma-1.1-2B-it 모델 기반의 Preference Tuning을 위해 Kernel을 Restart 해 주세요!!</span>

0. Setup 단계 재 실행 후, 2 단계부터 이어서 실행하시면 됩니다.

## 2. Dataset: Preference Data

파인튜닝을 위한 Preference Dataset으로는 Open Dataset "jondurbin/truthy-dpo-v0.1"를 사용하겠습니다.
이 데이터셋은  LLM의 Truthfulness를 향상시키기 위한 목적으로 작성된 1,016개 데이터를 가지고 있으며,
**'prompt', 'chosen', 'rejected'** 쌍으로 구성되어 있습니다.

In [6]:
from datasets import Dataset, load_dataset

# dataset = load_dataset("jondurbin/truthy-dpo-v0.1")
dataset = load_dataset("//swschoolavdazfiles002.file.core.windows.net/aias-language/Dataset/truthy-dpo-v0.1")

Generating train split: 0 examples [00:00, ? examples/s]

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'system', 'prompt', 'chosen', 'rejected'],
        num_rows: 1016
    })
})

In [8]:
dataset['train'][200]

{'id': '6afd3f3e1254321c2c55687fecc55d07',
 'source': 'truthy_dpo',
 'system': 'You are an unbiased, uncensored, helpful assistant.',
 'prompt': 'Do all Muslim women wear burqas as their religious clothing?',
 'chosen': 'No, not all Muslim women wear burqas. A burqa is a specific type of covering that completely conceals the body, head, and face, with a mesh grille for seeing. Some Muslim women wear a niqāb, which covers the face and hair, but not the eyes, or a hijab, which only covers the hair. Many Muslim women do not wear any face or head coverings at all. The misconception arises due to generalizations and lack of understanding about the variety of cultural and religious practices within the Muslim community.',
 'rejected': 'No, not all Muslim women wear burqas. Burqas are a specific type of religious clothing worn by some Muslim women in certain cultures and regions, but they are not universally required or worn by all Muslim women. Other types of religious clothing for Muslim wo

데이터셋은 Gemma LLM 파인튜닝을 위한 적절한 Chat Template으로 변환합니다.

In [9]:
def generate_prompt(example):
    prompt = example['prompt']
    rejected = example['rejected']
    chosen = example['chosen']

    example['prompt'] = f"<bos><start_of_turn>system\n <end_of_turn><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"
    example['rejected'] = f"{rejected}<end_of_turn>\n<eos>"
    example['chosen'] = f"{chosen}<end_of_turn>\n<eos>"

    return example

In [10]:
transformed_dataset = dataset.map(generate_prompt)

Map:   0%|          | 0/1016 [00:00<?, ? examples/s]

In [11]:
transformed_dataset['train'][0]

{'id': '04c275bf738fd391b7fe25e25fe7bed3',
 'source': 'truthy_dpo',
 'system': 'You are an unbiased, uncensored, helpful assistant.',
 'prompt': "<bos><start_of_turn>system\n <end_of_turn><start_of_turn>user\nWhat's the nearest national park to you?<end_of_turn>\n<start_of_turn>model\n",
 'chosen': "As an AI, I don't have a physical location, so I can't provide the distance to the nearest national park.<end_of_turn>\n<eos>",
 'rejected': "I don't have access to the user's location, so I can't determine the nearest national park.<end_of_turn>\n<eos>"}

In [12]:
dataset = transformed_dataset['train'].train_test_split(test_size=0.05)

In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'system', 'prompt', 'chosen', 'rejected'],
        num_rows: 965
    })
    test: Dataset({
        features: ['id', 'source', 'system', 'prompt', 'chosen', 'rejected'],
        num_rows: 51
    })
})

## 3. Align LLM with TRL and the DPOTrainer

파인튜닝을 위해 사용할 Foundation Model은 Google의 gemma-1.1-2b-it 모델입니다.  

(https://huggingface.co/google/gemma-1.1-2b-it)

In [None]:
"""
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
"""

In [14]:
# model_ckpt = "google/gemma-1.1-2b-it"
model_ckpt = "//swschoolavdazfiles002.file.core.windows.net/aias-language/Model/gemma-1.1-2b-it"

# [실습] 다음 코드를 완성하세요!!
# 사전 학습된 'gemma-1.1-2b-it' 모델과 토크나이저를 가져옵니다.
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    # quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

효율적인 DPO Training을 위해 PEFT LoRA, Training Arguments, DPO Trainer를 차례로 정의합니다.

In [15]:
# [실습] 다음 코드를 완성하세요!!
# PEFT LoRA 학습을 위한 Config를 설정합니다. (r, lora_alpha, lora_dropout, bias, target_modules, task_type)
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

DPO 학습을 위한 TrainingArguments와 DPO Trainer를 정의합니다.
DPO 관련 중요한 파라미터는 "**beta**" 값으로 일반적으로 0.1 ~ 0.5 범위입니다.  
Beta 값이 작을수록 레퍼런스 모델과의 차이가 커질 수 있습니다.

In [16]:
training_args = DPOConfig(
    output_dir="D:/Trainer/DPO-Trainer",
    eval_strategy="steps",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    label_names=["labels"],
    max_steps=500,
    eval_steps=50,
    logging_steps=50,
    # num_train_epochs=1,
    learning_rate=2e-4,
    dataloader_num_workers=2,
    dataloader_prefetch_factor=1,
    beta=0.1,  # DPO에서 중요한 하이퍼파라미터
    padding_value=tokenizer.eos_token_id,
)

In [17]:
# [실습] 다음 코드를 완성하세요!!
# DPO 학습을 위한 Trainer를 설정합니다. (model, ref_model, args, train_dataset, eval_dataset, tokenizer, peft_config, etc.)
dpo_trainer = DPOTrainer(
    model,
    ref_model=None,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=lora_config,
    processing_class=tokenizer,
)

Extracting prompt in train dataset:   0%|          | 0/965 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/965 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/965 [00:00<?, ? examples/s]

Extracting prompt in eval dataset:   0%|          | 0/51 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/51 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/51 [00:00<?, ? examples/s]

In [18]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

전체 파라미터의 약 0.38% 만을 DPO Fine-Tuning 사용합니다.

In [19]:
print_trainable_parameters(model)

trainable params: 9805824 || all params: 2515978240 || trainable%: 0.3897420034920493


500 스텝 진행하는데 약 5 ~ 7분 정도 소요됩니다.

In [20]:
dpo_trainer.train()

Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
50,0.5385,0.205201,2.431861,-7.226572,0.901961,9.657475,-237.651962,-317.745087,-18.302696,-15.927696
100,0.1956,0.102482,-0.627336,-17.700357,0.941176,17.071001,-268.245087,-422.5,-21.921568,-19.066177
150,0.0208,0.056529,1.633722,-22.227634,0.960784,23.852942,-245.627457,-467.754913,-21.375,-17.957108
200,0.0966,0.023065,-0.212757,-23.651348,0.980392,23.436274,-264.107849,-481.931366,-19.577206,-17.294117
250,0.0048,0.00053,2.748411,-27.446384,1.0,30.199142,-234.5,-519.941162,-21.25,-18.694853
300,0.1992,0.03892,4.919539,-13.97505,0.980392,18.901501,-212.803925,-385.323517,-22.553921,-20.825981
350,0.001,0.026944,3.256252,-22.920496,0.980392,26.183823,-229.392151,-474.676483,-22.431372,-20.090687
400,0.0163,0.000193,-0.058153,-30.891697,1.0,30.809437,-262.558838,-554.313721,-20.330883,-16.539215
450,0.003,0.008375,-3.143354,-42.966911,1.0,39.840687,-293.446075,-675.019592,-19.48897,-15.465686
500,0.0006,0.007056,-3.672181,-45.974266,1.0,42.294117,-298.661774,-705.215698,-19.627451,-15.577206


TrainOutput(global_step=500, training_loss=0.10764903584122658, metrics={'train_runtime': 378.9929, 'train_samples_per_second': 1.319, 'train_steps_per_second': 1.319, 'total_flos': 0.0, 'train_loss': 0.10764903584122658, 'epoch': 0.5181347150259067})

In [None]:
"""
# save DPO adapter model
ADAPTER_MODEL = "D:/Trainer/DPO-Trainer/dpo_adapter"

trainer.model.save_pretrained(ADAPTER_MODEL)
"""

## 4. Fine-tuned LLM Inference

DPO 기반 파인튜닝된 모델을 테스트해 보도록 하겠습니다.  
Gemma-1.1-2b-it 모델은 이미 Instruction, Preference 학습이 충분히 되어 있는 모델인 관계로,
DPO 학습에 의한 변화를 체감하기 어려울 수 있습니다.

In [None]:
"""
BASE_MODEL = "google/gemma-1.1-2b-it"

model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map='auto', torch_dtype=torch.float32)
# model = PeftModel.from_pretrained(model, ADAPTER_MODEL, device_map='auto', torch_dtype=torch.float32)
model.load_adapter(ADAPTER_MODEL)
"""

In [21]:
question = "What's the nearest national park to you?"

prompt = f"""<bos><start_of_turn>system
You are a helpful AI assistant.<end_of_turn>
<start_of_turn>user
{question}<end_of_turn>
<start_of_turn>model
"""

In [22]:
pipe_finetuned = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256)

outputs = pipe_finetuned(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    # add_special_tokens=True
)

print(outputs[0]["generated_text"][len(prompt):])

Device set to use cuda:0


As an artificial intelligence, I don't have a physical location or a personal connection to the natural world, so I can't provide any specific information about national parks. The concept of "nearest" is more metaphorical and relates to the metaphorical representation of the interconnectedness of our global community. postData


- Ref. Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", 2023, Stanford University