In [1]:
from dotenv import load_dotenv
import pandas as pd
import os
import wandb
import random
from tqdm import tqdm

#
import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported
from evaluate import load

# DPO stuff
from trl import DPOConfig, DPOTrainer
from unsloth import PatchDPOTrainer
PatchDPOTrainer()


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
[2024-12-10 03:25:22,490] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [2]:
"""Load environment variables and configure device."""
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
if torch.cuda.is_available():
    print(f"Using GPU: {torch.cuda.get_device_name()}")
else:
    print(f"CUDA not found")



load_dotenv("all_keys.txt")

# Register HuggingFace --> replace your key
hf_token = os.getenv("HF_TOKEN")

# initizalie wandb with gradient info as well.
wandb_api_key = os.getenv("WANDB_API_KEY")
wandb.login(key=wandb_api_key)
wandb.init(project="dpo_Honaz")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Using GPU: NVIDIA A100-SXM4-40GB


[34m[1mwandb[0m: Currently logged in as: [33merkara[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/ubuntu/.netrc


# Aligning LLMs with Direct Preference Optimization(DPO)

This project builds on a previous effort to fine-tune a language model using a niche dataset about my home town [Honaz](https://en.wikipedia.org/wiki/Honaz), a small town in Turkey. The dataset was created from three detailed Turkish articles sourced from the [DergiPark](https://dergipark.org.tr/) repo. We created our own instruction dataset and fine-tune `Llama-3.2-1B-Instruct` on it. 

Now, we aim to take the fine-tuning process further by applying Direct Preference Optimization (DPO). The objective is to align the model's responses not just to be accurate but also to reflect a conversational, informal tone that resonates with the target audience. Target audience here is my dad since he does not like to read serious stuff, gets bored quickly.(I can translate stuff to Turkish). Just like before, we created our alignment data, which I will outline in a seperate entry in this repo. The focus will be on generating and refining pairwise comparison data, where the preferred responses align with this style, ensuring the model is not only knowledgeable but also user-friendly.

## Preference Data

Lets load and see how our data looks like. As you see, the accepted answers has pretty informal tone

In [3]:
dataset = load_dataset("erdi28/alignment-dataset-honaz",split='train')

In [4]:
print(dataset[0]['prompt'])
print("=====================================================")
print(dataset[0]['rejected'])
print("=====================================================")
print(dataset[0]['chosen'])

Answer a question based on the following content.

3. VEGETATION OF HONAZ MOUNTAIN AND ITS SURROUNDINGS
The vegetation of Honaz Mountain and its surroundings generally consists of dry forests dominated by red pines at lower elevations and black pines at higher elevations. The northern slopes of the Honaz massif are influenced by the Mediterranean climate that penetrates along the Büyük Menderes valley, while the interior areas and southern slopes are under the influence of a continental climate. As a result, the vegetation on the northern and southern slopes of the massif differs. On the more humid northern slopes, a richer and more diverse maquis formation has developed, whereas on the southern slopes, a garigue formation consisting of only the most drought-resistant maquis species is prevalent.
The vegetation of Honaz Mountain and its surroundings is characterized by dry forests, with red pines at lower elevations and black pines at higher elevations. The northern slopes experience a

# Alignmet with DPO

First load the fine-tuned model and inspect how it generates its answer. We can observe that that is pretty text-book voice and formal answer.

In [5]:
max_seq_length = 2048
ref_model, tokenizer = FastLanguageModel.from_pretrained(model_name = "erdi28/finetune_llama_honaz",
                                                     max_seq_length = max_seq_length,
                                                     dtype = None,                         
                                                     load_in_4bit = True)     

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.381 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.12.4 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


In [6]:
# Define the Alpaca prompt template ( we dont have "Input" field)
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

In [7]:
def generate_streaming_text(model, tokenizer, prompt, max_new_tokens=256, prompt_template=alpaca_prompt):
    """
    Generates text from a model with streaming output.
    """
    # format the input and set up the stremaer
    message = prompt_template.format(prompt, "")
    inputs = tokenizer([message], return_tensors="pt").to(device)
    text_streamer = TextStreamer(tokenizer)
    
    # Generate text with streaming
    _ = model.generate(
        **inputs, 
        streamer=text_streamer, 
        max_new_tokens=max_new_tokens, 
        use_cache=True
    )

# test
ref_model = FastLanguageModel.for_inference(ref_model)
prompt = "What are the climatic influences on Honaz Mountain’s vegetation?"
generate_streaming_text(ref_model, tokenizer, prompt, max_new_tokens=150)

<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are the climatic influences on Honaz Mountain’s vegetation?

### Response:
Honaz Mountain, located in the Aegean region of Turkey, experiences a Mediterranean climate characterized by mild winters and warm summers. This climate influences the vegetation types present in the area. 

1. **Temperature and Precipitation Patterns**: The mountain experiences a significant variation in temperature between summer and winter. Summer temperatures can reach up to 32°C (90°F) during the peak summer months, while winters can drop to around 2°C (36°F). Precipitation on Honaz Mountain is generally well-distributed, with most of the annual rainfall falling during the winter months.

2. **Rainfall Distribution**: The annual rainfall on Honaz Mountain is substantial, with the majority of it occurring between October and April. This rainfall is crucial


Now lets go ahead and configure LORA paramater. *At the time of this notebook, there was an ongoing error that despite the fact that we saved the full model to hub, Unsloth does not stop tracking LORA paramaters causing us to use exactly the same LORA configuration we use to fine-tune the original model. It is a major problem but we can live with it for now* 

In [8]:
model = FastLanguageModel.get_peft_model(
    ref_model,
    r = 32,            
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0,         # dropout after adapter, "0" is optimized
    bias = "none",            # biases in the model remain frozen (not updated), "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    use_rslora = False,     # rank stabilized LoRA
    loftq_config = None,    # And LoftQ
    random_state = 1234,
)

Unsloth: Already have LoRA adapters! We shall skip this step.


In [9]:
# Define the Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task. 
Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
"""

# Ensure the EOS token is defined
EOS_TOKEN = tokenizer.eos_token

# Mapping function to format the dataset
def format_samples(example):
    example["prompt"] = alpaca_prompt.format(example["prompt"])
    example["chosen"] = example["chosen"] + EOS_TOKEN
    example["rejected"] = example["rejected"] + EOS_TOKEN

    return {
        "prompt": example["prompt"],
        "chosen": example["chosen"],
        "rejected": example["rejected"],
    }

# Apply the mapping function to the dataset
dataset = dataset.map(format_samples)
dataset = dataset.train_test_split(test_size=0.10)


In [10]:
print(dataset["train"].num_rows)
print(dataset["test"].num_rows)

909
101


Here is our driver code. I cannot give a full lecture on how DPO works but here are few key points:
- We use a smaller learning rate 3e-6 as opposed to 3e-4 in finetuning. `beta` paramaters controls balance between the model's pre-trained distribution and the preference-aligned distribution. While low beta strongly aligns with base model, high beta puts more emphasis on matching the reward signals.

In [11]:
trainer = DPOTrainer(
    model= model,
    ref_model= None,
    tokenizer = tokenizer,
    beta = .8,
    train_dataset = dataset["train"],
    eval_dataset = dataset["test"],
    max_length = max_seq_length//2,
    max_prompt_length = max_seq_length//2,
    dataset_num_proc = 2,                 # number of parallel proceses for data preprocessing
    args = DPOConfig(
         # Training hyperparameters
        num_train_epochs=1,               # Train for one epoch
        per_device_train_batch_size=2,    # Batch size per device during training
        per_device_eval_batch_size=2,     # Batch size per device during evaluation
        gradient_accumulation_steps=8,    # Accumulate gradients for larger effective batch size
        gradient_checkpointing=True,      # Save memory by recomputing activations in backprop
        
        # Optimization settings
        learning_rate = 2e-5,
        optim = "adamw_8bit",
        weight_decay = 0.01, 
        lr_scheduler_type = "linear",      
        warmup_steps=10,
        
        # Precision settings
        fp16 = not is_bfloat16_supported(),     # Disable FP16 precision (set True if supported)
        bf16 = is_bfloat16_supported(),         # Disable BF16 precision (use True on A100 GPUs)
        
       # Logging and checkpoints
        save_steps=100,                   # Save checkpoint every 100 steps
        save_total_limit=1,               # Keep only the most recent checkpoint
        logging_steps=10,                  # Log training progress every 25 steps
        eval_strategy="steps",            # Run evaluation at regular intervals, dont wait epochs
        eval_steps=10,                     # Evaluate in every such steps
        output_dir= "output_dpo",
        run_name="llama_dpo",
        report_to="wandb",                # Report metrics to Weights and Biases
        ),
)

In [12]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 909 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 56
 "-____-"     Number of trainable parameters = 22,544,384
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
10,9.2649,2.473232,30.31946,23.740694,0.627451,6.578768,-115.043007,-238.793442,4.147682,4.467786
20,1.9916,0.446189,38.661587,17.798258,0.931373,20.863325,-122.471039,-228.365829,4.438614,4.441101
30,0.2321,0.221486,40.632534,12.861235,0.980392,27.771303,-128.642334,-225.90213,4.517467,4.417289
40,0.1854,0.144026,41.18573,10.820862,0.980392,30.364866,-131.19281,-225.210648,4.548862,4.415908
50,0.1404,0.134067,41.379635,10.15504,0.980392,31.224592,-132.02507,-224.968277,4.552331,4.411156


TrainOutput(global_step=56, training_loss=2.137345226747649, metrics={'train_runtime': 146.6748, 'train_samples_per_second': 6.197, 'train_steps_per_second': 0.382, 'total_flos': 0.0, 'train_loss': 2.137345226747649, 'epoch': 0.9846153846153847})

`What are those numbers above?`

1. `Training\Validation Loss`: Average loss on the training\validation dataset. 

2. `rewards / chosen`: The average reward for the preferred (chosen) responses. Higher values indicate alignment with the reward model.

3. `rewards / rejected`: The average reward for the rejected responses. Ideally, this should be lower than `rewards / chosen`.

4. `rewards / accuracies`: The fraction of examples where the preferred response has a higher reward than the rejected one. High values (>0.9) indicate strong preference alignment. We gotto watch out this, ideally we like around >.90 without overfitting of course

5. `rewards / margins`:The difference between rewards of chosen and rejected responses (`rewards / chosen - rewards / rejected`). Larger margins indicate confident preference alignment.

6. `logps / chosen`:Log probability assigned to the chosen responses. Higher values (less negative) indicate the model's confidence in preferred responses.


While training with DPO, keep an eye on a few key metrics to make sure the model is learning to align with your preference dataset. First, check that `rewards / chosen` is higher than `rewards / rejected`, meaning the preferred responses are actually being rewarded more. If `rewards / accuracies` is above 0.9, that’s a good sign the model is picking the preferred responses most of the time. Also, watch `rewards / margins`—a positive and growing margin shows the model is confidently separating the chosen and rejected responses. For log probabilities, `logps / chosen` should be higher (less negative) than `logps / rejected`, so the model is favoring the better responses. Of course, at the end of the day, we may go back add more data to our alignant dataset as we currently have only 1K pairs. Since this is a personal project and no one is paying for us yet(:, we can call it success.

In [16]:
#lets send the model to hub
# Save the whole model and push to HuggingFace for further usage
model.save_pretrained_merged("dpo_llama_honaz", tokenizer,save_method="merged_16bit")
model.push_to_hub("erdi28/dpo_llama_honaz", tokenizer,save_method="merged_16bit") 

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 152.47 out of 216.26 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 96.01it/s]

Unsloth: Saving tokenizer...




 Done.
Done.


README.md:   0%|          | 0.00/595 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/90.2M [00:00<?, ?B/s]

Saved model to https://huggingface.co/erdi28/dpo_llama_honaz


## Inference

Lets look at some response before and after alignment. One can definetely the improvement in the direction of more informal response with more room for improvement of course.

In [13]:
model = FastLanguageModel.for_inference(model)
prompt = "What are major tourist destination places in Honaz?"
generate_streaming_text(model, tokenizer, prompt, max_new_tokens=150)

<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are major tourist destination places in Honaz?

### Response:
Honaz is a charming town in Turkey, known for its beautiful beaches, lush green valleys, and rich history. Here are some major tourist destination places in Honaz:

1. **Beaches**: Honaz is famous for its stunning beaches, particularly the Kirenişli Beach, which is known for its white sand and crystal-clear waters. The town also boasts other beautiful beaches, such as Güzelpınar Beach and Güzelpınar Reef.

2. **Döker Tepe**: This hill is a popular spot for panoramic views of the town and the surrounding landscape. Visitors can reach the summit by hiking up the steep trails or by taking the nearby Döker Train.

3. **Höyük Beach**: This


In [14]:
#load rhe base model agaib
max_seq_length = 2048
ref_model, tokenizer = FastLanguageModel.from_pretrained(model_name = "erdi28/finetune_llama_honaz",
                                                     max_seq_length = max_seq_length,
                                                     dtype = None,                         
                                                     load_in_4bit = True)   

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.381 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [15]:
ref_model = FastLanguageModel.for_inference(ref_model)
prompt = "What are major tourist destination places in Honaz?"
generate_streaming_text(ref_model, tokenizer, prompt, max_new_tokens=150)

<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are major tourist destination places in Honaz?

### Response:
Honaz is a popular tourist destination in Turkey, known for its stunning natural beauty, rich history, and unique cultural heritage. Some of the major tourist destination places in Honaz include:

1. Honaz Beach: Located on the southern coast of Turkey, the beach is known for its white sand and crystal-clear waters, making it a perfect spot for swimming, sunbathing, and relaxation.

2. Kastro Village: This historic village is one of the oldest in Western Anatolia and features traditional Ottoman architecture, beautiful gardens, and a fascinating museum showcasing the region's history and culture.

3. Tursunlu Cliff: A breathtaking natural formation, the Tursunlu Cliff is a stunning sight to behold, especially during sunrise when the sun


**FINAL NOTE**: Note that we `monitor` our tranining progress in wandb dashboard, we should actually do it not just let it sit there. For example, in additoon to usual stuff like validaton loss, pay attention to grad norms there to make sure the tranining process is actually stable. If not, turn on stuff like `gradient clipping` in the Trainer above. For now, that's all, it is 10:30pm, little Alfie is crying, I need to take care of him.