# CPSC 477/577 Spring 2025, HW3
## Part 1- Direct Preference Optimization

Yale University  
Spring 2025  
Instructor: Arman Cohan

# Goal

#### The goal of this homework assignemnt is to help students understand how to fine-tune language models with preference feedback using the Direct Preference Optimization (DPO) framework. You will also learn how to implement a training pipeline for language models.

**Acknolwedgement** _The assignment is designed by TA Yilun Zhao and former TA Kejian Shi, with help and guidance from Arman Cohan and Yixin Liu.



### Submission Instructions

Submit the notebook as a .ipynb file through GradeScope.

Make sure that the notebook is running without any errors before submission. Remove any unnecessary outputs or additional `print` or debugging statements that you put in the code before submission.

### Write your name and NetID below.

**Name:**    Yuan Chang

**NetID:**   yc2238


### Install and import packages

In [None]:
%pip install datasets gdown rouge-score

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import json
import random
import numpy as np
import nltk
nltk.download('punkt')
from tqdm import tqdm
from functools import partial
from rouge_score import rouge_scorer
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import Dict, Union, List, Tuple, Optional
from transformers.models.deberta_v2.modeling_deberta_v2 import (
    DebertaV2PreTrainedModel,
    DebertaV2Model,
    SequenceClassifierOutput
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### GPU requirement
This notebook requires at least 15Gb of GPU memory (e.g. a T4 GPU on Google Colab).
You are able to run the notebook on colab for free but you are also encouraged to use the Grace HPC cluster if you have issues with T4 GPUs. If you are using MacBook Pro with M1 chip or later, you can also try running on your own local machine using the `mps` device backend. We have provided the code for this below.

If you want to use `mps` please follow installation instructions below:

```
# Only run this if you are using Apple Silicon M1 chip or later with 18Gb of RAM
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
```

More information about installation:  ```https://developer.apple.com/metal/pytorch/```  

### Using GPU in Colab
PyTorch and other deep learning libraries are much faster using GPU acceleration. Similar to the previous assignment, we will use GPU runtime to train and evaluate models.

Go to Runtime option on the top left
Click Change runtime type
Select "GPU" for Hardware accelerator
Click SAVE button
However, Colab limits the amount of time that you can use a free GPU. So you may wish to implement much of the assignment without the GPU. But note that you will have to run all cells again once you change the runtime type.

Let's first verify which device you are using.

In [None]:
# set the device
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
elif torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
else:
    if not torch.backends.mps.is_built():
        print("MPS not available because the current PyTorch install was not "
              "built with MPS enabled.")
    else:
        print("MPS not available because the current MacOS version is not 12.3+ "
              "and/or you do not have an MPS-enabled device on this machine.")
    DEVICE = torch.device("cpu")
    print("Warning: You are using CPU. For better performance, use GPU.")
print("Pytorch version is: ", torch.__version__)
print("You are using: ", DEVICE)

MPS not available because the current PyTorch install was not built with MPS enabled.
Pytorch version is:  2.6.0+cu124
You are using:  cpu


### Setting up

In [None]:
SEED = 42

random.seed(SEED)
np.random.seed(SEED)

if torch.cuda.is_available():
    torch.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.enabled = False
    torch.backends.cudnn.deterministic = True
elif torch.backends.mps.is_available():
    torch.mps.manual_seed(SEED)


Setting up data directories

In [None]:
!mkdir -p cache
!mkdir -p data
!mkdir -p data/dpo_origin
!mkdir -p data/dpo
!mkdir -p data/sft

In [None]:
!gdown "https://drive.google.com/uc?id=15BpDzxPRGW8cpIbVheJbbKPBDbp5LQdV" -O dpo_data.zip
!unzip -o dpo_data.zip -d data

Downloading...
From: https://drive.google.com/uc?id=15BpDzxPRGW8cpIbVheJbbKPBDbp5LQdV
To: /content/dpo_data.zip
  0% 0.00/2.58M [00:00<?, ?B/s]100% 2.58M/2.58M [00:00<00:00, 177MB/s]
Archive:  dpo_data.zip
  inflating: data/dpo_origin/train.jsonl  
  inflating: data/dpo_origin/.DS_Store  
  inflating: data/dpo_origin/val.jsonl  
  inflating: data/sft/test.jsonl     


## Introduction

In this assignment, you will explore and implement the Direct Preference Optimization (DPO) algorithm for fine-tuning language models. DPO is a powerful technique that leverages preference data to guide the model towards generating more desirable outputs. By comparing the model's generated responses with expert (human-written or strong-llm-written) references, DPO aims to align the model's behavior with user preferences.

We will provide you with an SFT model (reference model), which is a [GPT-2-small](https://huggingface.co/openai-community/gpt2) (124M parameters) fine-tuned on the famous [tulu-v2-sft-mixture](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) dataset.


We have further trained this reference model using DPO for a certain number of iterations, resulting in an instructor's version of the DPO model. The DPO data chosen is the AI2 [UltraFeedback](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset, which will be provided to you via `gdown`. The dataset is annotated by GPT-4, a powerful model that has undergone heavy alignment tuning. So we expect the data to be of high quality.

Your task is to build upon the provided DPO model and implement the training loop and loss function to further fine-tune the model using an additional set of training examples from the same dataset (UltraFeedback). You will be provided with the necessary code templates and guidelines to complete the implementation.

After successfully fine-tuning the model, you will evaluate and compare the performance of three models:

1. The instructor's gpt2-SFT model
2. The instructor's DPO model
3. Your further fine-tuned DPO model

We hope you will gain insights into the effectiveness of DPO in improving the model's performance and aligning its outputs with human preferences. Although, leveraging the full potential of DPO requires a large amount of preference data, large models, and extensive computational resources, which is beyond the scope of this assignment.

We encourage you to actively engage with the provided materials, ask questions, and explore the concepts thoroughly.

You may find the following material useful, among many others:

* [The DPO paper](https://arxiv.org/abs/2305.18290)
* [A HuggingFace blog post](https://huggingface.co/blog/pref-tuning)



## DPO Recap


Recall that DPO bypasses the conventional reinforcement learning (RL) reward modeling phase in NLP (e.g., the [PPO algorithm](https://arxiv.org/abs/1707.06347)), by directly optimizing the policy language model using pairwise preference data.

Note: Following the notation in the original [paper](https://arxiv.org/pdf/2305.18290.pdf), we use $y_w$ and $y_l$ to denote the `chosen` or `preferred` and the `rejected` or `dispreferred` completion amongst a pair of responses ($y_1$, $y_2$), respectively.


Within the framework of RLHF (Reinforcement Learning From Human Feedback), DPO's pipeline involves:



1. Sample responses $y_1, y_2 \sim \pi_{\text {ref }}(\cdot \mid x)$ from the model for given input prompts $x$. Then, we get human preferences to label $y_w$ and $y_l$, where $y_w > y_l \mid x$. This process results in an offline dataset of preferences, $\mathcal{D}=\left\{x^{(i)}, y_w^{(i)}, y_l^{(i)}\right\}$.

2. Optimize the policy (model) $\pi_\theta$ by minimizing the DPO loss $\mathcal{L}_{\mathrm{DPO}}$ for a given reference model $\pi_{\text {ref }}$, $\mathcal{D}$, and $\beta$, which is a hyperparameter that controls the deviation from $\pi_{\text {ref }}$.

## Prompting the base and SFT models

#### Load the raw `gpt2` model (the smallest version, 124M parameters) and prompt it.

In [None]:
raw_model_name = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained('gpt2')  # load the tokenizer
raw_model = AutoModelForCausalLM.from_pretrained(raw_model_name).to(DEVICE)

prompt = "Q: Give me some options for healthy foods. \nA: "

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(DEVICE)
output = raw_model.generate(input_ids,
                            max_length=64,
                            pad_token_id=tokenizer.eos_token_id)

# TODO: decode the output
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(output)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Q: Give me some options for healthy foods. 
A:  I'm not sure if I can give you a list of all the options I have. 
Q:  What are some of the things you like about eating healthy? 
A:  I like to eat healthy.


#### Load our SFT model and make it say something

In [None]:
model_sft_name = "kejian/gpt2-instruct-tulu2"
tokenizer = AutoTokenizer.from_pretrained('gpt2')  # load the tokenizer
model_sft = AutoModelForCausalLM.from_pretrained(model_sft_name).to(DEVICE)  # load the reference model

config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

Recall that SFT models generally are trained to respond to prompts (or instructions) and they are better at this compared with the base models.
In SFT, the data is transformed to include the instructions such as the following format:


```
Instruction: {input}.strip()
Response: {output}
```
We should accordingly format our prompt in the same fashion.



In [None]:
input = "Give me some options for healthy foods.  "
formatted_input = f"Instruction: {input.strip()}\nResponse: "

input_ids = tokenizer.encode(formatted_input, return_tensors="pt").to(DEVICE)
output = model_sft.generate(input_ids,
                            max_length=64,
                            pad_token_id=tokenizer.eos_token_id)

# TODO: decode the output
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(output)

Instruction: Give me some options for healthy foods.
Response: 
1. Choose a healthy food source: Eating a healthy food source can help you feel more healthy and feel more connected to your body. Here are some options:
1. Choose a healthy food source: Eating a healthy food source can help


#### Question 1 - How does the output of the SFT model differ from the base model (1 ~ 2 sentences)?

In [None]:
response = "gpt response is question answer based while sft is giving the answer sequencially" # TODO: Write your answer here

assert response != "", "Please write your answer"

Chances are you prefer the instruction-tuned model over the base model.

## Define data classes and collators

To train models in Pytorch, we need to first define and implement the data pipeline. The data pipeline consists of a dataset class (that is inherited from the Pytorch Dataset class). The dataset class is responsible for loading the data and returning the data samples. The data samples are then passed to a collator, which is responsible for batching the data samples and converting them into a batch of tensors.

Concretely, the dataset class should implement the following methods:  
`__len__`: returns the number of samples in the dataset  
`__getitem__`: returns a sample from the dataset  

Optionally we can implement a `collator` function that does additional preprocessing, batching, padding, etc. that is applied to the samples before they are passed to the model. The collator function is used by the Pytorch DataLoader class to create batches of data samples.

We provide the classes and functions to you. Please read the code carefully and then run the following cell. You shoul familiarize yourself with how to write Dataset classes for a model.

#### In later section of the notebook, you will implement a `DPOGPTDataset` class and a collator function for training your model.

If you haven't trained a model before, we encourage you to carefully examing the code in the following cell and understand how the data pipeline is implemented.

In [None]:
class SFTGPTDataset(Dataset):
    def __init__(self, data, model_type, max_len=1024, is_test=False):
        """ data format: article, abstract, [(candidiate_i, score_i)] """
        self.data = data
        self.tok = AutoTokenizer.from_pretrained(model_type, verbose=False)
        self.tok.pad_token = self.tok.eos_token
        self.tok.pad_token_id = self.tok.eos_token_id
        self.max_len = max_len
        self.is_test = is_test
        self.num = len(self.data)
        self.SRC_MARKER = "Instruction: "
        self.TGT_MARKER = "\nResponse: "

    def __len__(self):
        return self.num

    def __getitem__(self, idx):
        data = self.data[idx]
        src_txt = self.SRC_MARKER + data["instruction"].strip() + self.TGT_MARKER
        data["instruction"] = src_txt
        src = self.tok([src_txt], return_tensors="pt", padding="longest", truncation=False)
        src_input_ids = src["input_ids"]
        src_input_ids = src_input_ids.squeeze(0)
        text = src_txt + data["response"].strip()
        encoded = self.tok([text], return_tensors="pt", padding="longest", truncation=False)
        input_ids = encoded["input_ids"]
        input_ids = input_ids.squeeze(0)
        assert input_ids.size(0) <= self.max_len - 1
        # add eos token
        input_ids = torch.cat([input_ids, input_ids.new_ones(1) * self.tok.eos_token_id])
        # we only need to train on the target part
        masks = torch.zeros_like(input_ids)
        masks[src_input_ids.size(0):] = 1
        if self.is_test:
            result = {
                "input_ids": src_input_ids,
                "masks": masks,
            }
        else:
            result = {
                "input_ids": input_ids,
                "masks": masks,
                }
        if self.is_test:
            result["data"] = data
        return result


def collate_base_gpt(batch, pad_token_id, is_test=False):
    def pad(X, padding, max_len=-1):
        if max_len < 0:
            max_len = max(x.size(0) for x in X)
        result = torch.ones(len(X), max_len, dtype=X[0].dtype) * padding
        attention_mask = torch.zeros(len(X), max_len, dtype=X[0].dtype)
        for (i, x) in enumerate(X):
            result[i, -x.size(0):] = x
            attention_mask[i, -x.size(0):] = 1
        return result, attention_mask

    input_ids, attention_mask = pad([x["input_ids"] for x in batch], pad_token_id)
    masks, _ = pad([x["masks"] for x in batch], 0)
    if is_test:
        data = [x["data"] for x in batch]
    result = {
        "input_ids": input_ids,
        "masks": masks,
        "attention_mask": attention_mask,
        }
    if is_test:
        result["data"] = data
    return result


class BaseDPOGPTDataset(Dataset):
    def __init__(self, data, model_type, max_len=1024, is_test=False):
        """ data format: article, abstract, [(candidiate_i, score_i)] """
        if isinstance(data, str):
            with open(data) as f:
                self.data = [json.loads(x) for x in f]
        else:
            self.data = data
        self.tok = AutoTokenizer.from_pretrained(model_type, verbose=False)
        self.tok.pad_token = self.tok.eos_token
        self.tok.pad_token_id = self.tok.eos_token_id
        self.max_len = max_len
        self.is_test = is_test
        self.num = len(self.data)
        self.SRC_MARKER = "Instruction: "
        self.TGT_MARKER = "\nResponse: "

    def __len__(self):
        return self.num

    def encode(self, src_len, text):
        encoded = self.tok([text], return_tensors="pt", padding="longest", truncation=False)
        input_ids = encoded["input_ids"]
        input_ids = input_ids.squeeze(0)
        assert input_ids.size(0) <= self.max_len - 1
        # add eos token
        # if not self.is_test:
        input_ids = torch.cat([input_ids, input_ids.new_ones(1) * self.tok.eos_token_id])
        # we only need to keep the target part
        masks = torch.zeros_like(input_ids)
        masks[src_len:] = 1
        return input_ids, masks


    def __getitem__(self, idx):
        data = self.data[idx]
        src_txt = self.SRC_MARKER + data["instruction"].strip() + self.TGT_MARKER
        data["raw_instruction"] = data["instruction"]
        data["instruction"] = src_txt
        src = self.tok([src_txt], return_tensors="pt", padding="longest", truncation=False)
        src_input_ids = src["input_ids"]
        src_input_ids = src_input_ids.squeeze(0)
        chosen_input_ids, chosen_masks = self.encode(src_input_ids.size(0), src_txt + data["chosen"].strip())
        rejected_input_ids, rejected_masks = self.encode(src_input_ids.size(0), src_txt + data["rejected"].strip())
        result = {
            "chosen_input_ids": chosen_input_ids,
            "chosen_masks": chosen_masks,
            "rejected_input_ids": rejected_input_ids,
            "rejected_masks": rejected_masks,
            }
        if self.is_test:
            result["data"] = data
            result["src_input_ids"] = src_input_ids
        return result


def collate_base_dpo_gpt(batch, pad_token_id, is_test=False):
    def pad(X, padding, max_len=-1, pad_side="left"):
        assert pad_side in ["left", "right"]
        if max_len < 0:
            max_len = max(x.size(0) for x in X)
        result = torch.ones(len(X), max_len, dtype=X[0].dtype) * padding
        attention_mask = torch.zeros(len(X), max_len, dtype=X[0].dtype)
        for i, x in enumerate(X):
            if pad_side == "left":
                result[i, -x.size(0) :] = x
                attention_mask[i, -x.size(0) :] = 1
            else:
                result[i, : x.size(0)] = x
                attention_mask[i, : x.size(0)] = 1
        return result, attention_mask

    # pad chosen
    chosen_input_ids, chosen_attention_mask = pad(
        [x["chosen_input_ids"] for x in batch], pad_token_id, pad_side="left"
    )
    chosen_masks, _ = pad([x["chosen_masks"] for x in batch], 0, pad_side="left")

    # pad rejected
    rejected_input_ids, rejected_attention_mask = pad(
        [x["rejected_input_ids"] for x in batch], pad_token_id, pad_side="left"
    )
    rejected_masks, _ = pad([x["rejected_masks"] for x in batch], 0, pad_side="left")

    # concatenate
    input_ids = torch.unbind(chosen_input_ids) + torch.unbind(rejected_input_ids)
    attention_mask = torch.unbind(chosen_attention_mask) + torch.unbind(rejected_attention_mask)
    masks = torch.unbind(chosen_masks) + torch.unbind(rejected_masks)

    # right pad now
    input_ids, _attention_mask = pad(input_ids, pad_token_id, pad_side="right")
    attention_mask, _ = pad(attention_mask, 0, pad_side="right")
    attention_mask = attention_mask * _attention_mask
    masks, _ = pad(masks, 0, pad_side="right")

    result = {
        "input_ids": input_ids,
        "masks": masks,
        "attention_mask": attention_mask,
    }

    if is_test:
        result["data"] = [x["data"] for x in batch]
        src_input_ids, src_input_mask = pad([x["src_input_ids"] for x in batch], pad_token_id, pad_side="left")
        result["src_input_ids"] = src_input_ids
        result["src_input_mask"] = src_input_mask
    return result

## Question 2 - Given the provided code, answer the following questions:

Q2.1 The `collate_base_gpt` and `collate_base_dpo_gpt` functions implement custom padding for batches of data. Describe how padding is applied to the input IDs and masks. What does the pad_side argument control in the collate_base_dpo_gpt function, and why might this be important?

`TODO: the padding is applied by left padding. It takes the longest term of the input sentence and pad every tensor to that length with the model's pad_token_id when padding input_ids, and 0 when padding the binary masks.

Q2.2 In the context of this code, what is the purpose of the attention mask (attention_mask) and how is it different from the masks (masks) used in the dataset classes?

`TODO: attention mask is to tell the model which parameter is the one that the attention module should focus and which is padding, the mask is to tell the model where to calculate the loss, to let the model know to calculate the loss to the response.

Q2.3 How does the output of the `__getitem__` method in the BaseDPOGPTDataset class change when `is_test` is set to `True`? What additional information is included in the result, and why might this be useful for testing?

`TODO: When the istest be set to true, it add example's original text and  prompt to the dataset, makes it able to test the performance of the model.

Q2.4 How do the dataset classes handle inputs of varying lengths, especially in terms of padding and truncation? Discuss the importance of dynamic input handling in the context of NLP models.

`TODO: Truncation is set to false in our setting, padding is set to pad all the input to the maximum length in this batch. For sft is right padding while dpo is left padding. Those dynamic input handling allows the model to accurate and efficiently training.

## Generate logits using reference (SFT) model:

Generating logits is an important part of the DPO algorithm. We use the SFT model as a reference policy to generate the logits (log probabilities) of the chosen and rejected responses.

The `compute_likelihood` function below will handle this step. It loads the SFT model, computes the logits for the chosen and rejected responses, and saves the results along with other relevant information.

Finish the `#TODO`s and proceed to the following cells.




In [None]:
def compute_likelihood(model, tokenizer, device, src_dir, tgt_dir, batch_size=8, num_workers=2, max_len=1024):
    """
    Compute the model-predicted likelihood for the chosen and rejected responses.

    Args:
        model (AutoModelForCausalLM): The reference model.
        tokenizer (AutoTokenizer): The tokenizer associated with the model.
        device (torch.device): The device to run the model on.
        src_dir (str): The directory containing the source data.
        tgt_dir (str): The directory to save the processed data.
        batch_size (int): The batch size for data loading (default: 8).
        num_workers (int): The number of worker processes for data loading (default: 4).
        max_len (int): The maximum sequence length (default: 1024).

    Steps:
        1. Load the tokenizer and the pre-trained SFT model.
        2. Load the dataset and create a DataLoader.
        3. Iterate over the batches and compute the logits for the chosen and rejected responses.
        4. Save the computed likelihoods along with other relevant information.

    Returns:
        None
    """
    model.eval()
    tokenizer.pad_token_id = tokenizer.eos_token_id
    tokenizer.pad_token = tokenizer.eos_token

    dataset = load_dataset("json", data_files=src_dir)["train"]

    dataset = BaseDPOGPTDataset(
        dataset,
        model_type=model.config.model_type,
        max_len=max_len,
        is_test=True,
    )
    collate_fn = partial(
        collate_base_dpo_gpt, pad_token_id=tokenizer.pad_token_id, is_test=True
    )
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=num_workers,
        collate_fn=collate_fn,
    )

    with torch.no_grad():
        with open(tgt_dir, "w") as f:
            for batch in tqdm(dataloader, desc="Computing likelihoods"):
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                # Compute the model output
                output = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    output_hidden_states=False
                )

                output = output[0]
                output = output[:, :-1]  # truncate last logit
                labels = input_ids[:, 1:]  # shift labels

                ### TODO: Compute the logits: 1 line


                logits = torch.log_softmax(output, dim=-1)

                logits = logits.gather(2, labels.unsqueeze(2)).squeeze(2) # GIVEN TO STUDENTS

                ### TODO: Mask the prompt tokens and computing \
                ### the log probabilities only for the completion tokens ~3 lines
                mask = batch["masks"].to(device)[:, 1:]
                logits = logits * mask
                num_tokens = mask.sum(dim=1) # TODO: Complete this line # HIDE FROM STUDENTS

                batch_size = logits.size(0) // 2
                pos_logits, neg_logits = logits[:batch_size], logits[batch_size:]
                pos_num_tokens, neg_num_tokens = num_tokens[:batch_size], num_tokens[batch_size:]

                # Save the computed likelihoods along with other relevant information
                for i in range(batch_size):
                    data = batch["data"][i]
                    data["instruction"] = data["raw_instruction"]
                    del data["raw_instruction"]
                    # TODO: Assigning values to proper keys - 4 lines
                    # i.e. data['key'] = value
                    data["chosen_log_likelihood"]   = pos_logits[i].sum().item()
                    data["rejected_log_likelihood"] = neg_logits[i].sum().item()
                    data["chosen_num_tokens"]       = pos_num_tokens[i].item()
                    data["rejected_num_tokens"]     = neg_num_tokens[i].item()

                    print(json.dumps(data), file=f, flush=True) # keep this line as it is

#### Question 3: Why do we want to mask out the prompt part? What would change in the DPO loss if we do not mask out the prompt part?

`TODO: We mask the prompt so that the loss is only for the response part. If we do not mask out the prompr the loss will be lower than the real value and waste the compute resourse



In [None]:
for split in ["train","val"]:
    src_dir = f"data/dpo_origin/{split}.jsonl"
    tgt_dir = f"data/dpo/{split}.jsonl"
    compute_likelihood(model_sft, tokenizer, DEVICE, src_dir, tgt_dir, batch_size=1, num_workers=2, max_len=1024)

Generating train split: 0 examples [00:00, ? examples/s]

Computing likelihoods:  38%|███▊      | 770/2000 [1:11:19<1:53:56,  5.56s/it]


KeyboardInterrupt: 

### Implementing DPO loss

Recall DPO loss from the [paper](https://arxiv.org/pdf/2305.18290.pdf):

Recalling the concept of DPO loss, the formula is given by
$$
\mathcal{L}_{\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\mathrm{ref}}\left(y_w \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\mathrm{ref}}\left(y_l \mid x\right)}\right)\right]
$$
Here, $\sigma$ denotes the logistic function, $\mathbb{E}$ is the expectation, $\log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\text {ref }}\left(y_w \mid x\right)}$ is the logarithm of the ratio of the probability of the chosen/preferred response $y_w$ according to the policy parameterized by $\theta$ compared to the reference policy, and finally, $\log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\text {ref }}\left(y_l \mid x\right)}$ denotes the logarithm of the same probability ratio but for the dispreferred outcome $y_l$. The objective is to increase the likelihood of the chosen/preferred responses $y_w$ and decrease that of the rejected/dispreferred response $y_l$.

Complete the code for `compute_dpo_loss()`. Note in the function signature, “chosen” means "preferred" and “rejected” means "dispreferred" completions.


In [None]:
def compute_dpo_loss(policy_chosen_logps, policy_rejected_logps,
                     reference_chosen_logps, reference_rejected_logps,
                     beta = 0.1):

    """Compute the DPO loss for a batch of policy and reference model log probabilities.

    Args:
        policy_chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,)
        policy_rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,)
        reference_chosen_logps: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,)
        reference_rejected_logps: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,)
        beta: Temperature parameter for the DPO loss

    Returns:
        A tuple of three tensors: (losses, chosen_rewards, rejected_rewards).
        The losses tensor contains the DPO loss for each example in the batch.
        The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively.
    """
    # Hint: Use F.logsigmoid()

    # TODO: Implement the DPO loss computation
    chosen_rewards  = policy_chosen_logps   - reference_chosen_logps
    rejected_rewards = policy_rejected_logps - reference_rejected_logps
    score_diff = beta * (chosen_rewards - rejected_rewards)
    losses = -F.logsigmoid(score_diff)

    return losses, chosen_rewards, rejected_rewards


#### Now, let's load the initial DPO model:

Training the DPO model from scratch is computationally expensive. So, we provide you with an initial DPO model that has been trained starting from an SFT model additionally trained for a certain number of iterations on the UltraFeedback dataset using the DPO loss.

You will further fine-tune this model on another 2k examples from the same dataset using DPO.




In [None]:
dpo_model_name = "kejian/gpt2-tulu2-DPO"
dpo_model = AutoModelForCausalLM.from_pretrained(dpo_model_name).to(DEVICE)  # load the initial DPO model

In [None]:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

Define the testing loop:

In [None]:
def test(dataloader, model, beta, device):
    model.eval()
    all_loss = 0
    batch_cnt = 0

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            batch = {k: v.to(device) for k, v in batch.items()}
            input_ids = batch["input_ids"]
            attention_mask = batch["attention_mask"]

            output = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                output_hidden_states=False
            )
            output = output[0]
            output = output[:, :-1]  # truncate last logit
            labels = input_ids[:, 1:]  # shift labels
            output = output.to(torch.float32)

            # TODO: Compute the logits and mask the prompt parts; 4 ~ 6 lines
            # This is the same as what you did for compute_likelihood(); You can copy over what you
            logits = torch.log_softmax(output, dim=-1)
            logits = logits.gather(2, labels.unsqueeze(2)).squeeze(2)
            mask = batch["masks"].to(device)[:, 1:]
            logits = logits * mask
            num_tokens = mask.sum(dim=1)
            batch_size = logits.size(0) // 2
            pos_logits, neg_logits = logits[:batch_size], logits[batch_size:]
            pos_ref_logits = batch["chosen_logprob"]
            neg_ref_logits = batch["rejected_logprob"]
            loss, _, _ = compute_dpo_loss(pos_logits, neg_logits, pos_ref_logits, neg_ref_logits, beta)
            loss = loss.mean()
            all_loss += loss.item()
            batch_cnt += 1

    loss = all_loss / batch_cnt
    model.train()
    return loss

### Creating `DPOGPTDataset` class and the `collate_fn` pf dataloaders for DPO training loop

In [None]:
class DPOGPTDataset(BaseDPOGPTDataset):
    """
    Dataset class for DPO (Direct Preference Optimization) training.

    This class inherits from the BaseDPOGPTDataset class and adds functionality specific to DPO training.

    TODO: Implement the __getitem__ method to retrieve the chosen and rejected log probabilities for each data point.
    Hint: Use the superclass's __getitem__ method to get the base result and add something.
    Hint: Look at run_dpo() below to see what is needed from a batch.
    """
    def __getitem__(self, idx):
        data = self.data[idx]
        # TODO: Implement this function
        item = super().__getitem__(idx)
        data = self.data[idx]
        item["chosen_logprob"] = torch.tensor(data["chosen_log_likelihood"])
        item["rejected_logprob"] = torch.tensor(data["rejected_log_likelihood"])
        return item




def collate_dpo_gpt(batch, pad_token_id, is_test=False):
    """
    This function collates a batch of data points and prepares them for DPO training.

    TODO: Implement the collate_dpo_gpt function to collate the chosen and rejected log probabilities for each batch.
    Hint: Use the collate_base_dpo_gpt function from the base class to get the base results and add something.
    You should need no more than 5 lines of code.
    """
    # TODO: Implement this function
    result = collate_base_dpo_gpt(batch, pad_token_id, is_test)
    result["chosen_logprob"] = torch.stack([x["chosen_logprob"] for x in batch])
    result["rejected_logprob"] = torch.stack([x["rejected_logprob"] for x in batch])
    return result


### The main training loop:

The code below defines the main training loop for the DPO model.
Read the following code carefully, run the cell, and then answer the following questions.


In [None]:
def run_dpo(model, dataloader, val_dataloader, optimizer, device, epochs=1, accumulate_step=4,
            report_freq=10, eval_interval=-1, max_lr=2e-3, warmup_steps=1000, beta=0.1, grad_norm=0):
    """
    Run Direct Preference Optimization (DPO) training.

    Args:
        model (AutoModelForCausalLM): The i DPO model to be fine-tuned.
        dataloader (DataLoader): The data loader for the training dataset.
        val_dataloader (DataLoader): The data loader for the validation dataset.
        optimizer (torch.optim.Optimizer): The optimizer for training.
        device (torch.device): The device to run the model on.
        epochs (int): The number of epochs to train (default: 1).
        accumulate_step (int): The number of steps to accumulate gradients (default: 4).
        report_freq (int): The frequency of reporting training progress (default: 10).
        eval_interval (int): The interval for performing evaluation (default: -1, no evaluation).
        max_lr (float): The maximum learning rate (default: 2e-3).
        warmup_steps (int): The number of warmup steps (default: 1000).
        beta (float): The beta value for DPO loss (default: 0.1).

    Returns:
        AutoModelForCausalLM: The dpo-tuned model.
    """
    model.train()
    all_step_cnt = 0
    minimum_loss = 1e5

    for epoch in range(epochs):
        optimizer.zero_grad()
        step_cnt = 0
        epoch_step = 0
        avg_loss = 0

        for i, batch in enumerate(tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}")):
                batch = {k: v.to(device) for k, v in batch.items()}
                step_cnt += 1

                input_ids = batch["input_ids"]
                attention_mask = batch["attention_mask"]

                # TODO: Compute the model output; The model should return the logits (not the hidden states)

                output = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    output_hidden_states=False
                )


                output = output[0]
                output = output[:, :-1]  # truncate last logit
                labels = input_ids[:, 1:]  # shift labels

                # TODO: Compute the logits and mask the prompt parts; 4 ~ 6 lines
                # This is the same as what you did for compute_likelihood() and test(); You can copy over what you had
                logits = torch.log_softmax(output, dim=-1)
                logits = logits.gather(2, labels.unsqueeze(2)).squeeze(2)
                mask = batch["masks"].to(device)[:, 1:]
                logits = logits * mask
                num_tokens = mask.sum(dim=1)

                batch_size = logits.size(0) // 2
                pos_logits, neg_logits = logits[:batch_size], logits[batch_size:]
                pos_ref_logits = batch["chosen_logprob"]
                neg_ref_logits = batch["rejected_logprob"]

                # Compute DPO loss
                loss, _, _ = compute_dpo_loss(pos_logits, neg_logits, pos_ref_logits, neg_ref_logits, beta)
                loss = loss.mean()
                loss = loss / accumulate_step
                avg_loss += loss.item()

                loss.backward()
                if step_cnt == accumulate_step:
                    # updating
                    if grad_norm > 0:
                        nn.utils.clip_grad_norm_(model.parameters(), grad_norm)
                    step_cnt = 0
                    epoch_step += 1
                    all_step_cnt += 1
                    # adjust learning rate
                    lr = max_lr * min(all_step_cnt ** (-0.5), all_step_cnt * (warmup_steps ** (-1.5)))
                    for param_group in optimizer.param_groups:
                        param_group['lr'] = lr
                    # TODO: Perform gradient update
                    optimizer.step()
                    optimizer.zero_grad()


                if epoch_step % report_freq == 0 and step_cnt == 0:
                    # Report training progress
                    print(f"Epoch {epoch+1}/{epochs} | Batch {epoch_step} | Avg Loss: {avg_loss / report_freq:.6f}")
                    avg_loss = 0

                del loss, output
                # Validation
                if eval_interval > 0 and all_step_cnt % eval_interval == 0 and all_step_cnt > 0 and step_cnt == 0 or (i == len(dataloader) - 1):
                    val_loss = test(val_dataloader, model, beta, device)
                    print(f"Validation Loss: {val_loss:.6f}")
                    if val_loss < minimum_loss:
                        minimum_loss = val_loss
                        print("best val loss - epoch: %d"%(epoch))

        return model

### Question 4: Given the provided code, answer the following questions:

Q4.1 Explain the purpose and the process of gradient accumulation in the run_dpo function. Why does the function divide the loss by accumulate_step before calling loss.backward(), and how does this approach affect the learning rate adjustment within the loop?

`TODO: Your answer here`

Q4.2 Describe the purpose of the optimizer.zero_grad() function in the run_dpo function.

`TODO: Your answer here`

Before we kick off training, we might need to clear up some space for GPU, especially if you are one T4 GPU on colab.

In [None]:
import gc

raw_model.to('cpu')
model_sft.to('cpu')
torch.cuda.empty_cache()

gc.collect()

Now you will define Pytorch dataloaders that will be used to load the training and testing data into the model. A Pytorch DataLoader is an iterator that provides a way to iterate over the dataset in batches. It gets as input a Pytorch Dataset (which you defined above), and optionally a collator function that is used to preprocess and batch the samples before they are passed to the model.

In [None]:
collate_fn = partial(collate_dpo_gpt, pad_token_id=tokenizer.pad_token_id, is_test=False)

train_data = load_dataset("json", data_files="data/dpo/train.jsonl")["train"]
val_data = load_dataset("json", data_files="data/dpo/val.jsonl")["train"]

batch_size = 1
max_len = 1024
is_test=False

# Replace None with your code
# Hint: Pass variables to a DPOGPTDataset constructor
train_set = DPOGPTDataset(train_data, dpo_model.config.model_type, max_len, is_test)
val_set = DPOGPTDataset(val_data, dpo_model.config.model_type, max_len, is_test)

# Replace None with your code
dataloader = DataLoader(train_set, batch_size=batch_size, shuffle=True, collate_fn=collate_fn, num_workers=4)
val_dataloader = DataLoader(val_set, batch_size=batch_size, shuffle=False, collate_fn=collate_fn, num_workers=4)


Note that the variance of the following DPO traning is high. You might see surprising results in the Evaluation section below.
We suggest you to conduct a few restarts and explore the resulting model quality.

#### Kicking off training:

In [None]:
optimizer = torch.optim.Adam(dpo_model.parameters())

# Define the number of epochs and other hyperparameters
# Recall that dataloaders have batch_size=1

epochs = 1
accumulate_step = 8 # Total effective batch size is 1 * 8 = 8
report_freq = 100
eval_interval = 500
max_lr = 1e-3
warmup_steps = 70
beta = 0.1

# Call the run_dpo function
dpo_model_student = run_dpo(
    model=dpo_model,
    dataloader=dataloader,
    val_dataloader=val_dataloader,
    optimizer=optimizer,
    device=DEVICE,
    epochs=epochs,
    accumulate_step=accumulate_step,
    report_freq=report_freq,
    eval_interval=eval_interval,
    max_lr=max_lr,
    warmup_steps=warmup_steps,
    beta=beta
)

### Evaluation

Next, we will evaluate the performance of three models: the pre-trained SFT model, the initial DPO model, and your fine-tuned DPO model. The evaluation process aims to assess the quality of the generated responses and compare them with expert references.

#### Evaluation Dataset
We will use the `sft/test.jsonl`, a held-out part from the TULUv2 SFT dataset for evaluation. It contains a set of instructions and corresponding human-written responses. Feel free to explore this data by yourself.

#### Evaluation Metrics
We will use the following metrics to evaluate the models:

1. **ROUGE Scores**: We've seen ROUGE in previous assignments.
2. **Pairwise Comparison**: We will use the PairRM (Pairwise Reward Model) to compare the generated responses with the human-written references. PairRM is a model that takes an instruction and a pair of output candidates as input and outputs a score for each candidate to measure their relative quality.

#### Evaluation Process
The evaluation process consists of the following steps:

1. **Response Generation**: For each model (SFT, instructor's DPO, and student's DPO), we will generate responses for the instructions in the `sft/test.jsonl` dataset.

2. **ROUGE Score Calculation**: We will calculate the ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) between the generated responses and the corresponding human-written references.

3. **Pairwise Comparison**: Using the PairRM model, we will compare each generated response with its corresponding human-written reference. PairRM will output a score indicating which response is preferred. We will calculate the **win rate**, which represents the percentage of generated responses that are preferred over the human-written references.

4. **Result Analysis**: We will analyze the evaluation results to assess the performance of each model. The ROUGE scores and win rates will provide insights into the quality of the generated responses and how well they align with human preferences.

#### PairRM (Pairwise Reward Model)
We will use a reward model that based on the **DeBERTa-v3-large** architecture and has been trained on a diverse collection of human-preference datasets. The reward model takes an instruction and a pair of output candidates as input and outputs a score for each candidate, indicating their relative quality.



In [None]:
# Adapted from https://github.com/yuchenlin/LLM-Blender/blob/main/llm_blender/pair_ranker/pairrm.py
class DebertaV2PairRM(DebertaV2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        self.n_tasks = config.n_tasks
        self.drop_out = config.drop_out

        # LM
        self.pretrained_model = DebertaV2Model(config)
        self.hidden_size = config.hidden_size

        self.sep_token_id = config.sep_token_id # to add
        self.source_prefix_id = config.source_prefix_id # to add
        self.cand_prefix_id = config.cand_prefix_id
        self.cand1_prefix_id = config.cand1_prefix_id
        self.cand2_prefix_id = config.cand2_prefix_id

        self.head_layer = nn.Sequential(
            nn.Dropout(self.drop_out),
            nn.Linear(2*self.hidden_size, 1*self.hidden_size),
            nn.Tanh(),
            nn.Dropout(self.drop_out),
            nn.Linear(1 * self.hidden_size, self.n_tasks),
        )
        self.sigmoid = nn.Sigmoid()

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutput]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        #  <source_prefix_id>...<sep><cand1_prefix_id>...<sep><cand2_prefix_id> ... <sep>
        assert all([self.source_prefix_id in input_ids[i] for i in range(input_ids.shape[0])]), "<source> id not in input_ids"
        assert all([self.cand1_prefix_id in input_ids[i] for i in range(input_ids.shape[0])]), "<candidate1> id not in input_ids"
        assert all([self.cand2_prefix_id in input_ids[i] for i in range(input_ids.shape[0])]), "<candidate2> id not in input_ids"

        keep_column_mask = attention_mask.ne(0).any(dim=0)
        input_ids = input_ids[:, keep_column_mask]
        attention_mask = attention_mask[:, keep_column_mask]
        outputs = self.pretrained_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
            return_dict=return_dict,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
        )
        encs = outputs.hidden_states[-1]
        source_idxs = torch.where(input_ids == self.source_prefix_id)
        source_encs = encs[source_idxs[0], source_idxs[1], :]
        cand1_idxs = torch.where(input_ids == self.cand1_prefix_id)
        cand1_encs = encs[cand1_idxs[0], cand1_idxs[1], :]
        cand2_idxs = torch.where(input_ids == self.cand2_prefix_id)
        cand2_encs = encs[cand2_idxs[0], cand2_idxs[1], :]

        # reduce
        source_cand1_encs = torch.cat([source_encs, cand1_encs], dim=-1)
        source_cand2_encs = torch.cat([source_encs, cand2_encs], dim=-1)
        left_pred_scores = self.head_layer(source_cand1_encs)
        right_pred_scores = self.head_layer(source_cand2_encs)

        loss = None
        if labels is not None:
            loss = self.compute_loss(left_pred_scores, right_pred_scores, labels)


        preds = (left_pred_scores - right_pred_scores).mean(dim=-1)
        return SequenceClassifierOutput(
            loss=loss, logits=preds,
            hidden_states=outputs.hidden_states if output_hidden_states else None,
            attentions=outputs.attentions
        )

    def compute_loss(self, left_pred_scores, right_pred_scores, labels):
        """
        Args:
            left_pred_scores: [n_candidates, n_task]
            right_pred_scores: [n_candidates, n_task]
            labels: [n_candidates, n_task], 1/0/-1 for left/right/both is better
        """

        device = left_pred_scores.device
        loss = torch.tensor(0.0).to(left_pred_scores.device)

        dif_scores = labels
        left_pred_scores = left_pred_scores * dif_scores.sign()
        right_pred_scores = - right_pred_scores * dif_scores.sign()
        cls_loss = torch.tensor(0.0, device=device)
        cls_loss += - torch.log(torch.sigmoid(left_pred_scores+right_pred_scores)).mean()
        loss += cls_loss
        return loss

### Generating Model Outputs

In this step, we will use the inference() function to generate responses from each model (SFT, instructor's DPO model, and your DPO model) for the instructions in the test set.

In [None]:
def inference(model, tokenizer, data_path, output_path, device, batch_size=4, gen_max_len=512, gen_min_len=10, max_len=1024, num_beams=1, length_penalty=1.0):
    model.eval()
    data = load_dataset("json", data_files=data_path)["train"]
    dataset = SFTGPTDataset(data, model.config.model_type, max_len, is_test=True)
    collate_fn = partial(collate_base_gpt, pad_token_id=tokenizer.pad_token_id, is_test=True)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn, num_workers=4)

    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeLsum"], use_stemmer=True, split_summaries=True)
    rouge1, rouge2, rougeLsum = 0, 0, 0
    length = 0

    with open(output_path, "w") as f:
        with torch.no_grad():
            for batch in tqdm(dataloader, desc="Generating outputs"):
                text_id = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                src_len = text_id.shape[1]
                summaries = model.generate(
                    input_ids=text_id,
                    attention_mask=attention_mask,
                    max_length=min(gen_max_len + src_len, max_len),
                    min_length=min(gen_min_len + src_len, max_len),
                    num_beams=num_beams,
                    length_penalty=length_penalty,
                    early_stopping=True,
                )
                dec = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summaries]
                for hypothesis, d in zip(dec, batch["data"]):
                    hypothesis = hypothesis[len(d["instruction"]):].strip()
                    scores = scorer.score(d["response"], hypothesis)
                    rouge1 += scores["rouge1"].fmeasure
                    rouge2 += scores["rouge2"].fmeasure
                    rougeLsum += scores["rougeLsum"].fmeasure
                    length += len(tokenizer.encode(hypothesis))
                    result = {
                        "instruction": d["instruction"],
                        "reference": d["response"],
                        "hypothesis": hypothesis,
                        "rouge1": scores["rouge1"].fmeasure,
                        "rouge2": scores["rouge2"].fmeasure,
                        "rougeLsum": scores["rougeLsum"].fmeasure,
                        "length": len(tokenizer.encode(hypothesis)),
                    }
                    f.write(json.dumps(result) + "\n")

    rouge1 /= len(dataset)
    rouge2 /= len(dataset)
    rougeLsum /= len(dataset)
    rouge1 *= 100
    rouge2 *= 100
    rougeLsum *= 100
    length /= len(dataset)
    print(f"Rouge1: {rouge1:.2f}, Rouge2: {rouge2:.2f}, RougeLsum: {rougeLsum:.2f}, Length: {length:.2f}")
    return rouge1, rouge2, rougeLsum



### Pairwise Comparison using PairRM
In this step, we will use the pairwise_compare() function to compare the generated responses from each model with the human-written references.

The function returns the win rate, which serves as an evaluation metric for the model's performance in generating responses that align with human preferences. Read the code below and run the cell, then answer the following questions.

In [None]:
def pairwise_compare(data_path, output_path, device, batch_size=1):
    pairrm = DebertaV2PairRM.from_pretrained("llm-blender/PairRM-hf").to(device).eval()
    tokenizer = AutoTokenizer.from_pretrained('llm-blender/PairRM-hf')
    source_prefix = "<|source|>"
    cand1_prefix = "<|candidate1|>"
    cand2_prefix = "<|candidate2|>"

    with open(data_path) as f:
        data = [json.loads(line) for line in f]

    def tokenize_pair(sources:List[str], candidate1s:List[str], candidate2s:List[str], source_max_length=960, candidate_max_length=460):
        ids = []
        assert len(sources) == len(candidate1s) == len(candidate2s)
        max_length = source_max_length + 2 * candidate_max_length
        for i in range(len(sources)):
            source_ids = tokenizer.encode(source_prefix + sources[i], max_length=source_max_length, truncation=True)
            candidate_max_length = (max_length - len(source_ids)) // 2
            candidate1_ids = tokenizer.encode(cand1_prefix + candidate1s[i], max_length=candidate_max_length, truncation=True)
            candidate2_ids = tokenizer.encode(cand2_prefix + candidate2s[i], max_length=candidate_max_length, truncation=True)
            ids.append(source_ids + candidate1_ids + candidate2_ids)
        encodings = tokenizer.pad({"input_ids": ids}, return_tensors="pt", padding="max_length", max_length=max_length)
        return encodings

    win = 0
    with open(output_path, "w") as f:
        for batch_start in tqdm(range(0, len(data), batch_size), desc="Comparing outputs"):
            batch_end = batch_start + batch_size
            batch_data = data[batch_start:batch_end]

            inputs = [d["instruction"] for d in batch_data]
            candidates_A = [d["hypothesis"] for d in batch_data]
            candidates_B = [d["reference"] for d in batch_data]

            encodings = tokenize_pair(inputs, candidates_A, candidates_B)
            encodings = {k:v.to(device) for k,v in encodings.items()}

            outputs = pairrm(**encodings)
            comparison_results = outputs.logits > 0
            comparison_results = comparison_results.cpu().numpy().tolist()

            for i, d in enumerate(batch_data):
                d["comparison_result"] = comparison_results[i]
                f.write(json.dumps(d) + "\n")
                if comparison_results[i]:
                    win += 1

    win_rate = win / len(data)
    print(f"Win rate: {win_rate:.2f}")
    return win_rate


### Question 5: Given the provided code, answer the following questions:

Q5.1 Describe how the pairwise_compare function determines the "win" cases between two candidates (referred to as candidates_A and candidates_B in the code) for each piece of input data.

`TODO: Your answer here`

In [None]:
dpo_model = AutoModelForCausalLM.from_pretrained("kejian/gpt2-tulu2-DPO").to('cpu') # load the initial DPO model again but to CPU.
dpo_model_student.to('cpu')

In [None]:
import gc

raw_model.to('cpu')
model_sft.to('cpu')
torch.cuda.empty_cache()

gc.collect()

In [None]:
import nltk
nltk.download('punkt_tab')
def evaluate_models(models, tokenizer, test_data_path, device, batch_size=4, cls_batch_size=1):
    for model_name, model in models.items():
        print(f"Evaluating {model_name}...")
        output_path = f"{model_name}_outputs.jsonl"
        score_path = f"{model_name}_scores.jsonl"

        model.to(device)
        rouge1, rouge2, rougeLsum = inference(model, tokenizer, test_data_path, output_path, device, batch_size)
        print(f"{model_name} Rouge Scores - Rouge1: {rouge1:.2f}, Rouge2: {rouge2:.2f}, RougeLsum: {rougeLsum:.2f}")

        model.to('cpu')
        torch.cuda.empty_cache()
        gc.collect()

        win_rate = pairwise_compare(output_path, score_path, device, cls_batch_size)
        print(f"{model_name} Win Rate: {win_rate:.2f}")


models = {
    # "RAW": raw_model,
    "SFT": model_sft,
    "DPO_Instructor": dpo_model,
    "DPO_Student": dpo_model_student
}

test_data_path = "data/sft/test.jsonl"
batch_size = 4
cls_batch_size = 1 # for the DeBERTa-large 0.4B

evaluate_models(models, tokenizer, test_data_path, DEVICE, batch_size, cls_batch_size)

#### Question 6:

Please provide a description of the results obtained. What were your initial expectations? How did the empirical outcomes compare to these expectations? Attempt to explain the reasons behind your observations.


Prompt your `dpo_model_student` and compare its responses with the SFT model. Don't forget to use the SFT prompt format (No explanations needed).


#### Question 7:  

What’s the purpose of using $\sigma$, the logistic function, in the DPO loss?


The logistic function σ is used to convert the difference in model log-probabilities into a probability in [0,1] that the “chosen” output is preferred. This smooth mapping stabilizes training by providing well-behaved gradients and a natural probabilistic interpretation for pairwise preference signals.

### References



```
@misc{cui2023ultrafeedback,
      title={UltraFeedback: Boosting Language Models with High-quality Feedback},
      author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Wei Zhu and Yuan Ni and Guotong Xie and Zhiyuan Liu and Maosong Sun},
      year={2023},
      eprint={2310.01377},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{rafailov2023direct,
      title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
      author={Rafael Rafailov and Archit Sharma and Eric Mitchell and Stefano Ermon and Christopher D. Manning and Chelsea Finn},
      year={2023},
      eprint={2305.18290},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@misc{ivison2023camels,
      title={Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2},
      author={Hamish Ivison and Yizhong Wang and Valentina Pyatkin and Nathan Lambert and Matthew Peters and Pradeep Dasigi and Joel Jang and David Wadden and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
      year={2023},
      eprint={2311.10702},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

```

