<a href="https://colab.research.google.com/github/Yingjia-Wan/C4AIScholarsChallenge2024/blob/main/Yingjia_Wan_C4AIScholarsChallenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Submission Notes from Applicant:**

Inside some subsections of each part, I wrote short reports in [text blocks] to answer some task questions in more detail. :)

# **Background**

Welcome to the C4AI Scholars Program Take-Home Challenge! This exercise is designed to allow you to showcase your engineering and problem solving skills. The Challenge consists of different challenges including:

*   Identifying bugs, and getting the code working. This is designed to test your ability to grapple with real world engineering challenges.
*   Testing your ability to generate code for a specified problem.
*   An opportunity for you to attempt an optional challenge question that extends the original problem set.

These tasks were chosen as a setting to see how you think about problems, even if they are not in your own research field of interest. The tasks and dataset are not meant to be indicative of the research goals of the Scholar Program. We purposefully have selected a simple toy problem so the focus is on how you think, and does not require significant machine learning resources (can be run in this colab).

Good luck!

**How to Use and Submit this Document?**

*   **Make a copy of this document** and rename it **Firstname_Lastname_C4AIScholarsChallenge**
*   Once you have completed all tasks, save and pin your revisions
*   Submit the assignment by responding directly to this email with a link to your final document by Sunday, September 15th, 11 PM PDT.

## **Coding Challenge Part 1: Debugging custom SmolLM code [10 points]**

In this coding challenge, you are required to debug and fix a bare-bones implementation of the following model.

**Model** : SmolLM-135M can be found at [HuggingFace](https://huggingface.co/HuggingFaceTB/SmolLM-135M).

We have 10 bugs in the following implementation.
There is a `check_solution` function for your convenience to verify you have correctly identified all the bugs. If you have found all bugs, the generated outputs will match the reference model exactly.

**Rules**:
1. **Bug Definition:**
  - There are 10 bugs to be fixed.
  - A bug is *defined as **{incorrect, missing, unnecessary}** lines of code*.
  - You earn 1 point for each correctly identified and fixed bug.
2. **Fix Guidelines:**
  - You are encouraged to make the smallest possible fix, wherever possible (e.g. edit a line instead of replacing it entirely).
  - Do not optimize the code; only fix the bugs. The implementation is *intentionally* non-optimized but valid.
3. **Documentation:** Document each fix by adding a comment on the line above the fix: : `### BUG FIX ###`.
4. **Sections:** *1. Setup [Helper Functions]* and *3. Test* don't contain bugs and shouldn't be changed.
5. **Submission:** Your final submission should be the exact same file except with your proposed fixes and the respective comments as per Rule #3.

## 1. Setup [Helper Functions]

In [None]:
######################################################################################################################
############################################## DO NOT CHANGE[START] ##################################################
######################################################################################################################


# [Don't use. Rate limit issues.] Use gdown to get weights file(BareBones_SmolLM-135M.pt) at https://drive.google.com/file/d/1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU/view . gdown should be installed by default else use `pip install gdown`
# !gdown 1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU


# [Recommended]Use HF to download the weights
!git lfs install
!git clone https://huggingface.co/dsouzadaniel/C4AI_SMOLLM135
!mv C4AI_SMOLLM135/BareBones_SmolLM-135M.pt ./
!ls

Git LFS initialized.
Cloning into 'C4AI_SMOLLM135'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (6/6), 2.11 KiB | 2.11 MiB/s, done.
BareBones_SmolLM-135M.pt  C4AI_SMOLLM135  drive  sample_data


In [None]:

# Libraries
import torch
import torch.nn.functional as F
from torch import nn
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model initialization/settings
checkpoint="HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

__reference_model = AutoModelForCausalLM.from_pretrained(checkpoint)
__reference_model.eval()

class smolConfig:
    vocab_size=49152
    hidden_size=576
    intermediate_size=1536
    num_hidden_layers = 30
    num_heads = 9
    kv_heads=3
config = smolConfig

# Helper Functions
def __generate(model, inputs, num_tokens):
    collect = []
    for _ in range(num_tokens):
        output = model(**inputs)
        output_id = torch.argmax(output['logits'][0,-1]).item()
        collect.append(output_id)
        if output_id==tokenizer.eos_token_id:
            break
        inputs['input_ids'] = torch.unsqueeze(torch.cat([inputs['input_ids'][0],torch.tensor([output_id])]),dim=0)
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
    return tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(collect))

def check_solution(prompt, num_tokens, model_A, model_B):
    print()
    print(f"{'>'*20}\n\tPrompt\n{'<'*20}\n{prompt}\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_A Generation\n{'<'*30}\n{__generate(model_A,  model_inputs, num_tokens)}")
    print("\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_B Generation\n{'<'*30}\n{__generate(model_B,  model_inputs, num_tokens)}")

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## 2. Custom SmolLM (for BugFixes)

In [None]:
def rotate_half(x):
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

def repeat_kv(hidden_states, n_rep):
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape

    ### BUG FIX ###  [missing]
    if n_rep == 1: # if n_rep=1, no need to reshape hidden_states
        return hidden_states

    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

class RotaryEmbedder(nn.Module):
    def __init__(self, dim, base):
        super().__init__()
        self.freq = 1/(base ** (torch.arange(0, dim, 2, dtype=torch.int64).float()/dim))

    @torch.no_grad()
    def forward(self,x):
        pos = torch.arange(x.shape[-2],dtype=torch.long)
        angles = torch.einsum('f,p->fp', self.freq, pos.float()).unsqueeze(dim=0)

        ### BUG FIX ### [incorrect]
        # emb = torch.cat((angles, angles), dim=-1)
        emb = torch.cat((angles, angles), dim=-2)

        ### BUG FIX ### [missing]
        emb = emb.permute(0, 2, 1)

        return emb.cos(), emb.sin()


class MLP(nn.Module):
    def __init__(self, hidden_size, intermediate_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.W_gate = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_up = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_down = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = torch.nn.modules.activation.SiLU()

    def forward(self, x):
        down_proj = self.W_down(self.act_fn((self.W_gate(x)) * self.W_up(x)))
        return down_proj

class RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        ### BUG FIX ### [incorrect] it should divide by standard deviation using rsqrt.
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        # hidden_states = hidden_states * torch.sqrt(variance + self.variance_epsilon) [original]

        return self.weight * hidden_states


class RopeAttention(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.hidden_size=config.hidden_size
        self.num_heads = config.num_heads
        self.head_dim = config.hidden_size//self.num_heads
        self.kv_heads = config.kv_heads
        self.rope_theta = 10000.0

        self.W_query = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.W_key = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_value = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_output = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
        self.rotary_emb = RotaryEmbedder(base=self.rope_theta,
                                         dim=config.hidden_size//self.num_heads)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask= None,
    ):
        b, q, _ = hidden_states.size()

        q_states = self.W_query(hidden_states)
        k_states = self.W_key(hidden_states)
        v_states = self.W_value(hidden_states)

        q_states = q_states.view(b, q, self.num_heads, self.head_dim).transpose(1, 2)
        k_states = k_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)
        v_states = v_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)

        cos, sin = self.rotary_emb(v_states)
        q_states, k_states = apply_rotary_pos_emb(q_states, k_states, cos, sin)

        ### BUG FIX ### [incorrect] __kv_groups must be a int not a float
        __kv_groups = self.num_heads // self.kv_heads
        # __kv_groups = self.num_heads / self.kv_heads # [original]

        k_states = repeat_kv(k_states, __kv_groups)
        v_states = repeat_kv(v_states, __kv_groups)

        attn_weights = torch.matmul(q_states, k_states.transpose(2, 3)) / math.sqrt(self.hidden_size)


        ### BUG FIX ### [incorrect] add if condition
        if attention_mask is not None:
            attn_weights = attn_weights + attention_mask
        # attn_weights = attn_weights + attention_mask #[original]

        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
        ### BUG FIX ### [incorrect] specify dropout p
        attn_weights = nn.functional.dropout(attn_weights, p=0.1)
        # attn_weights = nn.functional.dropout(attn_weights)

        attn_output = torch.matmul(attn_weights, v_states)
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(b, q, -1)

        return attn_output

class LlamaDecoder(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.self_attn = RopeAttention(config)
        self.mlp = MLP(hidden_size=config.hidden_size, intermediate_size=config.intermediate_size)
        self.pre_attn_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)
        self.pre_mlp_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(self,hidden_states, attention_mask):
        residual = hidden_states
        hidden_states = self.pre_attn_rmsnorm(hidden_states)
        ### BUG FIX ### [incorrect]
        attention_mask = torch.triu(torch.full((hidden_states.shape[1], hidden_states.shape[1]), fill_value=float('-inf')), diagonal=1)
        # attention_mask = torch.triu(torch.full((attention_mask.shape[-1],attention_mask.shape[-1]), fill_value=float('-inf')),diagonal=1) [original]

        hidden_states = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
        )
        hidden_states += residual

        ### BUG FIX ### [missing]
        residual = hidden_states # update the residual for the next blck

        hidden_states = self.pre_mlp_rmsnorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states += residual

        outputs = (hidden_states,)

        return outputs

class smolModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_tokens = nn.Embedding(num_embeddings=config.vocab_size,
                                         embedding_dim=config.hidden_size)
        self.layers = nn.ModuleList([LlamaDecoder(config) for _ in range(config.num_hidden_layers)])
        self.norm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(
        self,
        input_ids= None,
        attention_mask= None,
    ):
        inputs_embeds = self.embed_tokens(input_ids)
        hidden_states = inputs_embeds
        for decoder_layer in self.layers:
            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=attention_mask,
            )
            hidden_states = layer_outputs[0]
        hidden_states = self.norm(hidden_states)
        return [hidden_states]


class smolLM(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.model = smolModel(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    def forward(self,input_ids,attention_mask):
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        ### BUG FIX ### [incorrect] remove squeeze()
        hidden_states = outputs[0]
        # hidden_states = outputs[0].squeeze() #[original]

        logits = self.lm_head(hidden_states)
        logits = logits.float()
        return {'logits':logits}


In [None]:
__test_model = smolLM(config)
__test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)
__test_model.eval()

  __test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)


smolLM(
  (model): smolModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoder(
        (self_attn): RopeAttention(
          (W_query): Linear(in_features=576, out_features=576, bias=False)
          (W_key): Linear(in_features=576, out_features=192, bias=False)
          (W_value): Linear(in_features=576, out_features=192, bias=False)
          (W_output): Linear(in_features=576, out_features=576, bias=False)
          (rotary_emb): RotaryEmbedder()
        )
        (mlp): MLP(
          (W_gate): Linear(in_features=576, out_features=1536, bias=False)
          (W_up): Linear(in_features=576, out_features=1536, bias=False)
          (W_down): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (pre_attn_rmsnorm): RMSNorm()
        (pre_mlp_rmsnorm): RMSNorm()
      )
    )
    (norm): RMSNorm()
  )
  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
)

# 3. Test

In [None]:
######################################################################################################################
############################################## DO NOT CHANGE[START] ##################################################
######################################################################################################################

###### TESTING PROMPTS
# Single-Token Quick Test
check_solution(prompt="Given the following film movie by a critic, rate it out of 10. Respond in a single number.\n\nThe movie started off extremely well, but just got worse after that.\nThe storyline was all over the place and everyone acted terribly.\n 10/10 would not recommend! \n\n ",
               num_tokens=1,
               model_A=__reference_model,
               model_B=__test_model)


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)



>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Given the following film movie by a critic, rate it out of 10. Respond in a single number.

The movie started off extremely well, but just got worse after that.
The storyline was all over the place and everyone acted terribly.
 10/10 would not recommend! 

 


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
1



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
 wrongs


In [None]:
# Multi-Token Quick Test
check_solution(prompt="Where is the Nile located?",
               num_tokens=50,
               model_A=__reference_model,
               model_B=__test_model)

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################


>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Where is the Nile located?


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

The Nile River is located in the Nile Delta in the Nile River Basin, which is a region of Africa. It is the longest river in the world, with a length of 4,330 miles (6,900 km



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
pretationBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone exhibitionsligBone


# **Coding Challenge Part 2: Teach SmolLM to do grammatical error correction [15 points]**

The goal of this part is to train the SmolLM-135M model to perform grammatical error correction (GEC) using the Grammarly CoEdIT dataset. This [dataset](https://huggingface.co/datasets/grammarly/coedit), derived from the [CoEdIT project](https://arxiv.org/abs/2305.09857), provides a rich collection of text editing instructions and examples. The task involves several key steps that mimic conventional alignment processes:




## **2.1 Supervised Fine-Tuning (SFT) on Training Data [5 points]**

* Fine-tune the [SmolLM-135M model](https://huggingface.co/HuggingFaceTB/SmolLM-135M) using the CoEdIT dataset, which includes input sentences with grammatical errors and their corrected versions.
* Use the training GEC portion of the CoEdIT dataset to teach the model how to correct grammatical errors effectively.
* Calculate the BLEU score on the validation set to evaluate the model's performance in generating grammatically correct sentences. Ensure that this evaluation process is reusable for later comparisons.
* Search for an optimal set of hyperparameters, such as the learning rate. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters. **Do not train for more than 3 epochs -- we do not expect extensive training time.**
* For Part 2, don't use additional libraries, if an imported library is missing, install it with **pip install**.

In [None]:
!pip install datasets transformers trl torch
from datasets import load_dataset

# Download the GEC data
full_train_ds = load_dataset("grammarly/coedit", split="train")
full_test_ds = load_dataset("grammarly/coedit", split="validation")

In [None]:
# TODO: Filter examples, keeping only GEC task

def filter_gec(example):
    return example['task'] == 'gec'

gec_train_ds = full_train_ds.filter(filter_gec)
gec_val_ds = full_test_ds.filter(filter_gec)

# adding the 'text' items to dataset to be compatible with SFTTrainer
# (alternative is setting a formatter function: https://huggingface.co/docs/trl/en/sft_trainer#customize-your-prompts-using-packed-dataset)
def add_text_field(example):
    example['text'] = f"{example['src']}. \nAnswer: {example['tgt']}\n\n"
    example['prompt'] = f"{example['src']}. \nAnswer: "
    return example

train_ds = gec_train_ds.map(add_text_field)
val_ds = gec_val_ds.map(add_text_field)

# Expected number of samples
print(f"Train dataset size: {len(train_ds)}")
print(f"Validation dataset size: {len(val_ds)}")

Train dataset size: 19823
Validation dataset size: 485


Expected number of train and test samples are 19823 and 485, respectively.

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HuggingFaceTB/SmolLM-135M"

# TODO: Load the model and the tokenizer from huggingface

# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name).to(device)



Using device: cuda


In [None]:
# TRL - Transformer Reinforcement Learning -- https://huggingface.co/docs/trl/en/index
from trl import SFTConfig, SFTTrainer

# TODO: Run SFT

# hyperparameters
config = SFTConfig(
    dataset_text_field="text",
    max_seq_length=512,                 # max_seq_length set based on dataset length stats
    per_device_train_batch_size=16,     # Batch size set for efficiency
    learning_rate=1e-5,
    num_train_epochs=3,
    save_total_limit=1,
    output_dir="/tmp/SFT",
)

trainer = SFTTrainer(
    model,
    args=config,
    tokenizer=tokenizer,
    train_dataset=train_ds
)

# Train the model (takes apx. 5-10 min for 1 epoch on T4 GPU)
trainer.train()


Map:   0%|          | 0/19823 [00:00<?, ? examples/s]

Map:   0%|          | 0/485 [00:00<?, ? examples/s]

Map:   0%|          | 0/19823 [00:00<?, ? examples/s]



Step,Training Loss
500,1.7829
1000,1.6204
1500,1.5654
2000,1.5501
2500,1.5295
3000,1.5195
3500,1.509


TrainOutput(global_step=3717, training_loss=1.5781581694957292, metrics={'train_runtime': 548.3304, 'train_samples_per_second': 108.455, 'train_steps_per_second': 6.779, 'total_flos': 3669652263306240.0, 'train_loss': 1.5781581694957292, 'epoch': 3.0})

### Report on Part 2.1

#### + About hyperparameter search:

The considered factors for searching for the optimal hyperparameters in general include: computational efficiency and performance (using BLEU score as the proxy). The model trained with the current hyperparameters achieves a BLEU evaluation score of 0.4107 (1 epoch) and 0.4546 (3 epoch).


Here is a partial list of reference documents I considered:
- SFTConfig args doc: https://huggingface.co/docs/trl/v0.10.1/en/sft_trainer#trl.SFTConfig
- trainer args doc: https://huggingface.co/docs/trl/en/sft_trainer
- Model hyperparameter blog: https://huggingface.co/blog/smollm#hyperparameters-choice


Here is a breakdown of how I chose the hyperparameters below:

- max_seq_length=512:

    I collected the length statistics of train_ds and val_ds ('text' field). Shorter max_seq_length will reduce the amount of padding needed and speed up training, while a longer value will ensure most input texts are fully captured rather than truncated prematurely.
    As shown below, there is quite a length discrepancy between train_ds and val_ds. Based on the trade-off in efficiecny, preserving the dataset content, and improving the model performance in evaluation, I chose 512.
        Statistics for 'text' field in train_ds:
            Maximum length: 1413
            Average length: 248.60354134086668
            Median length: 229.0
            90th Percentile: 389.0
        Statistics for 'text' field in val_ds:
            Maximum length: 1944
            Average length: 556.9030927835051
            Median length: 538.0
            90th Percentile: 974.0
        Statistics for combined train_ds and val_ds:
            Maximum length: 1944
            Average length: 255.96641717549733
            Median length: 231.0
            90th Percentile: 401.0

- Others:
    
    The set values for other hyperparameters (e.g., per_device_train_batch_size=16, learning_rate=1e-5) as well as the exclusion of unspecified hyperparameters are decided based on several rounds of 1-epoch trial runs within the limited time, with the consideration of both efficiency and performance.


- (prompt template):

    I found that prompt templates can have a significant impact on the training performance, which is possibly related to how the base model is pretrained and aligned. My current prompt template (in `add_text_field`) is what I found achieving the best performance with several rounds of 1-epoch trials.




In [None]:
# SAVE MODEL function

from google.colab import drive
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Mount Google Drive
drive.mount('/content/drive')


def save_trainer_checkpoint(trainer, checkpoint_path):
    """
    Save a checkpoint of the trained model for easy loading.
    """
    # Create the directory if it doesn't exist
    if not os.path.exists(checkpoint_path):
        os.makedirs(checkpoint_path)

    # Save the model and the trainer's state
    trainer.save_model(checkpoint_path)  # Saves the model and tokenizer
    # trainer.save_state()  # Saves the trainer's state including optimizer, scheduler, etc.

    print(f"Saved to {checkpoint_path}!")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Save the SFT_model:

SFT_directory = '/content/drive/MyDrive/Application/C4AI/SFT_512_16_1e-5_epoch3_new'
save_trainer_checkpoint(trainer, SFT_directory)

In [None]:
# Quick test if your model works properly
def format_text(text: str) -> str:
    # here you may have formatting of the input that you adopted for training
    return f"{text}\n ### Answer: "

# Example of how to run inference on a single example
text = "Fix grammatically: I likes turtles."
# text = "Make the sentence grammatical: I realized beyond this attitude would destroy me, and at this points my views of happiness shifted in a more realistic way, acknowledging happiness was step through steps process of overcoming challenge."
inputs = tokenizer(format_text(text), return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)

# Define the stop token (newline character)
outputs = model.generate(**inputs, temperature=0.0, eos_token_id=tokenizer.eos_token_id,max_new_tokens=128)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

def extract_output(generated_text):
    pred_start = generated_text.find('Answer: ') + len('Answer: ')
    pred_end = generated_text.find('\n\n')
    output = generated_text[pred_start:pred_end]
    if 'Answer: ' in output:
        output = output[output.find('Answer: ') + len('Answer: '):]
    return output

generated_text = extract_output(generated_text)
print(generated_text)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


 I like turtles.


Expected output: I like turtles.

In [None]:
!pip install evaluate
import evaluate

# BLEU Score evaluation
def evaluate_model(model, tokenizer, ds, output_file = "/tmp/validation_output.json", max_length=512, max_new_tokens=512):
    preds = []
    targets = []
    batch_size=16 # evaluate in batches in parallel
    srcs = []

    for i in range(0, len(ds), batch_size):
        batch = ds[i: i + batch_size]
        input_texts = batch['prompt']
        inputs = tokenizer(input_texts, return_tensors="pt", padding=True, max_length=max_length).to(device)

        generated_ids = model.generate(**inputs,
                                       max_new_tokens=max_new_tokens,
                                       eos_token_id=tokenizer.eos_token_id,
                                       repetition_penalty=1.0) # curb generation length and repetition

        for j in range(len(generated_ids)):
            generated_text = tokenizer.decode(generated_ids[j], skip_special_tokens=True)
            # extracted_output = generated_text[len(input_texts[j]):].strip()
            extracted_output = extract_output(generated_text)
            preds.append(extracted_output)
            targets.append([batch['tgt'][j]])
            srcs.append([batch['src'][j]])

    bleu = evaluate.load("bleu")
    results = bleu.compute(predictions=preds, references=targets)

    # Save predictions and references to a JSON file; useful for case study in pt3.
    import json
    validation_output = {
        "src": srcs,
        "pred": preds,
        "tgt": targets,
    }
    with open(output_file, "w") as f:
        json.dump(validation_output, f, indent=4)


    return results["bleu"]



In [None]:
# TODO: Evaluate model, use the function given above

bleu_score = evaluate_model(model, tokenizer, val_ds, output_file = "/tmp/SFT_validation_output.json", max_length=512, max_new_tokens=512)
print(f"BLEU score on validation set: {bleu_score}")

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for o

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

BLEU score on validation set: 0.4545910286204956


Expected BLEU score after 1 epoch SFT is ~ 0.48.

In [None]:
import json

def load_and_read_samples(output_file="validation_output.json", num_samples=10):
    # Load the validation output file
    with open(output_file, "r") as f:
        validation_output = json.load(f)

    # Extract predictions and references
    predictions = validation_output["pred"]
    references = validation_output["tgt"]
    srcs = validation_output["src"]

    # Read the first num_samples samples
    for i in range(num_samples):
        print(f"Sample {i + 1}:")
        print(f"Source: {srcs[i]}")
        print(f"Prediction: {predictions[i]}")
        print(f"Reference: {references[i]}")
        print("-" * 40)

# Inspect SFT_model output
load_and_read_samples(output_file="/tmp/SFT_validation_output.json", num_samples=10)

Sample 1:
Source: ['Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.']
Prediction:  First of all, from you, I read just to find out what well-known critic has already found out, you have lost the pleasures of reading something which is expecting to be a new experience to you.
Reference: ['First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.']
----------------------------------------
Sample 2:
Source: ['Fix grammatical errors: Their research shown that before Hurricane Sandy only " about 50 percent during resident used the emergency departments, " and " only about 35 percents sought inpatient cares there and less than 10 percent used the hospitals when needing surgerie

## **2.2 Create a preference optimization dataset [5 points]**

* *Generate Output Variants* -- for each input sentence in the training set, use the fine-tuned model to generate two different output variants.
 * Consider using different decoding strategies, such as varying the temperature or beam size, to produce diverse outputs. Select an approach based on the desired balance between diversity and quality.

* *Preference Annotation* -- measure the edit distance between each **generated predicted variant** and **ground truth correction**. Label the variant with the lower edit distance as "chosen" and the one with the higher edit distance as "rejected."
 * Beyond using edit distance, what other metrics or methods could you consider to do preference dataset annotation?



### Report on Part 2.2:

- *Generate Output Variants:*

    To ensure diversity and quality, I played with distinctive decoding strategies including **[Greedy Search], [Beam Search], [Sampling with temperature], [Top-P sampling]** as shown below. You can freely choose 2 or more than 2 decoding_methods to generate candidate outputs, then the candidate outputs with the largest/smallest edit_distance to the target groundtruth are selected as th chosen/rejected pair.

    - To ensure inter-output diversity and quality, the final preference_dataset is created on the two **`decoding_methods=['greedy', 'sampling']`**. Greedy search is computationally efficient compared to other methods, while random sampling with a high temperature ensures diversity and differentiability from the other methods.

    - To ensure intra-output quality (and diversity), I curated a decent set of hyperprameters for each decoding methods.

    - In practice, apart from varying decoding strategies, it is also a competent approach to generate diversely high-quality preference data by randomly selecting few-shot examples from a pool that are prompted to LLMs for diverse generations.

- *Preference Annotation*:

    - Beyond using edit distance, for this project, BLEU score would be another suitable automatic metic, especially considering it is the metric for evaluating the SFT (and later DPO-trained) model.

    - The LLM-AS-JUDGE evaluation method is another automatic metric in evaluating the closeness between a model-generated sequence with the groundtruth sequence. We can even make this a more fine-grained and domain-specific evaluatoin metric, e.g., by instructing the LLM evaluator to judge in the aspect of grammaticality.

- *Visualization*:

    - **Diversity**: As shown from the visualization, the differentiability of the chosen/rejected pair in the generated preference dataset is not too high; however, none of the chosen/rejected output are identical. There are observably stable nuances in grammaticality.
    - **Quality**: Both chosen/rejected output are in good format, generally following the input and performing the grammatical error correction task.
    - **Annotation**: The preferential annotations are relatively consistent: with each of the five samples, the chosen output has a minor yet explicit edge over the rejected output in being closer to target output.



In [None]:
!pip install fast_edit_distance
from fast_edit_distance import edit_distance
from tqdm import tqdm
import torch
from datasets import Dataset

# Generate Output Variants and Annotate Preferences
def generate_annotate_preference(model, tokenizer, dataset, seed=42,
                               batch_size=32, decoding_methods=['greedy', 'top-p'], input_max_length=256, max_new_tokens=256,
                               num_beams=2, no_repeat_ngram_size=6,     # Beam Search hp
                               temperature=0.7,                         # Sampling with temperature hp
                               top_p=0.92):                             # Top-p sampling hp
    preference_data = []
    torch.manual_seed(seed)

    for i in tqdm(range(0, len(dataset), batch_size), desc="Generating Preferences"):
        batch = dataset[i : i + batch_size]
        input_texts = batch['prompt']
        inputs = tokenizer(input_texts, return_tensors="pt", padding='longest', truncation=True, max_length=input_max_length).to(device)

        for k in range(len(batch['prompt'])):
            output_variants = {}  # To store generated variants for this prompt

            # --- Decoding ---
            for method in decoding_methods:

                # Generate outputs with different decoding strategies
                if method == 'greedy':  # Greedy Search
                    generated_ids = model.generate(
                        **inputs, max_new_tokens=max_new_tokens, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.0,
                        do_sample=False, num_beams=1
                    )
                elif method == 'beam':  # Beam Search
                    generated_ids = model.generate(
                        **inputs, max_new_tokens=max_new_tokens, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.0,
                        num_beams=num_beams, early_stopping=True, num_return_sequences=1
                    )
                elif method == 'sampling':  # Sampling with temperature
                    generated_ids = model.generate(
                        **inputs, max_new_tokens=max_new_tokens, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.0,
                        top_k=0, temperature=temperature, do_sample=True
                    )
                elif method == 'top-p':  # Top-p sampling
                    generated_ids = model.generate(
                        **inputs, max_new_tokens=max_new_tokens, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.0,
                        top_k=0, top_p=top_p, do_sample=True
                    )
                else:
                    raise ValueError(f"Unknown decoding method: {method}")

                # Batch decode and extract the generated text
                generated_texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
                extracted_outputs = [extract_output(generated_text) for generated_text in generated_texts] # per method, len(extracted_outputs) = batch_size
                output_variants[method] = extracted_outputs

            # --- Calculate Edit Distances and Determine Preferences ---
            for k in range(len(batch['prompt'])):  # Iterate through each sample in the batch
                edit_distances = {}
                for method, outputs in output_variants.items():
                    edit_distances[method] = edit_distance(outputs[k], batch['tgt'][k])

                # Find methods with minimum and maximum edit distances for this sample
                chosen_method = min(edit_distances, key=edit_distances.get)
                rejected_method = max(edit_distances, key=edit_distances.get)

                preference_data.append({
                    'prompt': input_texts[k],
                    'tgt': batch['tgt'][k],
                    'chosen': output_variants[chosen_method][k],
                    'rejected': output_variants[rejected_method][k],
                    'chosen_method': chosen_method,
                    'rejected_method': rejected_method
                })

        return Dataset.from_list(preference_data)

In [None]:
# to load the saved SFT_model model:
SFT_model_name = '/content/drive/MyDrive/Application/C4AI/SFT_512_16_1e-5_epoch3_new'  # Your saved model path
SFT_model = AutoModelForCausalLM.from_pretrained(SFT_model_name).to(device)

# Create PO dataset by generating preference data
preference_dataset = generate_annotate_preference(SFT_model, tokenizer, train_ds, seed=42,
                                                  batch_size=160, decoding_methods=['greedy', 'sampling'],  # Choose at least two methods from ['greedy', 'top-p', 'sampling', 'beam']
                                                  input_max_length=256, max_new_tokens=256,            # set max seq based on train_ds length stats
                                                  num_beams=2, no_repeat_ngram_size=6,                 # Beam Search hp
                                                  temperature=0.7,                                     # Sampling with temperature hp
                                                  top_p=0.92)                                          # Top-p sampling hp

Generating Preferences:   0%|          | 0/124 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-

Note: This dataset is uploaded at huggingface for easy loading: "alisa-yingjia-wan/gec_SmolLM_DPO".

To load preference_dataset:
```
from datasets import load_dataset

preference_dataset = load_dataset("alisa-yingjia-wan/gec_SmolLM_DPO")
```

In [None]:
# TODO: (Load and) Visualize the created dataset -- display at least 5 lines of the dataset.

for i in range(5):  # Display the first 5 examples
    print(f"Example {i+1}:")
    print(f"Input Text: {preference_dataset[i]['prompt']}")
    print(f"Target: {preference_dataset[i]['tgt']}")
    print(f"Chosen: {preference_dataset[i]['chosen']}")
    print(f"Rejected: {preference_dataset[i]['rejected']}")
    print(f"Chosen Method: {preference_dataset[i]['chosen_method']}")
    print(f"Rejected Method: {preference_dataset[i]['rejected_method']}")
    print("-" * 20)


# self-added TODO: Beyond using edit distance, what other metrics or methods could you consider to do preference dataset annotation?


Example 1:
Input Text: Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.. 
Answer: 
Target: For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.
Chosen:  For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.
Rejected:  For example, countries with a lot of deserts can terraform their deserts to increase their habitable land and using irrigation to provide clean water to the desert.
Chosen Method: sampling
Rejected Method: greedy
--------------------
Example 2:
Input Text: Improve the grammaticality: As the number of people grows, the need of habitable environment is unquestionably essential.. 
Answer: 
Target: As t

## **2.3 Run Direct Preference Optimization (DPO) [5 points]**
* Use the preference optimization dataset to further train the model through DPO, a method that leverages human-like preferences for model training.
* After running DPO, measure the BLEU score on the test set. Compare this performance to the baseline established during the SFT phase.
* Search for an optimal set of hyperparameters, such as the learning rate and number of epochs. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters.

## Report on Part 2.3:

Through multiple 1-epoch trials, I found that the optimal learning rates in DPO training on my preference dataset converge towards a smaller value (1e-6).

The BLEU score gains of DPO-SFT model (0.4747) over the SFT_model (0.4546) shows that the DPO training pipeline increases model performance. The preference information empowers the model in aligning with better output in grammatical error function. This also provides a certain degree of validity in my preference dataset generated by choosing from different decoding strategies.

However, It is noteworthy that both the preference data annotation and the model evaluation is measured by automatic metrics (edit_distance and BLEU). Therefore, qualitative analysis and mannual output inspection is crucial (to be reported in Part 3).

In [None]:
# DPOTrainer: https://huggingface.co/docs/trl/en/dpo_trainer#trl.DPOTrainer
# DPOConfig: https://huggingface.co/docs/trl/en/dpo_trainer#trl.DPOConfig
import os
from trl import DPOConfig, DPOTrainer

# TODO: Run Direct Preference Optimization (DPO)

# load dataset
from datasets import load_dataset
preference_dataset = load_dataset("alisa-yingjia-wan/gec_SmolLM_DPO")

# load model, innitialized from SFT_model
SFT_directory = '/content/drive/MyDrive/Application/C4AI/SFT_512_16_1e-5_epoch3_new'
DPO_model = AutoModelForCausalLM.from_pretrained(SFT_directory).to(device)

training_args = DPOConfig(
    beta=0.1, # beta is the temperature parameter for the DPO loss, typically between 0.1 to 0.5. We ignore the reference model as beta -> 0.
    loss_type="sigmoid", # the DPO authors propose the sigmoid loss on the normalized likelihood via the logsigmoid to fit a logistic regression
    # max_length=512,
    max_prompt_length=256,
    max_target_length=256,
    per_device_train_batch_size=16,   # Batch size for training
    learning_rate=1e-6,
    num_train_epochs=1,
    output_dir="/tmp/DPO",
    fp16=True
)

dpo_trainer = DPOTrainer(
    DPO_model,
    args=training_args,
    train_dataset=preference_dataset['train'],
    tokenizer=tokenizer,
)

dpo_trainer.train()



Tokenizing train dataset:   0%|          | 0/25600 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
500,0.626
1000,0.5644
1500,0.5393


TrainOutput(global_step=1600, training_loss=0.5743206834793091, metrics={'train_runtime': 376.6671, 'train_samples_per_second': 67.965, 'train_steps_per_second': 4.248, 'total_flos': 0.0, 'train_loss': 0.5743206834793091, 'epoch': 1.0})

In [None]:
# Save DPO_model
DPO_directory = '/content/drive/MyDrive/Application/C4AI/DPO_256_16_5e-6_epoch1_new'
save_trainer_checkpoint(dpo_trainer, DPO_directory)

Saved to /content/drive/MyDrive/Application/C4AI/DPO_256_16_5e-6_epoch1_new!


In [None]:
# TODO: Evaluate model, use evaluate_model function
# toy_val_ds = val_ds.select(range(20))
dpo_bleu_score = evaluate_model(DPO_model, tokenizer, val_ds, output_file="/tmp/DPO_validation_output.json", max_length=512, max_new_tokens=256)
print(f"BLEU score on validation set after DPO training: {dpo_bleu_score}")

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for o

BLEU score on validation set after DPO training: 0.4747308567097804


Expected BLEU score after 1 epoch SFT + DPO is ~ 0.50.

# **Coding Challenge Part 3: Explore Alternative DPO Variants for Improved Model Performance [10 points]**

Consider employing a different version or variant of DPO. Your task is to:

* Choose a variant of DPO or another preference-based optimization method that could potentially enhance the model's performance.
* Describe the specific differences in this approach compared to the initial DPO method used.
* Train the model using this alternative DPO method and measure its performance on the test set using the BLEU score.
* Compare these results with the baseline performance achieved during the initial Supervised Fine-Tuning (SFT) and the first DPO implementation.
* Select a few GEC example after SFT, DPO and this DPO variant phases and compare the quality of the corrections, which one you prefer as human?
* You are allowed to make changes in the preference data annotation to improve the score, e.g. apply different metrics or methods beyond edit distance.
* Discuss the role of any changes in achieving these results. Consider potential trade-offs or limitations introduced by the new approach.

## Report on Part 3:

- I chose **[Robust DPO](https://arxiv.org/pdf/2403.00409)** as the Alternative DPO method to potentially enhance the model's performance. Below is the code for its Robust DPO training implementation.

    (The main rationale for choosing Robust DPO is because of the potential high noise level of my the preference dataset. It is reported in the paper that the performance of DPO drops significantly when the noise rates are high, which describes a big concern for my DPO pipeline.
    
    Firstly, due to the dataset nature of being annotated by a simple automatic metric edit_distance, the annotated chosen/rejected pairs do not always accurately reflect a good comparison for preference, thus contributing to preferential noise.

    Secondly, the output variants were generated via various decoding strategies which may suffer from diversity: the two be both valid but only differ slightly in phrasing, hence contributing to ambiguous or arbituary preferences. This can also introduce inconsistency into the DPO training process.)


- Differences between the two approaches:
    1. The main differences is in the DPO loss function: *DPO* is grounded in the log-sigmoid loss function, which optimizes the model based on the relative likelihood of preferred vs. non-preferred completions. It directly learns from the preference pairs without considering noise. In comparison, *Robust DPO* adopts a smoothed likelihood ratio with noise, accounting for labeling errors and adjusts its training process accordingly. The `label smoothing` techniques model loss to prevent overconfidence and improve robustness to noisy data.

    2. Assumptions: DPO using sigmoid loss function applies generally as a starting baseline, while Robust DPO modifies the DPO loss to account for noise, based on the assumption that preferences are probabilistic rather than binary.

 (Results and Discussions are continued in the next text block.)

In [None]:
# DPO variant reference doc: https://huggingface.co/docs/trl/en/dpo_trainer#loss-functions
# PEFT: https://huggingface.co/docs/trl/en/dpo_trainer#reference-model-considerations-with-peft
# DPOTrainer: https://huggingface.co/docs/trl/en/dpo_trainer#trl.DPOTrainer
# DPOConfig: https://huggingface.co/docs/trl/en/dpo_trainer#trl.DPOConfig

# TODO: Run Robust DPO:

# load robust DPO_model, innitialized from SFT_model
SFT_directory = '/content/drive/MyDrive/Application/C4AI/SFT_512_16_1e-5_epoch3_new'
rDPO_model = AutoModelForCausalLM.from_pretrained(SFT_directory).to(device)

training_args = DPOConfig(
    beta=0.1,           # beta is the temperature parameter for the DPO loss, typically between 0.1 to 0.5.
    loss_type="robust",  # Robust DPO
    label_smoothing=0.2, # Robust DPO hp (0 measn stardard DPO), 0.5 indicates high uncertainty or noise in the labels.
    # max_length=512,
    max_prompt_length=256,
    max_target_length=256,
    per_device_train_batch_size=16,   # Batch size for training
    learning_rate=1e-6,
    num_train_epochs=1,
    output_dir="/tmp/rDPO",
    fp16=True,
)

dpo_trainer = DPOTrainer(
    rDPO_model,
    args=training_args,
    train_dataset=preference_dataset['train'],
    tokenizer=tokenizer
)

dpo_trainer.train()



Tokenizing train dataset:   0%|          | 0/25600 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
500,0.5441
1000,0.3385


Step,Training Loss
500,0.5441
1000,0.3385
1500,0.2218


TrainOutput(global_step=1600, training_loss=0.35951624512672425, metrics={'train_runtime': 381.1598, 'train_samples_per_second': 67.163, 'train_steps_per_second': 4.198, 'total_flos': 0.0, 'train_loss': 0.35951624512672425, 'epoch': 1.0})

In [None]:
# SAVE rDPO_model at rDPO_directory

rDPO_directory = '/content/drive/MyDrive/Application/C4AI/rDPO_256_16_1e-6_epoch1_new'
save_trainer_checkpoint(dpo_trainer, rDPO_directory)

Saved to /content/drive/MyDrive/Application/C4AI/rDPO_256_16_1e-6_epoch1_new!


In [None]:
# Evaluate the RobustDPO model on bleu

# toy_val_ds = val_ds.select(range(20))
rdpo_bleu_score = evaluate_model(rDPO_model, tokenizer, val_ds, output_file="/tmp/rDPO_validation_output.json", max_length=512, max_new_tokens=256)
print(f"BLEU score on validation set after robust DPO training: {rdpo_bleu_score}")

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for o

BLEU score on validation set after robust DPO training: 0.4538686969060181


In [None]:
# Save _validation_outputs

import json

models = ['SFT', 'DPO', 'rDPO']

for model in models:
    tmp_json = f'/tmp/{model}_validation_output.json'
    file_path = f'/content/drive/MyDrive/Application/C4AI/{model}_validation_output.json'

    # Read data from the temporary JSON file
    with open(tmp_json, 'r') as tmp_file:
        data = json.load(tmp_file)

    # Write data to the final JSON file
    with open(file_path, 'w') as json_file:
        json.dump(data, json_file, indent=4)

In [None]:
# Qualitative case study: See report below for my preference annotation.

import json

def load_and_read_samples(models, base_path, num_samples=5):
    # Initialize a dictionary to store predictions
    predictions_dict = {}

    # Load the validation output files for each model
    for model in models:
        file_path = f"{base_path}/{model}_validation_output.json"
        with open(file_path, "r") as f:
            output = json.load(f)
        predictions_dict[model] = output["pred"]

    sources = output['src']
    references = output["tgt"]

    # Read the first num_samples samples
    for i in range(num_samples):
        print(f"Sample {i + 1}:")
        print(f"Source: {sources[i]}")
        print(f"Reference: {references[i]}")

        for model in models:
            print(f"{model} Prediction: {predictions_dict[model][i]}")
        print("-" * 40)

# Specify the models and base path
models = ['SFT', 'DPO', 'rDPO']
base_path = "/content/drive/MyDrive/Application/C4AI"

# Compare predictions for 5 samples
load_and_read_samples(models, base_path, num_samples=5)


Sample 1:
Source: ['Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.']
Reference: ['First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.']
SFT Prediction:  First of all, from you, I read just to find out what well-known critic has already found out, you have lost the pleasures of reading something which is expecting to be a new experience to you.
DPO Prediction:  First of all, from you read just to find out the poems or novel what well-known critic have already found out, you lose the pleasures of reading something which is expecting to be a new experience to you.
rDPO Prediction:  First of all, from you read just to find in the poems or novel what well-known critic h

## Report on Part 3 (Cont.)

- Comparison of Results:

    The evaluated BLEU scores on the validation set from SFT baseline model, SFT-DPO model, and SFT-Robust-DPO model are listed in the table below. There is an increase by adopting DPO to further train the SFT model. However, robust-DPO did not obtain a higher model performance compared to standard DPO.

    On the one hand, the performance drop of Robust-DPO could be related to the quality of my preferecne dataset not being drastically noisy in preference labels. The applicability of robust-DPO relies on its assumption that the preference dataset suffers greatly from prefrential noise (e.g., arbituray preference annotations, undifferentiable chosen/rejected pairs). When the assumption fails, the robust loss function does not necessarily work so well;
    
    On the other hand, the lower performance of Robust-DPO may also arise from non-optimal set of hyperparameters. I used the same learning rate as DPO, and did not experiment with more choices of `beta` and `label_smoothing` due to time constraint. This could also be a major reason.


| Model               | Epochs | BLEU Score |
|---------------------|--------|------------|
| SFT Model          | SFT: 3 | 0.4546     |
| SFT-DPO Model        | +DPO: 1 | 0.4747     |
| SFT-Robust-DPO Model | +R-DPO: 1 | 0.4539     |



- Comparative Case Study on Models' Output:
---
| Model        | Prediction                                                                                                                                                                                              |
|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Target**   | First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you. |
| SFT          | First of all, from you, I read just to find out what well-known critic has already found out, you have lost the pleasures of reading something which is expecting to be a new experience to you.     |
| DPO          | First of all, from you read just to find out the poems or novel what well-known critic have already found out, you lose the pleasures of reading something which is expecting to be a new experience to you. |
| rDPO         | First of all, from you read just to find in the poems or novel what well-known critic have already found out, you loosed the pleasures of reading something which is expecting to be a new experience to you. |
| **Sample 1 Preference** |     SFT                                                                                                                                                                                                   |


| Model        | Prediction                                                                                                                                                                                                                                         |
|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Target**   | Their research showed that before Hurricane Sandy, only " about 50 percent of residents used the emergency department " and " only about 35 percent sought inpatient care there, and less than 10 percent used the hospital when needing surgery of any kind. " |
| SFT          | 50 percent of the residents used the emergency departments, and only about 35 percent of the residents sought inpatient care there and less than 10 percent used the hospitals when needed.                                                            |
| DPO          | Their research showed that before Hurricane Sandy only " about 50 percent during the resident used the emergency departments, " and " only about 35 percent sought inpatient cares there and less than 10 percent used the hospitals when needed with any kind. ". |
| rDPO         | Answer: \n cloze: Their research showed that before Hurricane Sandy only " about 50 percent during resident used the emergency departments, " and " only about 35 percent sought inpatient cares there and less than 10 percent used the hospitals when needed to surgery with any kind. |
| **Sample 2 Preference** |          DPO                                                                                                                                                                                                                                                 |


| Model        | Prediction                                                                                                                                                                                                                                         |
|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Target**   | It is widely believed that every student should be interested in some subjects which might not be interesting to other students so it is difficult to force students to study subjects which they are unwilling to study, otherwise they will fail at them and because of that they will feel too disappointed to do anything and this a significant issue. |
| SFT          | It was widely believed that every student interested in some subject which might not be interested by other students, so it was difficult to force students to study subjects which they are not interested in, otherwise they would fail in it and because of that, they will feel disappointed to do any thing and this a significant issue. |
| DPO          | It been widely believed that every student interested in some subject which might not be interested by other students so it is difficult to force students to study subjects which they unwilling to study it, otherwise they will fail in it and because of that they will feel disappointed to do any thing and this a significant issue. |
| rDPO         | It been widely blelieved that every student interested within some subject which might not be interested by other students so it is difficult to force students to study subjects which they unwilling to study it, otherwise they will fail in it and because of that they will feel disappointed to do any thing and this a significant issue. |
| **Sample 3 Preference** |        SFT                                                                                                                                                                                                                                                   |

| Model        | Prediction                                                                                                                                                                                                                                         |
|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Target**   | This is why I totally agree with the following comment: " My upbringing taught me to be calm and easy-going - I really appreciate that now. " First of all, I agree with this person because I think that the way someone has been brought up has a great influence on his life. |
| SFT          | This is why I totally agree like the following comments: " My upbringings teaches me to be calm and easy-going - I really appreciate but now ".                                                                                                        |
| DPO          | This is why I totally agree like the following comments: " My upbringings teaches me to be calm and easy-going - I really appreciate but now ". First of all, I agree with this person including I think that the ways someones have been brought having a great influence on his life. |
| rDPO         | This is why I totally agree like the following comments: " My upbringings teaches me to be calm and easy-going - I really appreciate but now ". First of all, I agree with this person including I think that the ways someones have been brought having a great influence on his life. |
| **Sample 4 Preference** |       DPO = DPO                                                                                                                                                                                                                                                    |

| Model        | Prediction                                                                                                                                                                                                                                            |
|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Target**   | Yesterday I went to the shopping centre with some friends. I really enjoyed it, I like to buy new clothes for me, it's my best hobby, the problem is that I don't have very much money now. I think I'll ask my father for some. I need more clothes. I'm planning to go to the shopping centre again tomorrow, or maybe today in the afternoon. |
| SFT          | esterday, I went after the Center shopping before some friends, I really enjoyed it, I liked to buy new clothes for me, it's my best hobbie, the problem is that I don't have so much money now, I think I'll ask for it despite my father, I need more clothes, I'm planning to go back to the shopping again tomorrow, maybe today beyond the afternoon. |
| DPO          | Yesterday I went after the Center shopping before some friends, I really enjoyed it, I liked to buy new clothes for me, it's my best hobbie, the problem is that I doesn't have so much money now, I think I'll ask for it despite my father, I need more clothes, I'm planning to go again tomorrow, maybe today beyond the afternoon.        |
| rDPO         | Yesterday I went after the Center shopping before some friends, I really enjoyed it, I liked to buy new clothes for me, it's my best hobbie, the problem is that I doesn't have so much money now, I think I'll ask for it despite my father, I need more clothes, I'm planning to go of the shopping again tomorrows, maybe today beyond the afternoon.     |
| **Sample 5 Preference** |      SFT                                                                                                                                                                                                                                                        |


---


- Discussion:

    (Discuss the role of any changes in achieving these results. Consider potential trade-offs or limitations introduced by the new approach.)

    - The bleu score increase from DPO to robust DPO, as well as the qualitative case analysis shows that he Robust DPO is more applicable when it comes to DPO datasets with a high preferential noise level, which is the case of the current preference dataset. By adding label_smoothing=0.3 to adjust the loss function to account for preferential noise, SFT + robust DPO obtains the highest score in BLEU evaluation.
    - As mentioned early, the current project suffers from several aspects along different stages, pending further improvement:
        - using bleu score as the sole evaluation metric for the grammatical correction task, which is not the best metric to capture grammatical nuances;
        - using edit distance as the annotating tool for preference datase;
        - experimenting with more decoding strategies and apprpaches to elicit diverse generations.
    - Most of the tradeoff discussions have been covered in early discussions in Part 2. In general, computational efficiecny, model performance (in the current criteria of BLEU score) are the ultimate key factors in consideration.