### Goal
We are going to fit pretrained GPT2 studied on IMDB dataset using DPOTrainer to generate more positive texts

In [None]:
!pip install accelerate
!pip install trl

In [1]:
import json
import accelerate
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import TrainingArguments
from trl import DPOTrainer
from datasets import Dataset
from sklearn.model_selection import train_test_split


import torch
from itertools import product

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("CUDA available")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS available")
else:
    device = torch.device("cpu")
    print("Using CPU")

MPS available


In [3]:
gpt2_name = "lvwerra/GPT2-IMDB"
gpt2_tokenizer = GPT2Tokenizer.from_pretrained(gpt2_name)
gpt2_model = GPT2LMHeadModel.from_pretrained(gpt2_name)
gpt2_tokenizer.pad_token_id = 50256

### Generating pairs winner-loser

#### Method:
We will generate some textes grouped by prompt with pretrained gpt2 model and split them into 2 categories: bad with reward lower than 0.5 from rank model and good. Than for every prompt we will add сartesian product of bad and good texts into the dataset. 

In [4]:
def get_answers(promt: str, num_responses, model=gpt2_model, tokenizer=gpt2_tokenizer) -> str:
    input_ids = tokenizer.encode(promt, return_tensors="pt")
    output = model.generate(
        input_ids,
        max_length=30,
        do_sample=True,
        num_return_sequences=num_responses,
        top_k=20,
        temperature=.7).to('mps')

    return [tokenizer.decode(output[i], skip_special_tokens=True) for i in range(num_responses)]



In [5]:
prompts = [
    "The film's story begins when",
    "In this scene, we see",
    "The character's journey starts with",
    "Set in a world where",
    "The movie depicts a scenario in which",
    "In the heart of the city, there is",
    "During a typical day, the main character",
    "The narrative unfolds with a scene showing",
    "Exploring the themes of",
    "The plot takes a turn when",
    "Centered around the life of",
    "In the midst of a conflict, the protagonist",
    "The setting of the story is characterized by",
    "As the story progresses, we discover",
    "Caught between different worlds, the character",
    "The tale weaves through the events of",
    "In an unexpected twist, the film",
    "Juxtaposing the past and present, the story",
    "The climax of the movie occurs during",
    "The narrative explores the relationship between",
    "The protagonist's motivation is revealed when",
    "The audience is introduced to a mysterious character who",
    "The opening scene sets the tone for the entire story with",
    "Throughout the film, the main character struggles with",
    "The plot thickens when",
    "Against all odds, the hero",
    "The story unfolds in a world where",
    "The film delves into the complexities of",
    "The protagonist's journey is hindered by"
    "The audience is left wondering about the true intentions of",
    "The narrative highlights the contrast between",
    "The film's climax is reached through",
    "The protagonist's internal conflicts are reflected in"
    "As the story progresses, the audience is left questioning"
    "The setting plays a crucial role in shaping the characters' decisions, especially"
    "The narrative tackles the issue of identity through",
    "The audience is kept on the edge of their seats as",
    "The narrative explores the consequences of",
    "The story takes a surprising turn when"
]

In [6]:
rank_name = "lvwerra/distilbert-imdb"
rank_tokenizer = DistilBertTokenizer.from_pretrained(rank_name)
rank_model = DistilBertForSequenceClassification.from_pretrained(rank_name)

In [7]:
def get_ranks(answers: list[str]) -> list[int]:
    inputs = [rank_tokenizer(answer, return_tensors="pt") for answer in answers]
    outputs = [rank_model(**input) for input in inputs]
    return [torch.softmax(output.logits[0], dim=0)[1].item() for output in outputs]



In [8]:
def gen_marked_answers(prompt, model, tokenizer):
    answers = get_answers(prompt, 40, model, tokenizer)
    ranks = get_ranks(answers)
    positive = []
    negative = []
    for i in range(len(answers)):
        if ranks[i] > 0.6:
            positive.append(answers[i])
        else:
            negative.append(answers[i])
    return positive, negative


In [9]:
def gen_dataset(prompts, model, tokenizer):
    prompt_list = []
    chosen_list = []
    rejected_list = []
    for prompt in prompts:
        positive, negative = gen_marked_answers(prompt,model,tokenizer)
        for pr, ac, re in product([prompt],positive,negative):
            prompt_list.append(pr)
            chosen_list.append(ac)
            rejected_list.append(re)
    dataset = {"prompt": prompt_list, "chosen": chosen_list, "rejected": rejected_list}
    with open('generated_data.json', 'w') as json_file:
        json.dump(dataset, json_file, indent=4)
    print("Data saved to 'generated_data.json'")
    return  dataset

def load_dataset(path):
    with open(path,"r") as json_file:
        return json.load(json_file)

In [16]:
dpo_dataset_dict = Dataset.from_dict(gen_dataset(prompts, gpt2_model, gpt2_tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generati

Data saved to 'generated_data.json'


In [13]:
dpo_dataset_dict = Dataset.from_dict(load_dataset("generated_data.json"))

In [14]:
dpo_dataset_dict

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 5892
})

### DPO with hinge loss 

In [15]:
splitted = dpo_dataset_dict.train_test_split(test_size=0.2,seed=42)
train_data, test_data = splitted['train'], splitted['test']
splitted = test_data.train_test_split(test_size=0.1,seed=42)
test_data,validation_data = splitted['train'], splitted['test']

In [27]:
gpt2_model_hinge = GPT2LMHeadModel.from_pretrained(gpt2_name)
gpt2_tokenizer_hinge = GPT2Tokenizer.from_pretrained(gpt2_name)
gpt2_tokenizer_hinge.pad_token_id = 50256


training_args = TrainingArguments(
    output_dir="./output/hinge",
    per_device_train_batch_size=8,
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_total_limit=1,
)

dpo_trainer = DPOTrainer(
    gpt2_model_hinge,
    args=training_args,
    beta=0.1,
    train_dataset=train_data,
    loss_type="hinge",
    eval_dataset=validation_data,
    tokenizer=gpt2_tokenizer_hinge,
)

In [28]:
dpo_trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Epoch,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
1,0.1106,0.00034,-0.838304,-5.609751,1.0,4.771447,-149.120102,-101.162086,-43.919498,-42.777184
2,0.0215,0.0,-1.706716,-6.625601,1.0,4.918884,-159.278625,-109.846199,-45.705898,-44.523148
3,0.0094,0.0,-2.652683,-8.081777,1.0,5.429093,-173.840363,-119.305862,-97.228951,-95.538467
4,0.0045,0.0,-2.632718,-8.074957,1.0,5.442239,-173.772186,-119.106209,-99.003685,-97.384995
5,0.0034,0.0,-2.713865,-8.17884,1.0,5.464974,-174.811005,-119.917694,-103.28968,-101.6838


TrainOutput(global_step=2950, training_loss=0.025768857770046946, metrics={'train_runtime': 5589.4182, 'train_samples_per_second': 4.216, 'train_steps_per_second': 0.528, 'total_flos': 0.0, 'train_loss': 0.025768857770046946, 'epoch': 5.0})

In [17]:
path = './output/hinge/checkpoint-2500'
gpt2_model_hinge = GPT2LMHeadModel.from_pretrained(path)
gpt2_tokenizer_hinge = GPT2Tokenizer.from_pretrained(path)
gpt2_tokenizer_hinge.pad_token_id = 50256

To calculate entropy and mean reward we will generate N = 500 texts with different prompts by both trained and not trained model

In [18]:
test_dict = test_data.to_dict()

In [19]:
test_prompts = test_dict['prompt'][:500]

In [26]:
test_ans = []
for text in test_prompts:
    test_ans.append(get_answers([text],1,gpt2_model,gpt2_tokenizer)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` t

In [20]:
test_hinge_ans = []
for text in test_prompts:
    test_hinge_ans.append(get_answers([text],1,gpt2_model_hinge,gpt2_tokenizer_hinge)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` t

In [49]:
def calculate_mean_reward(answers):
    sum_reward = 0
    for text in answers:
        sum_reward += get_ranks([text])[0]
    return sum_reward/len(answers)

In [54]:
from scipy.stats import entropy
from collections import defaultdict
def token_entropy(generations, tokenizer):
    stats = defaultdict(int)
    num_tokens = 0
    for example in generations:
        tokens = tokenizer.encode(example)
        for t in tokens:
            if t == tokenizer.pad_token_id:
                continue
            stats[t] += 1
            num_tokens += 1
        for k in stats.keys():
            stats[k] /= num_tokens
    return entropy(list(stats.values()))

In [64]:
mean_reward_before = calculate_mean_reward(test_ans)

In [51]:
mean_reward_hinge = calculate_mean_reward(test_hinge_ans)

In [66]:
print(f"mean reward before DPOT: {mean_reward_before}")
print(f"mean reward after DPOT with hinge loss: {mean_reward_hinge}")

mean reward before DPOT: 0.6770781021649018
mean reward after DPOT with hinge loss: 0.9352875939905644


In [56]:
entropy_before = token_entropy(test_ans,gpt2_tokenizer)

In [57]:
entropy_hinge = token_entropy(test_hinge_ans,gpt2_tokenizer_hinge)

In [65]:
print(f"entropy before DPOT: {entropy_before}")
print(f"entropy after DPOT with hinge loss: {entropy_hinge}")

entropy before DPOT: 3.3201568209721186
entropy after DPOT with hinge loss: 3.0629535231772755


### DPO with sigmoid loss 

In [59]:
gpt2_model_sigmoid = GPT2LMHeadModel.from_pretrained(gpt2_name)
gpt2_tokenizer_sigmoid = GPT2Tokenizer.from_pretrained(gpt2_name)
gpt2_tokenizer_sigmoid.pad_token_id=50256


training_args = TrainingArguments(
    output_dir="./output/sigmoid",
    per_device_train_batch_size=8,
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_total_limit=1,
)

dpo_trainer = DPOTrainer(
    gpt2_model_sigmoid,
    args=training_args,
    beta=0.1,
    train_dataset=train_data,
    loss_type="sigmoid",
    eval_dataset=validation_data,
    tokenizer=gpt2_tokenizer_sigmoid,
)

dpo_trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Epoch,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
1,0.0997,0.007166,-7.82185,-20.862696,1.0,13.040845,-301.649536,-170.997528,-69.046112,-68.55838
2,0.0229,0.000797,-8.526478,-22.324667,1.0,13.798189,-316.269318,-178.043808,-70.932358,-70.789619
3,0.0092,0.000371,-9.599741,-24.020409,1.0,14.420669,-333.226685,-188.776428,-70.775658,-70.32666
4,0.004,0.000236,-9.28284,-23.978273,1.0,14.695434,-332.805328,-185.607452,-71.407074,-70.921066
5,0.0026,0.000204,-9.391363,-24.441488,1.0,15.050123,-337.437469,-186.692688,-68.021515,-67.388374


TrainOutput(global_step=2950, training_loss=0.02364427588753781, metrics={'train_runtime': 3767.019, 'train_samples_per_second': 6.256, 'train_steps_per_second': 0.783, 'total_flos': 0.0, 'train_loss': 0.02364427588753781, 'epoch': 5.0})

In [61]:
path = './output/sigmoid/checkpoint-2500'
gpt2_model_sigmoid = GPT2LMHeadModel.from_pretrained(path)
gpt2_tokenizer_sigmoid = GPT2Tokenizer.from_pretrained(path)
gpt2_tokenizer_sigmoid.pad_token_id = 50256

In [62]:
test_sigmoid_ans = []
for text in test_prompts:
    test_sigmoid_ans.append(get_answers([text],1,gpt2_model_sigmoid,gpt2_tokenizer_sigmoid)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` t

In [67]:
mean_reward_sigmoid = calculate_mean_reward(test_sigmoid_ans)

In [68]:
entropy_sigmoid = token_entropy(test_sigmoid_ans,gpt2_tokenizer_hinge)

In [70]:
print(f"mean reward before DPOT: {mean_reward_before}")
print(f"mean reward after DPOT with hinge loss: {mean_reward_hinge}")
print(f"mean reward after DPOT with sigmoid loss: {mean_reward_sigmoid}")

mean reward before DPOT: 0.6770781021649018
mean reward after DPOT with hinge loss: 0.9352875939905644
mean reward after DPOT with sigmoid loss: 0.9448682115077972


In [71]:
print(f"entropy before DPOT: {entropy_before}")
print(f"entropy after DPOT with hinge loss: {entropy_hinge}")
print(f"entropy after DPOT with sigmoid loss: {entropy_sigmoid}")

entropy before DPOT: 3.3201568209721186
entropy after DPOT with hinge loss: 3.0629535231772755
entropy after DPOT with sigmoid loss: 2.8124182883551825


### Results
1. There is a significant improvement in reward metrics when fitting the model with DPOTrainer and using both loss_types, increasing from 0.67 to over 0.9. The sigmoid loss function demonstrates superior performance.
2. Furthermore, there is a reduction in entropy. It is hypothesized that the decrease in the number of words is due to the model avoiding words with lower ranks. Additionally, the sigmoid loss function shows lower entropy.
3. There is a significant drop down of loss while training model. However, there is no overfitting, because token entropy is still not too low

### What to Enhance
1. Expand the training and test datasets.
2. Consider using an alternate method for creating the winner-loser dataset, such as human annotated data or a tournement-like model from Deepmind [SLI-HF](https://arxiv.org/pdf/2305.10425.pdf).