## Optimizing the finetuned custom GPT2 using Reinforcement Learning from Human Feedback (RLHF) 

Instead of human feedback as a reward mechanism, we use a text generation evaluation metric like "BERTScore" to automate human evaluation. 

##### Prerequisite

In [None]:
!pip install jupyter==1.0.0
!pip install ipywidgets==8.0.4
!pip install transformers==4.26.0
!pip install datasets==2.9.0
!pip install wandb==0.13.9
!pip install -e git+https://arunprsh:43211b1b75fad82266961eff3b85a061b53daae5@github.com/lvwerra/trl.git@v0.2.1#egg=trl

#### Imports 

In [2]:
from trl import AutoModelForCausalLMWithValueHead
from transformers import GPT2Tokenizer
from transformers import set_seed
from datasets import load_dataset
import matplotlib.pyplot as plt
from datasets import Dataset
from random import choices
from trl import PPOTrainer
from trl import PPOConfig
from tqdm import tqdm
import transformers 
import pandas as pd
import numpy as np
import ipywidgets
import datasets
import logging
import jupyter
import random
import torch
import wandb
import time
import trl
import os

  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)
  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)


##### Setup logging

In [3]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

##### Log versions of dependencies 

In [4]:
logger.info(f'[Using transformers version: {transformers.__version__}]')
logger.info(f'[Using datasets version: {datasets.__version__}]')
logger.info(f'[Using wandb version: {wandb.__version__}]')
logger.info(f'[Using trl version: {trl.__version__}]')

[Using transformers version: 4.26.0]
[Using datasets version: 2.9.0]
[Using wandb version: 0.13.9]
[Using trl version: 0.2.1]


#### Setup essentials 

In [5]:
np.random.seed(123)
tqdm.pandas()
set_seed(123)

In [6]:
!wandb login 8489739d838b89d2f424147f354f9db40517c1c9

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [7]:
path = os.path.abspath('01-rlhf.ipynb')
os.environ['WANDB_NOTEBOOK_NAME'] = path

##### Set constants 

In [8]:
MODEL_PATH = '.././02-finetune/model/custom-finetuned'
BOS_TOKEN = '<|startoftext|>'
EOS_TOKEN = '<|endoftext|>'
PAD_TOKEN = '<|pad|>'
MAX_LEN = 512

##### Setup configs

In [9]:
config = PPOConfig(model_name=MODEL_PATH, 
                   batch_size=8,
                   forward_batch_size=4,
                   remove_unused_columns=False,
                   log_with='wandb')

#### Load models 

In [10]:
active_model = AutoModelForCausalLMWithValueHead.from_pretrained(MODEL_PATH)

In [11]:
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(MODEL_PATH)

#### Load tokenizer 

In [12]:
tokenizer = GPT2Tokenizer.from_pretrained('../01-tokenize/vocab-custom', 
                                          bos_token=BOS_TOKEN, 
                                          eos_token=EOS_TOKEN, 
                                          pad_token=PAD_TOKEN, 
                                          lower=True,
                                          return_tensors='pt')
tokenizer.padding_side = 'left'
tokenizer.model_max_length = MAX_LEN
logger.info(f'Tokenizer: {tokenizer}')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tokenizer: GPT2Tokenizer(name_or_path='../01-tokenize/vocab-custom', vocab_size=50257, model_max_length=512, is_fast=False, padding_side='left', truncation_side='right', special_tokens={'bos_token': AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<|pad|>", rstrip=False, lstrip=False, single_word=False, normalized=True)})


#### Load dataset

In [13]:
test_df = pd.read_csv('.././03-evaluate/data/eval_results.csv')
test_df.head()

Unnamed: 0,question,answer,custom_gpt2_answer,oob_gpt2_answer,bert_score_custom_gpt2,bert_score_oob_gpt2,reward
0,"i have a few symptoms like the stomachache, co...",stomach troubles aren't a common symptom of th...,yes! while we are still learning about how cov...,yes! you may be able to get covid-19 from eati...,0.830807,0.819558,0.011249
1,what if my time off is not approved and i don’...,you will be treated just as you would if you d...,you can volunteer at a food bank or other comm...,you can contact your employer or the local fir...,0.81722,0.832336,-0.015117
2,where can i find more information about animal...,"for more information, check out the following ...",the centers for disease control (cdc) is const...,the cdc has a list of animal health organizati...,0.798086,0.80666,-0.008574
3,what precautions should i take during travel?,"during travel, everyone should clean hands fre...",if you have been in close contact with someone...,you can follow the guidance from your healthca...,0.823299,0.826669,-0.003371
4,use a contactless payment method if you can.,to avoid spreading germs during a cash or cred...,many stores have apps that allow shoppers to p...,check your bank account or credit card number ...,0.863554,0.825531,0.038023


In [14]:
rewards = test_df['reward'].to_list()  
max_reward = max(rewards)
min_reward = min(rewards)

In [15]:
dataset = load_dataset('csv', 
                       data_files='.././03-evaluate/data/eval_results_sample.csv',  
                       delimiter=',', 
                       split='train[:100%]',
                       download_mode='force_redownload')
dataset

Using custom data configuration default-64af6f26a2db74ac


Downloading and preparing dataset csv/default (download: 18.16 KiB, generated: 17.63 KiB, post-processed: Unknown size, total: 35.79 KiB) to /root/.cache/huggingface/datasets/csv/default-64af6f26a2db74ac/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/16 [00:00<?, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-64af6f26a2db74ac/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


Dataset({
    features: ['question', 'answer', 'custom_gpt2_answer', 'oob_gpt2_answer', 'bert_score_custom_gpt2', 'bert_score_oob_gpt2', 'reward'],
    num_rows: 16
})

In [16]:
def tokenize(samples: list):
    questions = samples['question']
    ground_truth = samples['answer']
    scores = samples['reward']
    
    input_ids = []
    query = []
    rewards = []
    
    for question, score in zip(questions, scores):
        prompted_input = f'question: {question}\nanswer:'
        query.append(prompted_input)
        tokenized_input = tokenizer(prompted_input, 
                                    truncation=True)
        input_ids.append(torch.tensor(tokenized_input['input_ids'], dtype=torch.long))
        normalized_score = (score - min_reward) / (max_reward - min_reward) * 2 - 1
        rewards.append(torch.tensor(normalized_score, dtype=torch.float))
        
    return {'input_ids': input_ids, 'query': query, 'rewards': rewards, 'ground_truth': ground_truth}

In [17]:
dataset = dataset.map(tokenize, 
                      batched=True, 
                      #num_proc=num_proc, 
                      load_from_cache_file=False, 
                      remove_columns=['question', 'answer', 'custom_gpt2_answer', 'oob_gpt2_answer', 'bert_score_custom_gpt2', 'bert_score_oob_gpt2', 'reward'])
dataset.set_format('pt', 
                   columns=['input_ids', 'query', 'rewards', 'ground_truth'],
                   output_all_columns=True)
dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['input_ids', 'query', 'rewards', 'ground_truth'],
    num_rows: 16
})

##### Create data collator

In [18]:
def collator(dataset):
    result = {}
    for key in dataset[0]:
        values = []
        for d in dataset:
            values.append(d[key])
        result[key] = values
    return result

In [19]:
collator(dataset)

{'input_ids': [tensor([32616,    26,   659,   571,   468,  1229,   639,   258, 21943,   334,
          18760,  5229,   301,  2857,    72,    13, 37343,    31,   199, 30749,
             26]),
  tensor([32616,    26,  1398,   286,   746,  2407,  5375,   639,   746,  1155,
            377,  2628,    14,   199, 30749,    26]),
  tensor([32616,    26,   716,   639,   468,   494,   355,    84,  4108,  1349,
          10041,    31,   199, 30749,    26]),
  tensor([32616,    26,   659,   494,   468,   835,  1031,  6427,  1637,   357,
          30974,    13, 33806,  7522,  9966, 12968,    31,   199, 30749,    26]),
  tensor([32616,    26,  6010,   806,   503,   468,   435,  6892,   317, 21746,
            335,   435,   278,  1031, 33277,    14,   468,   417, 26960,    14,
            468,  2464, 11773,    12,   417,  6683,  6863,   289,   258,   942,
           6774,  4968,   335,  2891,   286,  1311,    14,   716,   977,   468,
            571,    31,   199, 30749,    26]),
  tensor([32616,  

In [20]:
ppo_trainer = PPOTrainer(config, active_model, ref_model, tokenizer, dataset=dataset, data_collator=collator)

[34m[1mwandb[0m: Currently logged in as: [33mshankar-arunp[0m. Use [1m`wandb login --relogin`[0m to force relogin




In [22]:
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch['input_ids']
    rewards = batch['rewards']
    ground_truth_responses = batch['ground_truth']

    response_tensors = []

    for query, ground_truth_response in zip(query_tensors, ground_truth_responses):
        gt_len = len(ground_truth_response.split())
        response = ppo_trainer.generate(query, 
                                        do_sample=True, 
                                        top_k=1, 
                                        min_new_tokens=gt_len * 2,
                                        max_new_tokens=gt_len * 2, 
                                        repetition_penalty=10.0,
                                        length_penalty=-0.1,
                                        top_p=1.0)
        response_tensors.append(response.squeeze())
    batch['response'] = [tokenizer.decode(r) for r in response_tensors]
    print('step')
    print(query_tensors)
    print()
    print(response_tensors)
    print()
    print(rewards)
    #### Run PPO step 
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

0it [00:00, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end genera

step
[tensor([32616,    26,   659,   571,   468,  1229,   639,   258, 21943,   334,
        18760,  5229,   301,  2857,    72,    13, 37343,    31,   199, 30749,
           26]), tensor([32616,    26,  1812,  3709,   322,  1148,  1790,   576,   502,  2072,
         5355,    14,   199, 30749,    26]), tensor([32616,    26,   659,   494,   468,   835,  1031,  6427,  1637,   357,
        30974,    13, 33806,  7522,  9966, 12968,    31,   199, 30749,    26]), tensor([32616,    26,   716,   639,   468,   494,   355,    84,  4108,  1349,
        10041,    31,   199, 30749,    26])]

[tensor([32616,    26,   659,   571,   468,  1229,   639,   258, 21943,   334,
        18760,  5229,   301,  2857,    72,    13, 37343,    31,   199, 30749,
           26,   264,  3506,  1380,   438,  3472,   742, 32974,   317,   512,
         1266,    14,  2797,    12,   342,  8080,   335,  4569,   289, 30198,
          288,   727,  2932,  1450,  3664,  7646,  3954,   286,   914,  2910,
         1054,  2094,   8

1it [00:14, 14.86s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end ge

step
[tensor([32616,    26,  6656,   263,   491,   377, 20378,  8105,   289, 10770,
         1364,   732,    14,   716,  3646,   619,  2443,   571,   502,  3472,
           31,   199, 30749,    26]), tensor([32616,    26,   494,  9952, 10826,  2313,  4669,  2598,   384,    80,
         2749,     9,   289,  3954,  5153,   357, 13257,  3711,  2073,   286,
         1638,  2898,   289,   619,  1054,  3017,   278,   914,   288,   512,
           13,   458,  2312,    31,   199, 30749,    26]), tensor([32616,    26,   722,  1839,  1881,   456,    14,  2675,    12,   494,
          504,   288,   682, 22731,    12,   569,   377,   438,  9933,   278,
          473,  1699,   682,    12,  4087,   278,   426,   682,    31,   199,
        30749,    26]), tensor([32616,    26,   716,  8115,   977,   468,  1079,   824,  1407,    31,
          199, 30749,    26])]

[tensor([32616,    26,  6656,   263,   491,   377, 20378,  8105,   289, 10770,
         1364,   732,    14,   716,  3646,   619,  2443,   5

2it [00:30, 15.57s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end ge

step
[tensor([32616,    26,   494,   468,  1159,  3050,   379,  1041,    12,   876,
          341, 46144,    31,   199, 30749,    26]), tensor([32616,    26,  6010,   806,   503,   468,   435,  6892,   317, 21746,
          335,   435,   278,  1031, 33277,    14,   468,   417, 26960,    14,
          468,  2464, 11773,    12,   417,  6683,  6863,   289,   258,   942,
         6774,  4968,   335,  2891,   286,  1311,    14,   716,   977,   468,
          571,    31,   199, 30749,    26]), tensor([32616,    26,   659,   494,   468,  8194,   357,   528,   264,  8051,
         4024,   289,   438,  1014,  1625,   286,   755,  3325,   468,   733,
          935,   264,  2793,   377,  1779,  7122,    31,   199, 30749,    26]), tensor([32616,    26,   377,   614,  2469,   317, 10200,   286,  6253,   341,
         7114,   824,   264,   512,    13,   458,   643,    31,   199, 30749,
           26])]

[tensor([32616,    26,   494,   468,  1159,  3050,   379,  1041,    12,   876,
          341, 461

3it [00:44, 14.71s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end ge

step
[tensor([32616,    26,   468, 26074,   658,   463,  1074,   585,   512,    13,
          458,    14,   938,   494,   468,  1523,   264,  1516,  3489,   317,
         1074,    31,   199, 30749,    26]), tensor([32616,    26,  1398,   286,   746,  2407,  5375,   639,   746,  1155,
          377,  2628,    14,   199, 30749,    26]), tensor([32616,    26,   977,   570,   569,  7957,   412, 41407, 11144,  2690,
           13, 11491,   639,   511,   873,  7995,  8774,  4968,  2213,    31,
          199, 30749,    26]), tensor([32616,    26,   716,   900,   619,  2726,   334,   614,   317,  3662,
        30875,    31,   199, 30749,    26])]

[tensor([32616,    26,   468, 26074,   658,   463,  1074,   585,   512,    13,
          458,    14,   938,   494,   468,  1523,   264,  1516,  3489,   317,
         1074,    31,   199, 30749,    26,  3641,   449,  2655,   334,   258,
         1453,  2427,   288,  1023,   784,   322,  3026,   502,  1599,   528,
          564,  6183,    12,   881,  11

4it [00:58, 14.68s/it]
