# Tune GPT2 to generate controlled sentiment reviews

From this repo:  https://github.com/lvwerra/trl

> Optimise GPT2 to produce IMDB movie reviews with controlled sentiment using a BERT sentiment classifier for rewards.

**WARNING:** We often experienced loss spikes in this examples which caused model training to fail or slow down. There is a [GitHub issue](https://github.com/lvwerra/trl/issues/101) to track the issue.

<div style="text-align: center">

<p style="text-align: center;"> <b>Figure:</b> Experiment setup to tune our LLM. The yellow arrows are outside the scope of this notebook, but the trained models are available through Hugging Face. </p>
</div>


The experiment setup is very similar to the positive sentiment notebook. However, in this notebook we fine-tune an LLM to generate star_rating based on the Amazon Customer Reviews Dataset. The model gets the target `star_rating` and tokens from a real review and is tasked to produce continuations with the targeted star_rating. The reward for the continuations is calculated with the logits of a reward model trained on (review, star_rating, ranking) tuples, essentially.  That reward is then used for PPO training.

## Setup experiment

In [2]:
import psutil

notebook_memory = psutil.virtual_memory()
print(notebook_memory)

if notebook_memory.total < 32 * 1000 * 1000 * 1000:
    print('*******************************************')    
    print('YOU ARE NOT USING THE CORRECT INSTANCE TYPE')
    print('PLEASE CHANGE INSTANCE TYPE TO  m5.2xlarge ')
    print('*******************************************')
else:
    correct_instance_type=True

svmem(total=802916929536, available=787953823744, percent=1.9, used=10475315200, free=753847668736, active=9926877184, inactive=36188454912, buffers=0, cached=38593945600, shared=106233856, slab=1149804544)


### Import dependencies

In [3]:
%pip install --disable-pip-version-check -q \
    transformers==4.26.1 \
    datasets==2.9.0 \
    accelerate==0.17.0 \
    bitsandbytes==0.37.0 \
    promptsource==0.2.3 \
    trl==0.4.1 \
    evaluate==0.4.0

[0mNote: you may need to restart the kernel to use updated packages.


In [8]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
%store -r reward_model_path

In [12]:
try:
    reward_model_path
except NameError:
    print("*** PLEASE RUN PREVIOUS NOTEBOOK BEFORE CONTINUING ***")

In [13]:
print(reward_model_path)

./tmp_models/reward_model/


# TODO:  This should be bloomz (not bloom) at this point in the flow

In [17]:
%store -r model_checkpoint

In [18]:
model_checkpoint = 'bigscience/bloomz-560m'

In [20]:
try:
    model_checkpoint
except NameError:
    print("*** PLEASE RUN PREVIOUS NOTEBOOK BEFORE CONTINUING ***")

In [21]:
print(model_checkpoint)

bigscience/bloom-560m


In [132]:
import random
import torch
#import wandb
import time
import os
from tqdm import tqdm
import numpy as np
import pandas as pd
from random import choices
import matplotlib.pyplot as plt
tqdm.pandas()

from datasets import load_dataset

from transformers import AutoTokenizer, pipeline, AutoModelForSequenceClassification

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import LengthSampler

### Configuration

In [54]:
config = PPOConfig(
    ppo_epochs=1,
    batch_size=10,
    #model_name="lvwerra/gpt2-imdb",
    steps=100, # 51200    
    learning_rate=1.41e-5,
    remove_unused_columns=False,
)

reward_pipeline_kwargs = {
    "top_k": None,  
    "function_to_apply": "none"
}

txt_in_len = 5
txt_out_len = 20
seed = 1


In [55]:
np.random.seed(seed)

You can see that we load a GPT2 model called `gpt2_imdb`. This model was additionally fine-tuned on the IMDB dataset for 1 epoch with the huggingface [script](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py) (no special settings). The other parameters are mostly taken from the original paper ["Fine-Tuning Language Models from Human Preferences"](
https://arxiv.org/pdf/1909.08593.pdf). This model as well as the BERT model is available in the Huggingface model zoo [here](https://huggingface.co/models). The following code should automatically download the models.

## Load data and models

### Load supervised-fine-tuned (SFT) language models

We load the LLM with a value head and the tokenizer. We load the model twice; the first model is optimized while the second model serves as a reference to calculate the KL-divergence from the starting point. This serves as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original language model.

In [89]:
#gpt2_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
#gpt2_model_ref = create_reference_model(gpt2_model)
#gpt2_tokenizer = AutoTokenizer.from_pretrained(config.model_name)
#gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token # specific to gpt2 since it doens't have an official pad token

ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained(model_checkpoint)
ppo_model_ref = create_reference_model(ppo_model)
ppo_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

### Load IMDB dataset
The IMDB dataset contains 50k movie review annotated with "positive"/"negative" feedback indicating the sentiment.  We load the IMDB dataset into a DataFrame and filter for comments that are at least 500 characters long and take the first 1000 characters of each comment. The first filter we apply to avoid comments that are less than `txt_in_len` token long and the second to avoid tokenizing way more text than we actually need.

In [139]:
# # create the dataset 
# # 
# dataset = load_dataset('imdb', split='train')
# dataset = dataset.rename_columns({'text': 'review', 'label': 'sentiment'})
# # make sure the comments are are at least 500 and trim to 1000
# dataset = dataset.filter(lambda x: len(x["review"])>500, batched=False)
# dataset = dataset.map(lambda x:{"review":x['review'][:1000]}, batched=False)

# dataset.select([1, 2])

### Tokenize IMDB reviews

We tokenize all IMDB in advance to avoid tokenizing twice. In the first step we encode the queries and slice the first `txt_in_len` tokens. In a second step we decode these tokens back to text for later display.

In [141]:
# dataset = dataset.map(lambda x:{"input_ids": ppo_tokenizer.encode(' '+x['review'], return_tensors="pt")[0, :txt_in_len]}, batched=False)
# dataset = dataset.map(lambda x:{"query": ppo_tokenizer.decode(x["input_ids"])}, batched=False)
# dataset = dataset[:20480]

# from datasets import Dataset
# dataset = Dataset.from_dict(dataset)
# dataset.set_format("pytorch")

100%|██████████| 22578/22578 [00:18<00:00, 1221.43ex/s]
100%|██████████| 22578/22578 [00:01<00:00, 12430.54ex/s]


In [144]:
# dataset[3]

{'review': "This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10.",
 'sentiment': tensor(0),
 'input_ids': tensor([  3904,   8874,   1620,  15927, 110479]),
 'query': ' This film was probably inspired'}

In [142]:
# dataset[3]["query"]

' This film was probably inspired'

In [143]:
# dataset[3]["input_ids"]

tensor([  3904,   8874,   1620,  15927, 110479])

In [154]:
# for idx in range(10):
#     print(dataset[idx]["review"].item())

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve

In [155]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

In [71]:
from datasets import Dataset

dataset_test = Dataset.from_parquet('./data/test/*.parquet'.format(model_checkpoint))

Using custom data configuration default-552893b01f764c80
Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/default-552893b01f764c80/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


In [72]:
dataset_test

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 25
})

### Load reward classifier
We load the reward classifier that we fine-tuned in a previous step.

In [111]:
reward_model = AutoModelForSequenceClassification.from_pretrained(reward_model_path, num_labels=1)
reward_model_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

reward_pipeline = pipeline("text-classification", model=reward_model, tokenizer=reward_model_tokenizer) #, device=device)

The model outputs are the logits for the negative and positive class. We will use the logits for positive class as a reward signal for the language model.

In [114]:
prompt = 'PROMPT: Given the following review:\nIf you are prepping for the end of the world this is one of those things that you should have installed on your-end-of-the-world-proof PC.  Hail to the great Yuri!\npredict the associated rating from the following choices (1 being lowest and 5 being highest)\n- 1\n- 2\n- 3\n- 4\n- 5\nRESPONSE: '
response = '2'
output = reward_pipeline(prompt + response, **reward_pipeline_kwargs)
output

[{'label': 'LABEL_0', 'score': -50.25454330444336}]

The resulting reward signal:

In [116]:
# def extract_pipeline_output(outputs):
#     positive_logits = []
#     for out in outputs:
#         for element in out:
#             if element["label"]=="LABEL_0": # =="POSITIVE":
#                 positive_logits.append(torch.tensor(element["score"]))
#     return positive_logits

In [117]:
# extract_pipeline_output


IndexError: list index out of range

### Control token dict
We will append the control token at the beginning of each query to signal the model what the target sentiment is. Each control sequence consists of three tokens:

In [157]:
ctrl_str = ['0', '1'] # ['[negative]', '[neutral]', '[positive]']
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # this should be handled by accelerate
ctrl_tokens = dict((s, ppo_tokenizer.encode(s, return_tensors="pt").squeeze().to(device)) for s in ctrl_str)

In [158]:
ctrl_tokens

{'0': tensor(19, device='cuda:0'), '1': tensor(20, device='cuda:0')}

### Reward function

In [159]:
def pos_logit_to_reward(logit, task):
    """
    Take the positive sentiment logit and scale it for the task.
        task [negative]: reward = -logit
        task [neutral]: reward = -2*abs(logit)+4
        task [positive]: reward = logit
    """
    
    print(logit)
    print(task)
    for i in range(len(logit)):
        if str(task[i])=='0': # [negative]
            logit[i] = -logit[i]
        # elif task[i]=='[neutral]':
        #     logit[i] = -2*torch.abs(logit[i])+4
        elif str(task[i])=='1': # [positive]:
            pass
        else:
            raise ValueError('task has to be in [0, 1]!')
    return logit

The following examples show the rewards for the cases where the classifier logit is 4, -4 and 0 for the three targets `['negative]`, `['neutral]` and `['positive']`. The scaling is not perfect as it differs between neutral and the other two classes. This is something to further investigate in the future. Ideally, one would use the logit output for each class individually, but since there is no dedicated class for neutral this is a workaround.

In [160]:
print(ctrl_str)

['0', '1']


In [163]:
pos_logit_to_reward(torch.Tensor([4,4]), ctrl_str)

tensor([4., 4.])
['0', '1']


tensor([-4.,  4.])

In [164]:
pos_logit_to_reward(torch.Tensor([-4,-4]), ctrl_str)

tensor([-4., -4.])
['0', '1']


tensor([ 4., -4.])

In [165]:
pos_logit_to_reward(torch.Tensor([0, 0]), ctrl_str)

tensor([0., 0.])
['0', '1']


tensor([-0., 0.])

### Generation settings

In [166]:
generation_kwargs = {
    "min_length":-1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
#    "pad_token_id": ppo_tokenizer.eos_token_id,
    "max_new_tokens": txt_out_len,
    "eos_token_id": -1
}


## Optimize model

**Steps**

The training loop consists of the following steps:
1. Get a batch of queries and create random controls
2. Get the query responses from the policy
3. Join query and responses, tokenize, and get score from reward model
4. Get reward for query/responses from reward model
5. Optimize policy with PPO using the (query, response, reward) triplet
6. Log all the training statistics

**Training time**

This step takes **~2h** on a P6000 GPU with the above specified settings.

In [167]:
#ppo_trainer = PPOTrainer(config, gpt2_model, gpt2_model_ref, gpt2_tokenizer, dataset, data_collator=collator)
ppo_trainer = PPOTrainer(config, ppo_model, ppo_model_ref, ppo_tokenizer, dataset_test, data_collator=collator)

if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu" # to avoid a `pipeline` bug
else:
    device = ppo_trainer.accelerator.device

In [209]:
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)


generation_kwargs = {
    "min_length":-1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": ppo_tokenizer.eos_token_id
}


for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
#    print(type(batch))
#    print(type(batch['input_ids']))
    query_tensors = batch['input_ids']

    # TODO:  Replace this with dataset['query'] up at the top
    queries_as_tensors = []
    #### Get response from ppo_trainer
    response_tensors = []
    for query in query_tensors:
        # query_str = ppo_tokenizer.decode(query["input_ids"])
        # print(type(query_str))
        # print(query_str)
#        print(query)
        query_as_tensor = torch.as_tensor(query).to(device)
#        print(query_as_tensor)
        queries_as_tensors.append(ppo_tokenizer.decode(query_as_tensor))
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query_as_tensor, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    
#    print(response_tensors)
    # TODO:  Replace this with dataset['query'] up at the top
    batch['query'] = queries_as_tensors    
    batch['response'] = [ppo_tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute reward score
    texts = [q + r for q,r in zip(batch['query'], batch['response'])]
    pipeline_outputs = reward_pipeline(texts, **reward_pipeline_kwargs)
#    print(pipeline_outputs)
#    rewards = [print(output[0]['score']) for output in pipeline_outputs]
    rewards = [torch.tensor(output[0]["score"]) for output in pipeline_outputs]    
#    rewards = [torch.tensor(output[1]["score"]) for output in pipeline_outputs]



    #### Run PPO step 
    
    print(type(query_tensors))
    print(type(response_tensors))
    print(type(rewards))

#    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    stats = ppo_trainer.step(torch.as_tensor(query_tensors), torch.as_tensor(response_tensors), torch.as_tensor(rewards))

    ppo_trainer.log_stats(stats, batch, rewards)

0it [00:03, ?it/s]

<class 'list'>
<class 'list'>
<class 'list'>





TypeError: only integer tensors of a single element can be converted to an index

### Training progress
If you are tracking the training progress with Weights&Biases you should see a plot similar to the following:

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2-ctrl-training-stats.png' width='800'>
<p style="text-align: center;"> <b>Figure:</b> Reward mean and distribution evolution during training. </p>
</div>

One can observe how the model starts to generate more positive outputs after a few optimisation steps.

> Note: Investigating the KL-divergence will probably show that at this point the model has not converged to the target KL-divergence, yet. To get there would require longer training or starting with a higher inital coefficient.

## Model inspection

### Reward distribution
First, we can have a look at the reward distribution. Both the negative and positive rewards are clearly shifted to high rewards. The neutral rewards, however, are still centered around zero. There are a few possible explanations for this. There could be a bug in the code and the way the neutral rewards are calculated. Another problem could be that sentence sometimes start with a strong sentiment and it is hard for the model shift the sentiment towards neutral.

In [None]:
#### get a batch from the dataset
bs = 16
game_data = dict()
dataset.set_format("pandas")
df_batch = dataset[:].sample(bs)
game_data['query'] = df_batch['query'].tolist()
query_tensors = df_batch['input_ids'].tolist()

response_tensors_ref, response_tensors = [], []

#### get response from gpt2 and gpt2_ref
for i in range(bs):
    gen_len = output_length_sampler()
    output = ref_model.generate(torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),
                                     max_new_tokens=gen_len, **gen_kwargs).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = model.generate(torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),
                                 max_new_tokens=gen_len, **gen_kwargs).squeeze()[-gen_len:]
    response_tensors.append(output)

#### decode responses
game_data['response (before)'] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
game_data['response (after)'] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]

#### sentiment analysis of query/response pairs before/after
texts = [q + r for q,r in zip(game_data['query'], game_data['response (before)'])]
game_data['rewards (before)'] = [output[1]["score"] for output in reward_pipeline(texts, **reward_pipeline_kwargs)]

texts = [q + r for q,r in zip(game_data['query'], game_data['response (after)'])]
game_data['rewards (after)'] = [output[1]["score"] for output in reward_pipeline(texts, **reward_pipeline_kwargs)]

# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results

In [None]:
print('mean:')
display(df_results[["rewards (before)", "rewards (after)"]].mean())
print()
print('median:')
display(df_results[["rewards (before)", "rewards (after)"]].median())

## Save model
Finally, we save the model to disk for later usage.

In [None]:
ppo_model_path = './tmp_models/ppo_model/'
ppo_model.save_pretrained(ppo_model_path)
ppo_tokenizer.save_pretrained(ppo_model_path)

In [None]:
print(type(logs))
print(logs)
print(ctrl_str)

# for ctrl_s in ctrl_str:
#     plt.hist([r for r, t in zip(logs['env/reward_dist'], task_list) if t==ctrl_s],
#              density=True,
#              alpha=0.5,
#              label=ctrl_s)
# plt.legend(loc='best')
# plt.title('reward distribution')
# plt.grid(True)
# plt.show()