<a href="https://colab.research.google.com/github/finardi/tutos/blob/master/PTBR_GPT%2BPPO_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2-ctrl-training-setup.png' width='600'>
<p style="text-align: center;"> <b>Figure:</b> Experiment setup to tune GPT2. The yellow arrows are outside the scope of this notebook, but the trained models are available through Hugging Face. </p>
</div>


In this notebook we fine-tune GPT2 (small) to generate positive movie reviews based on the IMDB dataset. The model gets the start of a real review and is tasked to produce positive continuations. To reward positive continuations we use a BERT classifier to analyse the sentiment of the produced sentences and use the classifier's outputs as rewards signals for PPO training.

### Import dependencies

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
%pip install -q transformers trl wandb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 KB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m93.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m106.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.8/212.8 KB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 KB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) 

In [None]:
import torch
from tqdm import tqdm
import pandas as pd
tqdm.pandas()

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

### Configuration

In [None]:
config = PPOConfig(
    model_name='/content/drive/MyDrive/LLMs/ckpts/GPT_imdb',
    learning_rate=1.41e-5,
    log_with="wandb",
)

sent_kwargs = {
    "return_all_scores": True,
    "function_to_apply": "none",
    "batch_size": 16
}

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

In [None]:
import wandb
wandb.init()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You can see that we load a GPT2 model called `gpt2_imdb`. This model was additionally fine-tuned on the IMDB dataset for 1 epoch with the huggingface [script](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py) (no special settings). The other parameters are mostly taken from the original paper ["Fine-Tuning Language Models from Human Preferences"](
https://arxiv.org/pdf/1909.08593.pdf). This model as well as the BERT model is available in the Huggingface model zoo [here](https://huggingface.co/models). The following code should automatically download the models.

## Load data and models

### Load IMDB dataset
The IMDB dataset contains 50k movie review annotated with "positive"/"negative" feedback indicating the sentiment.  We load the IMDB dataset into a DataFrame and filter for comments that are at least 200 characters. Then we tokenize each text and cut it to random size with the `LengthSampler`.

In [None]:
ds = load_dataset("maritaca-ai/imdb_pt", split='train')
ds = ds.rename_columns({'text': 'review'})
ds

Downloading builder script:   0%|          | 0.00/3.29k [00:00<?, ?B/s]

Downloading and preparing dataset imdb_pt/plain_text to /root/.cache/huggingface/datasets/maritaca-ai___imdb_pt/plain_text/1.0.0/93713e4fbbbd544d1c09fb6072e3d18fedd347bfc32206b4e2d98b27444ebd5a...


Downloading data:   0%|          | 0.00/33.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/32.5M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset imdb_pt downloaded and prepared to /root/.cache/huggingface/datasets/maritaca-ai___imdb_pt/plain_text/1.0.0/93713e4fbbbd544d1c09fb6072e3d18fedd347bfc32206b4e2d98b27444ebd5a. Subsequent calls will reuse this data.


Dataset({
    features: ['review', 'label'],
    num_rows: 25000
})

In [None]:
def build_dataset(config, tokenizer, ds, input_min_text_length=3, input_max_text_length=10):
    ds = ds.filter(lambda x: len(x["review"])>200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[:input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type='torch')
    return ds

# - - - -
dataset = build_dataset(config, tokenizer, ds)

def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])    

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/24860 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2165 > 2048). Running this sequence through the model will result in indexing errors


### Initialize PPOTrainer
The `PPOTrainer` takes care of device placement and optimization later on:

In [None]:
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)

### Load BERT classifier
We load a BERT classifier fine-tuned on the IMDB dataset.

In [None]:
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
   device = 0 if torch.cuda.is_available() else "cpu" # to avoid a `pipeline` bug
sentiment_pipe = pipeline("sentiment-analysis", model="/content/drive/MyDrive/LLMs/ckpts/BERT_imdb/", device=device)

The model outputs are the logits for the negative and positive class. We will use the logits for positive class as a reward signal for the language model.

In [None]:
text = 'Esse filme foi ruim!!'
sentiment_pipe(text, **sent_kwargs)



[[{'label': 'LABEL_0', 'score': 3.0385935306549072},
  {'label': 'LABEL_1', 'score': -3.26000714302063}]]

In [None]:
text = 'Esse filme foi bom!!'
sentiment_pipe(text, **sent_kwargs)

[[{'label': 'LABEL_0', 'score': -2.3053135871887207},
  {'label': 'LABEL_1', 'score': 2.1447975635528564}]]

### Generation settings
For the response generation we just use sampling and make sure top-k and nucleus sampling are turned off as well as a minimal length.

In [None]:
gen_kwargs = {
    "min_length":-1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id
}

## Optimize model

### Training loop

The training loop consists of the following main steps:
1. Get the query responses from the policy network (GPT-2)
2. Get sentiments for query/responses from BERT
3. Optimize policy with PPO using the (query, response, reward) triplet

**Training time**

This step takes **~2h** on a V100 GPU with the above specified settings.

In [None]:
output_min_length = 4
output_max_length = 18
output_length_sampler = LengthSampler(output_min_length, output_max_length)


generation_kwargs = {
    "min_length":-1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id
}


for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch['input_ids']

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch['response'] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute sentiment score
    texts = [q + r for q,r in zip(batch['query'], batch['response'])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    #### Run PPO step 
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

0it [00:00, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
97it [3:17:06, 121.93s/it]


### Training progress
If you are tracking the training progress with Weights&Biases you should see a plot similar to the one below. Check out the interactive sample report on wandb.ai: [link](https://app.wandb.ai/lvwerra/trl-showcase/runs/1jtvxb1m/).

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_tuning_progress.png' width='800'>
<p style="text-align: center;"> <b>Figure:</b> Reward mean and distribution evolution during training. </p>
</div>

One can observe how the model starts to generate more positive outputs after a few optimisation steps.

> Note: Investigating the KL-divergence will probably show that at this point the model has not converged to the target KL-divergence, yet. To get there would require longer training or starting with a higher inital coefficient.

## Model inspection
Let's inspect some examples from the IMDB dataset. We can use `model_ref` to compare the tuned model `model` against the model before optimisation.

In [None]:
#### get a batch from the dataset
bs = 12
game_data = dict()
dataset.set_format("pandas")
df_batch = dataset[:].sample(bs)
game_data['query'] = df_batch['query'].tolist()
query_tensors = df_batch['input_ids'].tolist()

response_tensors_ref, response_tensors = [], []

#### get response from gpt2 and gpt2_ref
for i in range(bs):
    gen_len = output_length_sampler()
    output = ref_model.generate(torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),
                                     max_new_tokens=gen_len, **gen_kwargs).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = model.generate(torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),
                                 max_new_tokens=gen_len, **gen_kwargs).squeeze()[-gen_len:]
    response_tensors.append(output)

#### decode responses
game_data['response (before)'] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
game_data['response (after)'] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]

#### sentiment analysis of query/response pairs before/after
texts = [q + r for q,r in zip(game_data['query'], game_data['response (before)'])]
game_data['rewards (before)'] = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

texts = [q + r for q,r in zip(game_data['query'], game_data['response (after)'])]
game_data['rewards (after)'] = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results

Unnamed: 0,query,response (before),response (after),rewards (before),rewards (after)
0,Isso tem que ser,uma arma do exérc,um dos melhores film mais,-0.319532,1.414127
1,Visualmente deslumbr,"ante, muitas vezes atrev","ado, é um dos filmes mais",1.545089,2.833027
2,Em um dormit,"ório simplificado, um jovem adolescent",ório em frente ao lado da,0.200822,0.175321
3,Este filme não,é tão inacrizante quanto pode ser. Se p,"recebe dinheiro, mas é um dos melhores que e",-3.367388,2.949946
4,"Em 1942, um filme de Manhattan",envolvendo algumas vezes elétrico e,também conheceu o seu melhor --e,0.182283,1.470433
5,Este filme,foi muit,é um dos mel,-1.119987,0.214351
6,Este filme é um dos mel,"hores de Tim Burton, Arthur Nunez e Anita",hores programas sofisticados do ano.,3.067658,2.643729
7,**** Inclui spoilers ****,Carla de uma m,Este é um dos mel,-0.315817,-1.209339
8,No início do fil,"me (2008), há uma grande paixão na car","me, ele era um filme divertido quando nos perc...",-0.554945,-0.570873
9,Este filme é incrí,vel. Estava,"cil, mas que",2.872612,1.851713


Looking at the reward mean/median of the generated sequences we observe a significant difference.

In [None]:
print('mean:')
display(df_results[["rewards (before)", "rewards (after)"]].mean())
print()
print('median:')
display(df_results[["rewards (before)", "rewards (after)"]].median())

mean:


rewards (before)    0.318937
rewards (after)     1.046508
dtype: float64


median:


rewards (before)   -0.066767
rewards (after)     1.442280
dtype: float64

## Save model
Finally, we save the model and push it to the Hugging Face for later usage.

In [None]:
model.save_pretrained('/content/drive/MyDrive/LLMs/ckpts/gpt2-imdb-pos-v2', push_to_hub=False)
tokenizer.save_pretrained('/content/drive/MyDrive/LLMs/ckpts/gpt2-imdb-pos-v2', push_to_hub=False)

('/content/drive/MyDrive/LLMs/ckpts/gpt2-imdb-pos-v2/tokenizer_config.json',
 '/content/drive/MyDrive/LLMs/ckpts/gpt2-imdb-pos-v2/special_tokens_map.json',
 '/content/drive/MyDrive/LLMs/ckpts/gpt2-imdb-pos-v2/vocab.json',
 '/content/drive/MyDrive/LLMs/ckpts/gpt2-imdb-pos-v2/merges.txt',
 '/content/drive/MyDrive/LLMs/ckpts/gpt2-imdb-pos-v2/added_tokens.json',
 '/content/drive/MyDrive/LLMs/ckpts/gpt2-imdb-pos-v2/tokenizer.json')