<a href="https://colab.research.google.com/github/finardi/tutos/blob/master/GPT2%20-%20RL(BERT)Feedback.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2-ctrl-training-setup.png' width='600'>
<p style="text-align: center;"> <b>Figure:</b> Experiment setup to tune GPT2. The yellow arrows are outside the scope of this notebook, but the trained models are available through Hugging Face. </p>
</div>


In this notebook we fine-tune GPT2 (small) to generate positive movie reviews based on the IMDB dataset. The model gets the start of a real review and is tasked to produce positive continuations. To reward positive continuations we use a BERT classifier to analyse the sentiment of the produced sentences and use the classifier's outputs as rewards signals for PPO training.

### Import dependencies

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%pip install -q transformers trl wandb

In [3]:
import torch
from tqdm.auto import tqdm

import gc

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

MANUAL_SEED  = 2711
def deterministic(rep=True):
    torch.manual_seed(MANUAL_SEED)
    if torch.cuda.is_available():
            torch.cuda.manual_seed(MANUAL_SEED)
            torch.cuda.manual_seed_all(MANUAL_SEED)
            torch.backends.cudnn.enabled = False 
            torch.backends.cudnn.benchmark = False
            torch.backends.cudnn.deterministic = True
            print(f'Experimento deterministico, seed: {MANUAL_SEED} -- ', end = '')
            print(f'Existe {torch.cuda.device_count()} GPU {torch.cuda.get_device_name(0)} disponível.')
    else:
        print('Device CPU')
deterministic()        

Experimento deterministico, seed: 2711 -- Existe 1 GPU Tesla T4 disponível.


### Configuration

In [4]:
config = PPOConfig(
    model_name='/content/drive/MyDrive/LLMs/ckpts/GPT_imdb',
    learning_rate=1.41e-5,
    log_with="wandb",
    batch_size=96,

)

# sent_pipeline
sent_kwargs = {
    "return_all_scores": True,
    "function_to_apply": "none",
    "batch_size": 12
}

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

In [5]:
import wandb
wandb.init()

[34m[1mwandb[0m: Currently logged in as: [33mpfinardi[0m. Use [1m`wandb login --relogin`[0m to force relogin


You can see that we load a GPT2 model called `gpt2_imdb`. This model was additionally fine-tuned on the IMDB dataset for 1 epoch with the huggingface [script](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py) (no special settings). The other parameters are mostly taken from the original paper ["Fine-Tuning Language Models from Human Preferences"](
https://arxiv.org/pdf/1909.08593.pdf). This model as well as the BERT model is available in the Huggingface model zoo [here](https://huggingface.co/models). The following code should automatically download the models.

## Load data and models

### Load IMDB dataset
The IMDB dataset contains 50k movie review annotated with "positive"/"negative" feedback indicating the sentiment.  We load the IMDB dataset into a DataFrame and filter for comments that are at least 200 characters. Then we tokenize each text and cut it to random size with the `LengthSampler`.

In [6]:
ds = load_dataset("maritaca-ai/imdb_pt", split='train')
ds = ds.rename_columns({'text': 'review'})
ds = ds.filter(lambda x: (len(x["review"].split())>30) and (len(x["review"].split())<200))
ds



Dataset({
    features: ['review', 'label'],
    num_rows: 15428
})

In [7]:
def create_query(sample, n_words=4):
    sample["input_ids"] = tokenizer.encode((' ').join(sample["review"].split()[:n_words]))
    sample["query"] = tokenizer.decode(sample["input_ids"])
    return sample

def collator(data):
    # toma as keys em data[0]: review, label, input_ids, query
    # para cada key em todo data cria os objetos key [lista de elel]
    return dict((key, [d[key] for d in data]) for key in data[0])    


data = ds.map(create_query, batched=False)
data = data.filter(lambda x: len(x["input_ids"])<10)
data = data.remove_columns(["review"])

data.set_format(type='torch', output_all_columns=True)
data



Dataset({
    features: ['label', 'input_ids', 'query'],
    num_rows: 12205
})

In [8]:
data[0]

{'label': tensor(0),
 'input_ids': tensor([ 4653,  2471,   268,   292, 31215,   819,  7940]),
 'query': 'Se apenas para evitar'}

In [9]:
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=data, data_collator=collator)

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

In [10]:
batch = next(iter(ppo_trainer.dataloader))
print(batch.keys())
print(len(batch['label']))
# batch['input_ids']

dict_keys(['label', 'input_ids', 'query'])
96


### Load BERT classifier
We load a BERT classifier fine-tuned on the IMDB dataset.

In [11]:
if ppo_trainer.accelerator.num_processes == 1:
   device = 0 if torch.cuda.is_available() else "cpu" # to avoid a `pipeline` bug

sentiment_pipe = pipeline("sentiment-analysis", model="/content/drive/MyDrive/LLMs/ckpts/BERT_imdb/", device=device)

The model outputs are the logits for the negative and positive class. We will use the logits for positive class as a reward signal for the language model.

In [12]:
text = 'Esse filme foi ruim!!'
sentiment_pipe(text, **sent_kwargs)



[[{'label': 'LABEL_0', 'score': 3.0385935306549072},
  {'label': 'LABEL_1', 'score': -3.26000714302063}]]

In [13]:
text = 'Esse filme foi bom!!'
sentiment_pipe(text, **sent_kwargs)

[[{'label': 'LABEL_0', 'score': -2.3053135871887207},
  {'label': 'LABEL_1', 'score': 2.1447975635528564}]]

### Generation settings
For the response generation we just use sampling and make sure top-k and nucleus sampling are turned off as well as a minimal length.

In [14]:
gen_kwargs = {
    "min_length":-1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id
}

## Optimize model

### Training loop

The training loop consists of the following main steps:
1. Get the query responses from the policy network (GPT-2)
2. Get sentiments for query/responses from BERT
3. Optimize policy with PPO using the (query, response, reward) triplet

**Training time**

This step takes **~2h** on a V100 GPU with the above specified settings.

In [15]:
try:
    del model
    gc.collect()
    torch.cuda.empty_cache()
    
except:
    pass

generation_kwargs = {
    "min_length":-1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id
}


MAX_LEN_OUTPUT = 12
N_SAMPLES = 5
loop = tqdm(ppo_trainer.dataloader, leave=True)

for step, batch in enumerate(loop):
    query_tensors = batch['input_ids']

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        gen_len = MAX_LEN_OUTPUT
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch['response'] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute sentiment score
    texts = [q + r for q, r in zip(batch['query'], batch['response'])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    #### Run PPO step 
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)
    
    #### Get N_SAMPLES to check
    sample_N = torch.multinomial(
        torch.arange(config.batch_size, dtype=torch.float), num_samples=N_SAMPLES
        )
    response_tensors = torch.stack(response_tensors)
    
    query_sampled    = [tokenizer.decode(query_tensors[s]) for s in sample_N]
    gen_text_sampled = [tokenizer.decode(e) for e in response_tensors[sample_N]]
    rewards_sampled  = torch.stack(rewards)[sample_N]

    print(f'step {step} -- mean_reward: ({torch.tensor(rewards).mean():.3}) -- decoding {N_SAMPLES} rand samples:')
    for ix, (q,a,r) in enumerate(zip(query_sampled, gen_text_sampled, rewards_sampled)):
        print(f'\t{ix}: {q:<35} -> {a:<50} -- Reward: {r.item():<40.3}')
    print()

    loop.set_description(f'mean rewards: {torch.tensor(rewards).mean():.3}')

  0%|          | 0/127 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


step 0 -- mean_reward: (-0.28) -- decoding 5 rand samples:
	0: Este filme é uma                    ->  comédia, talvez o filme mais                      -- Reward: 2.85                                    
	1: O que você espera                   ->  deste filme é um palantar medíocre                -- Reward: -2.33                                   
	2: Receio ter que discordar            ->  de muitas outras classificaçõ                     -- Reward: -0.105                                  
	3: Como sempre, fui assistir           ->  Bloody Petty (filme the Sickest Movie) quando     -- Reward: -1.3                                    
	4: A atuação é uma                     ->  obrigação típica de uma                           -- Reward: -0.12                                   

step 1 -- mean_reward: (-0.295) -- decoding 5 rand samples:
	0: Eu me deparei com                   ->  este filme quando assisti ao trailer deste        -- Reward: 0.854                                   
	1: 



step 8 -- mean_reward: (-0.325) -- decoding 5 rand samples:
	0: Se a sua idéia                      ->  de que todo personagem infantil que não           -- Reward: -0.447                                  
	1: É difícil acreditar que             ->  este filme é um dos mais extensos e               -- Reward: -0.411                                  
	2: Ok, então eu entendi.               ->  É bastante tarde, mas... francamente,             -- Reward: -1.29                                   
	3: Citizen X conta ao                  ->  cidadão geneticamente deficiente D cometendo c    -- Reward: -0.745                                  
	4: Eu vi esse filme                    ->  dizendo que era o filme mais engra                -- Reward: 0.251                                   

step 9 -- mean_reward: (-0.278) -- decoding 5 rand samples:
	0: Eu estava preocupado que            ->  havia sacos caminhando em todo o                  -- Reward: -0.849                                  
	1:

### Training progress
If you are tracking the training progress with Weights&Biases you should see a plot similar to the one below. Check out the interactive sample report on wandb.ai: [link](https://app.wandb.ai/lvwerra/trl-showcase/runs/1jtvxb1m/).

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_tuning_progress.png' width='800'>
<p style="text-align: center;"> <b>Figure:</b> Reward mean and distribution evolution during training. </p>
</div>

One can observe how the model starts to generate more positive outputs after a few optimisation steps.

> Note: Investigating the KL-divergence will probably show that at this point the model has not converged to the target KL-divergence, yet. To get there would require longer training or starting with a higher inital coefficient.

## Model inspection
Let's inspect some examples from the IMDB dataset. We can use `model_ref` to compare the tuned model `model` against the model before optimisation.

In [16]:
#### get a batch from the dataset
bs = 12
game_data = dict()
data.set_format("pandas")
df_batch = data[:].sample(bs)
game_data['query'] = df_batch['query'].tolist()
query_tensors = df_batch['input_ids'].tolist()

response_tensors_ref, response_tensors = [], []

#### get response from gpt2 and gpt2_ref
for i in range(bs):
    gen_len = MAX_LEN_OUTPUT
    output = ref_model.generate(torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),
                                     max_new_tokens=gen_len, **gen_kwargs).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = model.generate(torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),
                                 max_new_tokens=gen_len, **gen_kwargs).squeeze()[-gen_len:]
    response_tensors.append(output)

#### decode responses
game_data['response (before)'] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
game_data['response (after)'] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]

#### sentiment analysis of query/response pairs before/after
texts = [q + r for q,r in zip(game_data['query'], game_data['response (before)'])]
game_data['rewards (before)'] = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

texts = [q + r for q,r in zip(game_data['query'], game_data['response (after)'])]
game_data['rewards (after)'] = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

# store results in a dataframe
import pandas as pd
df_results = pd.DataFrame(game_data)
df_results

NameError: ignored

Looking at the reward mean/median of the generated sequences we observe a significant difference.

In [None]:
print('mean:')
display(df_results[["rewards (before)", "rewards (after)"]].mean())
print()
print('median:')
display(df_results[["rewards (before)", "rewards (after)"]].median())

## Save model
Finally, we save the model and push it to the Hugging Face for later usage.

In [None]:
model.save_pretrained(    '/content/drive/MyDrive/LLMs/ckpts/gpt2-PTBR_imdb-pos-v2', push_to_hub=False)
tokenizer.save_pretrained('/content/drive/MyDrive/LLMs/ckpts/gpt2-PTBR_imdb-pos-v2', push_to_hub=False)