# Fine-tunning de uma LLM utilizando o aprendizado por reforço busncando prevenir respostas negativos

Retornos negativos de modelos de LLM podem ocorrer quando este não é treinado de maneira eficiente para prevenir esse tipo de ocorrência. Uma das principais técnicas utilizadas atualmente para previnir retornos que podem conter caráter ofensivo (tanto em aspecto ético, como legal) é a utilização de modelos de aprendizado por reforço, que realizam o ajuste do modelo de LLM até que a sua reposta esteja de acordo com o esperado.

Summário:

- [1 - Configurando o Ambiente](#1)
   - [1.1 - Verifica a Instancia e Instala os Utilitários que serão Necessários](#1.1)
   - [1.2 - Importa as Bibliotecas que serão utilizadas](#1.2) 
- [2 - Carrega o modelo FLAN-T5, Prepara o Modelo de Recompensa e a Avaliação da Toxicidade](#2)
   - [2.1 - Preparando o modelo de recompensa](#2.1)
- [3 - Aplicando o PPO para Retirar a Toxidade do modelo](#3)
   - [3.1 - Inicializa o PPOTrainer](#3.1)
   - [3.2 - Realiza o Fine-tunning do modelo com a tecnica de POO](#3.2)
   - [3.3 - Avalia o modelo qualitativamente](#3.3)

<a name='1' ></a>
# 1 - Configurando o Ambiente

<a name='1.1'></a>
## 1.1 - Verifica a Instancia e Instala os Utilitários que serão Necessários

In [2]:
import os

instance_type_expected = "ml-m5-2xlarge"
instance_type_current = os.environ.get("HOSTNAME")

print(f"Expected instance type: instance-datascience-{instance_type_expected}")
print(f"Current instance type: {instance_type_expected}")

assert  instance_type_expected in instance_type_current, f"ERROR. Expected instance type: {instance_type_expected}"
print(f"Instance type has been choosen correctly")

Expected instance type: instance-datascience-ml-m5-2xlarge
Current instance type: ml-m5-2xlarge
Instance type has been choosen correctly


In [3]:
%pip install -U datasets==2.17.0

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    peft==0.3.0 --quiet

# Installing Reinforcement learning direct from library
%pip install git+http://github.com/lvwerra/trl.git@25fa1bd

Collecting datasets==2.17.0
  Downloading datasets-2.17.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.17.0)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting xxhash (from datasets==2.17.0)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets==2.17.0)
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Collecting aiohttp (from datasets==2.17.0)
  Downloading aiohttp-3.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting huggingface-hub>=0.19.4 (from datasets==2.17.0)
  Downloading huggingface_hub-0.21.3-py3-none-any.whl.metadata (13 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets==2.17.0)
  Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting frozenlist>=1.1.1 (from aiohttp->datasets==2.17.0)
  Downloading

In [4]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, \
                            AutoModelForSeq2SeqLM, GenerationConfig

from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# tlr: transformer reinforcement learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

# tqdm library make the loops show a smart process meter
from tqdm import tqdm
tqdm.pandas()

<a name=2></a>
# 2 - Carrega o modelo FLAN-T5, Prepara o Modelo de Recompensa e a Avaliação da Toxicidade

In [5]:
model_name = "google/flan-t5-base"
huggingface_dataset_name = "knkarthick/dialogsum"

base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

reward_model = AutoModelForSequenceClassification.from_pretrained("gpt2")

original_dataset = load_dataset(huggingface_dataset_name)

original_dataset

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [60]:
def build_dataset(dataset_name, model_name, input_min_length, input_max_length):
    
    # Load the dataset
    dataset = load_dataset(dataset_name, split="train")
    
    # Filter dialogues lens into min and max input length
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_length and len(x["dialogue"]) <= input_max_length, batched=False)
    
    # Prepare the tokenizer. Set device_map='Auto' allows sweet from CPU to GPU automatically
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")
    
    def tokenize(sample):
        
        prompt = f"""
        
        Summaryze the following conversation.
        
        {sample["dialogue"]}
        
        Summary:"""
        
        sample["input_ids"] = tokenizer.encode(prompt)
        
        # This must be called "query". Which is a requirement of PPO library
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample
    
    # Tokenize the dataset dialogues
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")
    
    # Split data into train and test parts
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)
    
    return dataset_splits

dataset = build_dataset(huggingface_dataset_name, model_name, input_min_length=200, input_max_length=1000) 
    
print(dataset)              

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 2005
    })
})


Download the fully peft pre-trained model from s3.

In [61]:
!aws s3 cp --recursive s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/ ./peft-dialogue-summary-checkpoint-from-s3/ 

download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/special_tokens_map.json to peft-dialogue-summary-checkpoint-from-s3/special_tokens_map.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/adapter_config.json to peft-dialogue-summary-checkpoint-from-s3/adapter_config.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/tokenizer_config.json to peft-dialogue-summary-checkpoint-from-s3/tokenizer_config.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/tokenizer.json to peft-dialogue-summary-checkpoint-from-s3/tokenizer.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/adapter_model.bin to peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin


In [62]:
!ls -alh ./peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin

-rw-r--r-- 1 root root 14M May 15  2023 ./peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin


No laboratio experimento anterio foi incluido um adaptador em que todos os pesos foram congelados, somente para inferencia.
Agora, as configurações do Lora são necessários pois o adaptador será treinado.

In [63]:
def print_number_of_model_trainable_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():   # Parametros de toda a rede
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params +=param.numel()
    return f"trainable model parameters: {trainable_model_params}. Model parameters:{all_model_params}. Percentage of trainable model parameters: {100*(trainable_model_params/all_model_params)}"

In [64]:
lora_config = LoraConfig(
    r=32,   # Rank,
    lora_alpha=32,
    target_modules = ["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM   # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model,
                                       "./peft-dialogue-summary-checkpoint-from-s3/",
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16,
                                       device_map="auto",
                                       is_trainable=True
                                      )

print(f"Peft model parameters to be update: {print_number_of_model_trainable_parameters(peft_model)}\n")

Peft model parameters to be update: trainable model parameters: 3538944. Model parameters:251116800. Percentage of trainable model parameters: 1.4092820552029972



Neste laboratorio esta sendo pre-treinado um LLM utilizando-se para isto a tecnica PEFT/Lora. E, para o treinamento utilizando o metodo Lora,
um algoritmo de aprendizado por reforço é utilizado. O aprendizado por reforço utiliza uma tecnica de PPO para otimização da política RL.

In [65]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtypte=torch.bfloat16,
                                                               is_trainable=True
                                                              ) 
print(f"PPO model parameters to be update: {print_number_of_model_trainable_parameters(ppo_model)}\n")

Detected kernel version 4.14.336, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


PPO model parameters to be update: trainable model parameters: 3539713. Model parameters:251117569. Percentage of trainable model parameters: 1.4095839706062143



Cria uma copia do modelo sem o ajuste da toxidade.

In [66]:
ref_model = create_reference_model(ppo_model)

<a name='2.1'></a>
## 2.1 - Preparando o modelo de recompensa

In [67]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

{0: 'nothate', 1: 'hate'}


Pega alguns exemplo e passa para o modelo.

In [68]:
non_toxic_text = "Person 1# tells tony that he didn't like the movie"

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors='pt').input_ids

# Aplica a transormacao utilizando a funcao logistica
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f"outputs of logit function: [not hate, hate]: {logits.tolist()[0]}")
      
# Imprime as probabilidade para [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f"probabilities for [not hate, hate]: {probabilities}")
      
# Pega a saida da funcao logistica para [not hate]. Essa é a recompensa!
not_hate_index = 0
not_hate_logit = logits.tolist()[0][not_hate_index]
print(f"output of logit function: [not hate]. This is the reward: {not_hate_logit}")

outputs of logit function: [not hate, hate]: [3.1276395320892334, -2.4689579010009766]
probabilities for [not hate, hate]: [0.9963032007217407, 0.003696749685332179]
output of logit function: [not hate]. This is the reward: 3.1276395320892334


Vamos ver agora um exemplo de um comentario negativo. A "recompensa" vai ser menor.

In [69]:
toxic_text = "Person 1# tells tony that this movie is terrible and stupid."

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors='pt').input_ids

# Aplica a transormacao utilizando a funcao logistica
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f"outputs of logit function: [not hate, hate]: {logits.tolist()[0]}")
      
# Imprime as probabilidade para [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f"probabilities for [not hate, hate]: {probabilities}")
      
# Pega a saida da funcao logistica para [not hate]. Essa é a recompensa!
not_hate_index = 0
not_hate_logit = logits.tolist()[0][not_hate_index]
print(f"output of logit function: [not hate]. This is the reward: {not_hate_logit}")

outputs of logit function: [not hate, hate]: [-0.6807236075401306, 0.3578655421733856]
probabilities for [not hate, hate]: [0.2614223062992096, 0.738577663898468]
output of logit function: [not hate]. This is the reward: -0.6807236075401306


Agora, utilizando a biblioteca hugging face para simplificar o codigo.

In [70]:
device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis",
                           model=toxicity_model_name,
                           device=device)
reward_logits_kwards = {
    "top_k":None,   # Return all score 
    "function_to_apply":"none",   # Return the output of the logistic function only
    "batch_size":16
}

reward_probabilities_kwards = {
    "top_k":None,   # Return all score 
    "function_to_apply":"softmax",   # Return the output of the logistic after apply the softmax funtion
    "batch_size":16
}


print("Reward model output")
print("For non-toxicity text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwards))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwards))
print("For toxicity text")
print(sentiment_pipe(toxic_text, **reward_logits_kwards))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwards))

Reward model output
For non-toxicity text
[{'label': 'nothate', 'score': 3.1276395320892334}, {'label': 'hate', 'score': -2.4689579010009766}]
[{'label': 'nothate', 'score': 0.9963032007217407}, {'label': 'hate', 'score': 0.003696749685332179}]
For toxicity text
[{'label': 'hate', 'score': 0.3578655421733856}, {'label': 'nothate', 'score': -0.6807236075401306}]
[{'label': 'hate', 'score': 0.738577663898468}, {'label': 'nothate', 'score': 0.2614223062992096}]


<a name='2.3'></a>
2.3- Avaliando a toxidade do modelo

In [71]:
toxicity_evaluator = evaluate.load("toxicity",
                                  toxicity_model_name,
                                  module_type="measurement",
                                  toxic_label="hate"
                                 )

Calcula o score de toxidade. Não é surpresa que este seja igual a saída da função logistica para a classe [hate]

In [72]:
toxicity_score = toxicity_evaluator.compute(predictions=[non_toxic_text])

print(f"Non toxic text toxicity measurement: {toxicity_score}")

toxicity_score = toxicity_evaluator.compute(predictions=[toxic_text])

print(f"Toxic text toxicity measurement: {toxicity_score}")

Non toxic text toxicity measurement: {'toxicity': [0.003696749685332179]}
Toxic text toxicity measurement: {'toxicity': [0.738577663898468]}


Este modelo pode ser utilizado para avaliar o score dos dialogos carregados na seção [2.1](#1).

In [73]:
def evaluate_toxicity(model, toxicity_evaluator, tokenizer, dataset, num_samples):
    
    
    max_new_tokens = 100
    
    toxicities = []
    
    input_texts = []
    
    for i, sample in tqdm(enumerate(dataset)):
        imput_text = sample["query"]
        
        if i > num_samples:
            break
        
        input_ids = tokenizer(imput_text, return_tensors='pt', padding=True).input_ids
        
        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             top_k=0.0,
                                             top_p=1.0,
                                             do_samples=True)
        
        response_tokens_ids = model.generate(input_ids=input_ids,
                                             generation_config=generation_config)
        
        generation_text = tokenizer.decode(response_tokens_ids[0], skip_special_tokens=True)
        
        toxicity_score = toxicity_evaluator.compute(predictions=[(imput_text + "" + generation_text)])
        
        toxicities.extend(toxicity_score["toxicity"])
        
    # Compute mean and std
    mean_toxicity = np.mean(toxicities)
    std_toxicity = np.std(toxicities)
    
    return mean_toxicity, std_toxicity

Verifica o desempenho do modelo anteriormente ao fine-tunning

In [75]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

mean_before_fine_tunning, std_before_fine_tunning = evaluate_toxicity(model=ref_model,
                                                                      toxicity_evaluator=toxicity_evaluator,
                                                                      tokenizer=tokenizer,
                                                                      dataset=dataset["test"],
                                                                      num_samples=10)

print(f"toxicity mean and std before fine-tunning, mean: {mean_before_fine_tunning} - std: {std_before_fine_tunning}")

11it [00:22,  2.08s/it]

toxicity mean and std before fine-tunning, mean: 0.01768248436168175 - std: 0.022171593557173273





<a name="3" ></a>
# 3 - Aplicando o PPO para Retirar a Toxidade do modelo

<a name="3.1" ></a>
## 3.1 - Inicializa o PPOTrainer

In [76]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]

print(f"Test data: {test_data}")
print(f"Collator of the test data: {collator(test_data)}")

Test data: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator of the test data: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


In [78]:
learning_rate = 1.41e-5
max_ppo_epochs = 1
mini_batch_size = 4
batch_size = 16

config = PPOConfig(
    model_name=model_name,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=tokenizer,
                         dataset=dataset["train"],
                         data_collator=collator)

Detected kernel version 4.14.336, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


<a name='3.2'></a>
## 3.2 - Realiza o Fine-tunning do modelo com a tecnica de POO

In [None]:
output_min_len = 100
output_max_len = 400
output_length_sampler = LengthSampler(output_min_len, output_max_len)

generation_kwards = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwards = {
    "top_k": None,
    "function_to_apply": "none",
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # break when reach the max_steps
    if step > max_ppo_steps:
        break
    
    prompt_tensors = batch["input_ids"]
    
    # Get responses from FLAN-T5/PEFT LLM
    summary_tensors = []
    
    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()
        
        generation_kwards["max_new_tokens"] = max_new_tokens
        
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwards)
        
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])
        
        
    # This need to be called response
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]
        
    # Compute rewards output
    query_reponse_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = sentiment_pipe(query_reponse_pairs, **reward_kwards)   
    
    print(rewards)
    # Use the [nothate] item because this is the score for the position [nothate] class
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]
                                  
    # Run ppo step
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)
    
    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print("-"*100)

0it [00:00, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[[{'label': 'nothate', 'score': 2.9155023097991943}, {'label': 'hate', 'score': -2.291459560394287}], [{'label': 'nothate', 'score': 2.6924073696136475}, {'label': 'hate', 'score': -2.123380661010742}], [{'label': 'nothate', 'score': 1.6082422733306885}, {'label': 'hate', 'score': -1.249406337738037}], [{'label': 'nothate', 'score': 2.9333224296569824}, {'label': 'hate', 'score': -2.4180214405059814}], [{'label': 'nothate', 'score': 2.5497634410858154}, {'label': 'hate', 'score': -2.025524139404297}], [{'label': 'nothate', 'score': 2.162712574005127}, {'label': 'hate', 'score': -1.695723056793213}], [{'label': 'nothate', 'score': 1.4694528579711914}, {'label': 'hate', 'score': -1.1428331136703491}], [{'label': 'nothate', 'score': 1.6313555240631104}, {'label': 'hate', 'score': -1.326237678527832}], [{'label': 'nothate', 'score': 1.577005386352539}, {'label': 'hate', 'score': -1.266847848892212}], [{'label': 'nothate', 'score': 3.3155808448791504}, {'label': 'hate', 'score': -2.67425251

1it [01:47, 107.24s/it]

objective/kl: 37.28555679321289
ppo/returns/mean: -1.0773183107376099
ppo/policy/advantages_mean: -7.3118471277666686e-09
----------------------------------------------------------------------------------------------------
[[{'label': 'nothate', 'score': 2.124241590499878}, {'label': 'hate', 'score': -1.6787097454071045}], [{'label': 'nothate', 'score': 1.8615583181381226}, {'label': 'hate', 'score': -1.4675705432891846}], [{'label': 'nothate', 'score': 1.8002383708953857}, {'label': 'hate', 'score': -1.3934768438339233}], [{'label': 'nothate', 'score': 3.0695385932922363}, {'label': 'hate', 'score': -2.4361748695373535}], [{'label': 'nothate', 'score': 2.0829241275787354}, {'label': 'hate', 'score': -1.6129119396209717}], [{'label': 'nothate', 'score': 2.078359603881836}, {'label': 'hate', 'score': -1.5973446369171143}], [{'label': 'nothate', 'score': 4.1084465980529785}, {'label': 'hate', 'score': -3.489579916000366}], [{'label': 'nothate', 'score': 2.1222023963928223}, {'label': 'ha

2it [03:23, 100.95s/it]

objective/kl: 36.44472122192383
ppo/returns/mean: -0.966327428817749
ppo/policy/advantages_mean: -6.0313798400102314e-09
----------------------------------------------------------------------------------------------------
[[{'label': 'hate', 'score': 0.09352150559425354}, {'label': 'nothate', 'score': -0.22506117820739746}], [{'label': 'nothate', 'score': 2.2582011222839355}, {'label': 'hate', 'score': -1.7329121828079224}], [{'label': 'nothate', 'score': 1.4088160991668701}, {'label': 'hate', 'score': -1.1574993133544922}], [{'label': 'nothate', 'score': 1.158969521522522}, {'label': 'hate', 'score': -0.938805341720581}], [{'label': 'nothate', 'score': 3.097087860107422}, {'label': 'hate', 'score': -2.486356496810913}], [{'label': 'nothate', 'score': 0.6831753849983215}, {'label': 'hate', 'score': -0.5990680456161499}], [{'label': 'nothate', 'score': 3.1910274028778076}, {'label': 'hate', 'score': -2.501859664916992}], [{'label': 'nothate', 'score': 2.4501819610595703}, {'label': 'hat

3it [04:47, 93.07s/it] 

objective/kl: 30.097347259521484
ppo/returns/mean: -0.9715116024017334
ppo/policy/advantages_mean: -8.30156832165585e-09
----------------------------------------------------------------------------------------------------
[[{'label': 'nothate', 'score': 3.282665252685547}, {'label': 'hate', 'score': -2.659043312072754}], [{'label': 'nothate', 'score': 1.8448354005813599}, {'label': 'hate', 'score': -1.445022702217102}], [{'label': 'nothate', 'score': 1.9774035215377808}, {'label': 'hate', 'score': -1.5677027702331543}], [{'label': 'nothate', 'score': 3.7356529235839844}, {'label': 'hate', 'score': -3.0938377380371094}], [{'label': 'nothate', 'score': 3.152747392654419}, {'label': 'hate', 'score': -2.568631410598755}], [{'label': 'nothate', 'score': 1.694312334060669}, {'label': 'hate', 'score': -1.3282610177993774}], [{'label': 'nothate', 'score': 1.6597055196762085}, {'label': 'hate', 'score': -1.3082761764526367}], [{'label': 'nothate', 'score': 1.6032626628875732}, {'label': 'hate',

4it [06:08, 88.19s/it]

objective/kl: 30.190887451171875
ppo/returns/mean: -0.7808466553688049
ppo/policy/advantages_mean: -4.3584846842747993e-10
----------------------------------------------------------------------------------------------------
[[{'label': 'nothate', 'score': 2.0312414169311523}, {'label': 'hate', 'score': -1.6007598638534546}], [{'label': 'nothate', 'score': 2.7916512489318848}, {'label': 'hate', 'score': -2.205249786376953}], [{'label': 'nothate', 'score': 2.4140689373016357}, {'label': 'hate', 'score': -1.8804314136505127}], [{'label': 'nothate', 'score': 3.057588815689087}, {'label': 'hate', 'score': -2.369279146194458}], [{'label': 'nothate', 'score': 3.401616096496582}, {'label': 'hate', 'score': -2.7510058879852295}], [{'label': 'nothate', 'score': 1.4757474660873413}, {'label': 'hate', 'score': -1.1663193702697754}], [{'label': 'nothate', 'score': 2.470313549041748}, {'label': 'hate', 'score': -1.9726637601852417}], [{'label': 'nothate', 'score': 0.9484515190124512}, {'label': 'hat

5it [07:30, 85.98s/it]

objective/kl: 27.19145965576172
ppo/returns/mean: -0.5293272733688354
ppo/policy/advantages_mean: -8.731987577448308e-09
----------------------------------------------------------------------------------------------------
[[{'label': 'nothate', 'score': 1.9366639852523804}, {'label': 'hate', 'score': -1.4840123653411865}], [{'label': 'nothate', 'score': 2.6126298904418945}, {'label': 'hate', 'score': -2.046844482421875}], [{'label': 'nothate', 'score': 2.418452501296997}, {'label': 'hate', 'score': -1.8784294128417969}], [{'label': 'nothate', 'score': 1.57919442653656}, {'label': 'hate', 'score': -1.2508587837219238}], [{'label': 'nothate', 'score': 2.963106632232666}, {'label': 'hate', 'score': -2.4120545387268066}], [{'label': 'nothate', 'score': 2.3630409240722656}, {'label': 'hate', 'score': -1.8591077327728271}], [{'label': 'nothate', 'score': 2.8167941570281982}, {'label': 'hate', 'score': -2.1615700721740723}], [{'label': 'nothate', 'score': 2.5310134887695312}, {'label': 'hate'

6it [09:05, 89.26s/it]

objective/kl: 31.77707290649414
ppo/returns/mean: -0.8464198112487793
ppo/policy/advantages_mean: -2.0219784957475895e-08
----------------------------------------------------------------------------------------------------
[[{'label': 'nothate', 'score': 1.41933274269104}, {'label': 'hate', 'score': -1.1345998048782349}], [{'label': 'nothate', 'score': 3.4381227493286133}, {'label': 'hate', 'score': -2.8685507774353027}], [{'label': 'nothate', 'score': 0.5207525491714478}, {'label': 'hate', 'score': -0.4411659836769104}], [{'label': 'nothate', 'score': 2.7762651443481445}, {'label': 'hate', 'score': -2.2506656646728516}], [{'label': 'nothate', 'score': 2.150848150253296}, {'label': 'hate', 'score': -1.6764607429504395}], [{'label': 'nothate', 'score': 3.277405023574829}, {'label': 'hate', 'score': -2.651092529296875}], [{'label': 'nothate', 'score': 2.4621965885162354}, {'label': 'hate', 'score': -1.9132494926452637}], [{'label': 'nothate', 'score': 1.9966607093811035}, {'label': 'hate

7it [10:40, 90.92s/it]

objective/kl: 28.974096298217773
ppo/returns/mean: -0.6000247597694397
ppo/policy/advantages_mean: -2.4664952302799747e-09
----------------------------------------------------------------------------------------------------
[[{'label': 'nothate', 'score': 2.459815502166748}, {'label': 'hate', 'score': -1.9132345914840698}], [{'label': 'nothate', 'score': 1.080428957939148}, {'label': 'hate', 'score': -0.8835965991020203}], [{'label': 'nothate', 'score': 3.2355856895446777}, {'label': 'hate', 'score': -2.6272566318511963}], [{'label': 'nothate', 'score': 3.628333568572998}, {'label': 'hate', 'score': -2.9806172847747803}], [{'label': 'nothate', 'score': 3.2258081436157227}, {'label': 'hate', 'score': -2.571758270263672}], [{'label': 'nothate', 'score': 2.270799398422241}, {'label': 'hate', 'score': -1.7947824001312256}], [{'label': 'nothate', 'score': 2.269404411315918}, {'label': 'hate', 'score': -1.7373530864715576}], [{'label': 'nothate', 'score': 2.244831085205078}, {'label': 'hate'

10it [15:15, 91.15s/it]

objective/kl: 33.04023361206055
ppo/returns/mean: -0.933142364025116
ppo/policy/advantages_mean: 1.2839547203213897e-08
----------------------------------------------------------------------------------------------------
[[{'label': 'nothate', 'score': 2.025411605834961}, {'label': 'hate', 'score': -1.5545072555541992}], [{'label': 'nothate', 'score': 2.707101821899414}, {'label': 'hate', 'score': -2.1036617755889893}], [{'label': 'nothate', 'score': 2.295156955718994}, {'label': 'hate', 'score': -1.7623037099838257}], [{'label': 'nothate', 'score': 2.160024642944336}, {'label': 'hate', 'score': -1.7335741519927979}], [{'label': 'nothate', 'score': 2.1263046264648438}, {'label': 'hate', 'score': -1.6454317569732666}], [{'label': 'nothate', 'score': 1.5389256477355957}, {'label': 'hate', 'score': -1.2253371477127075}], [{'label': 'nothate', 'score': 2.1456289291381836}, {'label': 'hate', 'score': -1.707491159439087}], [{'label': 'nothate', 'score': 2.521397590637207}, {'label': 'hate', 

11it [16:52, 92.09s/it]

objective/kl: 32.764068603515625
ppo/returns/mean: -0.9268460273742676
ppo/policy/advantages_mean: -1.7478838376661088e-09
----------------------------------------------------------------------------------------------------





<a name="3.3"></a>
3.3 - Avalia o modelo quantitativamente

In [80]:
mean_after_fine_tunning, std_after_fine_tunning = evaluate_toxicity(model=ppo_model,
                                                                      toxicity_evaluator=toxicity_evaluator,
                                                                      tokenizer=tokenizer,
                                                                      dataset=dataset["test"],
                                                                      num_samples=10)

print(f"toxicity mean and std after fine-tunning, mean: {mean_after_fine_tunning} - std: {std_after_fine_tunning}")

11it [00:19,  1.74s/it]

toxicity mean and std after fine-tunning, mean: 0.014201556380033831 - std: 0.01997620393343237





In [82]:
mean_improvement = (mean_after_fine_tunning - mean_before_fine_tunning) / mean_after_fine_tunning

std_improvement = (std_after_fine_tunning - std_before_fine_tunning) / std_after_fine_tunning

print(f"Mean improvement of the ppo from the base model: {mean_improvement}")
print(f"Std improvement of the ppo from the base model: {std_improvement}")

Mean improvement of the ppo from the base model: -0.24510890838287297
Std improvement of the ppo from the base model: -0.10990024085940953


<a name="3.3"></a>
## 3.3 - Avalia o modelo qualitativamente

In [89]:
batch_size = 20

compare_results = {}

df_batch = dataset["test"][0: batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []

summary_tensors = []

generation_kwards = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

# Get response from ppo base model

for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwards["max_new_tokens"] = gen_len
    
    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwards,
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)
        
    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwards,
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode Responses
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]
              
# Sentiment analysis fom query/response pairs before/after
text_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewords_before = sentiment_pipe(text_before, **reward_kwards)
compare_results["sentiment_before"] = [reward[not_hate_index]["score"] for reward in rewords_before]

text_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewords_after = sentiment_pipe(text_after, **reward_kwards)
compare_results["sentiment_after"] = [reward[not_hate_index]["score"] for reward in rewords_after]

100%|██████████| 20/20 [01:27<00:00,  4.38s/it]


Guarda e reve os resultados em um dataframe

In [92]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["rewards_diff"] = df_compare_results["sentiment_after"] - df_compare_results["sentiment_before"]
df_compare_results_sorted = df_compare_results.sort_values(by="rewards_diff", ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,sentiment_before,sentiment_after,rewards_diff
0,"Summaryze the following conversation. #Person1#: Judy, what is everybody talking about? #Person2#: Haven't you heard? Richard was fired by our manager. #Person1#: You're kidding. It can't be true. #Person2#: Believe it or not. Everybody is talking about it in the company. #Person1#: Really? I'm surprised. #Person2#: Me too. Summary:</s>",<pad> Judy is surprised that Richard was fired from a team. Judy is also surprised.</s>,<pad> Judy and #Person1# are surprised because Richard was fired by our manager.</s>,1.137102,2.143,1.005898
1,"Summaryze the following conversation. #Person1#: Today more and more families have personal computers. People have wider range of choice to communicate with the outside world. #Person2#: Right. With the establishment of Internet and a lot of web companies, people are getting more and more dependent on the web. #Person1#: One of the common uses of PC is that people can buy goods through it without going out to the physical stores. #Person2#: Can you tell me how it is done? #Person1#: If a cus...",<pad> #Person1# tells #Person2# how to buy goods through the web. #Person1# shows how it is done for a customer without going to the physical stores. #Person1# tells #Person2# how it is done.</s>,<pad> #Person1# highly recommends the use of PCs to help people communicate with the outside world more. Can people buy goods through it without going out to the physical stores?</s>,2.426362,3.033279,0.606918
2,"Summaryze the following conversation. #Person1#: Could you help me, Sir? My flight got in 15 minutes ago. Everyone else has picked up the luggage but mine hasn't come through. #Person2#: I'm sorry, Madam, I'll go and find out if there is any more to come. Summary:</s>","<pad> #Person1#, #Person2# says her flight got in 15 minutes ago and misses hers.</s>",<pad> #Person1#'s flight got in 15 minutes ago. #Person2# will go and find out if there is more.</s>,1.9263,2.511166,0.584867
3,"Summaryze the following conversation. #Person1#: I'm forming a music band. #Person2#: Do you already know how to play an instrument? #Person1#: Uh... Yeah! I'Ve told you a thousand times that I'm learning to play the drums. Now that I know how to play well, I would like to form a rock band. #Person2#: Aside from yourself, who are the other members of the band? #Person1#: We have a guy who plays guitar, and another who plays bass. Although we still haven't found anyone to be our singer. You t...",<pad> #Person1# is forming a music band and talks about the members of the band. #Person2# will audition #Person1#'s country singing talent with the help of #Person1#'s house for audition.</s>,"<pad> #Person1# asks #Person2# to audition for a rock band. #Person1# talks about the members of the band and calls them to audition. #Person1# refuses because #Person2# doesn't have enough room for the amplifiers, microphones or even the drums.</s>",2.406677,2.875031,0.468354
4,"Summaryze the following conversation. #Person1#: Here is the final draft of our contract. I'm glad that we have reached an agreement on almost every term in our trade. #Person2#: Yes, it seems to me we have come quite a long way. However, let me take a close look at the final draft. #Person1#: Do you have some points to bring up? #Person2#: Well, everything we've discussed seems to be here. #Person1#: Yes, including a description of the shirts you want to purchase this time, the total amount...",<pad> #Person2# looks at the flesh of a final draft of the contract. She wants to sign the contract right now. #Person1# advises her to plenty of time in order to check over the draft.</s>,"<pad> #Person1# and #Person2# are happy they reached an agreement on almost every term in their trade. To ask some questions, #Person1# presents the final draft.</s>",2.867878,3.231239,0.363361
5,"Summaryze the following conversation. #Person1#: Oh, my God! What's this? #Person2#: What? #Person1#: Look! This window is open. #Person2#: Did you open it before we left? #Person1#: Are you kidding? It's winter. Why would I open it? #Person2#: I don't know. Wait. Is this yours? #Person1#: No! Oh, my God! Someone has broken into the house. #Person2#: It looks that way. That's probably why the door wasn't locked when we came in. #Person1#: I locked it when I left though. #Person2#: Yes, but t...",<pad> Allen tells #Person2# he is blind for the winter because he broke into the house with the help of the blind help. Inhuman underwear and stereo aren't found at all. He also tells #Person1# that the unlocked window is open.</s>,<pad> Allen and #Person2# tell each other the window was broken and robber wants to find someone. Allen and #Person1# will look upstairs to find someone.</s>,1.591679,1.878596,0.286917
6,"Summaryze the following conversation. #Person1#: What can I do for you, madam? #Person2#: I'd like to buy a toy car for my son. #Person1#: How about this one? #Person2#: It looks nice. How much is it? #Person1#: They're three hundred dollars. #Person2#: Oh, I'm afraid it's too expensive. Can you show me something cheaper? #Person1#: OK, This one is one hundred and twenty. It's the cheapest here. #Person2#: OK, I'll take it. Here's the money. #Person1#: Thank you very much. Summary:</s>",<pad> #Person1# has most of the toy cars but the others are too expensive. They show #Person2# the other one and they both agree. #Person2# accepts and gets in.</s>,<pad> #Person1# helps #Person2# find the cheaper toy car to get for #Person2# and #Person2# likes it.</s>,1.40755,1.635844,0.228295
7,"Summaryze the following conversation. #Person1#: I would like to order some internet today. #Person2#: What kind would you like? #Person1#: What kind of internet is there? #Person2#: You can get DEL or dial-up. #Person1#: Which of those two is best? #Person2#: I would recommend DEL. #Person1#: So that one better? #Person2#: It's better because it doesn't tie up the phone. #Person1#: What do you mean by that? #Person2#: DEL isn't connected through your phone line, but dial-up is. #Person1#: S...",<pad> #Person2# recommends dial-up to #Person1# for a high-speed internet starting at DEL instead of dial-up because it doesn't tie up the phone. But #Person1# can't use @ the Internet if #Person1#'s on the internet.</s>,<pad> #Person1# would like to order Kplinter internet. #Person1# says DEL isn't connected through the phone network but Dial-up.</s>,2.451935,2.564701,0.112766
8,"Summaryze the following conversation. #Person1#: I'd like to have this cashed, please. #Person2#: Please put you name and address here. May I see your passport? #Person1#: Yes. #Person2#: How would you like it? #Person1#: Ten hundreds and ten twenties, and the rest in small change, please. #Person2#: OK. Here you are. Summary:</s>","<pad> #Person2# helps #Person1# crochet a cash. #Person1# wants to cash in 10 hundreds and 10 twenties, and only in small change.</s>",<pad> #Person1# wants to send a cashed cheque to #Person2#. #Person1# tells #Person2# the amount in cash.</s>,1.647519,1.753495,0.105976
9,"Summaryze the following conversation. #Person1#: Could you help me figure out how to look for a job? #Person2#: We have lots of options, what type of job do you need? #Person1#: I want to work in an office. #Person2#: Do you want to work part-time or full-time? #Person1#: I want to work full-time. #Person2#: We have binders with local job listings or you can make use of the computers. OK? #Person1#: I am confused a bit but I am sure that I can figure it out. #Person2#: If you make an appoint...",<pad> #Person1# tells #Person2# #Person1# is confused about a job. #Person2# offers advice on who can come help #Person1#.</s>,<pad> #Person1# thinks it convenient to work in an office. #Person2# will help #Person1# to help #Person1# find a job. #Person1# turns to #Person2#'s assistant to help #Person1# find a job.</s>,1.993768,2.070232,0.076465
