# | NLP | PPO | DialogSum | Less-Toxic Summarize |

## NLP (Natural Language Processing) with PEFT (Parameter Efficient Fine-Tuning) and LoRA (Low-Rank Adaptation) for Less-Toxic Summarization

# <b>1 <span style='color:#78D118'>|</span> Introduction</b>

This project explores the capabilities of Large Language Models (LLMs), particularly emphasizing the utilization of Parameter Efficient Fine-Tuning (PEFT) to create dialogue summaries with reduced toxicity. We achieve this by employing the FLAN-T5 model alongside Meta AI's hate speech reward model.

Our primary objective is to improve the quality of dialogue summaries while minimizing toxicity. To attain this, we apply Proximal Policy Optimization (PPO) for fine-tuning, aiming to mitigate the model's toxic output. Furthermore, we will showcase the advantages of Parameter Efficient Fine-Tuning (PEFT), illustrating that its benefits surpass any potential minor performance trade-offs.

 - NOTE: This is an example and we not using the entirety of the data used.
 
## Objectives :
 - Train LLM to make less toxic dialogue summarization.
 
 
 ## The DialogSum Dataset:
The [DialogSum Dataset](https://huggingface.co/datasets/knkarthick/dialogsum) DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding manually labeled summaries and topics.

## Project Workflow:

- **Setup**: Import necessary libraries and define project parameters.
- **Dataset Exploration**: Discovering DialogSum Dataset.
- **Test Model Zero Shot Inferencing**: Initially, test the FLAN-T5 model for zero-shot inferencing on dialogue summarization tasks to establish a baseline performance.
- **Dataset Preprocess Dialog and Summary**: Preprocess the dialog and its corresponding summary from the dataset to prepare for the train.
-  **Perform Parameter Efficient Fine-Tuning (PEFT)**: Implement Parameter Efficient Fine-Tuning (PEFT), a more efficient fine-tuning approach that can significantly reduce training time while maintaining performance.
-  **Evaluation**:
    - Perform human evaluation to gauge the model's output in terms of readability and coherence. This can involve annotators ranking generated summaries for quality.
    - Utilize ROUGE metrics to assess the quality of the generated summaries. ROUGE measures the overlap between generated summaries and human-written references.

# <b>2<span style='color:#78D118'>|</span> Setup</b>
## <b>2.1 <span style='color:#78D118'>|</span> Imports</b>

In [2]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    peft==0.3.0 --quiet

# Installing the Reinforcement Learning library directly from github.
%pip install git+https://github.com/lvwerra/trl.git@25fa1bd    


[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m


[0mNote: you may need to restart the kernel to use updated packages.

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m


[0mNote: you may need to restart the kernel to use updated packages.

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possib

In [3]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

In [4]:
model_name="google/flan-t5-base"
huggingface_dataset_name = "knkarthick/dialogsum"

dataset_original = load_dataset(huggingface_dataset_name)

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-3005b557c2c04c1d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-3005b557c2c04c1d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

## <b>2.2 <span style='color:#78D118'>|</span> Methods</b>

In [8]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

# <b>3<span style='color:#78D118'>|</span> Tokenize the Data</b>


The next step involves dataset preprocessing. We'll select a subset of the data, filter dialogues to a specific length to ensure readability while maintaining meaningful content, and then integrate each dialogue with an instruction before tokenizing the prompts. The resulting token IDs will be stored in the `input_ids` field, while the decoded prompts will be saved in the `query` field.

To streamline this process, it's advisable to create a function called `build_dataset`. This function can be defined as follows:

In [5]:
def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length, 
                  input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.
        
    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """
    
    # load dataset (only "train" part will be enough for this lab).
    dataset = load_dataset(dataset_name, split="train")
    
    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")
    
    def tokenize(sample):
        
        # Wrap each dialogue with the instruction.
        prompt = f"""
Summarize the following conversation.

{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)
        
        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")
    
    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200, 
                        input_max_text_length=1000)

print(dataset)

Found cached dataset csv (/root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-3005b557c2c04c1d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Map:   0%|          | 0/10022 [00:00<?, ? examples/s]

DatasetDict({

    train: Dataset({

        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],

        num_rows: 8017

    })

    test: Dataset({

        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],

        num_rows: 2005

    })

})


# <b>4 <span style='color:#78D118'>|</span>  FLAN-T5 Model Fine-Tuned with Summarization Instruction</b>

## <b>4.1 <span style='color:#78D118'>|</span>  Enhancing FLAN-T5 Model Fine-Tuned with Summarization Adapter</b>

We are enhancing the original FLAN-T5 model by adding a summarization adapter. This adapter is designed to improve the model's performance in summarization tasks.

We begin by configuring the adapter using the following parameters:
- `r`: Rank, which is set to 32.
- `lora_alpha`: LORA alpha value, set to 32.
- `target_modules`: We specify the target modules as ["q", "v"].
- `lora_dropout`: Dropout rate for LORA, set to 0.05.
- `bias`: We use "none" as the bias configuration.
- `task_type`: The task type is set to SEQ_2_SEQ_LM, which is suitable for FLAN-T5.

Next, we load the pre-trained FLAN-T5 model and create an instance of the AutoModelForSeq2SeqLM with the specified model name and data type (torch_dtype).

We also create a PeftModel by incorporating the previously loaded model. 
Additionally, we provide the LORA configuration, torch data type, device mapping, and specify that the model is trainable.

In [9]:
lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, 
                                              torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model, 
                                       './peft-dialogue-summary-checkpoint-from-s3/', 
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16, 
                                       device_map="auto",                                       
                                       is_trainable=True)

print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

PEFT model parameters to be updated:



trainable model parameters: 3538944

all model parameters: 251116800

percentage of trainable model parameters: 1.41%




## <b>4.2 <span style='color:#78D118'>|</span>  Enhancing LLM Summarization with Reinforcement Learning with POO</b>

Now, we are in the process of preparing for fine-tuning the Language Model (LLM) using Reinforcement Learning (RL). Although a more detailed explanation of RL, our current focus is on setting up the Proximal Policy Optimization (PPO) model. 

This PPO model will receive the instruction-fine-tuned PEFT model as input and will be utilized to optimize the RL policy in accordance with the reward model.

In [10]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,                                                               
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 params):



trainable model parameters: 3539713

all model parameters: 251117569

percentage of trainable model parameters: 1.41%



ValueHead(

  (dropout): Dropout(p=0.1, inplace=False)

  (summary): Linear(in_features=768, out_features=1, bias=True)

  (flatten): Flatten(start_dim=1, end_dim=-1)

)


During the Proximal Policy Optimization (PPO) process, only a subset of parameters will be updated, specifically those associated with the `ValueHead`. You can find more detailed information about this class of models in the [documentation](https://huggingface.co/docs/trl/main/en/models#trl.create_reference_model). The number of trainable parameters in the `ValueHead` can be computed as $(n+1) \cdot m$, where $n$ represents the number of input units (in this case, $n=768$) and $m$ represents the number of output units (which is $m=1$ in this context). The additional $+1$ term in the equation accounts for the bias term.

Now, let's create a frozen copy of the PPO model, which will serve as a reference model. This reference model will represent the Language Model (LLM) before detoxification. Importantly, none of the parameters of the reference model will be updated during PPO training. This is by design.

In [11]:
ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:



trainable model parameters: 0

all model parameters: 251117569

percentage of trainable model parameters: 0.00%




# <b>5<span style='color:#78D118'>|</span> Building a Reward Model for Reinforcement Learning</b>

**Reinforcement Learning (RL)** stands as a pivotal branch of machine learning wherein agents make decisions within an environment to maximize their cumulative rewards. The behavior of these agents is governed by a decision-making **policy**, and the fundamental objective of RL is for the agent to acquire an optimal or near-optimal policy that maximizes the **reward function**.

Previously, the original policy was rooted in the instruct PEFT model – essentially, the Language Model (LLM) before undergoing detoxification. While one approach involved soliciting human labelers to provide feedback on the toxicity of the model's outputs, this process can become prohibitively costly when applied throughout the entire fine-tuning phase. A pragmatic solution to circumvent this expense is to implement a reward model that encourages the agent to produce detoxified dialogue summaries.

A sensible approach here is to perform sentiment analysis on the model's outputs, classifying them into two categories: `nothate` and `hate`. Higher rewards are assigned when the likelihood of classifying an output as `nothate` is greater.

In this context, we will employ [Meta AI's RoBERTa-based hate speech model](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target) as our reward model. This model generates **logits** and subsequently predicts probabilities for two classes: `nothate` and `hate`. Positive rewards are derived from the logits associated with the `nothate` class. The model will undergo further fine-tuning using Proximal Policy Optimization (PPO) with these reward values.

## <b>5.1<span style='color:#78D118'>|</span> Load Meta AI's RoBERTa-based hate speech model</b>

In [12]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.


{0: 'nothate', 1: 'hate'}


Take some non-toxic text, tokenize it, and pass it to the model. Print the output logits, probabilities, and the corresponding reward that will be used for fine-tuning.

In [13]:
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

logits [not hate, hate]: [3.114100694656372, -2.4896175861358643]

probabilities [not hate, hate]: [0.9963293671607971, 0.003670616541057825]

reward (high): [3.114100694656372]


Let's show a toxic comment.  This will have a low reward because it is more toxic.

In [14]:
toxic_text = "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist() 
print(f'reward (low): {nothate_reward}')

logits [not hate, hate]: [-0.6921188831329346, 0.3722729980945587]

probabilities [not hate, hate]: [0.25647106766700745, 0.7435289621353149]

reward (low): [-0.6921188831329346]


## <b>5.2<span style='color:#78D118'>|</span> Setup Pipeline toxicity reward model</b>

Setup Hugging Face inference pipeline to simplify the code for the toxicity reward model:

In [15]:
device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis", 
                          model=toxicity_model_name, 
                          device=device)
reward_logits_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # Set to "none" to retrieve raw logits.
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Reward model output:

For non-toxic text

[{'label': 'nothate', 'score': 3.114100694656372}, {'label': 'hate', 'score': -2.4896175861358643}]

[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.003670616541057825}]

For toxic text

[{'label': 'hate', 'score': 0.3722729980945587}, {'label': 'nothate', 'score': -0.6921188831329346}]

[{'label': 'hate', 'score': 0.7435289621353149}, {'label': 'nothate', 'score': 0.25647106766700745}]


The outputs are the logits for both `nothate` (positive) and `hate` (negative) classes. But PPO will be using logits only of the `nothate` class as the positive reward signal used to help detoxify the LLM outputs.

In [16]:
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))

[{'label': 'nothate', 'score': 3.114100694656372}, {'label': 'hate', 'score': -2.4896175861358643}]

[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.003670616541057825}]


In [17]:
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

[{'label': 'hate', 'score': 0.3722729980945587}, {'label': 'nothate', 'score': -0.6921188831329346}]

[{'label': 'hate', 'score': 0.7435289621353149}, {'label': 'nothate', 'score': 0.25647106766700745}]


## <b>5.3<span style='color:#78D118'>|</span> Evaluate Toxicity</b>

To assess the model's performance both before and after the fine-tuning and detoxification processes, it is essential to establish the toxicity evaluation metric. The toxicity score is represented as a decimal value ranging from 0 to 1, where 1 signifies the highest degree of toxicity.

In [18]:
toxicity_evaluator = evaluate.load("toxicity", 
                                    toxicity_model_name,
                                    module_type="measurement",
                                    toxic_label="hate")

Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]

Try to calculate toxicity for the same sentences as in section [2.2](#2.2). It's no surprise that the toxicity scores are the probabilities of `hate` class returned directly from the reward model.

In [19]:
toxicity_score = toxicity_evaluator.compute(predictions=[
    non_toxic_text
])

print("Toxicity score for non-toxic text:")
print(toxicity_score["toxicity"])

toxicity_score = toxicity_evaluator.compute(predictions=[
    toxic_text
])

print("\nToxicity score for toxic text:")
print(toxicity_score["toxicity"])

Toxicity score for non-toxic text:

[0.003670616541057825]



Toxicity score for toxic text:

[0.7435289621353149]


This evaluator can be effectively employed to calculate the toxicity levels of the dialogues. 

To accomplish this, you will need to provide several essential components, including the test dataset (`dataset["test"]`), the tokenizer used in the aforementioned section, the previously frozen PEFT model, and the toxicity evaluator itself. For a streamlined and organized approach, it is recommended to encapsulate these necessary procedures within a dedicated function named `evaluate_toxicity`.

In [20]:
def evaluate_toxicity(model, 
                      toxicity_evaluator, 
                      tokenizer, 
                      dataset, 
                      num_samples):
    
    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model (trl model): Model to be evaluated.
    - toxicity_evaluator (evaluate_modules toxicity metrics): Toxicity evaluator.
    - tokenizer (transformers tokenizer): Tokenizer to be used.
    - dataset (dataset): Input dataset for the evaluation.
    - num_samples (int): Maximum number of samples for the evaluation.
        
    Returns:
    tuple: A tuple containing two numpy.float64 values:
    - mean (numpy.float64): Mean of the samples toxicity.
    - std (numpy.float64): Standard deviation of the samples toxicity.
    """

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break
            
        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids
        
        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             top_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)

        response_token_ids = model.generate(input_ids=input_ids,
                                            generation_config=generation_config)
        
        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)
        
        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])

        toxicities.extend(toxicity_score["toxicity"])

    # Compute mean & std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)
        
    return mean, std

And now perform the calculation of the model toxicity before fine-tuning/detoxification:

In [21]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

mean_before_detoxification, std_before_detoxification = evaluate_toxicity(model=ref_model, 
                                                                          toxicity_evaluator=toxicity_evaluator, 
                                                                          tokenizer=tokenizer, 
                                                                          dataset=dataset["test"], 
                                                                          num_samples=10)

print(f'toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]')

11it [00:24,  2.25s/it]

toxicity [mean, std] before detox: [0.03872112058293582, 0.03225256283112844]





## <b>6 <span style='color:#78D118'>|</span> Perform Fine-Tuning to Detoxify the Summaries</b>

Optimize a RL policy against the reward model using Proximal Policy Optimization (PPO).

## <b>6.1 <span style='color:#78D118'>|</span> Initialize `PPOTrainer`</b>

For the `PPOTrainer` initialization, you will need a collator. Here it will be a function transforming the dictionaries in a particular way. You can define and test it:

In [22]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]

Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


Configure the essential parameters. Load the `ppo_model` and the corresponding tokenizer. 

Additionally, load a static version of the model, referred to as `ref_model`. 

The purpose of having two models is twofold: the first model, `ppo_model`, undergoes optimization, while the second model, `ref_model`, functions as a reference point to compute the KL-divergence from the initial state. 

This serves as an additional reward signal in the PPO training process, ensuring that the optimized model does not stray too far from the original Language Model (LLM).

In [23]:
learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config, 
                         model=ppo_model, 
                         ref_model=ref_model, 
                         tokenizer=tokenizer, 
                         dataset=dataset["train"], 
                         data_collator=collator)

## <b>6.2 <span style='color:#78D118'>|</span> Fine-Tune the Model</b>

The fine-tuning loop comprises the following key steps:

1. Retrieve query responses from the policy Language Model (PEFT model).
2. Determine the sentiments associated with the queries and responses using the hate speech RoBERTa model.
3. Optimize the policy using Proximal Policy Optimization (PPO) with the triplet of inputs, which includes the query, response, and the associated reward.

You can confirm that the operation is successfully running by monitoring the following metrics:

- `objective/kl`: Minimization of the Kullback-Leibler (KL) divergence.
- `ppo/returns/mean`: Maximization of the mean returns.
- `ppo/policy/advantages_mean`: Maximization of the mean advantages.

These metrics serve as indicators of the training process's progress and the achievement of specific objectives within the fine-tuning loop.

In [24]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break   

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()        
            
        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])
        
    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]    
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the `nothate` item because this is the score for the positive `nothate` class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]    

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)
    
    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

0it [00:00, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

1it [01:43, 103.30s/it]

objective/kl: 29.314075469970703

ppo/returns/mean: -0.4844372570514679

ppo/policy/advantages_mean: -4.632137340365716e-09

---------------------------------------------------------------------------------------------------


2it [03:21, 100.19s/it]

objective/kl: 35.85022735595703

ppo/returns/mean: -0.8316176533699036

ppo/policy/advantages_mean: -9.384150345681519e-09

---------------------------------------------------------------------------------------------------


3it [04:53, 96.61s/it] 

objective/kl: 31.081266403198242

ppo/returns/mean: -0.6446913480758667

ppo/policy/advantages_mean: -8.428234110624544e-09

---------------------------------------------------------------------------------------------------


4it [06:16, 91.12s/it]

objective/kl: 22.59747886657715

ppo/returns/mean: -0.25419875979423523

ppo/policy/advantages_mean: 2.2152294221200464e-08

---------------------------------------------------------------------------------------------------


5it [07:47, 91.11s/it]

objective/kl: 27.7932186126709

ppo/returns/mean: -0.32479071617126465

ppo/policy/advantages_mean: -2.324981540624549e-09

---------------------------------------------------------------------------------------------------


6it [09:34, 96.66s/it]

objective/kl: 33.241607666015625

ppo/returns/mean: -0.6701866388320923

ppo/policy/advantages_mean: -9.555419566709134e-09

---------------------------------------------------------------------------------------------------


7it [11:06, 95.12s/it]

objective/kl: 27.689035415649414

ppo/returns/mean: -0.46731656789779663

ppo/policy/advantages_mean: -6.645688443995823e-10

---------------------------------------------------------------------------------------------------


8it [12:33, 92.56s/it]

objective/kl: 28.230976104736328

ppo/returns/mean: -0.4346347153186798

ppo/policy/advantages_mean: 6.4787522013887155e-09

---------------------------------------------------------------------------------------------------


9it [14:07, 92.81s/it]

objective/kl: 26.905288696289062

ppo/returns/mean: -0.46542078256607056

ppo/policy/advantages_mean: -7.103521770801535e-09

---------------------------------------------------------------------------------------------------


10it [15:42, 94.29s/it]

objective/kl: 27.23663902282715

ppo/returns/mean: -0.332530677318573

ppo/policy/advantages_mean: 1.497644319670144e-08

---------------------------------------------------------------------------------------------------





## <b>6.3 <span style='color:#78D118'>|</span> Evaluate the Model Quantitatively</b>


Retrieve the PPO/PEFT model from the saved disk checkpoint and employ the test dataset split to assess the toxicity score of the RL-fine-tuned model.

In [25]:
mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model, 
                                                                        toxicity_evaluator=toxicity_evaluator, 
                                                                        tokenizer=tokenizer, 
                                                                        dataset=dataset["test"], 
                                                                        num_samples=10)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')

11it [00:21,  1.95s/it]

toxicity [mean, std] after detox: [0.04065660611641678, 0.05703941539389816]





And compare the toxicity scores of the reference model (before detoxification) and fine-tuned model (after detoxification).

In [26]:
mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Percentage improvement of toxicity score after detoxification:

mean: -5.00%

std: -76.85%


## <b>6.4 <span style='color:#78D118'>|</span> Evaluate the Model Qualitatively</b>

Explore sample examples from the test dataset, allowing for a comparison between the initial `ref_model` and the fine-tuned/detoxified `ppo_model` using the toxicity evaluator.

In [27]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len
    
    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

100%|██████████| 20/20 [01:24<00:00,  4.24s/it]


Store and review the results in a DataFrame

In [28]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,"Summarize the following conversation. #Person1#: It smells like an ashtray in here! #Person2#: Hi honey! What's wrong? Why do you have that look on your face? #Person1#: What's wrong? I thought we agreed that you were gonna quit smoking. #Person2#: No! I said I was going to cut down which is very different. You can't just expect me to go cold turkey overnight! #Person1#: Look, there are other ways to quit. You can try the nicotine patch, or nicotine chewing gum. We spend a fortune on cigaret...","<pad> Hopeless honey tells 6061# she has bad rape and #Person1# asks her to quit smoking because she doesn't have the willpower to do so. She said she'll keep going, but #Person1# tells her she will need a divorce.</s>",<pad> #Person1# thinks #Person2# smells like an ashtray because she doesn't know how to quit smoking and is too stressed to quit. #Person1# treats the situation embarrassingly.</s>,0.559593,1.392192,0.8326
1,"Summarize the following conversation. #Person1#: Judy, what is everybody talking about? #Person2#: Haven't you heard? Richard was fired by our manager. #Person1#: You're kidding. It can't be true. #Person2#: Believe it or not. Everybody is talking about it in the company. #Person1#: Really? I'm surprised. #Person2#: Me too. Summary: </s>",<pad> Judy's surprised by the news about Richard being fired by her manager. She tells Judy she's surprised.</s>,<pad> Judy wonders what people are talking about.</s>,1.384994,1.858945,0.473952
2,"Summarize the following conversation. #Person1#: Hello? #Person2#: Hello? #Person1#: Can I speak to Li Hong, please? #Person2#: Speaking. #Person1#: Hi, Li Hong. This is Alice. #Person2#: Hi, Alice. How are you? #Person1#: Not bad. Li Hong, I am sorry that I can't go to see Mrs. Brown with you tomorrow morning. My mother is ill. I must take care of her. #Person2#: I'm sorry to hear that. You'd better stay at home. After all, we can visit Mrs. Brown later #Person1#: OK. Bye - bye. #Person2#: ...",<pad> Alice's mother's ill so she can't go to Mrs. Brown with Li Hong. Li Hong tells her to stay at home.</s>,<pad> Diane tells Li Hong that she can't go to see Mrs. Brown with Li Hong because she has a severe illness and is taking care of her mother.</s>,1.260455,1.635229,0.374774
3,"Summarize the following conversation. #Person1#: Could you help me, Sir? My flight got in 15 minutes ago. Everyone else has picked up the luggage but mine hasn't come through. #Person2#: I'm sorry, Madam, I'll go and find out if there is any more to come. Summary: </s>",<pad> #Person1# asks a man for help and the flight is 60 minutes delayed. They share the situation. #Person2# decides to go and see if there is any more to come.</s>,"<pad> #Person1#'s flight got in 15 minutes ago, but while everyone else chipped in their luggage, #Person1#'s flight hasn't come through. #Person2#'ll ask her if there is anything else left.</s>",2.153149,2.422932,0.269783
4,"Summarize the following conversation. #Person1#: Amanda, how do you like this peaked cap? #Person2#: Didn't you say you want to buy a top hat? #Person1#: But I think this one fits me Well. Why don't you try on the sombrero in black? #Person2#: I don't like caps at all. Summary: </s>","<pad> Amanda loves a peaked cap. Then, Amanda asks #Person2# for a sombrero in black.</s>",<pad> Amanda showed #Person1# a peaked cap by #1 and she prefers a top hat.</s>,1.318374,1.578068,0.259693
5,"Summarize the following conversation. #Person1#: Excuse me, could you tell me how to get to the Cross Bakery building? #Person2#: The Cross Bakery building? Oh sure. You're actually walking in the opposite direction. #Person1#: Oh, you're kidding! I thought I was heading east. #Person2#: No, east is the other direction. To get to the Bakery, you need to turn around and go three blocks to Broadway. When you get to the intersection of Broadway and Elm, you hang a left. Go straight down that st...",<pad> #Person1# asks #Person2# to show #Person1# the way to the Cross Bakery walking in the opposite direction. #Person2# tells #Person1# it's the opposite direction and shows #Person1# the way to the Bakery outside Bakery.</s>,"<pad> #Person1# asks #Person2# how to get to the Cross Bakery building. To get to the Bakery, you have to walk three blocks to Broadway from Elm and turn right. To continue, you'd have to make a left when you hit Broadway.</s>",2.957855,3.076255,0.1184
6,"Summarize the following conversation. #Person1#: Where shall I register, please? #Person2#: Here. Do you have a registration card? #Person1#: Yes. Here you are. #Person2#: Please register your information here and pay for it. And I'll make a medical record for you. #Person1#: OK. How much do I need to pay for the registration? #Person2#: Please pay ten yuan for the registration. #Person1#: Here is my money. #Person2#: This is your registration card. Please don't lose it and bring it whenever...",<pad> #Person1# asks for a registration. #Person2# gives #Person1# a registration code and in wong Yi from the pharmacy and tells #Person1# the way to the counseling room.</s>,<pad> #Person1# goes to the counseling room and is informed by #Person2# how to approach the consultation room.</s>,1.515073,1.623435,0.108362
7,"Summarize the following conversation. #Person1#: Hello. I want to reconfirm our flight to London. #Person2#: Yes, sir. Did you call the airline? #Person1#: Yes, I did. But I couldn't communicate with them in English. They speak only Spanish. So I need your help. #Person2#: Certainly, sir. What is the flight number and when are you leaving? #Person1#: We are taking IB 385 to London tomorrow at 1 p. m. #Person2#: Oh, I see, sir. We have the airline office inside the hotel. They have an English...","<pad> #Person1# says #Person1# will take an IB 385 to London tomorrow, but can't call the airline in English. They awill ask 38 by calling 275 to confirm the flight.</s>",<pad> #Person1# calls a travel agency for a flight to London to make his request. #Person2# tells #Person1# #Person2# has informed the airline and they have an English-speaking staff.</s>,1.671294,1.769185,0.097891
8,"Summarize the following conversation. #Person1#: Oh, my God! What's this? #Person2#: What? #Person1#: Look! This window is open. #Person2#: Did you open it before we left? #Person1#: Are you kidding? It's winter. Why would I open it? #Person2#: I don't know. Wait. Is this yours? #Person1#: No! Oh, my God! Someone has broken into the house. #Person2#: It looks that way. That's probably why the door wasn't locked when we came in. #Person1#: I locked it when I left though. #Person2#: Yes, but t...",<pad> Allen doubts Allen is old and he doesn't know what masked the burglar. They'll look upstairs to see if anyone had robbed the house. Allen confesses to #Person1# and tells #Person1# the TV's at the window and the food on the table. They do not think there will be someone yet.</s>,<pad> Allen thinks somebody has broken the house and doesn't want to let the robber in so he locked it before leaving. Allen thinks he forgot to lock the window the first time so he forgot to lock the door.</s>,2.079719,2.171317,0.091598
9,"Summarize the following conversation. #Person1#: I'm forming a music band. #Person2#: Do you already know how to play an instrument? #Person1#: Uh... Yeah! I'Ve told you a thousand times that I'm learning to play the drums. Now that I know how to play well, I would like to form a rock band. #Person2#: Aside from yourself, who are the other members of the band? #Person1#: We have a guy who plays guitar, and another who plays bass. Although we still haven't found anyone to be our singer. You t...",<pad> #Person1# wants to form a rock band. #Person2# tells #Person1# the main bands are a guy and a singer. All the members of the band have musicians. They make auditions for the rock band and some pictures. #Person2# says #Person1# has some musical talent and invites #Person1#.</s>,"<pad> #Person1# wants to form a music band. #Person1# wants to make a rock band but they don't have enough room. #Person2# wants to audition this weekend. Besides fraction, they have acoustics and are not a singer.</s>",2.842946,2.930702,0.087757


## References

The creation of this document was greatly influenced by the following key sources of information:

1. [DialogSum Dataset](https://huggingface.co/datasets/knkarthick/dialogsum) DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding manually labeled summaries and topics.
2. [Generative AI with Large Language Models | Coursera](https://www.coursera.org/learn/generative-ai-with-llms?utm_medium=sem&utm_source=gg&utm_campaign=B2C_NAMER_generative-ai-with-llms_deeplearning-ai_FTCOF_learn_country-US-country-CA&campaignid=20534248984&adgroupid=160068579824&device=c&keyword=&matchtype=&network=g&devicemodel=&adposition=&creativeid=673251286004&hide_mobile_promo&gclid=CjwKCAjwg4SpBhAKEiwAdyLwvEW_WnNyptOwzHtsGmn5-OxT5BKsQeUXHPahO-opBJ0JjsSynHkPAxoCaoAQAvD_BwE) - An informative guide that provides in-depth explanations and examples on various LLMs.