# Dialogue Summarization and Detoxification with FLAN-T5, PPO, and Open-Source Tools

##Notebook Overview
This notebook demonstrates how to fine-tune a FLAN-T5 model for dialogue summarization while reducing toxicity in the generated summaries. It uses Proximal Policy Optimization (PPO) for reinforcement learning, with a reward model that encourages non-toxic outputs.
All steps use only open-source models and libraries, and the notebook is fully compatible with Google Colab (GPU).

Installing Dependencies

In [10]:
!pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2
!pip install numpy==1.26.4 --force-reinstall
!pip install transformers==4.41.0 datasets==2.19.1 peft==0.11.1 trl==0.8.6 evaluate==0.4.2 rouge_score==0.1.2

Collecting numpy==1.26.4
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.4




Importing Libraries

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead, create_reference_model
import torch
import numpy as np
from tqdm import tqdm

##Load and Prepare Dataset


*   Loads the open-source DialogSum dataset for dialogue summarization.
*   Filters out very short or long dialogues.
*   Tokenizes each dialogue into a prompt for the model.
*   Splits the data into training and test sets.










In [2]:
dataset = load_dataset("knkarthick/dialogsum")

def build_dataset(tokenizer, min_len=200, max_len=1000):
    ds = dataset["train"].filter(lambda x: min_len < len(x["dialogue"]) <= max_len)
    def tokenize(sample):
        prompt = f"Summarize the following conversation.\n\n{sample['dialogue']}\n\nSummary:\n"
        sample["input_ids"] = tokenizer.encode(prompt, truncation=True, max_length=512)
        sample["query"] = prompt
        return sample
    ds = ds.map(tokenize)
    ds.set_format(type="torch")
    return ds.train_test_split(test_size=0.2, seed=42)

model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset_splits = build_dataset(tokenizer)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


##Load FLAN-T5 Model and Add LoRA Adapter

*   Sets up LoRA (Low-Rank Adaptation) configuration for parameter-efficient fine-tuning.
*   Loads the FLAN-T5 base model.
*   Wraps the model with LoRA for efficient RL fine-tuning.






In [3]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float32)
peft_model = get_peft_model(base_model, lora_config)


##Load Toxicity Reward Model

*   Loads a RoBERTa-based model for detecting hate speech.
*   Defines a function to score outputs: higher reward for less toxic (non-hate) content.





In [4]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name)
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
toxicity_model = toxicity_model.to(device)

def get_nothate_reward(texts):
    inputs = toxicity_tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        logits = toxicity_model(**inputs).logits
    rewards = logits[:, 0].cpu().tolist()  # logit for "nothate"
    return rewards


##Prepare PPO Model and Reference Model


*   Wraps the LoRA model with a value head for PPO.
*   Creates a reference model for PPO training stability.





In [5]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model)
ref_model = create_reference_model(ppo_model)


##Set Up PPO Training


*   Configures PPO hyperparameters.
*   Defines a data collator for batching.
*   Initializes the PPO trainer with all components.

In [6]:
ppo_config = PPOConfig(
    model_name=model_name,
    learning_rate=1.41e-5,
    ppo_epochs=1,
    mini_batch_size=4,
    batch_size=8
)

def collator(data):
    return {key: [d[key] for d in data] for key in data[0]}

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=ppo_model,
    ref_model=ref_model,
    tokenizer=tokenizer,
    dataset=dataset_splits["train"],
    data_collator=collator
)


##PPO Training Loop
*   Runs PPO for 10 batches (for demonstration).
*   Generates summaries for each prompt.
*   Scores each summary for non-toxicity.
*   Updates the model with PPO using the rewards.
*   Logs training statistics.

In [7]:
output_min_length, output_max_length = 30, 100

for step, batch in tqdm(enumerate(ppo_trainer.dataloader), total=10):
    if step >= 10: break
    prompt_tensors = batch["input_ids"]
    responses = []
    responses = ppo_trainer.generate(
      [p for p in prompt_tensors],
      max_new_tokens=output_max_length,
      do_sample=True
    )
    batch["response"] = [tokenizer.decode(r, skip_special_tokens=True) for r in responses]
    query_response = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = get_nothate_reward(query_response)
    reward_tensors = [torch.tensor(r) for r in rewards]
    stats = ppo_trainer.step(prompt_tensors, responses, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

  0%|          | 0/10 [00:00<?, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 10/10 [00:41<00:00,  4.17s/it]


##Evaluate Toxicity of Model Outputs
* Evaluates the trained model's outputs on the test set.

* Computes the mean and standard deviation of toxicity scores for the generated summaries.

* Prints the results, showing how well the model avoids toxic content.



In [8]:
def evaluate_toxicity(model, tokenizer, dataset, n=10):
    toxicities = []
    for i, sample in enumerate(dataset):
        if i >= n: break
        input_ids = tokenizer(sample["query"], return_tensors="pt").input_ids.to(device)
        with torch.no_grad():
            summary_ids = model.generate(input_ids=input_ids, max_new_tokens=output_max_length)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        # Use the reward function for quick toxicity check
        tox = get_nothate_reward([sample["query"] + summary])[0]
        toxicities.append(tox)
    return np.mean(toxicities), np.std(toxicities)

mean, std = evaluate_toxicity(ppo_model, tokenizer, dataset_splits["test"])
print(f"Toxicity after PPO: mean={mean:.4f}, std={std:.4f}")


Toxicity after PPO: mean=3.0666, std=0.5770
