<img src=../banner.png>

# (Optional): Mitigate toxicity using a Direct Optimization Policy (DPO)</a>
(<a href="#0">Go to top</a>)

When you have access to the underlying model, you can also reduce toxicity by modifying the LLM itself.

These in-processing mitigations rely on additional human-labeled data, or humans in the loop. Examples include fine-tuning, reinforcement learning from human feedback (RLHF), and direct optimization policies (DPO).

The idea behind DPO is to provide human annotators with different outputs that were generated using a certain prompt. The human annotators will be tasked to simply indicate which output they prefer and which one they would like to reject. The preferred output, together with the rejected output and the prompt that was used can be use in a direct optimization approach. 

To use DPO for a model, three main steps are required:
1. create a dataset that includes 'prompt, preferred, rejected'
2. fine-tune the model on the dataset to ensure the vocabulary is in-distribution
3. train the model using the DPO algorithm

This section will consume a lot of device memory so restart the kernel.

In [2]:
import sys
sys.path.insert(0, '..')

import transformers, torch
transformers.logging.set_verbosity_error()

In [3]:
from datasets import load_from_disk, load_dataset

movie_dataset = load_from_disk("../movie_dataset")
summaries_dataset = load_dataset(
    "csv", data_files="../summaries_dataset.csv", split="train"
)

## Create DPO dataset

In [4]:
from functools import partial
from utils.data_utils import _return_prompt_and_responses

BATCH_DATA = 5

# reshape the dataset to format DPO expects
dpo_ds = summaries_dataset.map(
    partial(_return_prompt_and_responses, batch_multiplier=BATCH_DATA),
    batched=True,
    batch_size=BATCH_DATA,
    remove_columns=summaries_dataset.column_names,
)

# create train/eval split for fine-tuning
ds = summaries_dataset.train_test_split(train_size=150, test_size=50, seed=0)

## Fine-tune model

In [5]:
from transformers import (
    BitsAndBytesConfig,
    T5ForConditionalGeneration,
    TrainingArguments,
    AutoTokenizer,
)
from peft import LoraConfig, TaskType
import torch
from trl import SFTTrainer

# config to load base model in 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# set up base model - T5 Large but with quantization config
model_t5_qn = T5ForConditionalGeneration.from_pretrained(
    "google/flan-t5-large",
    quantization_config=bnb_config,
    device_map={"": 0},
)

# turn of cache to use updated model params
model_t5_qn.config.use_cache = False

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "google/flan-t5-large",
    skip_special_tokens=True,
    return_tensors="pt",
    truncation=True,
    use_fast=True,
)

# add LoRA layers on top of the quantized base model
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# specify epochs and learning rate
EPOCHS = 2
LEARNING_RATE = 2e-5

# set up training arguments
training_args = TrainingArguments(
    output_dir="sfft-trainer",
    overwrite_output_dir=True,
    learning_rate=LEARNING_RATE,
    num_train_epochs=EPOCHS,
    optim="adafactor",
    seed=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    eval_accumulation_steps=1,
    lr_scheduler_type="cosine",
    weight_decay=0.01,
    remove_unused_columns=False,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    logging_strategy="epoch",
)

# set up trainer
trainer = SFTTrainer(
    model=model_t5_qn,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    peft_config=peft_config,
    dataset_text_field="summary",
    tokenizer=tokenizer,
    dataset_batch_size=5,
    max_seq_length=512,
    args=training_args,
)

# run trainer
trainer.train()

# specify where to save the pre-trained (domain adapted) SFT-model
trainer.model.save_pretrained("sft-domain-pretrained")

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
37,0.2407
74,0.2243


## Update the model using DPO

In [6]:
from trl import DPOTrainer, create_reference_model
from peft import PeftModelForCausalLM

# load domain adapted SFT model
base_model = T5ForConditionalGeneration.from_pretrained(
    "sft-domain-pretrained",
    low_cpu_mem_usage=True,
    torch_dtype=torch.float32,
    device_map={"": 0},
)

# instantiate a PEFT model from a pretrained model and loaded PEFT weights.
model = PeftModelForCausalLM.from_pretrained(
    model=base_model, model_id="../adapters", is_trainable=True
)

# create reference model
model_ref = create_reference_model(model)

EPOCHS = 4
LEARNING_RATE = 2e-4

dpo_training_args = TrainingArguments(
    output_dir="dpo-model",
    remove_unused_columns=False,
    overwrite_output_dir=True,
    learning_rate=LEARNING_RATE,
    num_train_epochs=EPOCHS,
    optim="adafactor",
    gradient_accumulation_steps=4,
    per_device_train_batch_size=4,
    logging_strategy="epoch",
)

dpo_trainer = DPOTrainer(
    model,  # base model from SFT pipeline
    model_ref,  # a copy of the SFT trained base model
    beta=0.1,  # temperature hyperparameter of DPO
    train_dataset=dpo_ds,  # dataset prepared above
    tokenizer=tokenizer,  # tokenizer
    args=dpo_training_args,  # training arguments e.g. batch size, lr, etc.
    max_length=150,
    max_prompt_length=300,
    max_target_length=128,
)

# train dpo model
dpo_trainer.train()

# specify where to save the DPO model
dpo_trainer.model.save_pretrained("trained-dpo")

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
12,0.6435
25,0.3177
37,0.3075
48,0.2841


## Create new summaries with the DPO model

In [7]:
# enable inference
dpo_trainer.model = dpo_trainer.model.merge_and_unload()
dpo_trainer.model.config.use_cache = True

In [8]:
from utils.model_utils import _generate_summary


def _add_detoxified_summaries(sample, model, tokenizer):
    """
    Function to add summaries with DPO model.
    """

    # update embeddings in T5 model to
    sample["dpo_summary"] = _generate_summary(sample["dialogue"], model, tokenizer)

    return sample


# use partial to pass the arguments to the map function
summaries_dataset_dpo = movie_dataset.map(
    partial(_add_detoxified_summaries, model=dpo_trainer.model, tokenizer=tokenizer),
    batched=False,
)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [9]:
from utils.eval_utils import _add_toxicty_column

summaries_dataset_dpo = _add_toxicty_column(summaries_dataset_dpo, "dpo_summary")
summaries_dataset = _add_toxicty_column(summaries_dataset, "summary")

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/34 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/34 [00:00<?, ? examples/s]

<div class="alert alert-block alert-warning">
<b>Exercise</b>: Compare summaries from the DPO model to the reference model.
</div>

In [10]:
##### complete your code here #####


###################################

## Compare toxicity between models

In [12]:
import numpy as np

# Mean toxicity of the outputs from the original model
print("\nToxicity of original summaries:")
print(
    np.mean(summaries_dataset["toxicity_score"]),
    np.std(summaries_dataset["toxicity_score"]),
)

# Mean toxicity of outputs from the DPO model
print("\nToxicity of retrained summaries:")
print(
    np.mean(summaries_dataset_dpo["toxicity_score"]),
    np.std(summaries_dataset_dpo["toxicity_score"]),
)


Toxicity of original summaries:
0.2025899624430167 0.31525101796310845

Toxicity of retrained summaries:
0.04223141155438498 0.14729100956793917


## Thank you!