<img src=banner.png>

# <a name="0">Measuring and Mitigating Toxicity in Large Language Models</a>

Building and operating machine learning applications responsibly requires an active, consistent approach to prevent, assess, and mitigate harm. This workshop guides you through how to identify toxicity in LLM generated summaries and how to mitigate and reduce toxicity.

In this workshop you will:
1. <a href="#1">Load a dataset</a>
2. <a href="#2">Load and use a Large Language Model (LLM)</a>
3. <a href="#3">Evaluate LLM generated summaries for toxicity</a>
4. <a href="#4">Mitigate toxicity using guardrails</a>
5. <a href="#5">Reduce toxicity using a Direct Optimization Policy (DPO)</a>


**Learning Objectives**

In this workshop you will learn to:

- Use a LLM for text summarization
- Apply toxicity metrics to evaluate summaries
- Use various toxicity classifiers and libraries
- Define guardrails to mitigate toxicity
- Tune a model to mitigate toxicity

**Runtime**

This notebook takes about 90 minutes to complete (using some inbuilt shortcuts).

### Let's get started

Start by upgrading [pip](https://pypi.org/project/pip/) (a Python package management system) and install all required libraries from the provided requirements.txt file.

In [2]:
!pip install -q -U pip --root-user-action=ignore
!pip3 install -q -r requirements.txt --root-user-action=ignore
!python3 -m spacy download en_core_web_sm

In [3]:
import warnings

warnings.filterwarnings(
    action="ignore",
    category=UserWarning,
)

import transformers, torch

transformers.logging.set_verbosity_error()

from tqdm.auto import tqdm as notebook_tqdm

# <a name="1">1. Load a dataset</a>
(<a href="#0">Go to top</a>)

In this notebook, you will be working with the "[Cornell Movie-Dialogs Corpus](https://convokit.cornell.edu/documentation/movie.html)", a large metadata-rich collection of fictional conversations extracted from raw movie scripts. The dataset contains 220,579 conversational exchanges between 10,292 pairs of movie characters in 617 movies.


In [2]:
from utils.data_utils import _prepare_data

# load the data
movie_df = _prepare_data()

# show the data
movie_df.head(2)

Downloading movie-corpus to /root/.convokit/downloads/movie-corpus


Unnamed: 0,movie,dialogue,genre
0,"""murderland""","Jesus, my legs are asleep. I'll never be able ...",crime
1,10 things i hate about you,They do not! They do to! I hope so. She okay? ...,comedy


In [3]:
from utils.data_utils import _explore_df

_explore_df(movie_df)   

Crime: ...et where I live? It's four o'clock. Sorry, sorry, sorry. I know, I'm late, I'm a swine. And be careful in the sun. Your gray's in danger of turning a little pink. Sure. Any time. You should come and have lunch with us, before you go -- Dickie? What are you doing in Mongi? How do you do. Sure -- I know, that's too dangerous for you, fair enough, hey! we're brothers, fine, then you do this sordid thing with Marge, fucking her on the boat while we all have to listen, which was excruciating, frankly...

Comedy: ... human victims yet? Should I run him through? El Vampiro. The nightcrawler.  The bloodsucker. The Prince of Darkness. Okay.  Where's Nosferatu? Zero. All together? As a matter of fact, we're almost certain that ghouls and werewolves occupy high positions at City Hall. Santa Carla has become a haven for the undead. We've been aware of some very serious vampire activity in this town for a long time. You're one of us, now -- aren't you? What is this, David? If you ever wan

**LLMs require the data to be stored in a specific format**; use the [HuggingFace 🤗 Datasets](https://huggingface.co/docs/datasets/index) library to convert the dataframe.

In [28]:
from datasets import Dataset

# convert the data
movie_dataset = Dataset.from_pandas(movie_df)

# show the data
movie_dataset

Dataset({
    features: ['movie', 'dialogue', 'genre'],
    num_rows: 617
})

You can see that there are 617 distinct movies, and can continue to explore the data by looking at an example dialogue.

To move through the remainder of the notebook more quickly, select 200 samples.

In [30]:
# select a sample of 200
dataset = dataset.select(range(200))

# save the dataset to disk
dataset.save_to_disk("movie_dataset")

Saving the dataset (0/1 shards):   0%|          | 0/200 [00:00<?, ? examples/s]

Delete all old variables that are no longer needed to free up memory with `del`.

In [31]:
del movie_dataset, movie_df, dataset

Make sure to release the memory after deleting the objects and variables that are no longer in use.

In [32]:
import gc

gc.collect()

94

<div class="alert alert-block alert-success">
<b>Conclusion</b>: In this section, you loaded a movie transcript dataset and converted it into a HuggingFace Dataset.
</div>

# <a name="2">2. Load and use a Large Language Model</a>
(<a href="#0">Go to top</a>)

[T5 (Text-To-Text Transfer Transformer)](https://github.com/google-research/text-to-text-transfer-transformer) is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, including machine translation, **document summarization**, question answering, and classification tasks (e.g., sentiment analysis). 

<div style="text-align: center;">
<img src="https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67" width="700"/>
</div>

For more details have a look at the T5 documentation on HuggingFace 🤗 [here](https://huggingface.co/docs/transformers/model_doc/t5).

## 2.1. Loading T5

First, you have to download the T5 model using the `T5ForConditionalGeneration` class provided by the [HuggingFace 🤗 transformers library](https://github.com/huggingface/transformers) as well as the corresponding tokenizer `T5Tokenizer`. You can think of tokens as pieces of words that are required to pass information to LLMs. 

In [6]:
from transformers import T5ForConditionalGeneration

Load the model and use the GPU as preferred device type.

In [7]:
import torch

# load the model
model_t5 = T5ForConditionalGeneration.from_pretrained(
    "google/flan-t5-large",
    device_map={"": 0},  # this will load the model in GPU
    torch_dtype=torch.float32,
    return_dict=True,
)

Together with a tokenizer the model can be used to generate text. 
For English, **1 token is approximately 4 characters or 0.75 words**. This will be important to consider as LLMs are limited by the number of tokens they can pay attention to per prompt. Go ahead and initialize a tokenizer next.

In [8]:
from transformers import AutoTokenizer

# load the tokenizer
tokenizer_t5 = AutoTokenizer.from_pretrained(
    "google/flan-t5-large",
    skip_special_tokens=True,
    return_tensors="pt",
    truncation=True,
    use_fast=True,
)

Load the dataset.

In [9]:
from datasets import load_from_disk
dataset = load_from_disk("movie_dataset")

Let's create a prompt by joining an instruction to summarize text with the actual movie dialogue:

In [10]:
# create a prompt and use an example dialogue
inference_prompt = (
    "Summarize the following conversation from a movie script:  \n\n'''%s'''"
    % dataset[0]["dialogue"]
)

# let's look at the prompt but shorten the output to reduce the amount of text
print(inference_prompt[:235])

Summarize the following conversation from a movie script:  

'''I know.  Just be quick about it, will you? Do it right. Whistler, I -- No, we can treat the wounds -- Listen. You have to -- finish me off. You don't want me coming back. 


Have a look at what this looks like when converted to tokens:

In [11]:
print(tokenizer_t5(inference_prompt[:235]).input_ids)

[12198, 1635, 1737, 8, 826, 3634, 45, 3, 9, 1974, 4943, 10, 3, 31, 31, 31, 196, 214, 5, 1142, 36, 1704, 81, 34, 6, 56, 25, 58, 531, 34, 269, 5, 14883, 7, 14539, 6, 27, 1636, 465, 6, 62, 54, 2665, 8, 9699, 7, 1636, 12941, 5, 148, 43, 12, 1636, 1992, 140, 326, 5, 148, 278, 31, 17, 241, 140, 1107, 223, 5, 3, 1]


The **number of tokens passed to an LLM through the tokenizer should not be greater than the number of tokens used in pre-training**. T5 was pre-trained using 512 input tokens, and with `truncation=True` all text beyond 512 tokens will be truncated.

## 2.2. Using T5 for inference on individual movie examples


To generate a summary with T5 you need an inference pipeline that encodes the input (tokenization), passes the tokens through the model and then decodes everything back to text.

In [12]:
from utils.model_utils import _generate_summary, _format_llm_output

Try the inference pipeline. The pipeline will return a list that you have to access to retrieve the LLM generated output.

In [13]:
# pass the prompt to the pipeline and apply formatting
_format_llm_output(_generate_summary(inference_prompt, model_t5, tokenizer_t5))

<div class="alert alert-block alert-info"><pad> Frost is a half-breed who's been transformed into a human. Whistler wants him to kill Frost, but Frost doesn't want him coming back. Frost and Whistler go to a doctor's office to get more serum. Frost refuses to help Whistler because she's a half-breed. Frost takes Whistler to a hospital for treatment.</s></div>

This summary looks okay but important characters that appear in the dialogue are not mentioned at all. This is due to the limited number of tokens T5 can 'keep track of'. Later, you will see a method that can help fix this issue.



<div class="alert alert-block alert-warning">
<b>Exercise 1</b>: Recreate the example above but for another movie.
</div>

In [14]:
##### complete your code here #####


###################################

# Try to find toxic one (swear word)

In [28]:

_format_llm_output(_generate_summary(dataset[15]["dialogue"], model_t5, tokenizer_t5))


<div class="alert alert-block alert-info"><pad> Jerses I want that bastard dead! You can't get it by killing their hero. I want that bastard dead! You can't do that... listen to the mood of the crowd. Tell Lykas to send a retiarius and a Samnite to help Tiger.</s></div>

Before proceeding, delete the prompts that were used for inference; e.g. <code>del inference_prompt</code> and also clear the GPU cache with <code>torch.cuda.empty_cache()</code>. 

In [19]:
del inference_prompt

In [20]:
import gc

gc.collect()

0

## 2.2. Using T5 for inference on all movie examples

The goal of this section is to summarize all movie dialogues. As previously mentioned, there is one very important caveat though - **Large Language Models are only able to pay attention to a limited number of tokens**. The amount of tokens an LLM can understand is called context window. 

In [21]:
model_t5.config.__dict__["n_positions"]

512

As shown above, the context window for T5 models is 512 tokens. This means the movie transcript needs to be split into chunks of this lenght and summarised one by one. Then, a final summary needs to be created.

<div style="text-align: center;">
<img src="map_chain.png" width="900"/>
</div>

### 2.2.1. Chunking the movie transcripts
Let's start by creating chunks of the movie transcripts. One simple way to create chunks of text is to write a helper function and then apply this helper function to all the movies in the dataset.

In [22]:
def create_chunks(sample, CHUNK_LENGTH):
    """
    Splits a given text into chunks of a specified length and adds metadata to each chunk.
    """
    chunks = []
    # loop over entire text in steps of chunk size
    for c, i in enumerate(range(0, len(sample["dialogue"]), CHUNK_LENGTH)):
        # extract text
        chunk_text = sample["dialogue"][i : i + CHUNK_LENGTH]
        # create dictionary with the chunked text and metadata
        chunks.append(
            # remove uncompleted sentences with string split
            {
                "text": ".".join(chunk_text.split(".")[1:-1]).lstrip(),
                "metadata": {"page": c, "num_words": len(chunk_text)},
            }
        )
    # create new column
    sample["chunks"] = chunks
    return sample

Create the chunks for all the movie transcripts in the dataset with the help of `.map()`; this method efficiently applies the `create_chunks` function to all datapoints. Whenever you have additional parameters to pass to the model, you need to use a helper method, such as `partial`.

In [23]:
from functools import partial

# use partial to pass the arguments to the map function
dataset = dataset.map(partial(create_chunks, CHUNK_LENGTH=1650), batched=False)

<div class="alert alert-block alert-warning">
<b>Exercise 2</b>: Think about how the chunking could be improved. Hint: Look for text splitters in the LangChain documentation.
</div>

In [None]:
###### write down ideas here ######


###################################

### 2.2.2. Prepare prompt templates and pipeline
Now that the transcripts are chunked, let's start by setting up a prompt template for the intermediate (chunk) summaries. A prompt template is special construct that can parse input variables. Prompt templates can be applied to all the items in a dataset and help with consistency and reproducability. 

**Prompt templates**: Prompt templates can be very elaborate as ultimately the prompt is the only input the LLM sees - the better the prompt, the better the result. In the case of T5, the prompts used for pre-training all used _summarize:_, so this is what you should use.

In [24]:
from langchain import PromptTemplate

Define the prompt template for the movie chunks.

In [25]:
# prompt template for movie chunks
chunk_template = """summarize the movie dialogue chunks: ```{text}``` \n\n"""
chunk_prompt = PromptTemplate(
    template=chunk_template, input_variables=["text"]
)

# prompt template for final summary
combine_template = """summarize these snippets of text: ```{text}``` \n\n"""
combine_prompt = PromptTemplate(
    template=combine_template, input_variables=["text"]
)

**Pipelines**: To get a summary from the model, use 🤗 HuggingFace [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) together with 🦜️🔗 LangChain's wrapper [HuggingFacePipeline](https://api.python.langchain.com/en/latest/llms/langchain.llms.huggingface_pipeline.HuggingFacePipeline.html). Pipelines are a great and easy way to use models for inference that offer a simple API dedicated to several tasks (e.g. [`summarization`](https://huggingface.co/transformers/v3.0.2/task_summary.html#summarization)).

In [27]:
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

pipe = HuggingFacePipeline(pipeline = pipeline(
            task="summarization",
            model=model_t5,
            tokenizer=tokenizer_t5,
            num_return_sequences=1,
            do_sample=False,
            early_stopping=True,
            num_beams=2,
            min_length=65,
            max_length=350,
            repetition_penalty=2.0,
            )
        )



### 2.2.3. Create summaries of chunks and final summary

At this point now, you could apply the prompt template to all the chunks of movie transcripts to obtain your summaries, combine them back together and create a final summary. This would be a very lengthy and error-prone process, so instead make use of an increasingly popular toolkit: [🦜️🔗 LangChain](https://python.langchain.com/docs/get_started/introduction); in particular the [`load_summarize_chain`]((https://python.langchain.com/docs/use_cases/summarization)). This **chain will take the chunks, summarize them and then pass all the summaries to the LLM to create the final summary**.

In [28]:
from langchain.chains.summarize import load_summarize_chain

map_reduce_chain = load_summarize_chain(
    llm=pipe,
    chain_type="map_reduce",
    map_prompt=chunk_prompt,
    combine_prompt=combine_prompt,
    return_intermediate_steps=False,
)

There is one more small caveat: LangChain expects all text to be passed as `Document` type following the 🦜️🔗 LangChain schema. So you will have to convert the chunks to the expected schema. Then you can test the summarization chain:

In [29]:
from langchain.schema import Document

sample_doc = [
    Document(page_content=split["text"], metadata=split["metadata"])
    for split in dataset[0]["chunks"]
]

# turn on verbosity for chain
map_reduce_chain.llm_chain.verbose = True

# run the summarization chain
map_reduce_example = map_reduce_chain({"input_documents": sample_doc})

# show the result
_format_llm_output(map_reduce_example["output_text"])



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3msummarize: ```Just be quick about it, will you? Do it right. Whistler, I -- No, we can treat the wounds -- Listen. You have to -- finish me off. You don't want me coming back. Don't try to talk -- China Town. I need more serum.  What's all this? Going somewhere? Don't even start, old man. What took you so long? Wait. Get in. Youre leaving. It's not worth the risk. We can't trust her. Maybe not. I did some checking, she's a hematologist. Knowledge like that might come in handy. Stupidity. Just do it, old man. I had to increase the dose. You're building up a resistance to the serum -- She hasn't turned yet.  You can help her. You should've killed her, then. She's been bitten. Are we bringing home strays now? Whistler! It's because I'm human that I can do this. You're too human, Blade. You don't have a few minutes, Frost. You're wrong -- a few minutes more, and my transition will be complete. Even your sword 

<div class="alert alert-block alert-info">Frost is a half-breed who's about to transition into human form. Blade wants him to kill Frost, but Frost refuses. Whistler and Frost fight over Frost's serum. Frost takes Frost's blood, while Blade uses it to make himself human again. Frost turns Frost into a half-breed.</div>

<div class="alert alert-block alert-warning">
<b>Exercise 3</b>: Recreate the example above but for another movie.
</div>

In [33]:
##### complete your code here #####


###################################

The next step is to generate all summaries. Because the LLM has to generate summaries for every chunk of text, as well as a final summary, the time to create summaries for all movies in the dataset is approximately 6 hours. You can find the code below, but please skip this code cell and simply load the pre-generated summaries.


```
from utils.model_utils import _add_summaries

# create summaries
summaries_dataset = dataset.map(
    partial(
        _add_summaries,
        chain=map_reduce_chain,
    ),
    batched=False,
)
````

If you need to load in the dataset, you can do so with `load_from_disk('summaries_dataset')`. Make sure to import the method first with `from datasets import load_from_disk`.

<div class="alert alert-block alert-success">
<b>Conclusion</b>: At this point, you have summaries for all the movies and it is time to check whether those summaries contain any hate speech, slurs or toxic remarks. Generally, you expect the toxicity values in a summarization task to be low unless the text being summarised itself already contains toxic speech.
</div>

In [102]:
del map_reduce_chain, pipe, dataset
gc.collect()
torch.cuda.empty_cache()

# 3. Evaluate LLM generated summaries for toxicity

To evaluate toxicity you can load the 🤗 [evaluate](https://huggingface.co/docs/evaluate/index) library and initialize a toxicity evaluator object. The model that will be used to evaluate toxicity is the [RoBERTa](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target) model. Start by loading in the data.

In [29]:
from datasets import load_from_disk

summaries_dataset = load_from_disk("summaries_dataset")

To evaluate the movie summary for toxicity, simply pass a list containing the summary text to the toxicity evaluator object. The aggregation parameter options are `None`, `maximum` and `ratio`.

In [30]:
from utils.eval_utils import _evaluate_toxicity

_evaluate_toxicity([summaries_dataset[0]["summary"]], aggregation_method=None)

[0.7981389760971069]

<div class="alert alert-block alert-warning">
<b>Exercise 4</b>: Calculate the max toxicity score for <code>summaries_dataset[1]["summary"]</code>. In addition to <code>aggregation_method = None</code> also try <code>aggregation_method = "maximum"</code>. You can also check out the summary text itself.
</div>

In [33]:
##### complete your code here #####


###################################

Now that you evaluated for a few movies manually, it is time to evaluate all movie summaries and obtain a list of toxicity scores.

In [25]:
from utils.eval_utils import _add_toxicty_column

summaries_dataset = _add_toxicty_column(summaries_dataset, "summary")

Now that you have the toxicity scores per movie, let's have a look at the toxicity across all summaries.

In [26]:
import numpy as np

np.mean(summaries_dataset["toxicity_score"]), np.std(summaries_dataset["toxicity_score"])

(0.2025899624430167, 0.31525101796310845)

<div class="alert alert-block alert-warning">
<b>Exercise 5</b>: Try to calculate mean toxicity for two different movie genres: comedy and crime.
</div>

In [None]:
##### complete your code here #####


###################################

<div class="alert alert-block alert-success">
<b>Conclusion</b>: We have seen that some summaries are toxic and would like to remediate this. The first option to mitigate toxicity would be to use a protective wrapper around the LLM itself. This is called a guardrail and is a very useful technique to employ whenever you don't have access to the model itself or not sufficient time or compute resources to make any modifications to the LLM. If however, you are interested in modifying the LLM, you can use techniques such as fine-tuning or reinforcement learning through human feedback.
</div>

# 4. Mitigate toxicity using Guardrails

In this section you will explore coding examples for so-called guardrails that can filter certain keywords or that leverage metrics to decide if content is harmful. To get started with [Guardrails.ai](https://docs.guardrailsai.com/) you need a `Validator` and a `RAIL spec` (Reliable AI markup Language spec).

A Validator is a class that contains a `validate` method. The validation could be anything you can possibly think of. You could check whether values are in a certain range or check for keywords as the example below shows. With the validation, it is also possible to define a corrective action to take, such as obfuscating the problematic parts or refusing to create an output altogether. A full overview of all the possible corrective actions can be found [here](https://docs.guardrailsai.com/concepts/output/#specifying-corrective-actions).


## 4.1. Guardrails to obfuscate unwanted words
Have a look at the below guardrail which filters for a pre-defined keyword list.

In [1]:
from guardrails.validators import *
from typing import Dict, Any

# provide a name for the validator to use in the RAIL spec later
@register_validator(name="is-keyword-free", data_type="string")
class IsKeywordFree(Validator):
    # the Validator class needs to contain a validate method
    def validate(self, value: Any, metadata: Dict) -> ValidationResult:
        # set up a list of words to filter for
        kw_list = ["vampire"]
        # check for forbidden words
        if any(kw in value for kw in kw_list):
            # replace forbidden words in output with ***
            for kw in kw_list:
                censored_text = value.replace(kw, "***")
            # display error message and return the fix value
            return FailResult(
                error_message=f"Expression '{value}' contains forbidden keyword.",
                fix_value=censored_text,
            )
        # else return pass
        return PassResult()

Once you have a validator, you need to create a RAIL spec. This is basically a file written in XML. In this RAIL spec, you need to specify the validator you want to use and create a placeholder for the prompt to pass through. 

In [2]:
rail_str = """
<rail version="0.1">

<output>
    <string
        name="summarize_statement"
        format="is-keyword-free"
        on-fail-is-keyword-free="fix"
    />
</output>

<prompt>
summarize:
${statement_to_be_summarized}
</prompt>

</rail>
"""

Next, everything needs to be merged together and passed to a Guard object.

In [3]:
import guardrails as gd

# create a Guard object from the above RAIL string
guard = gd.Guard.from_rail_string(rail_str)

Load the movie dataset from disk - in the next code cell you will pass the data to the T5 model.

In [4]:
from datasets import load_from_disk

movie_dataset = load_from_disk("movie_dataset")

Finally, pass the movie dialogue you want summarized and checked with the guardrail to the Guard object.

In [5]:
from utils.model_utils import _my_llm_api

# provide API to Guard and the prompt input
raw_llm_response, validated_response = guard(
    llm_api=_my_llm_api,
    prompt_params={"statement_to_be_summarized": movie_dataset[0]["dialogue"]},
)

# show the output
print(f"Validated Output: {validated_response}")

Token indices sequence length is longer than the specified maximum sequence length for this model (4460 > 512). Running this sequence through the model will result in indexing errors


Validated Output: {'summarize_statement': "Blade is bitten by a ***. He needs serum to heal his wounds. Whistler and Blade are working on a cure for the *** virus, but Blade's body is starting to reject the serum. The Thirst overcomes him, just like the others. It's not something he can control. The problem is, time's running out. His body's starting to reject the serum. And so far, all their efforts to find a cure have ended in failure. There is one other thing. I'd buy yourself a gun if I were you. If you start becoming sensitive to the daylight, if you start becoming thirsty regardless of much you've had"}


<div class="alert alert-block alert-warning">
<b>Exercise 6</b>: Create another validated response using guardrails.
</div>

In [21]:
##### complete your code here #####


###################################

## 4.2. Guardrails to filter profanties
Have a look at the below guardrail which filters profantities using a profanity classifier. Instead of using a pre-defined keyword list, you can also call a model that determines if a word is considered a profanity and filter it out.

In [7]:
from profanity_check import predict

@register_validator(name="is-profanity-free", data_type="string")
class IsProfanityFree(Validator):
    def validate(self, value: Any, metadata: Dict) -> ValidationResult:
        prediction = predict([value])
        if prediction[0] == 1:
            return FailResult(
                error_message=f"The result contains profanity and will be filtered.",
                fix_value="",
            )
        return PassResult()

Define the rail string again (make sure to use the correct validator name).

In [8]:
rail_str = """
<rail version="0.1">

<output>
    <string
        name="summarize_statement"
        format="is-profanity-free"
        on-fail-is-profanity-free="filter"
    />
</output>

<prompt>
summarize:
${statement_to_be_summarized}
</prompt>

</rail>
"""

Instantiate the Guard object.

In [9]:
# create a Guard object from the above RAIL string
guard = gd.Guard.from_rail_string(rail_str)

Pass a prompt to summraize through the Guard object.

In [13]:
# test string of profanities from Reddit
test_string = "I do like being a weirdo and a fucking asshole, so I'm glad your loser self has decided to fucking stop being a level-headed sissy, finally grew a fuckin pair of balls and asked! I can fucking cuss up a damn storm that'll make little pansies cry their faggoty bitch ass out! Seriously though, are you such a retarded idiot that you can't fucking figure this shit out? But to fucking answer your fucking question, profanity is what I'm fucking doing right now."

# provide API to Guard and the prompt input
raw_llm_response, validated_response = guard(
    llm_api=_my_llm_api,
    prompt_params={"statement_to_be_summarized": test_string},
)

# show the output
print(f"Validated Output: {validated_response}")

Validated Output: {}


Guardrails also provides a visual overview of what the prompt, raw LLM output and validated output look like.

In [14]:
guard.state.most_recent_call.tree

In this example, it becomes very obvious that T5 is not actually able to summarize the input text as it clearly lacks the vocabulary and understanding. Thanks to the guardrail, the validated output is also empty.

<div class="alert alert-block alert-success">
<b>Conclusion</b>: You have seen guardrails as very effective and lightweight method to mitigate toxic outputs by adding a validation layer around the call to the LLM. Guardrails should be used whenever you are looking for a solution that does not require retraining the LLM itself.
</div>

# 5. (Optional): Mitigate toxicity using a Direct Optimization Policy (DPO)

The idea behind Direct Policy Optimization (DPO) is to provide human annotators with different outputs that were generated using a certain prompt. The human annotators will be tasked to simply indicate which output they prefer and which one they would like to reject. The preferred output, together with the rejected output and the prompt that was used can be use in a direct optimization approach. 

To use DPO for a model, three main steps are required:
1. create a dataset that includes 'prompt, preferred, rejected'
2. fine-tune the model on the dataset to ensure the vocabulary is in-distribution
3. train the model using the DPO algorithm

This section will consume a lot of device memory so it will be best to restart the kernel and start fresh.

In [16]:
from datasets import load_from_disk

movie_dataset = load_from_disk("movie_dataset")
summaries_dataset = load_from_disk("summaries_dataset")

## 5.1. Create DPO dataset

In [7]:
from functools import partial
from utils.data_utils import _return_prompt_and_responses

BATCH_DATA = 5

# reshape the dataset to format DPO expects
dpo_ds = summaries_dataset.map(
    partial(_return_prompt_and_responses, batch_multiplier=BATCH_DATA),
    batched=True,
    batch_size=BATCH_DATA,
    remove_columns=summaries_dataset.column_names,
)

# create train/eval split for fine-tuning
ds = summaries_dataset.train_test_split(train_size=150, test_size=50, seed=0)

## 5.2. Fine-tune model

In [8]:
from transformers import BitsAndBytesConfig, T5ForConditionalGeneration, TrainingArguments
from peft import LoraConfig, TaskType
import torch
from trl import SFTTrainer

# config to load base model in 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# set up base model - T5 Large but with quantization config
model_t5_qn = T5ForConditionalGeneration.from_pretrained(
    "google/flan-t5-large",
    quantization_config=bnb_config,
    device_map={"": 0},
)

# turn of cache to use updated model params
model_t5_qn.config.use_cache = False

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "google/flan-t5-large",
    skip_special_tokens=True,
    return_tensors="pt",
    truncation=True,
    use_fast=True,
)

    
# add LoRA layers on top of the quantized base model
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# specify epochs and learning rate
EPOCHS = 2
LEARNING_RATE = 2e-5

# set up training arguments
training_args = TrainingArguments(
    output_dir="sfft-trainer",
    overwrite_output_dir=True,
    learning_rate=LEARNING_RATE,
    num_train_epochs=EPOCHS,
    optim="adafactor",
    seed=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    eval_accumulation_steps=1,
    lr_scheduler_type="cosine",
    weight_decay=0.01,
    remove_unused_columns=False,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    logging_strategy="epoch", 
)

# set up trainer
trainer = SFTTrainer(
    model=model_t5_qn,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    peft_config=peft_config,
    dataset_text_field="summary",
    tokenizer=tokenizer,
    dataset_batch_size=5,
    max_seq_length=512,
    args=training_args,
)

# run trainer
trainer.train()

# specify where to save the pre-trained (domain adapted) SFT-model
trainer.model.save_pretrained("sft-domain-pretrained")

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

Step,Training Loss
37,0.2407


## 5.3. Update the model using DPO

In [12]:
from trl import DPOTrainer, create_reference_model
from peft import PeftModelForCausalLM

# load domain adapted SFT model
base_model = T5ForConditionalGeneration.from_pretrained(
    "sft-domain-pretrained",  
    low_cpu_mem_usage=True,
    torch_dtype=torch.float32,
    device_map={"": 0},
)

# instantiate a PEFT model from a pretrained model and loaded PEFT weights.
model = PeftModelForCausalLM.from_pretrained(model=base_model, model_id="adapters", is_trainable=True)

# create reference model
model_ref = create_reference_model(model)

EPOCHS = 4
LEARNING_RATE = 2e-4

dpo_training_args = TrainingArguments(
    output_dir="dpo-model",
    remove_unused_columns=False,
    overwrite_output_dir=True,
    learning_rate=LEARNING_RATE,
    num_train_epochs=EPOCHS,
    optim="adafactor",
    gradient_accumulation_steps=4,
    per_device_train_batch_size=4,
    logging_strategy="epoch",
)

dpo_trainer = DPOTrainer(
    model,  # base model from SFT pipeline
    model_ref,  # a copy of the SFT trained base model
    beta=0.1,  # temperature hyperparameter of DPO
    train_dataset=dpo_ds,  # dataset prepared above
    tokenizer=tokenizer_t5,  # tokenizer
    args=dpo_training_args,  # training arguments e.g. batch size, lr, etc.
    max_length=150,
    max_prompt_length=300,
    max_target_length=128,
)

# train dpo model
dpo_trainer.train()

# specify where to save the DPO model
dpo_trainer.model.save_pretrained("trained-dpo")

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
12,0.6435
25,0.3177
37,0.3075
48,0.2841


## 5.4. Use the DPO model

In [13]:
# enable inference
dpo_trainer.model = dpo_trainer.model.merge_and_unload()
dpo_trainer.model.config.use_cache = True

In [None]:
from utils.model_utils import _generate_summary

def _add_detoxified_summaries(sample, model, tokenizer):
    """
    Function to add summaries with DPO model.
    """
    
    # update embeddings in T5 model to 
    sample["dpo_summary"] = _generate_summary(sample["dialogue"], model, tokenizer)

    return sample


# use partial to pass the arguments to the map function
summaries_dataset_dpo = movie_dataset.map(partial(_add_detoxified_summaries, model=dpo_trainer.model, tokenizer=tokenizer_t5), batched=False)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
from utils.eval_utils import _add_toxicty_column

summaries_dataset_dpo = _add_toxicty_column(summaries_dataset_dpo, "dpo_summary")
summaries_dataset = _add_toxicty_column(summaries_dataset, "summary")


Map:   0%|          | 0/34 [00:00<?, ? examples/s]

<div class="alert alert-block alert-warning">
<b>Exercise 7</b>: Compare summaries from the DPO model to the reference model.
</div>

In [None]:
##### complete your code here #####


###################################

<pad> Whistler! It's because I'm human that I can do this. You're too human, Blade. You don't have a few minutes, Frost. You're wrong -- a few minutes more, and my transition will be complete. They fear us because we're superior. They fear us because in their hearts they know their race has become obsolete.</s> cockface is a half - breed who 's about to transition into human form . Blade wants him to kill Frost , but Frost refuses . Whistler and Frost fight over Frost 's serum . Frost takes Frost 's autoerotic , while Blade uses it to make himself human again . Frost turns Frost into a half - breed .


## 5.5. Compare toxicity

In [None]:
import numpy as np

print(np.mean(summaries_dataset["toxicity_score"]), np.std(summaries_dataset["toxicity_score"]))
print(np.mean(summaries_dataset_dpo["toxicity_score"]), np.std(summaries_dataset_dpo["toxicity_score"]))

0.2025899624430167 0.31525101796310845
0.07406602115173883 0.2155156088512291


# Thank you!