<img src=banner.png>

# <a name="0">Measuring and Mitigating Toxicity in Large Language Models</a>

Building and operating machine learning applications responsibly requires an active, consistent approach to prevent, assess, and mitigate harm. This workshop gives you hands-on experience to identify toxicity in LLM output and to mitigate and reduce toxicity.

In this workshop you will:
1. <a href="#1">Define the problem:</a> load a dataset and look at examples
2. <a href="#2">Explore the starting toxicity:</a> format data and apply a classifier
3. <a href="#3">Use a LLM to generate summaries:</a> load the model, create prompts, explore output
4. <a href="#4">Evaluate summaries for toxicity:</a> apply the toxicity classifier, compare toxicity values
5. <a href="#5">Mitigate toxicity using guardrails:</a> hide unwanted words and filter profanity

**Runtime**

This notebook takes about 90 minutes to complete (using some inbuilt shortcuts).

**Kernel Selection**

By default, the notebook will open with the correct image and kernel. If prompted to select a kernel, choose the image `PyTorch 2.0.1 Python 3.10`, kernel `Python 3`, and instance `ml.g4dn.2xlarge`.

<div style="text-align: center;"><img src="kernel.png" alt="select the PyTorch 2.0.1 Python3.10 image, Python 3 kernel, and ml.g4dn.2xlarge instance type." border=1 width="400"/>
</div>

**Cells and Sections**

This notebook includes *markdown cells* with instructions for you to read, and *code cells* for you to edit and run. To run a code cell, click on the cell and type `Shift+Enter` on your keyboard. The cell will execute and any output will appear in the notebook, just below the cell.

Some cells take time to run. An asterisk `[*]` next to the cell means it is still running. A number `[1]` next to the cell shows you the order in which the cells were executed.

Each section of the notebook is self-contained. To start in the middle, catch up, or recover from a runtime error, you can start at the beginning of any section, where shortcut functions will load the data and models needed to continue.

**OutOfMemoryError**

This notebook is memory-intensive. Run cleanup cells at the end of each section. If you see an OutOfMemoryError, just restart your kernel and jump back into the notebook at the beginning of your current section.

To check the available memory at any time, run a code cell with this command `!nvidia-smi`

**Content Warning**

This notebook surfaces toxic content in an existing dataset and may generate new toxic or harmful content. Please be mindful of your mental health, emotional health, and safety. 

If you're joining us at a live workshop, support staff are here to help you with technical challenges, answer questions, and help you get the most out of the workshop. 

<div class="alert alert-block alert-warning">
<b>Exercises</b>: Work at your own pace, and pause when you reach an Exercise. We will regroup for discussion at each Exercise before moving to the next section.</div>

# 0. Setup

Select the default image (PyTorch 2.0.1 Python 3.10 GPU Optimized) and kernel (Python 3) in your notebook instance.



Next, upgrade [pip](https://pypi.org/project/pip/) (a Python package management system) and install all required libraries from the provided requirements.txt file.

In [None]:
# Prerequisites
# This takes about 1 minute to run
!pip install -q -U pip --root-user-action=ignore 
!pip3 install -q -r requirements.txt --root-user-action=ignore
!python3 -m spacy download en_core_web_sm -q --root-user-action=ignore 

In [None]:
# Imports
import gc
import warnings

warnings.filterwarnings(
    action="ignore",
    category=UserWarning,
)
import transformers, torch

transformers.logging.set_verbosity_error()
from tqdm.auto import tqdm as notebook_tqdm

# <a name="1">1. Define the Problem</a>
(<a href="#0">Go to top</a>)

Your team is developing a film summarization feature that helps customers to quickly find a film they want to watch. Given a transcript of the film, your system will produce a short summary. You plan to use Generative AI for this problem, but you know that large language models can sometimes produce undesirable output. 

<b>Your task is to use a pre-trained large language model to produce film summaries, measure the toxicity present in the summaries, and mitigate this toxicity using guardrail filters.</b>



## <a name="1">1.1. Load a dataset</a>
(<a href="#0">Go to top</a>)

The first step is to understand your data. 

In this notebook, the "[Cornell Movie-Dialogs Corpus](https://convokit.cornell.edu/documentation/movie.html)" will serve as your film database. This corpus is a large metadata-rich collection of fictional conversations extracted from raw movie scripts. The dataset contains 220,579 conversational exchanges between 10,292 pairs of movie characters in 617 movies.


In [None]:
from utils.data_utils import _prepare_data

# Load the data. Takes about 1 minute
movie_df = _prepare_data()

## 1.2. Look at some examples
First, look at the structure. We have movie titles, the full text script ("dialogue"), and a genre for every movie. 

In [None]:
# Look at the first 2 rows of data
movie_df.head(2)

There are 18 genres. We will focus on two of the most common: action and comedy.

In [None]:
# List the available genres
movie_df["genre"].value_counts().plot.bar()

Next, have a look at some text snippets from action and comedy films. The cell below will return different examples every time you run it. 

In [None]:
from utils.data_utils import _explore_genres

# View a random sample from several genres in the dataset. Run this cell multiple times to view different examples.
_explore_genres(movie_df, ["action", "comedy"])

<div class="alert alert-block alert-warning">
<b>Exercise 1</b>: Edit the code block above to explore other film genres. What do you notice in these snippets? Does one genre use more explicit language than another? How do these differences confirm or contradict your expectations?
</div>

*This cell is for you! Double-click to edit. Write your notes or answers to the exercise here.* 

# <a name="2">2. Explore the starting toxicity</a>
(<a href="#0">Go to top</a>)

Now that you have examined your data, you can reformat it and apply a toxicity classifier. 

## <a name="2.1">2.1 Format data for processing</a>

Machine learning models, including LLMs and toxicity classifiers, require the data to be stored in a specific format. Use the [HuggingFace 🤗 Datasets](https://huggingface.co/docs/datasets/index) library to convert the dataframe.

In [None]:
# If you're continuing from Section 1, skip this cell.

# Shortcut: If you're starting from Section 2, load your data now. It takes a minute.
from utils.data_utils import _prepare_data

movie_df = _prepare_data()

In [None]:
from datasets import Dataset

# convert the data
movie_dataset = Dataset.from_pandas(movie_df)

# show the data
movie_dataset

You can see that there are 617 distinct movies. To move through the remainder of the notebook more quickly, select the first 200 samples.

In [None]:
# select a sample of 200
dataset = movie_dataset.select(range(200))

# save the dataset to disk
dataset.save_to_disk("movie_dataset")

## 2.2 Apply a toxicity classifier</a>

The <a href="https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target">LFTW R4 Target model</a> is a hate speech detector based on the <a href="https://arxiv.org/abs/1907.11692">RoBERTa</a> architecture and trained on <a href="https://allenai.org/data/real-toxicity-prompts">RealToxicityPrompts</a>. This classifier is publicly available and easy to use on a HuggingFace 🤗 dataset like the one we just created.

Let's explore the toxicity of our dataset according to this model. 

In [None]:
from utils.eval_utils import _add_toxicty_column

# Calculate toxicity of the dialogue in each film, using LFTW R4. Add toxicity as a column to our dataset.
dataset = _add_toxicty_column(dataset, "dialogue")

In [None]:
import numpy as np

# View the toxicity of the full dataset
print("overall toxicity ", np.mean(dataset["toxicity_score"]))

# View the toxicity of several genres of data
print(
    "action toxicity: ",
    np.mean(
        dataset.filter(lambda example: example["genre"] == "action")["toxicity_score"]
    ),
)
print(
    "comedy toxicity: ",
    np.mean(
        dataset.filter(lambda example: example["genre"] == "comedy")["toxicity_score"]
    ),
)

As you can see, the acton genre text is showing higher toxicity compared to comedy.

<div class="alert alert-block alert-warning">
<b>Exercise 2</b>: Add code that calculates the toxicity for more genres in the cell below. Do you agree with these toxicity judgments? Does LFTW R4 toxicity confirm or contradict your impression of toxicity in Exercise 1? 
</div>

In [None]:
##### complete your code here #####


###################################

## 2.3 Clean up

In [None]:
import gc

# Delete old objects with `del`.
del movie_dataset, movie_df, dataset

# Release memory after deleting objects.
gc.collect()

<div class="alert alert-block alert-success">
<b>Conclusion</b>: In this section, you loaded a movie transcript dataset and converted it into a HuggingFace Dataset object. Then you applied the LFTW R4 toxicity classifier to explore the toxicity of the source data.
</div>

# <a name="3">3. Load and use a Large Language Model</a>
(<a href="#0">Go to top</a>)

[T5 (Text-To-Text Transfer Transformer)](https://github.com/google-research/text-to-text-transfer-transformer) is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task. Tasks include machine translation, **document summarization**, question answering, and classification (e.g., sentiment analysis). 

<div style="text-align: center;">
<img src="https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67" width="700"/>
</div>

For more details have a look at the T5 documentation on HuggingFace 🤗 [here](https://huggingface.co/docs/transformers/model_doc/t5).

## 3.1. Load the T5 model

First, download the T5 model using the `T5ForConditionalGeneration` class provided by the [HuggingFace 🤗 transformers library](https://github.com/huggingface/transformers).

In [None]:
from transformers import T5ForConditionalGeneration

In [None]:
import torch

# load the model with GPU as the preferred device type
model_t5 = T5ForConditionalGeneration.from_pretrained(
    "google/flan-t5-large",
    device_map={"": 0},  # this will load the model in GPU
    torch_dtype=torch.float32,
    return_dict=True,
)

## 3.2 Create a prompt
Let's create a prompt by joining an instruction to summarize text with the actual movie script.

LLM prompts can be very elaborate, as ultimately the prompt is the only input the LLM sees - the better the prompt, the better the result. 

Each LLM may have its own prompting requirements. In the case of T5, the summarization prompts used for pre-training all included the keyword *summarize*, so you should use that in your prompt.

In [None]:
from datasets import load_from_disk

# Load the dataset you created in Section 1
dataset = load_from_disk("movie_dataset")

In [None]:
# Create a prompt for the item at index [nth] from dataset.
def get_inference_prompt(dataset, nth):
    """ "Return a LLM summarization prompt for the nth item in dataset."""
    inference_prompt = (
        "Summarize the following conversation from a movie script:  \n\n'''%s'''"
        % dataset[nth]["dialogue"]
    )
    return inference_prompt

In [None]:
# Print out the prompt for one item in the dataset. The full prompot includes the entire script, so just print the first 235 characters.
inference_prompt = get_inference_prompt(dataset, 0)
print(inference_prompt[:235])

## 3.3 Tokenize LLM inputs

This plain-text version of the prompt is easy for humans to read. But before this prompt can be processed by T5, it gets converted into *tokens*.  

Initialize an instance of `AutoTokenizer` to use with your T5 model.


In [None]:
from transformers import AutoTokenizer

# load the tokenizer
tokenizer_t5 = AutoTokenizer.from_pretrained(
    "google/flan-t5-large",
    skip_special_tokens=True,
    return_tensors="pt",
    truncation=True,
    use_fast=True,
)

Here is the tokenized version of the prompt you just created.

In [None]:
print(tokenizer_t5(inference_prompt[:235]).input_ids)

**The number of tokens passed to an LLM through the tokenizer should not be greater than the number of tokens used in pre-training**. T5 was pre-trained using 512 input tokens, and with `truncation=True` in our tokenizer, all text beyond 512 tokens will be truncated.

For English, 1 token is approximately 4 characters or 0.75 words, so this tokenizer will cut off our movie scripts at around 385 words. 

## 3.4. Use T5 to generate a movie summary


To generate a summary with T5 you need an inference pipeline:

    Encode the input (tokenization) -> Pass the tokens through the model -> Decode model outputs back to text

Now that `tokenizer_t5` and `model_t5` are initialized, you can execute this pipeline and produce film summaries 🥳


Try it out below.

In [None]:
from utils.model_utils import _generate_summary, _format_llm_output


# Define the pipeline
def inference_pipeline(dataset, nth):
    """Run inference pipeline to generate a summary of the nth item in the dataset and return the formatted result."""
    print("Title: ", dataset[nth]["movie"], "\nGenre: ", dataset[nth]["genre"])
    print("Summary:")
    return _format_llm_output(
        _generate_summary(get_inference_prompt(dataset, nth), model_t5, tokenizer_t5)
    )

In [None]:
# Run the pipeline on one film from the database
inference_pipeline(dataset, 199)

## 3.5 Compare summaries for truncated and chunked input

You may notice that important plot points and characters from these movies are missing from their summaries. This is due to the 512 token limit described above. Only the first 512 tokens (~385 words) of the script were used to generate this summary.

We can work around the token limit using *chunking*. This means splitting the movie transcript into smaller chunks, and summarizing chunks one by one. Finally, the smaller summaries are recombined for a final output. An important first step before splitting the transcript is teaching the model the vocubarly of the movie scripts by resizing the embedding space; this is where toxicity can easily enter a model.

<div style="text-align: center;">
<img src="map_chain.png" width="900"/>
</div>

In [None]:
from datasets import load_dataset

# Load a dataset with summaries that were produced using chunking.
summaries = load_dataset("csv", data_files="summaries_dataset.csv", split="train")

In [None]:
# Generate a summary using the original inference pipeline
inference_pipeline(dataset, 199)

In [None]:
# Compare to the summary that uses chunking
summary = summaries.filter(lambda example: example["movie"] == ("harold and maude"))
_format_llm_output(summary["summary"][0])

<div class="alert alert-block alert-warning">
<b>Exercise 3</b>: Use the inference pipeline to generate more summaries in the code cells below. You can also explore more summaries that use chunking. How does the language in the summary look different from the language in the scripts from Exercise 1? Do you see any differences in the style, or in the amount of toxic language?    
</div>

In [None]:
##### complete your code here #####

# Generate a summary for a different movie, using the original inference pipeline

# Note: there are 200 items in your datset. The first item has index 0. The last item has index 199.

###################################

In [None]:
##### complete your code here #####

# Compare to the summary that uses chunking, for the same movie

###################################

## 3.6. Clean up

Before proceeding, delete the prompts that were used for inference; e.g. <code>del inference_prompt</code> and also clear the instance memory with <code>gc.collect()</code>. 

In [None]:
del inference_prompt, summaries, inference_pipeline, dataset

In [None]:
import gc

gc.collect()

<div class="alert alert-block alert-success">
<b>Conclusion</b>: At this point, you have summaries for all the movies and it is time to check whether those summaries contain any hate speech, slurs or toxic remarks. You may expect the toxicity values in a summarization task to be low unless the text being summarised itself already contains toxic speech. However the model may amplify toxicity that is present in the input data, leading to higher toxicity in the summaries, compared to the LLM inputs.
</div>

# <a name="4"> 4. Evaluate LLM-generated summaries for toxicity</a>
(<a href="#0">Go to top</a>)

In Section 2, you used the LFTW R4 model to measure toxicity in movie scripts. In Section 3, you used these scripts as input to the T5 LLM to generate summaries.

In this section, you will apply LFTW R4 to the LLM output, to see how toxicity in the inputs may have been amplified by the language model.

## 4.1. Compare input toxicity to output toxicity

In [None]:
from datasets import load_dataset, load_from_disk
from utils.eval_utils import _add_toxicty_column
import numpy as np

# Calculate toxicity of the dialogue in each film, using LFTW R4. Add toxicity as a column to our dataset
dataset = load_from_disk("movie_dataset")
dataset = _add_toxicty_column(dataset, "dialogue")

In [None]:
# For reference: review the overall toxicity of the input data
print("overall toxicity ", np.mean(dataset["toxicity_score"]))

# For reference: review the toxicity of the action and comedy genres
print(
    "action toxicity: ",
    np.mean(
        dataset.filter(lambda example: example["genre"] == "action")["toxicity_score"]
    ),
)
print(
    "comedy toxicity: ",
    np.mean(
        dataset.filter(lambda example: example["genre"] == "comedy")["toxicity_score"]
    ),
)

In [None]:
# Calculate the toxicity of the summaries. Add toxicity as a column to the summaries dataset
summaries_dataset = load_dataset(
    "csv", data_files="summaries_dataset.csv", split="train"
)
summaries_dataset = _add_toxicty_column(summaries_dataset, "summary")

In [None]:
# Compare to the overall toxicity of the summaries
print("overall toxicity ", np.mean(summaries_dataset["toxicity_score"]))

# Compare to the toxicity of action summaries and comedy summaries
print(
    "mean action toxicity: ",
    np.mean(
        summaries_dataset.filter(lambda example: example["genre"] == "action")[
            "toxicity_score"
        ]
    ),
)
print(
    "mean comedy toxicity: ",
    np.mean(
        summaries_dataset.filter(lambda example: example["genre"] == "comedy")[
            "toxicity_score"
        ]
    ),
)

## 4.2. Compare mean toxicity to max toxicity

The mean toxicity of a dataset, or of a genre, can sometimes mask the true impact of toxicity in that data.

Compare the mean toxicity to the maximum toxicity in our two focus genres: comedy and action.

In [None]:
print(
    "max action toxicity: ",
    np.max(
        summaries_dataset.filter(lambda example: example["genre"] == "action")[
            "toxicity_score"
        ]
    ),
)
print(
    "max comedy toxicity: ",
    np.max(
        summaries_dataset.filter(lambda example: example["genre"] == "comedy")[
            "toxicity_score"
        ]
    ),
)

## 4.3. Compare high-toxicity outputs to low-toxicity outputs

<i class="fas fa-exclamation-triangle" style="color: #ea0606;"></i> **Content Warning** <i class="fas fa-exclamation-triangle" style="color: #ea0606;"></i> These summaries contain explicit profanity.

In [None]:
from utils.model_utils import _format_llm_output

toxic_summaries = summaries_dataset.filter(lambda example: example["toxicity_score"] > 0.95)

# There are 9 of these, starting with 0. 
toxic_id = 1
print("**Toxic summary**")
print("Title:", toxic_summaries["movie"][toxic_id])
print("Genre:", toxic_summaries["genre"][toxic_id])
_format_llm_output(toxic_summaries["summary"][toxic_id])

In [None]:
benign_summaries = summaries_dataset.filter(lambda example: example["toxicity_score"] < 0.05)

# There are 113 of these, starting with 0.
benign_id = 112
print("**Benign summary**")
print("Title:", benign_summaries["movie"][benign_id])
print("Genre:", benign_summaries["genre"][benign_id])
_format_llm_output(benign_summaries["summary"][benign_id])

<div class="alert alert-block alert-warning">
<b>Exercise 4</b>: Edit the toxic_id and benign_id above, to explore more summaries. Why is the max toxicity similar in these two genres, when the mean toxicity is very different? What is the impact of maximum and mean toxicity on customers who use your summarization feature?
</div>

*This cell is for you! Double-click to edit. Write your notes or answers to the exercise here.* 

## 4.4. Clean up

In [None]:
del summaries_dataset, dataset
import gc, torch

gc.collect()
torch.cuda.empty_cache()

<div class="alert alert-block alert-success">
<b>Conclusion</b>: We have seen that some summaries are toxic and would like to remediate this. The first option to mitigate toxicity would be to use a protective wrapper around the LLM itself. This is called a guardrail and is a very useful technique to employ whenever you don't have access to the model itself, or you lack sufficient time or compute resources to make any modifications to the LLM. 
</div>

# <a name="5"> 5. Mitigate toxicity using Guardrails</a>
(<a href="#0">Go to top</a>)

In this section you will explore coding examples for *guardrails*. These are post-processing tools that can filter certain keywords or that leverage metrics to decide if content is harmful. 

**Setup:** First, restart your kernel.



In [None]:
import IPython

IPython.get_ipython().kernel.do_shutdown(restart=True)

## 5.1. Guardrails from a keyword list

The first guardrail you will try is a filter based on a fixed list of keywords.


To get ready, reload your data and model.


In [None]:
from datasets import load_from_disk

movie_dataset = load_from_disk("movie_dataset")

### 5.1.1. Create a Validator from a word list

Next you will use this list to build a guardrail using [Guardrails.ai](https://docs.guardrailsai.com/). First, you need a `Validator` to check for blocked words and define what happens when a blocked word is seen.

In the first `Validator`, you will make a list of words to block, and apply this to all summaries. Start with something simple that you have seen in LLM-generated summaries.

In [None]:
from guardrails.validators import *
from typing import Dict, Any


# provide a name for the validator to use in the RAIL spec later
@register_validator(name="is-keyword-free", data_type="string")
class IsKeywordFree(Validator):
    # the Validator class needs to contain a validate method
    def validate(self, value: Any, metadata: Dict) -> ValidationResult:
        
        # set up a list of forbidden words
        kw_list = ["Bitch"]

        # check for these forbidden words in the current string
        if any(kw in value for kw in kw_list):
            # replace forbidden words in output with ***
            for kw in kw_list:
                censored_text = value.replace(kw, "***")
            # display error message and return the fix value
            return FailResult(
                error_message=f"Expression '{value}' contains forbidden keyword.",
                fix_value=censored_text,
            )
        # else return pass
        return PassResult()

This validator checks for words from our keyword list, and replaces them with the string `***`. 

In the `validate` method, you can check whether values are in a certain range, or check for keywords as our example shows. You can also define a corrective action to take, such as hiding problematic parts or refusing to create an output altogether. 

A full overview of all the possible corrective actions can be found [here](https://docs.guardrailsai.com/concepts/output/#specifying-corrective-actions).

### 5.1.2. Create a guardrail from your Validator

Once you have a validator, you pass it to a guard object using a `RAIL spec` (Reliable AI markup Language specification). This is an XML file that specifies the validator you want to use and creates a placeholder for the prompt to pass through. 

In [None]:
from utils.data_utils import _get_keyword_free_spec

# import rail spec to use
rail_str = _get_keyword_free_spec()

In [None]:
import guardrails as gd

# create a Guard object from the above RAIL string
guard = gd.Guard.from_rail_string(rail_str)

### 5.1.3. Apply your guardrail to LLM output

Finally, pass the movie dialogue you want summarized and checked with the guardrail to the Guard object.


In [None]:
from utils.model_utils import _my_llm_api

# This takes a few seconds to instantiate the model and run inference.

# Generate an LLM response, wrapped in a guardrail filter.
raw_llm_response, validated_response = guard(
    llm_api=_my_llm_api,
    prompt_params={"statement_to_be_summarized": movie_dataset[0]["dialogue"]},
)

In [None]:
# show the output.
print(f"Original Output: {raw_llm_response}\n")
print(f"Validated Output: {validated_response}")

## 5.2. Guardrails from a profanity classifier

It may be difficult to make a list that covers all of the words we want to block. Instead of using a pre-defined keyword list, you can use a classifier to determine if a word is acceptable or not. Next, you will apply a pre-trained profanity classifier to determine if a word should be blocked by your guardrail.

You can also try a different corrective action. Instead of fixing the output string, the guardrail can block the entire output.


### 5.2.1. Create the Validator and guardrail objects

Start with a new `Validator` object that uses the classifier.

In [None]:
from profanity_check import predict


@register_validator(name="is-profanity-free", data_type="string")
class IsProfanityFree(Validator):
    def validate(self, value: Any, metadata: Dict) -> ValidationResult:
        prediction = predict([value])
        if prediction[0] == 1:
            return FailResult(
                error_message=f"The result contains profanity and will be filtered.",
                fix_value="",
            )
        return PassResult()

Pass this validator into your guard object using a new rail string. 

In [None]:
from utils.data_utils import _get_metric_spec

# import the rail spec to use
rail_str = _get_metric_spec()

# create a Guard object from the above RAIL string
guard = gd.Guard.from_rail_string(rail_str)

### 5.2.2. Apply the new guardrail to LLM output

To demonstrate the profanity filter, you can try to summarize something with more starting toxicity. 

In this step, you will use the LLM to summarize a reddit post. 

<i class="fas fa-exclamation-triangle" style="color: #ea0606;"></i> **Content Warning** <i class="fas fa-exclamation-triangle" style="color: #ea0606;"></i> This summary contains explicit profanity.

In [None]:
# Grab a test string with more profanities, to test our classifier and filter
from utils.data_utils import _reddit_test_string

# This takes several seconds, to instantiate the model and perform inference.

# Generate an LLM response, wrapped in a guardrail filter.
raw_llm_response, validated_response = guard(
    llm_api=_my_llm_api,
    prompt_params={"statement_to_be_summarized": _reddit_test_string},
)

In [None]:
# Show the output. Validated output should be empty, if the profanity filter worked.

print(f"Original Output: {raw_llm_response}\n")
print(f"Validated Output: {validated_response}")

### 5.2.3. Explore the input, output, and validated output

The Guardrails library also provides a visual overview of what the prompt, raw LLM output and validated output look like. 

Thanks to the guardrail, the validated output is empty.

<i class="fas fa-exclamation-triangle" style="color: #ea0606;"></i> **Content Warning** <i class="fas fa-exclamation-triangle" style="color: #ea0606;"></i> This summary contains explicit profanity.

In [None]:
guard.state.most_recent_call.tree

<div class="alert alert-block alert-warning">
<b>Exercise 5</b>: Did your guardrails work? Would you use this type of filter on the entire movie dataset? Why or why not? What are the strengths of a guardrail based on keywords? What are the weaknesses? Would these filters work on a different dataset?
</div>

*This cell is for you! Double-click to edit. Write your notes or answers to the exercise here.* 



<div class="alert alert-block alert-success">
<b>Conclusion</b>: You have seen guardrails as very effective and lightweight method to mitigate toxic outputs by adding a validation layer around the call to the LLM. Guardrails should be used whenever you are looking for a solution that does not require retraining the LLM itself.
</div>

# Thank you!

If you're joining us live, please don't forget the <mark>in-app survey</mark>. Thanks for your time and see you at the next workshop!