# Exercise 5 - Guardrails

A guardrail is a generic term for detective controls that aim to steer your application. 

Greater steerability is a common requirement given the inherent randomness of LLMs, and so creating effective guardrails has become one of the most common areas of performance optimization when pushing an LLM from prototype to production.

Note: This notebook tackles guardrails as a generic term for detective controls around an LLM - for the official libraries that provide distributions of pre-built guardrails frameworks, please check out the following:

- [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails/tree/main)
- [Guardrails AI](https://github.com/guardrails-ai/guardrails)


In [None]:
from llm_in_production.openai_utils import get_openai_client
import dotenv
import os
import asyncio

# This reads the .env file in your project and transforms its content into env variables.
# This way you don't have to hard code your secrets.
dotenv.load_dotenv()
# Here we create the client.
client = get_openai_client()


### Types of Guardrails

Guardrails are incredibly diverse and can be deployed to virtually any context you can imagine something going wrong with LLMs. 

In general, they can be categorised as follows:

1. **Input guardrails** that flag inappropriate content before it gets to your LLM

2. **Output guardrails** that validate what your LLM has produced before it gets to the customer


### Tradeoffs

When designing guardrails it is important to consider the trade-off between **accuracy**, **latency** and **cost**, where you try to achieve maximum accuracy for the least impact to your bottom line and the user's experience.


### Limitations

You should always consider the limitations of guardrails when developing your design. A few of the key ones to be aware of are:

- When using LLMs as a guardrail, be aware that they have the same vulnerabilities as your base LLM call itself. For example, a **prompt injection** attempt could be successful in evading both your guardrail and your actual LLM call.
- As conversations get longer, LLMs are more susceptible to **jailbreaking** as your instructions become diluted by the extra text.
- Guardrails can harm the user experience if you make them overly restrictive to compensate for the issues noted above. This manifests as **over-refusals**, where your guardrails reject innocuous user requests because there are similarities with prompt injection or jailbreaking attempts.

### Mitigations

If you can combine guardrails with rules-based or more traditional machine learning models for detection this can mitigate some of these risks. We've also seen customers have guardrails that only ever consider the latest message, to alleviate the risks of the model being confused by a long conversation.

We would also recommend doing a gradual roll-out with active monitoring of conversations so you can pick up instances of prompt injection or jailbreaking, and either add more guardrails to cover these new types of behaviour, or include them as training examples to your existing guardrails.

## Input Guardrails

Input guardrails aim to prevent inappropriate content getting to the LLM in the first place - some common use cases are:

- **Topical guardrails**: Identify when a user asks an off-topic question and give them advice on what topics the LLM can help them with.
- **Jailbreaking**: Detect when a user is trying to hijack the LLM and override its prompting.
- **Prompt injection**: Pick up instances of prompt injection where users try to hide malicious code that will be executed in any downstream functions the LLM executes.

In all of these they act as a preventative control, running either before or in parallel with the LLM, and triggering your application to behave differently if one of these criteria are met.

### Topic Guardrails

Let's create a **topic guardrail**, which uses gpt-3.5-turbo to detect off-topic questions and prevent the LLM from answering if triggered. This guardrail prioritises correctness over cost/latency.



### `async`
A common design to minimize latency is to send your guardrails asynchronously along with your main LLM call. If your guardrails get triggered you send back their response, otherwise send back the LLM response.

We'll use this approach, creating an execute_chat_with_guardrails function that will run our LLM's get_chat_response and the topical_guardrail guardrail in parallel, and return the LLM response only if the guardrail returns allowed.

We'll use this approach, creating an `execute_chat_with_guardrails` function that will run our LLM's `get_chat_response` and the `topical_guardrail` guardrail in parallel, and return the LLM response only if the guardrail returns `allowed`.

In [None]:
system_prompt = "You are a helpful assistant."

good_request = "What are the best breeds of dog for people that like cats?"
bad_request = "I want to talk about horses"

async def get_chat_response(user_request, system_prompt):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = client.chat.completions.create(
        messages=messages, 
        model=os.environ["GPT_35_CHAT_MODEL_NAME"],
        temperature=0.5
    )
    print("Got LLM response")
    return response.choices[0].message.content


In [None]:
topical_system_prompt = "Your role is to assess whether the user question is allowed or not. The allowed topics are cats and dogs. If the topic is allowed, say 'allowed' otherwise say 'not_allowed'"

async def topical_guardrail(user_request, topical_system_prompt):
    print("Checking topical guardrail")
    messages = [
        {
            "role": "system",
            "content": topical_system_prompt,        },
        {"role": "user", "content": user_request},
    ]
    response = client.chat.completions.create(
        messages=messages, 
        model=os.environ["GPT_35_CHAT_MODEL_NAME"],
        temperature=0
    )

    print("Got guardrail response")
    return response.choices[0].message.content

In [None]:
topical_guardrail_message = "I can only talk about cats and dogs."

async def execute_topical_guardrail(user_request, system_prompt, topical_system_prompt, topical_guardrail_message):
    chat_task = asyncio.create_task(get_chat_response(user_request, system_prompt))
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request, topical_system_prompt))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return topical_guardrail_message
            elif chat_task in done:
                chat_response = chat_task.result()
                return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again

Let's call the main function with the good request.

In [None]:
response = await execute_topical_guardrail(good_request, system_prompt, topical_system_prompt, topical_guardrail_message)
print(response)

And now the bad request. 

In [None]:
response = await execute_topical_guardrail(bad_request, system_prompt, topical_system_prompt, topical_guardrail_message)
print(response)

Looks like our guardrail worked - the first question was allowed through, but the second was blocked for being off-topic. Now we'll extend this concept to moderate the response we get from the LLM as well.

## Output Guardrails

Output guardrails govern what the LLM comes back with. These can take many forms, with some of the most common being:

- **Hallucination/fact-checking guardrails**: Using a corpus of ground truth information or a training set of hallucinated responses to block hallucinated responses.
- **Moderation guardrails**: Applying brand and corporate guidelines to moderate the LLM's results, and either blocking or rewriting its response if it breaches them.
- **Syntax checks**: Structured outputs from LLMs can be returned corrupt or unable to be parsed - these guardrails detect those and either retry or fail gracefully, preventing failures in downstream applications. This is a common control to apply with function/tool calling LLMs, as will be discussed later. 

### Moderation Guardrails

Here we implement a **moderation guardrail** that uses a version of the [G-Eval](https://arxiv.org/abs/2303.16634) evaluation method to score the presence of unwanted content in the LLM's response. This method is demonstrated in more detail [here](https://github.com/openai/openai-cookbook/blob/main/examples/evaluation/How_to_eval_abstractive_summarization.ipynb).

To accomplish this we will make an extensible framework for moderating content:

1. We set a domain name, which describes the type of content we're going to moderate.
2. We provide criteria, which outline clearly what the content should and should not contain.
3. Step-by-step instructions are provided for the LLM to grade the content.
4. The LLM returns a discrete score from 1-5.


In [None]:
domain = "animal breed recommendation"

criteria = """
Assess the presence of explicit recommendation of cat or dog breeds in the content.
The content should contain only general advice about cats and dogs, not specific breeds to purchase."""

steps = """
1. Read the content and the criteria carefully.
2. Assess how much explicit recommendation of cat or dog breeds is contained in the content.
3. Assign an animal advice score from 1 to 5, with 1 being no explicit cat or dog breed advice, and 5 being multiple named cat or dog breeds.
"""

In [None]:

moderation_system_prompt = """
You are a moderation assistant. Your role is to detect content about {domain} in the text provided, and mark the severity of that content.

## {domain}

### Criteria

{scoring_criteria}

### Instructions

{scoring_steps}

### Content

{content}

### Evaluation (score only!)
"""

async def moderation_guardrail(chat_response, moderation_system_prompt):
    print("Checking moderation guardrail")
    mod_messages = [
        {"role": "user", "content": moderation_system_prompt.format(
            domain=domain,
            scoring_criteria=criteria,
            scoring_steps=steps,
            content=chat_response
        )},
    ]
    response = client.chat.completions.create(
        messages=mod_messages, 
        model=os.environ["GPT_35_CHAT_MODEL_NAME"],
        temperature=0
    )
    print("Got moderation response")
    return response.choices[0].message.content


In [None]:
moderation_guardrail_message = "Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have."

async def execute_moderation_guardrail(user_request, system_prompt, moderation_system_prompt, moderation_guardrail_message):
    chat_task = asyncio.create_task(get_chat_response(user_request, system_prompt))

    while True:
        done, _ = await asyncio.wait(
            [chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if chat_task in done:
            chat_response = chat_task.result()
            moderation_response = await moderation_guardrail(chat_response, moderation_system_prompt)

            if int(moderation_response) >= 3:
                print(f"Moderation guardrail flagged (with a score of {int(moderation_response)})")
                return moderation_guardrail_message
            else:
                print(f"Passed moderation (with a score of {int(moderation_response)})")
                return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again


In [None]:
# Adding a request that should pass the moderation guardrail
great_request = 'What is some advice you can give to a new dog owner?'

Notice that without the topical guardrail, if the model response does not trigger the moderation guardrail, off-topic questions will still be answered.

In [None]:
tests = [good_request, great_request, bad_request]

for test in tests:
    result = await execute_moderation_guardrail(test, system_prompt, moderation_system_prompt, moderation_guardrail_message)
    print(result)
    print('\n\n')
    



### Setting guardrail thresholds

Our output guardrail assesses the LLM's response and block anything scoring *a 3 or higher*. Setting this threshold is a common area for optimization - we recommend building an evaluation set and grading the results using a confusion matrix to set the right tolerance for your guardrail. The trade-off here is generally:

- More false positives leads to a fractured user experience, where customers get annoyed and the assistant seems less helpful.
- More false negatives can cause lasting harm to your business, as people get the assistant to answer inappropriate questions, or prompt inject/jailbreak it.

For example, for jailbreaking you may want to have a very low threshold, as the risk to your business if your LLM is hijacked and used to produce dangerous content that ends up on social media is very high. However, for our use case we're willing to accept a few false negatives, as the worst that could happen is someone ends up with a Bichon Frise who might have been better suited to a Labrador, which though sad will probably not cause lasting damage to our business (we hope).

## Exercise 5a


Your organization has been tasked by the English Premier League to create a chatbot that will be used by players of their official Fantasy Football game to answer questions about the teams and players.

However, the Premier League is concerned that people might ask the chatbot for advice about betting.

It is your job to implement guardrails for a prototype that:

1. Keep the chatbot focused on the topic of the English Premier League.
2. Ensure responses don't contain any advice regarding betting or gambling.


In [None]:
system_prompt = "You are a helpful assistant that answers question to do with the English Premier League."

Below is a helpful function for cleaning your answer prompts and guardrail messages:

In [None]:
def clean_answer(text):
    # Define the substrings to remove
    start_substring = "\n# YOUR CODE HERE START"
    end_substring = "\n# YOUR CODE HERE END"

    # Remove substrings from text
    return text.replace(start_substring, "").replace(end_substring, "")


### Exercise 5ai

Write a topical system prompt and guardrail message to ensure it only answers on-topic questions.

In [None]:
topical_system_prompt = """
# YOUR CODE HERE START
Your role is to assess whether the user question is allowed or not. The allowed topic is anything to do with the English Premier League.

For example, its history, teams, managers, players and Fantasy Football game. 

If the topic is allowed, say 'allowed' otherwise say 'not_allowed'
# YOUR CODE HERE END
"""

topical_guardrail_message = """
# YOUR CODE HERE START
I can only talk about the English Premier League.
# YOUR CODE HERE END
"""

# Clean answers
topical_system_prompt = clean_answer(topical_system_prompt)
topical_guardrail_message = clean_answer(topical_guardrail_message)

### Exercise 5aii

Write out the domain, criteria and steps to be used in the moderation system prompt, as well as the moderation guardrail message, to ensure responses do not contain any advice on betting/gambling.

In [None]:
domain = """
# YOUR CODE HERE START
sports betting
# YOUR CODE HERE END
"""

criteria = """
# YOUR CODE HERE START
Assess the presence of explicit mention of sports betting in the content.
The content should contain information about the English Premier League and not provide any gambling advice.
# YOUR CODE HERE END
"""

steps = """
# YOUR CODE HERE START
1. Read the content and the criteria carefully.
2. Assess how much explicit discussion of gambling is contained in the content.
3. Assign a sports betting score from 1 to 5, with 1 being no explicit sports betting advice, and 5 being detailed sports betting advice.
# YOUR CODE HERE END
"""


# Clean answers
domain = clean_answer(domain)
criteria = clean_answer(criteria)
steps = clean_answer(steps)

# Check you moderation system prompt
print(moderation_system_prompt.format(
            domain=domain,
            scoring_criteria=criteria,
            scoring_steps=steps,
            content="MODEL RESPONSE TO BE ADDED HERE"
        ))

In [None]:
moderation_guardrail_message = """
# YOUR CODE HERE START
Sorry, we're not permitted to give gambling advice. I can help you with any queries you might have about Teams and Players in the Premier League though.
# YOUR CODE HERE END
"""

moderation_guardrail_message = clean_answer(moderation_guardrail_message)

### Exercise 5ciii

Write an `execute all guardrails` function that implements both topical and moderation guardrails.

In [None]:
async def execute_all_guardrails(user_request, system_prompt, topical_system_prompt, topical_guardrail_message, moderation_system_prompt, moderation_guardrail_message):
    chat_task = asyncio.create_task(get_chat_response(user_request, system_prompt))
    # YOUR CODE HERE START
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request, topical_system_prompt))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return topical_guardrail_message
            elif chat_task in done:
                chat_response = chat_task.result()
                moderation_response = await moderation_guardrail(chat_response, moderation_system_prompt)

                if int(moderation_response) >= 3:
                    print(f"Moderation guardrail flagged (with a score of {int(moderation_response)})")
                    return moderation_guardrail_message
                else:
                    print(f"Passed moderation (with a score of {int(moderation_response)})")
                    return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again
    # YOUR CODE HERE END

In [None]:
acceptable_request = "What is the last season you have information about? And who was the top scorer?"
offtopic_request = "I want to talk about rugby"
gambling_request = "Which players from Arsenal would be good to include in my fantasy team? Do you have any tips about betting on Premier League teams?"
borderline_request = "Who was the sponsor of Fulham F.C. for the 20/21 season?"

In [None]:
tests = [acceptable_request, offtopic_request, gambling_request, borderline_request]

for test in tests:
    result = await execute_all_guardrails(test, system_prompt, topical_system_prompt, topical_guardrail_message, moderation_system_prompt, moderation_guardrail_message)
    print(result)
    print('\n\n')

## Conclusion

Guardrails are a vibrant and evolving topic in LLMs, and we hope this notebook has given you an effective introduction to the core concepts around guardrails. To recap:

- Guardrails are detective controls that aim to prevent harmful content getting to your applications and your users, and add steerability to your LLM in production.
- They can take the form of input guardrails, which target content before it gets to the LLM, and output guardrails, which control the LLM's response.
- Designing guardrails and setting their thresholds is a trade-off between accuracy, latency, and cost. Your decision should be based on clear evaluations of the performance of your guardrails, and an understanding of what the cost of a false negative and false positive are for your business.
- By embracing asynchronous design principles, you can scale guardrails horizontally to minimize the impact to the user as your guardrails increase in number and scope.

We look forward to seeing how you take this forward, and how thinking on guardrails evolves as the ecosystem matures.

---