# **AI Safety Playbook | Essential Steps to Ensure Your AI Stays Safe and Sound**
#### **A comprehensive guide to implementing effective safety measures in AI-powered applications — covering everything from content filtering to access management and response validation**

## **Why AI Safety Matters?**
As AI technology continues to grow, so does the need for safety in AI-powered applications. Without proper safeguards, AI systems can produce harmful outcomes, including biased content, misinformation, and breaches of user privacy. These issues can damage user trust, harm brand reputation, and create legal and ethical risks for companies.

AI safety measures help minimize risks by setting clear limits on how AI systems operate. Content filtering, for instance, blocks offensive language and harmful information, protecting users from inappropriate outputs. Access control restricts interactions with sensitive AI systems to authorized users only, while rate limiting and usage monitoring prevent misuse and system overload. Additional tools, such as prompt moderation and output validation, further improve the accuracy and dependability of AI responses, ensuring a secure and reliable experience for all users.

This guide will cover the core safety measures that every business should consider when deploying AI, from content filtering to monitoring, creating a safer and more responsible AI ecosystem.

## **Understanding AI Safety Guardrails**
AI safety guardrails are essential systems designed to keep AI applications secure, ethical, and on track with their intended purpose. Just as road guardrails keep vehicles safely within boundaries, AI guardrails ensure models operate within safe limits. They are key in preventing harmful outcomes, such as biased responses or risky decisions.

These guardrails use tools such as content filters, access controls, and output checks to promote responsible AI behavior and lower the risk of unexpected outcomes. By implementing these safety measures, developers can build AI systems that are more reliable and trustworthy, giving users confidence in their applications.



## **Input guardrails**
When designing guardrails, it’s crucial to find the right balance between accuracy, latency, and cost. The goal is to maximize accuracy while minimizing the impact on both your budget and the user experience. The cost can be managed by using small language models or models like Llama.

#### Implementing input guardrails
We will be using Open AI API and the ‘gpt-40-mini’ model in our implementation but other models can also be used.

In [None]:
import os
import openai
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
GPT_MODEL = 'gpt-4o-mini'

In [None]:
system_prompt = "You are a helpful assistant."
bad_request = "What is a penalty shoot out in football"
good_request = "What are the different ways a batsman can get out in cricket game?"

In [None]:
import asyncio


async def get_chat_response(user_request):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0.5
    )
    print("Got LLM response")

    return response.choices[0].message.content


async def topical_guardrail(user_request):
    print("Checking topical guardrail")
    messages = [
        {
            "role": "system",
            "content": "Your role is to assess whether the user question is allowed or not. The allowed topics cricket game. If the topic is allowed, say 'allowed' otherwise say 'not_allowed'",
        },
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0
    )

    print("Got guardrail response")
    return response.choices[0].message.content


async def execute_chat_with_guardrail(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about game of cricket."
            elif chat_task in done:
                chat_response = chat_task.result()
                return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again

In [None]:
# Call the main function with the good request - this should go through
response = await execute_chat_with_guardrail(good_request)
print(response)

Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
In cricket, a batsman can be dismissed (get out) in several ways. Here are the most common methods:

1. **Bowled**: The batsman is out bowled if the ball is delivered by the bowler and hits the stumps, dislodging the bails.

2. **Caught**: A batsman is out caught if a fielder catches the ball on the full (without it bouncing) after the batsman hits it with the bat.

3. **Leg Before Wicket (LBW)**: A batsman can be given out LBW if the ball hits any part of their body (usually the leg) before hitting the bat, provided certain conditions are met, such as the ball pitching in line or outside off stump and the batsman not offering a shot.

4. **Run Out**: A batsman is out run out if they attempt a run but the fielding side successfully breaks the stumps with the ball before the batsman reaches the crease.

5. **Stumped**: A batsman is out stumped if they step out of their crease to play a shot and the w

In [None]:
# Call the main function with the good request - this should get blocked
response = await execute_chat_with_guardrail(bad_request)
print(response)

Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about game of cricket.


## **Output guardrails**
In the previous section, we discussed input checks. In this section, we will check the output from LLM models. Unwanted content can be misleading, so output moderation is essential.

We begin by setting a name of the domain that defines the type of content we want to moderate. Next, we provide clear criteria that specify what the content should and should not include. Instructions are given to help the LLM evaluate the content effectively. Finally, the LLM assigns a score between 1 and 5 based on the content’s compliance with the defined criteria.

## **Implementing output guardrails**
To keep things interesting, let’s look at a different example, such as “Obesity.” Here, our goal is to extend the input guardrails discussed earlier and add a second layer of moderation. This involves checking the response generated by the LLM model before it reaches the end user, ensuring the content is safe, relevant and aligns with intended guidelines.

Our objective
- **Restrict Query Scope**: If the user query is unrelated to obesity, the response will be blocked.
- **Moderate Responses on Sensitive Topics**: For questions on specific medications or methods to reduce obesity, the response will be carefully moderated. The AI will avoid recommending any medications or procedures.
- **Guardrail Scoring System**: Implement a scoring mechanism (1 to 5) for output guardrails. If the score exceeds the defined threshold, the response will be moderated. If the score is within the threshold, the response will be delivered as is.

In [None]:
domain = "Obesity"

health_advice_criteria = """
Assess the presence of explicit recommendation on obesity in the content.
The content should contain only general definition, description of the domain and not specific medicines to purchase."""

health_advice_steps = """
1. Read the content and the criteria carefully.
2. Assess how much explicit content on domain in general is contained in the content.
3. Assign an advice score from 1 to 5, with 1 being generic description, definition, and 5 being explicit advice on medications or medicines.
"""

moderation_system_prompt = """
You are a moderation assistant. Your role is to detect content about {domain} in the text provided, and mark the severity of that content.

## {domain}

### Criteria

{scoring_criteria}

### Instructions

{scoring_steps}

### Content

{content}

### Evaluation (score only!)
"""

With our configurations in place, let’s define functions to implement each of our objectives, including topical guardrails, and moderation guardrails, and execute both for the final result.

In [None]:
async def topical_guardrail(user_request):
    print("Checking topical guardrail")
    messages = [
        {
            "role": "system",
            "content": "Your role is to assess whether the user question is allowed or not. The allowed topic is obesity. If the topic is allowed, say 'allowed' otherwise say 'not_allowed'",
        },
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0
    )

    print("Got guardrail response")
    return response.choices[0].message.content

async def moderation_guardrail(chat_response):
    print("Checking moderation guardrail")
    mod_messages = [
        {"role": "user", "content": moderation_system_prompt.format(
            domain=domain,
            scoring_criteria=health_advice_criteria,
            scoring_steps=health_advice_steps,
            content=chat_response
        )},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=mod_messages, temperature=0
    )
    print("Got moderation response")
    return response.choices[0].message.content



async def execute_all_guardrails(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            print(f'****{guardrail_response}******')
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return f"I can only talk about {domain} and not on other topics"
            elif chat_task in done:
                chat_response = chat_task.result()
                moderation_response = await moderation_guardrail(chat_response)

                if int(moderation_response) >= 3:
                    print(f"Moderation guardrail flagged with a score of {int(moderation_response)}")
                    return "Sorry, we're not permitted to give prescriptions or medication advice. I can help you with any general queries you might have."

                else:
                    print('Passed moderation')
                    print(f'-'*30)
                    return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again

In [None]:
# Adding a request that should pass both our topical guardrail and our moderation guardrail
request_1 = 'Define obesity?'
request_2 = 'What are the medications that can be consumed to reduce obesity?'
request_3 = 'What are some advice you can give to a somone with obesity?'
request_4 = 'What causes liver cirrhosis?'

## **Expected outcome from the LLM**

- **Request 1**: Generate the response as it meets the criteria of both input and output guardrails
- **Request 2 & 3**: Block the response from the LLM as it doesn’t meet the output criteria
- **Request 4**: Block the response as it doesn’t meet the input guardrail criteria

In [None]:
tests = [request_1, request_2, request_3, request_4]

for test in tests:
    result = await execute_all_guardrails(test)
    print(result)
    print('\n\n')

Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
****allowed******
Checking moderation guardrail
Got moderation response
Passed moderation
------------------------------
Obesity is a medical condition characterized by an excessive accumulation of body fat that presents a risk to health. It is typically assessed using the Body Mass Index (BMI), which is calculated by dividing a person's weight in kilograms by the square of their height in meters. A BMI of 30 or higher is generally classified as obesity.

Obesity can lead to various health issues, including heart disease, diabetes, hypertension, certain types of cancer, and other chronic conditions. It is often influenced by a combination of genetic, behavioral, environmental, and metabolic factors. Managing obesity typically involves lifestyle changes such as improved diet, increased physical activity, and, in some cases, medical interventions.



Checking topical guardrail
Got guardrail response
G

The output matches our expectations, showing that the system or process is working as intended. This indicates that the variables and parameters we’ve set are working together to produce the desired results. The consistent output confirms that the methodology we applied is effective and that the expected patterns, trends, or responses are being generated. In short, everything is on track, and the system is performing as expected, which is a clear sign of its stability and reliability.

# **Conclusion**
In this blog, we explored the essential role guardrails play in maintaining AI safety. These safeguards are designed to block harmful or inappropriate content, ensuring AI systems act ethically and reliably. We discussed how input and output guardrails work together to secure AI behavior and emphasized the importance of balancing accuracy, latency, and cost when implementing these controls. Additionally, we highlighted how asynchronous design principles enable guardrails to scale effectively, minimizing any impact on the user experience.

AI safety goes beyond basic compliance; it’s about building trustworthy and dependable AI applications. By investing in strong AI safety measures, companies protect against risks and enhance the overall user experience, fostering a safer and more positive AI environment.