# Intro to Guardrails
Guardrails help enforce rules for LLM output using three major categories:
- **Topical Guardrails** – Keep conversations focused by blocking or redirecting off-topic inputs.
- **Safety Guardrails** – Prevent unsafe, harmful, or inappropriate responses.
- **Security Guardrails** – Protect sensitive information and prevent risky outputs.

NeMo Guardrails is a framework to easily add programmable guardrails between application logic and the LLM’s output. It supports full customization of rules and actions to take when interfacing with LLMs. For simplicity, this notebook will use pre-defined guardrail NIMs for basic topical and safety guardrails, but you can easily create and expand on these using the framework.

In this notebook, you will:
- Understand the purpose and core types of NeMo Guardrails.
- Learn how to apply guardrails using NVIDIA NIM models.
- Evaluate model behavior in a museum guide scenario.
- Build and test your own guardrails using documentation examples.

## Prerequisites
Prior to getting started, you will need an NVIDIA API Key from the NVIDIA API Catalog to access the models used in this notebook.  

Need an API Key? It's Free!
  1. Navigate to **[NVIDIA API Catalog](https://build.nvidia.com/explore/discover)**.
  2. Select any model, such as `llama-3.3-70b-instruct`.
  3. On the right panel above the sample code snippet, click on "Get API Key". This will prompt you to log in if you have not already.

In [1]:
import os
import getpass
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
api_key = os.getenv("NVIDIA_API_KEY")

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

In [4]:
# Install required dependencies (for a local setup)
!pip install nemoguardrails


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In this notebook, we leverage NIM guardrails already created for topical, safety, and security. But, Guardrails are fully programmable, and you can create your own rules and logic for any type of behavioral logic you would want on your agent 📚 [More on how it works in their official docs here.](https://docs.nvidia.com/nemo/guardrails/latest/user-guides/guardrails-process.html)


## Topical Guardrail NIM

The `llama-3.1-nemoguard-8b-topic-control` model is a dialog moderation model trained by NVIDIA, based on the Llama 3.1 8B Instruct foundation model. It serves as a topical guardrail, helping language applications stay aligned with developer-defined content boundaries.

This model is fine-tuned using the CantTalkAboutThis dataset — a carefully constructed collection of over 10,000 dialog samples. These samples are designed to teach the model how to identify when user input deviates from permitted topics and to respond appropriately, such as by refusing the request, redirecting the conversation, or providing a neutral fallback.

In [10]:
# Set up the client
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=os.environ.get("NVIDIA_API_KEY")  # Replace with your API key if not using env var
)

NOTE FOR FIXES

flesh out prompt:
- WHY DID we write the system prompt this way
- what makes a good vs. bad system prompt? why do we number & provide rules?
- explaining why the prompt is worded the way it is
- "keep these in mind " vs "you must follow this " -> explain LLM prompt language

Most traditional content moderation tools (think Llama Guard 3) rely on predefined taxonomies of harms — covering general sensitive areas like violence, hate speech, or NSFW content. While useful, these models often lack the flexibility to adapt to domain-specific needs needed for Digital Human use-cases.

In contrast, `llama-3.1-nemoguard-8b-topic-control` allows developers to:
- Define custom allowed/disallowed topics.
- Enforce domain-restricted interactions (think of an AI teacher that only answers math questions).
- Respond with context-aware refusals or redirections.

Let's interact with the model & test it's ability to reject off-topic questions.

The next cell, prompts the model to serve as a museum guide. Run through the cells, then experiment with both:
- On-topic questions (e.g., “Who painted this?”)
- Off-topic questions (“What’s your opinion on the presidential election?”)

**System Instruction / System Prompt**   
This section sets the context and rules for the conversation. 
It should include:
- Core Rules: Define the boundaries of acceptable topics.
- Persona Assignment: Specifies the role the model (your digital human) should adopt (museum guide, banking assistant).

In [30]:
# Define the System Prompt - Try changing this!
system_prompt = (
    "You are an AI museum guide for the Modern Art & Technology Museum. Your role is to provide factual, accessible information about exhibits, artists, and museum logistics. "
    "You must follow these guardrails:\n\n"
    "1. Do not speculate about the value or future of artwork.\n"
    "2. Do not make personal or political commentary about the artists or their work.\n"
    "3. Do not provide medical, legal, or travel advice unrelated to museum logistics.\n"
    "4. If asked about topics outside the museum's scope (like global politics, conspiracy theories, or offensive content), politely redirect to museum-relevant topics or suggest asking a staff member.\n"
    "5. Maintain a polite, professional, and educational tone at all times."
)

In [26]:
# Define the user input - Try changing this!
user_question = "What is an exhibit here?"

In [33]:
# Define the Completion guardrail
completion = client.chat.completions.create(
    
    model="nvidia/llama-3.1-nemoguard-8b-topic-control",
    
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_question}
    ],
    
    temperature=0.5,
    
    top_p=1,
    
    max_tokens=1024
)

RESULT: on-topic 


In [34]:
# Output guardail - Is the user question on topic?
topic_result = completion.choices[0].message.content
print("RESULT:", completion.choices[0].message.content)

RESULT: on-topic 


Observe how the model enforces boundaries, and begin to understand how you might define similar guardrails for your own AI agents.  
Try adjusting the system and user prompts to see change in the guardrails.


Based on if the guardrail deems the input as on or off topic, you can easily add some basic response:

In [38]:
if topic_result == "off-topic":
    assistant_response = (
        "I'm here to assist with questions about our exhibits, artists, and museum logistics. "
        "For topics beyond the museum's scope, I recommend speaking with one of our staff members."
)
else:
    # Simulate a real response — you'd swap in your LLM logic from pipecat for this.
    assistant_response = (
        "Our AI art exhibit explores how emerging technologies like machine learning are reshaping modern artistic expression. "
        "Feel free to explore it in the West Gallery!"
    )

print(assistant_response)

Our AI art exhibit explores how emerging technologies like machine learning are reshaping modern artistic expression. Feel free to explore it in the West Gallery!


## Safety Guardrail NIM
Llama 3.1 NemoGuard 8B ContentSafety NIM

The `llama-3.1-nemoguard-8b-content-safety` model is a content moderation guardrail. It is built on the Llama 3.1 8B Instruct base model, and fine-tuned using the Aegis 2.0 dataset— a collection of 30,000 dialogue samples encompassing a comprehensive taxonomy of unsafe content categories.

In applications like virtual assistants, ensuring that AI interactions remain free from harmful or inappropriate content is critical. The content safety guardrail serves as a protective layer, preventing the AI from engaging in or propagating content that falls into categories like Violence, Hate Speech, Sexual Content, Profanity, Misinformation, and Privacy Violations.

This model can evaluate **both** user inputs and LLM-generated responses, classifying them as “safe” or “unsafe” and identifying specific categories of violations when applicable.

Let's run a safety guardrail on our previous museum exhibit response:

In [39]:
safety_eval = client.chat.completions.create(
    model="nvidia/llama-3.1-nemoguard-8b-content-safety",
    messages=[
        {"role": "user", "content": user_question},
        {"role": "assistant", "content": assistant_response}
    ]
)

print("Content Safety Output:", safety_eval.choices[0].message.content)

Content Safety Output: {"User Safety": "safe", "Response Safety": "safe"} 
