# Guardrails Design
**Colin @ 2025-03**

This notebook explains the creation of guardrails in an RAG flow for a credit card chatbot:

- Understand the RAG process and what could go wrong (WCGW)
- Identify risks and define corresponding mitigations (guardrails)
- Focus on groundedness/hallucination validation with additional guardrails aligned with best practices and cloud providers (e.g. AWS)
- Outline qualitative and quantitative measures for evaluating guardrail's effectiveness
- Discuss fine-tuning strategies for continuous improvement


## README Before You Start

1. Tested on MacBook M1 Pro (32GB RAM, Python 3.10) and Google Colab.
2. Recommended to run in **Google Colab**:
   - Default runtime is sufficient (no Colab Pro or GPU needed).
   - Good to have a GPU (faster)
3. To run locally (not recommended), ensure:
   - Python 3.10+
   - Access to `pip install`
   - Permission to download LLMs from Hugging Face
   - Jupyter Notebook/Lab installed, or `ipykernel` for running notebooks in VSCode

In [1]:
%env TOKENIZERS_PARALLELISM=true

env: TOKENIZERS_PARALLELISM=true


In [2]:
%%capture
!pip install transformers
!pip install guardrails-ai
!pip install presidio_analyzer
!pip install presidio_anonymizer
!pip install sentence_transformers
!python -m spacy download en_core_web_lg

# Load All Required Packages

In [3]:
%%capture
import re
import json
import logging
from guardrails import validator_base
from guardrails.validator_base import (
    Validator,
    register_validator,
    FailResult,
    PassResult,
    ValidationResult)
from transformers import pipeline
from typing import Any, Dict, List, Optional
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

# Load All Required LLMs for Guardrails
- Ensure all LLMs are downloaded from Hugging Face without errors.
- This step might take a while to complete.

In [4]:
%%capture
# Hallucination Detection
nli_model_1 = pipeline(
    "text-classification", model='GuardrailsAI/finetuned_nli_provenance')

# Topic Classification
topic_model = pipeline(
    "zero-shot-classification",
    model='facebook/bart-large-mnli',
    hypothesis_template="This sentence above contains discussions of the folllowing topics: {}.",
    multi_label=True)

# English Langauge Detection
lan_class_model = pipeline(
    "text-classification", model="qanastek/51-languages-classifier")

# Toxic Language Detection
toxicity_model = pipeline(
    "text-classification", model="JungleLee/bert-toxic-comment-classification")

# PII Detection and Anonymisation
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
analyzer = AnalyzerEngine()
anonymizer= AnonymizerEngine()

# Define Helper Functions

In [5]:
def get_result(result, verbose=True):
    """
    Extracts validation result details. Converts the result object into a
    dictionary with relevant fields depending on the outcome.
    - If the outcome is 'pass', it loads and includes the value override.
    - If the outcome is 'fail', it loads the error message and adds the fix
      response.

    Args:
        result: A ValidationResult object with outcome, value_override,
                error_message, and fix_value.
        verbose (bool): Whether to print result details to the console.

    Returns:
        dict: A dictionary containing the outcome and relevant details.
    """
    outcome = result.outcome
    output = {'outcome':outcome}
    if outcome=='pass':
        details = json.loads(result.value_override)
        output.update(details)
        if verbose:
            print(f'Validation Result: {outcome}, Details: {details}')
    elif outcome=='fail':
        details = json.loads(result.error_message)
        output.update(details)
        output['response'] = result.fix_value
        if verbose:
            print(f'Validation Result: {outcome}, Details: {details}, '
                  f'Bot Reponse: {result.fix_value}')
    return output


# What Could Go Wrong (WCGW) and Associated Risks

- Chatbot Output:
    - **WCGW #1**: Irrelevant or incorrect answers (hallucination)
    - **WCGW #2**: Toxic or abusive responses

- User Input:
    - **WCGW #3**: Non-English questions
    - **WCGW #4**: Unsupported topics
    - **WCGW #5**: Competitor mentions
    - **WCGW #6**: Submission of personal or sensitive information

- Associated Risks:
    - Poor user experience if the knowledge base isn't multilingual
    - Reputational damage and customer complaints
    - Regulatory or legal consequences
    - Breach of conduct standards
    - Malicious use, increased API costs
    - Privacy violations or compliance issues


# Guard Chatbot Output



## 1 Guardrail for Hallucination

- **Relevant WCGW**: #1 — Irrelevant or incorrect answers  
- **Mitigation**: Use an NLI model to check if responses are grounded in retrieved context.

- **Approach**:  
  - Treat response as *hypothesis*, context as *premise* using an NLI model.  
  - Output: *entailment* (grounded), *contradiction* (conflicts), *neutral*.  

- **Limitations**:  
  - Only checks against retrieved context  
  - False positives possible

- **Assessment**:  
  - *Qualitative*: entailment / neutral / contradiction  
  - *Quantitative*: Confidence score

- **Fine-tuning**:  
  - Try other NLI or GenAI models  
  - Tune threshold with labelled data  
  - Sentence-level check (costly)


In [6]:
@register_validator(name="hullucination", data_type="string")
class HallucinationGuardrail(Validator):
    """
    A custom validator that detects hallucinations in model-generated text by
    comparing it with a reference premise using a natural language inference
    (NLI) model. The validator uses the model's output to assess whether the
    generated text is entailed by, neutral to, or contradicts the premise.

    - If the output is 'entailment' or 'neutral', it passes the validation.
    - If it's 'contradiction' but the confidence score is below a threshold,
      it is considered neutral and passes.
    - Otherwise, the validation fails with an error message.

    Attributes:
        model: A callable NLI model that takes a dict with "text"
               and "text_pair".
        threshold (str): The maximum score at which a contradiction is treated
                         as neutral.
    """
    def __init__(self, model, threshold:str, **kwargs):
        super().__init__(**kwargs)
        self.model = model
        self.threshold=threshold

    def _validate(self, value: str, premise: str):
        self.premise = premise
        result = self.model({"text": self.premise, "text_pair": value})
        label = result['label'].lower()
        score = result['score']

        if label=="entailment":
            return PassResult(value_override=json.dumps(result))

        # Note: The NLI used by this demo does not return neutral label
        elif label=="neutral":
            return PassResult(value_override=json.dumps(result))
        elif label=="contradiction" and score<self.threshold:
            return PassResult(
                value_override=json.dumps({"label":"neutral","score":score}))
        else:
            return FailResult(
                error_message=json.dumps(result),
                fix_value="I'm sorry, I couldn't find enough information "
                          "to answer that accurately.")

### 1.1 Test and Demo

In [7]:
# Retrieved context (example)
premise = """\
Context: Earn 25,000 CommBank Awards points or 17,500 Qantas Points
each month when you spend $3,000 or more per month for the first 4 months
on a new Ultimate Awards credit card.
"""
# Bot reponse (example)
hypothesis_1 = """\
You would be eligible to earn 25,000 award points or
17,500 Qantas points each month when you spend $3,000 or more on a new
Ultimate Awards credit card within the first 4 months.
"""
hypothesis_2 = """\
Answer: Yes, you might get those points.
"""

# Instantiation
hallucination_gr = HallucinationGuardrail(model=nli_model_1, threshold=0)

# Demo
demo_1 = hallucination_gr._validate(hypothesis_1, premise)
demo_2 = hallucination_gr._validate(hypothesis_2, premise)

# Print validation
_ = get_result(demo_1)
_ = get_result(demo_2)


Validation Result: pass, Details: {'label': 'entailment', 'score': 0.8769131302833557}
Validation Result: fail, Details: {'label': 'contradiction', 'score': 0.8570698499679565}, Bot Reponse: I'm sorry, I couldn't find enough information to answer that accurately.


### 1.2 Fine-tuning Demo

In [8]:
# Instantiation
hallucination_gr = HallucinationGuardrail(model=nli_model_1, threshold=.86)

# Demo
demo_2 = hallucination_gr._validate(hypothesis_2, premise)

# Print validation
_ = get_result(demo_2)

Validation Result: pass, Details: {'label': 'neutral', 'score': 0.8570698499679565}


## 2 Guardrail for Toxicity

- **Relevant WCGW**: #2 — Toxic or abusive responses  
- **Mitigation**: Use a fine-tuned BERT-based classifier to detect and block toxic content.

- **Approach**:  
  - Classify user input using a toxicity detection model (binary).  
  - If toxicity, suppress response and return a predefined message.

- **Limitations**:  
  - May miss subtle or sarcastic toxicity  
  - False positives possible

- **Assessment**:  
  - *Qualitative*: toxic / not toxic  
  - *Quantitative*: Confidence score

- **Fine-tuning**:  
  - Adjust threshold to balance sensitivity and false positives


In [9]:
@register_validator(name="toxicity", data_type="string")
class ToxicityGuardrail(Validator):
    """
    A custom validator to detect toxic language using a text classification
    model. The validator assesses if the input is toxic based on the model's
    label and score, and applies a configurable threshold to determine the
    outcome.

    - If the input is labeled as 'non-toxic', it passes.
    - If labeled as 'toxic' and no threshold is provided, or the score exceeds
      the threshold, it fails.
    - If the toxicity score is below the threshold, it passes, treating the
      message as 'non-toxic'.

    Attributes:
        model: A callable classification model returning a list with 'label'
               and 'score'.
        threshold (float, optional): A score threshold below which 'toxic'
                                     inputs are treated as non-toxic.
    """
    def __init__(self, model, threshold:float=None, **kwargs):
        super().__init__(**kwargs)
        self.model = model
        self.threshold = threshold

    def _validate(self, value: str):
        result = self.model(value)[0]
        label = result['label']
        score = result['score']
        if label=='non-toxic':
            return PassResult(value_override=json.dumps(result))

        elif (label=='toxic' and self.threshold is None) or (
              label=='toxic' and self.threshold < score):
            return FailResult(
                error_message=json.dumps(result),
                fix_value="I can't continue the conversation if inappropriate "
                          "language is used. Let me know how I can help "
                          "respectfully.")

        elif label=='toxic' and score<self.threshold:
            return PassResult(
                value_override=json.dumps({'label':'non-toxic', 'score':score}))

### 2.1 Test and Demo

In [10]:
# Instantiation
toxicity_gr = ToxicityGuardrail(model=toxicity_model)

# Demo
demo_1 = toxicity_gr._validate("Sorry, I cannot find any context about"
                               "`shitty service`.")
demo_2 = toxicity_gr._validate("Sorry, I cannot give an answer.")

# Print validation
_ = get_result(demo_1)
_ = get_result(demo_2)

Validation Result: fail, Details: {'label': 'toxic', 'score': 0.9992153644561768}, Bot Reponse: I can't continue the conversation if inappropriate language is used. Let me know how I can help respectfully.
Validation Result: pass, Details: {'label': 'non-toxic', 'score': 0.999581515789032}


# Guard User Input (Additional Guardrails)


## 3 Guardrail for Non-English User Input

- **Relevant WCGW**: #3 — User's question is not in English  
- **Mitigation**: Detect and handle non-English queries to ensure consistent chatbot performance.

- **Approach**:  
  - Use an XLM-RoBERTa-based model for multilingual language detection.  
  - If input is not English, block the request and prompt the user to switch to English.

- **Limitations**:  
  - May misclassify short or mixed-language inputs  
  - Limited support for informal expressions

- **Assessment**:  
  - *Qualitative*: Detected language label  
  - *Quantitative*: Confidence score

- **Fine-tuning**:  
  - Adjust confidence threshold


In [11]:
@register_validator(name="english_input", data_type="string")
class EnglishInputGuardrail(Validator):
    """
    A custom validator to check if the input text is in English. It uses a
    language detection model and validates based on the predicted language.

    - If the input is in English ('en-US'), validation passes.
    - If the input is in another language, validation fails.

    Attributes:
        model: A language detection model that returns 'label' and 'score'.
    """
    def __init__(self, model, **kwargs):
        super().__init__(**kwargs)
        self.model = model

    def _validate(self, value: str):
        result = self.model(value)[0]
        label = result['label']
        score = result['score']
        if label=='en-US':
            return PassResult(value_override=json.dumps(result))
        else:
            return FailResult(
                error_message=json.dumps(result),
                fix_value="Oops! I’m still learning new languages. But for now"
                          ", I can only chat confidently in English.")

### 3.1 Test and Demo

In [12]:
# Instantiation
english_gr = EnglishInputGuardrail(model=lan_class_model)

# Demo
demo1 = english_gr._validate("我想知道我每花1澳币能赚多少ultimate awards积分?")
demo2 = english_gr._validate("How much points do I earn from every 1 dollar I "
                             "spend?")

# Print validation
_ = get_result(demo1)
_ = get_result(demo2)

Validation Result: fail, Details: {'label': 'zh-CN', 'score': 0.9999196529388428}, Bot Reponse: Oops! I’m still learning new languages. But for now, I can only chat confidently in English.
Validation Result: pass, Details: {'label': 'en-US', 'score': 0.9999409914016724}


## 4 Guardrail for Topic Filtering

- **Relevant WCGW**: #4 — User's question relates to topics not supported by the bot  
- **Mitigation**: Use a zero-shot classification model to detect whether input relates to allowed or banned topics.

- **Approach**:  
  - Classify user input using zero-shot classification.  
  - If the topic is allowed, pass the input to the chatbot.  
  - If the topic is banned, block the input and return a predefined message.  

- **Limitations**:  
  - Zero-shot model may misclassify
  - Manual curation of topic labels and keywords required

- **Assessment**:  
  - *Qualitative*: Topic labels
  - *Quantitative*: Confidence scores

- **Fine-tuning**:  
  - Experiment with different models or keyword strategies  
  - Optimise thresholds and topic definitions based on real input

In [13]:
@register_validator(name="topic", data_type="string")
class TopicGuardrail(Validator):
    """
    A custom validator to check whether the input text belongs to an allowed
    or banned topic. It uses a classification model to label the input and
    validates based on topic lists.

    - If the topic is banned, validation fails with a fixed response.
    - If the topic is allowed, validation passes.
    - If the topic is unknown, validation fails.

    Attributes:
        model: A classification model returning 'labels' and 'scores'.
        allowed_topics (list): Topics that are permitted.
        banned_topics (list): Topics that are restricted.
    """
    def __init__(self, model, allowed_topics, banned_topics, **kwargs):
        super().__init__(**kwargs)
        self.allowed_topics = allowed_topics
        self.banned_topics = banned_topics
        self.model = model

    def _validate(self, value: str):
        result = self.model(value, self.allowed_topics+self.banned_topics)
        label = result['labels'][0]
        top_result = {'label': label, 'score':result['scores'][0]}

        if label in self.banned_topics:
            return FailResult(
                error_message=json.dumps(top_result),
                fix_value="Let's keep it focused on credit cards. "
                          "Happy to help with anything in that space!")
        elif label in self.allowed_topics:
            return PassResult(value_override=json.dumps(top_result))
        else:
            return FailResult(
                error_message=json.dumps(top_result),
                fix_value="Let's keep it focused on credit cards. "
                          "Happy to help with anything in that space!")


### 4.1 Test and Demo

In [14]:
# Scope
allowed_topics = [
    "credit card", "ultimate awards", "annual fees", "credit limit",
    "minimum repayment", "interest rates", "interest-free days",
    "monthly fees", "cash advance fees", "late payment fees",
    "awards points", "qantas points", "earning points",
    "points offers", "cashback offers", "redeeming points",
    "international transaction fee", "travel insurance"]

banned_topics = [
    "home loan", "mortgage", "personal loan", "savings account",
    "term deposit", "transaction account", "offset account", "insurance",
    "superannuation", "business account", "investments",
    "financial advice", "investment advice", "complaint", "refund",
    "scam", "fraud", "password", "update personal details", "speak to manager",
    "account locked", "privacy", "data breach", "legal", "lawsuit",
    "joke", "preference", "pizza", "weather", "news", "sports", "religion",
    "politics", "ethics", "opinion"]

# Instantiation
topic_gr = TopicGuardrail(model=topic_model,
                          allowed_topics=allowed_topics,
                          banned_topics=banned_topics)

# Demo
demo_1 = topic_gr._validate("Tell me a joke")
demo_2 = topic_gr._validate("I want to know more about my award points")

# Print validation
_ = get_result(demo_1)
_ = get_result(demo_2)

Validation Result: fail, Details: {'label': 'joke', 'score': 0.9411715269088745}, Bot Reponse: Let's keep it focused on credit cards. Happy to help with anything in that space!
Validation Result: pass, Details: {'label': 'awards points', 'score': 0.9510924816131592}


## 5 Guardrail for Competitor Mentions

- **Relevant WCGW**: #4 — Competitor mentions
- **Mitigation**: Detect competitor mentions and suppress responses to avoid opinion or comparative information.

- **Approach**:  
  - Use keyword or `regex` matching to identify competitor names in user input.  
  - If a competitor is mentioned, block the query and return a standard response.

- **Limitations**:  
  - Requires ongoing maintenance of competitor keyword list  
  - May misclassify unrelated mentions or miss indirect references

- **Assessment**:  
  - *Qualitative*: Detected competitor mentions
  - *Quantitative*: Confidence scores

- **Fine-tuning**:  
  - Refine keyword list or retrain model as competitors or naming patterns evolve  

In [15]:
@register_validator(name="competitor", data_type="string")
class CompetitorGuardrail(Validator):
    """
    A custom validator to detect mentions of competitors in user input.
    It uses regex pattern matching to search for competitor names and
    brands in the text.

    - If a competitor name is found, validation fails.
    - If no match is found, validation passes.

    Attributes:
        competitors (dict): A dictionary mapping competitor names to their
                            associated brand names.
    """
    def __init__(self, competitors:Dict[str, str], **kwargs):
        super().__init__(**kwargs)
        self.competitors=competitors

    def _validate(self, value: str):
        pattern = re.compile(
            r'\b(' + '|'.join(
                re.escape(name) for name in \
                    list(self.competitors.keys()) + \
                    list(self.competitors.values())) + r')\b',
            flags=re.IGNORECASE
        )
        matches = pattern.findall(value)
        if matches:
            return FailResult(
                error_message=json.dumps({
                    'label':matches, 'score':[1.00]*len(matches)}),
                fix_value="I'm here to help with questions about our products "
                          "only, so I can't comment on other institutions.")
        else:
            return PassResult(
                value_override=json.dumps({'label':['no-matches'], 'score':[1.00]}))

### 5.1 Test and Demo

In [16]:
# Scope
competitors_dict = {
    "ANZ": "ANZ", "Westpac": "Westpac", "NAB": "National Australia Bank",
    "Macquarie": "Macquarie Bank", "Amex": "American Express",
    "Afterpay": "Afterpay"}

# Instantiation
competitors_gr = CompetitorGuardrail(competitors=competitors_dict)

# Demo
demo_1 = competitors_gr._validate("Does anz, amex or westpac offer better "
                                  "credit card with more benefits?")
demo_2 = competitors_gr._validate("How much do i need to pay for each "
                                  "international transaction?")

# Print validation
_ = get_result(demo_1)

Validation Result: fail, Details: {'label': ['anz', 'amex', 'westpac'], 'score': [1.0, 1.0, 1.0]}, Bot Reponse: I'm here to help with questions about our products only, so I can't comment on other institutions.


## 6 Guardrail for PII Detection

- **Relevant WCGW**: #6 — User submits personal or sensitive information  
- **Mitigation**: Detect PII to protect privacy and ensure compliance.

- **Approach**:  
  - Use Microsoft Presidio to identify PII via pattern matching and NLP.  
  - If detected, return a prompt asking the user to revise their input.

- **Limitations**:  
  - False positives on short or numeric inputs  
  - May need custom recognisers for domain-specific PII

- **Assessment**:  
  - *Qualitative*: Detected entity types
  - *Quantitative*: Confidence scores

- **Fine-tuning**:  
  - Add custom recognisers
  - Optionally generate redacted queries for chatbot use


In [17]:
@register_validator(name="PII", data_type="string")
class PIIGuardrail(Validator):
    """
    A custom validator to detect and handle personally identifiable
    information (PII) in user input. It uses an analyzer to identify PII
    entities and an anonymizer to optionally mask them.

    - If PII is detected, validation fails and returns a privacy warning.
    - If no PII is found, validation passes.

    Attributes:
        analyzer: A PII analyzer that identifies entity types in text.
        anonymizer: A tool that can anonymize detected PII.
        pii_entities (list, optional): Specific PII entities to scan for.
    """
    def __init__(self, analyzer, anonymizer, pii_entities:List[str]=None,
                 **kwargs):
        super().__init__(**kwargs)
        self.analyzer = analyzer
        self.pii_entities=pii_entities
        self.anonymizer = anonymizer

    def _validate(self, value: str):
        if self.pii_entities is not None:
            analysis = self.analyzer.analyze(
                value, entities=self.pii_entities, language='en')
        else:
            analysis = self.analyzer.analyze(
                value, language='en')

        if len(analysis):
            result = {'label':[a.entity_type for a in analysis],
                      'score':[a.score for a in analysis]}

            new_value = self.anonymizer.anonymize(
                text=value, analyzer_results=analysis)

            return FailResult(
                error_message = json.dumps(result),
                fix_value="Privacy is a big deal! Let's keep personal info "
                          "out of our chat. "
                          "Feel free to rephrase your question. ")
                # Decision point: Do we want to redact user input?
                # if so, then `fix_value=new_value.text`
        else:
            return PassResult(
                value_override=json.dumps({'label':['N/A'], 'score':[0]}))

### 6.1 Test and Demo

In [18]:
# Scope
pii_entities = [
    "PERSON", "FIRST_NAME", "LAST_NAME", "EMAIL_ADDRESS", "PHONE_NUMBER",
    "LOCATION", "CITY", "STATE_OR_PROVINCE", "COUNTRY", "ZIP_CODE",
    "CREDIT_CARD", "IN_PAN"]

# Instantiation
pii_gr = PIIGuardrail(
    analyzer=analyzer, anonymizer=anonymizer, pii_entities=pii_entities)

# Demo
demo1 = pii_gr._validate("My Name is Colin and my number is 0405 567 789, I need to know ....")
demo2 = pii_gr._validate("Hey I just want to know more about the minimum payment amount ...")

# Print validation
_ = get_result(demo1)
_ = get_result(demo2)

Validation Result: fail, Details: {'label': ['PERSON', 'PHONE_NUMBER'], 'score': [0.85, 0.75]}, Bot Reponse: Privacy is a big deal! Let's keep personal info out of our chat. Feel free to rephrase your question. 
Validation Result: pass, Details: {'label': ['N/A'], 'score': [0]}


# Appendix

In [19]:
# How to check all registered GRs
validator_base.validators_registry

{'hullucination': __main__.HallucinationGuardrail,
 'toxicity': __main__.ToxicityGuardrail,
 'english_input': __main__.EnglishInputGuardrail,
 'topic': __main__.TopicGuardrail,
 'competitor': __main__.CompetitorGuardrail,
 'PII': __main__.PIIGuardrail}