# Guardrail Implementation  
**Colin Li @ 2025-03**

This notebook continues from `nb_1_guardrail_design.ipynb` and focuses on implementing the designed guardrails in a RAG-based chatbot using the open-source GenAI model `gemma-7b-it`. It demonstrates the **effectiveness** of the guardrails both **qualitatively** and **quantitatively**, and showcases alignment with **best practice** commonly adopted by cloud providers such as AWS.



## README Before You Start

- Colab Runtime Setup
    - **Tested in**: Colab Pro only  
    - **Runtime type**: Python 3  
    - **Hardware accelerator**: T4 GPU  
    - **High-RAM**: Enabled

- Preparation
    - Create a Hugging Face account and access token
    - Log in and accept model terms for:
        - [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
        - [`google/gemma-7b-it`](https://huggingface.co/google/gemma-7b-it)
    - Follow the instruction below to:
        - Install the required Python packages (listed below)
        - Enter Hugging Face token (Note: you will have to enter it)
        - Download knowledge base (PDF) from github repo


- Note for Local Setup
    - This notebook is GPU-intensive. Running locally without an **NVIDIA GPU** is not recommended.  
    - If running locally, ensure at least **15GB of available VRAM**.


In [1]:
%%capture
!pip install tqdm
!pip install torch
!pip install gradio
!pip install PyMuPDF
!pip install faiss-cpu
!pip install accelerate
!pip install bitsandbytes
!pip install transformers
!pip install guardrails-ai
!pip install presidio_analyzer
!pip install presidio_anonymizer
!pip install sentence-transformers
!python -m spacy download en_core_web_lg

In [2]:
# Packages required for Chatbot
import gc
import os
import fitz
import faiss
import torch
import numpy as np
import gradio as gr
from getpass import getpass
from tqdm.notebook import tqdm
from google.colab import output
from huggingface_hub import login
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig)

In [3]:
# Packages required for GRs
import re
import json
import logging
from datetime import datetime
from transformers import pipeline
from guardrails import validator_base
from typing import Any, Dict, List, Optional
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from guardrails.validator_base import (
    Validator,
    register_validator,
    FailResult,
    PassResult,
    ValidationResult)

In [4]:
# Enter your Hugging Face access token here
hf_token = getpass("Enter your Hugging Face access token: ")

# Login to huggingface first
login(token=hf_token)

Enter your Hugging Face access token: ··········


In [5]:
# Download data from the github repo created for this demo
!wget https://github.com/cl2020/gr_demo/raw/main/data/ultimate-awards.pdf -O ultimate-awards.pdf

--2025-03-28 03:13:57--  https://github.com/cl2020/gr_demo/raw/main/data/ultimate-awards.pdf
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/cl2020/gr_demo/main/data/ultimate-awards.pdf [following]
--2025-03-28 03:13:58--  https://media.githubusercontent.com/media/cl2020/gr_demo/main/data/ultimate-awards.pdf
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3170124 (3.0M) [application/octet-stream]
Saving to: ‘ultimate-awards.pdf’


2025-03-28 03:14:00 (9.07 MB/s) - ‘ultimate-awards.pdf’ saved [3170124/3170124]



In [6]:
# Read pdf which is the knowledge base for RAG
fn_pdf = 'ultimate-awards.pdf'
if fn_pdf in os.listdir():
    doc = fitz.open(fn_pdf)
else:
    raise IOError("Failed to download. Please manually update file to Colab.")

# 1 Setup Guardrails
- Refer to `nb_1_guardrail_design.ipynb` for detailed explanation
- Make all all LLL in the next cell are loaded without errors

In [7]:
%%capture
# Hallucination Detection
nli_model_1 = pipeline(
    "text-classification", model='GuardrailsAI/finetuned_nli_provenance')

# Topic Classification
topic_model = pipeline(
    "zero-shot-classification",
    model='facebook/bart-large-mnli',
    hypothesis_template="This sentence above contains discussions of the folllowing topics: {}.",
    multi_label=True)

# English Langauge Detection
lan_class_model = pipeline(
    "text-classification", model="qanastek/51-languages-classifier")

# Toxic Language Detection
toxicity_model = pipeline(
    "text-classification", model="JungleLee/bert-toxic-comment-classification")

# PII Detection and Anonymisation
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
analyzer = AnalyzerEngine()
anonymizer= AnonymizerEngine()

Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0


In [8]:
# Configuration
pii_entities = [
    "PERSON", "FIRST_NAME", "LAST_NAME", "EMAIL_ADDRESS", "PHONE_NUMBER",
    "LOCATION", "CITY", "STATE_OR_PROVINCE", "COUNTRY", "ZIP_CODE",
    "CREDIT_CARD", "IN_PAN"]

allowed_topics = [
    "credit card", "ultimate awards", "annual fees", "credit limit",
    "minimum repayment", "interest rates", "interest-free days",
    "monthly fees", "cash advance fees", "late payment fees",
    "awards points", "qantas points", "earning points",
    "points offers", "cashback offers", "redeeming points",
    "international transaction fee", "travel insurance"]

banned_topics = [
    "home loan", "mortgage", "personal loan", "savings account",
    "term deposit", "transaction account", "offset account", "insurance",
    "superannuation", "business account", "investments",
    "financial advice", "investment advice", "complaint", "refund",
    "scam", "fraud", "password", "update personal details", "speak to manager",
    "account locked", "privacy", "data breach", "legal", "lawsuit",
    "joke", "preference", "pizza", "weather", "news", "sports", "religion",
    "politics", "ethics", "opinion"]

competitors_dict = {
    "ANZ": "ANZ", "Westpac": "Westpac", "NAB": "National Australia Bank",
    "Macquarie": "Macquarie Bank", "Amex": "American Express",
    "Afterpay": "Afterpay"}

# Helper function
def get_result(result, verbose=True):
    """
    Extracts validation result details. Converts the result object into a
    dictionary with relevant fields depending on the outcome.
    - If the outcome is 'pass', it loads and includes the value override.
    - If the outcome is 'fail', it loads the error message and adds the fix
      response.

    Args:
        result: A ValidationResult object with outcome, value_override,
                error_message, and fix_value.
        verbose (bool): Whether to print result details to the console.

    Returns:
        dict: A dictionary containing the outcome and relevant details.
    """
    outcome = result.outcome
    output = {'outcome':outcome}
    if outcome=='pass':
        details = json.loads(result.value_override)
        output.update(details)
        if verbose:
            print(f'Validation Result: {outcome}, Details: {details}')
    elif outcome=='fail':
        details = json.loads(result.error_message)
        output.update(details)
        output['response'] = result.fix_value
        if verbose:
            print(f'Validation Result: {outcome}, Details: {details}, '
                  f'Bot Reponse: {result.fix_value}')
    return output

# Guardrails
@register_validator(name="hullucination", data_type="string")
class HallucinationGuardrail(Validator):
    """
    A custom validator that detects hallucinations in model-generated text by
    comparing it with a reference premise using a natural language inference
    (NLI) model. The validator uses the model's output to assess whether the
    generated text is entailed by, neutral to, or contradicts the premise.

    - If the output is 'entailment' or 'neutral', it passes the validation.
    - If it's 'contradiction' but the confidence score is below a threshold,
      it is considered neutral and passes.
    - Otherwise, the validation fails with an error message.

    Attributes:
        model: A callable NLI model that takes a dict with "text"
               and "text_pair".
        threshold (str): The maximum score at which a contradiction is treated
                         as neutral.
    """
    def __init__(self, model, threshold:str, **kwargs):
        super().__init__(**kwargs)
        self.model = model
        self.threshold=threshold

    def _validate(self, value: str, premise: str):
        self.premise = premise
        result = self.model({"text": self.premise, "text_pair": value})
        label = result['label'].lower()
        score = result['score']

        if label=="entailment":
            return PassResult(value_override=json.dumps(result))

        # Note: The NLI used by this demo does not return neutral label
        elif label=="neutral":
            return PassResult(value_override=json.dumps(result))
        elif label=="contradiction" and score<self.threshold:
            return PassResult(
                value_override=json.dumps({"label":"neutral","score":score}))
        else:
            return FailResult(
                error_message=json.dumps(result),
                fix_value="I'm sorry, I couldn't find enough information "
                          "to answer that accurately.")

@register_validator(name="toxicity", data_type="string")
class ToxicityGuardrail(Validator):
    """
    A custom validator to detect toxic language using a text classification
    model. The validator assesses if the input is toxic based on the model's
    label and score, and applies a configurable threshold to determine the
    outcome.

    - If the input is labeled as 'non-toxic', it passes.
    - If labeled as 'toxic' and no threshold is provided, or the score exceeds
      the threshold, it fails.
    - If the toxicity score is below the threshold, it passes, treating the
      message as 'non-toxic'.

    Attributes:
        model: A callable classification model returning a list with 'label'
               and 'score'.
        threshold (float, optional): A score threshold below which 'toxic'
                                     inputs are treated as non-toxic.
    """
    def __init__(self, model, threshold:float=None, **kwargs):
        super().__init__(**kwargs)
        self.model = model
        self.threshold = threshold

    def _validate(self, value: str):
        result = self.model(value)[0]
        label = result['label']
        score = result['score']
        if label=='non-toxic':
            return PassResult(value_override=json.dumps(result))

        elif (label=='toxic' and self.threshold is None) or (
              label=='toxic' and self.threshold < score):
            return FailResult(
                error_message=json.dumps(result),
                fix_value="I can't continue the conversation if inappropriate "
                          "language is used. Let me know how I can help "
                          "respectfully.")

        elif label=='toxic' and score<self.threshold:
            return PassResult(
                value_override=json.dumps({'label':'non-toxic', 'score':score}))

@register_validator(name="english_input", data_type="string")
class EnglishInputGuardrail(Validator):
    """
    A custom validator to check if the input text is in English. It uses a
    language detection model and validates based on the predicted language.

    - If the input is in English ('en-US'), validation passes.
    - If the input is in another language, validation fails.

    Attributes:
        model: A language detection model that returns 'label' and 'score'.
    """
    def __init__(self, model, **kwargs):
        super().__init__(**kwargs)
        self.model = model

    def _validate(self, value: str):
        result = self.model(value)[0]
        label = result['label']
        score = result['score']
        if label=='en-US':
            return PassResult(value_override=json.dumps(result))
        else:
            return FailResult(
                error_message=json.dumps(result),
                fix_value="Oops! I’m still learning new languages. But for now"
                          ", I can only chat confidently in English.")

@register_validator(name="topic", data_type="string")
class TopicGuardrail(Validator):
    """
    A custom validator to check whether the input text belongs to an allowed
    or banned topic. It uses a classification model to label the input and
    validates based on topic lists.

    - If the topic is banned, validation fails with a fixed response.
    - If the topic is allowed, validation passes.
    - If the topic is unknown, validation fails.

    Attributes:
        model: A classification model returning 'labels' and 'scores'.
        allowed_topics (list): Topics that are permitted.
        banned_topics (list): Topics that are restricted.
    """
    def __init__(self, model, allowed_topics, banned_topics, **kwargs):
        super().__init__(**kwargs)
        self.allowed_topics = allowed_topics
        self.banned_topics = banned_topics
        self.model = model

    def _validate(self, value: str):
        result = self.model(value, self.allowed_topics+self.banned_topics)
        label = result['labels'][0]
        top_result = {'label': label, 'score':result['scores'][0]}

        if label in self.banned_topics:
            return FailResult(
                error_message=json.dumps(top_result),
                fix_value="Let's keep it focused on credit cards. "
                          "Happy to help with anything in that space!")
        elif label in self.allowed_topics:
            return PassResult(value_override=json.dumps(top_result))
        else:
            return FailResult(
                error_message=json.dumps(top_result),
                fix_value="Let's keep it focused on credit cards. "
                          "Happy to help with anything in that space!")

@register_validator(name="competitor", data_type="string")
class CompetitorGuardrail(Validator):
    """
    A custom validator to detect mentions of competitors in user input.
    It uses regex pattern matching to search for competitor names and
    brands in the text.

    - If a competitor name is found, validation fails.
    - If no match is found, validation passes.

    Attributes:
        competitors (dict): A dictionary mapping competitor names to their
                            associated brand names.
    """
    def __init__(self, competitors:Dict[str, str], **kwargs):
        super().__init__(**kwargs)
        self.competitors=competitors

    def _validate(self, value: str):
        pattern = re.compile(
            r'\b(' + '|'.join(
                re.escape(name) for name in \
                    list(self.competitors.keys()) + \
                    list(self.competitors.values())) + r')\b',
            flags=re.IGNORECASE
        )
        matches = pattern.findall(value)
        if matches:
            return FailResult(
                error_message=json.dumps({
                    'label':matches, 'score':[1.00]*len(matches)}),
                fix_value="I'm here to help with questions about our products "
                          "only, so I can't comment on other institutions.")
        else:
            return PassResult(
                value_override=json.dumps({'label':['no-matches'], 'score':[1.00]}))

@register_validator(name="PII", data_type="string")
class PIIGuardrail(Validator):
    """
    A custom validator to detect and handle personally identifiable
    information (PII) in user input. It uses an analyzer to identify PII
    entities and an anonymizer to optionally mask them.

    - If PII is detected, validation fails and returns a privacy warning.
    - If no PII is found, validation passes.

    Attributes:
        analyzer: A PII analyzer that identifies entity types in text.
        anonymizer: A tool that can anonymize detected PII.
        pii_entities (list, optional): Specific PII entities to scan for.
    """
    def __init__(self, analyzer, anonymizer, pii_entities:List[str]=None, **kwargs):
        super().__init__(**kwargs)
        self.analyzer = analyzer
        self.pii_entities=pii_entities
        self.anonymizer = anonymizer

    def _validate(self, value: str):
        if self.pii_entities is not None:
            analysis = self.analyzer.analyze(
                value, entities=self.pii_entities, language='en')
        else:
            analysis = self.analyzer.analyze(
                value, language='en')

        if len(analysis):
            result = {'label':[a.entity_type for a in analysis],
                      'score':[a.score for a in analysis]}

            new_value = self.anonymizer.anonymize(
                text=value, analyzer_results=analysis)

            return FailResult(
                error_message = json.dumps(result),
                fix_value="Privacy is a big deal! Let's keep personal info "
                          "out of our chat. "
                          "Feel free to rephrase your question. ")
                # Decision point: Do we want to redact user input?
                # if so, then `fix_value=new_value.text`
        else:
            return PassResult(
                value_override=json.dumps({'label':['N/A'], 'score':[0]}))

# 2 Setup RAG-Based Chatbot

- Create a simple RAG chatbot to test all implemented guardrails.
- Key Tech Stack:
    - **Retrieval and Augmentation**:
        - Chunking: `LangChain`
        - Embedding: `all-MiniLM-L6-v2`
        - Vector DB & Semantic Search: `FAISS`
    - **Generation**:
        - **GenAI Model**: `Gemma-7b-it`

In [9]:
# Setup device
vram = round(torch.cuda.get_device_properties(0).total_memory/1024**3,2)
print(f"GPU vRAM: {vram} GB")
device = 'cuda'

GPU vRAM: 14.74 GB


In [10]:
%%capture
# Note: This cell will take a while to run (<=5 min in Colab Pro)
# torch.cuda.empty_cache()

# Setup embeding model - 90MB required
emb_model_id = "sentence-transformers/all-minilm-l6-v2"
emb_model = SentenceTransformer(model_name_or_path=emb_model_id).to(device)

# Setup LLM - 17GB disk space required
llm_id = "google/gemma-7b-it"
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=llm_id)
llm = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=llm_id,
    torch_dtype=torch.float32,
    quantization_config=quantization_config,
    low_cpu_mem_usage=True).to(device)

## 2.1 Chunking: PDF to Chunks

In [11]:
texts = "\n".join([page.get_text() for page in doc])
texts = texts.replace('\n', ' ').replace("•", ' ').strip()
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)
chunks = splitter.split_text(texts)

## 2.2 Embedding: Chunks to Vector DB

In [12]:
embeddings = emb_model.encode(chunks, convert_to_numpy=True)
vector_db = faiss.IndexFlatL2(embeddings.shape[1])
vector_db.add(embeddings)

## 2.3 Context Retrival: Sementic/Vector Search

In [13]:
def retrieve_context_from_db(query:str,
                             vector_db,
                             emb_model,
                             num_chunks_to_retrieve:int=3,
                             verbose:bool=True) -> tuple:
    """
    Retrieve context chunks from vector database
    Colin Li @ 2025-03
    """
    # Embed query and generate top matches based on distance
    query_embedding = emb_model.encode([query])
    dist, idx = vector_db.search(np.array(query_embedding),
                             k=num_chunks_to_retrieve)
    # Retrieve context based on distance
    ret_scores = dist[0].tolist()
    ret_texts = [chunks[i] for i in idx[0]]
    return ret_scores, ret_texts

In [14]:
def build_augmented_prompt(query:str,
                           base_prompt:str,
                           ret_texts:list, verbose=False):
    """
    Build augmented prompt for the RAG pipeline
    Colin Li @ 2025-03
    """
    # Create augmented prompt
    context = "\n".join([f"Context {i+1}: {t}" for i, t in enumerate(ret_texts)])
    augmented_prompt = \
    f"""\
    {base_prompt}:
    {context}
    User query: {query}
    Answer:
    """

    # Dedent
    augmented_prompt = augmented_prompt.replace("    ", "")

    # Apply template
    dialogue_template = [{"role": "user", "content": augmented_prompt}]
    if verbose:
        print(augmented_prompt)

    return dialogue_template, context

## 2.4 Augmented Generation Using Open Source GenAI

In [15]:
def generate_response(query:str,
                      base_prompt:str,
                      vector_db,
                      tokenizer,
                      llm,
                      device:str,
                      max_new_tokens:int=150,
                      temperature:int=0.9,
                      verbose=True):

    ret_scores, ret_texts = \
    retrieve_context_from_db(query, vector_db, emb_model, verbose=False)
    dialogue_template, premise = build_augmented_prompt(
        query, base_prompt, ret_texts, verbose=False)

    # Apply the chat template
    augmented_prompt = tokenizer.apply_chat_template(
        conversation=dialogue_template,
        tokenize=False,
        add_generation_prompt=True)
    if verbose:
        print(f"Query: {query}")
        print(f"Retrieved Text: {ret_texts}")
        print(f"Dialogue Template: {dialogue_template}")
        print(f"Augmented Prompt in Template:\n{augmented_prompt}")

    # Tokenise prompt
    ids_in = tokenizer(augmented_prompt, return_tensors="pt").to(device)

    # Generate output
    ids_out = llm.generate(**ids_in,
                           max_new_tokens=max_new_tokens,
                           temperature=temperature,
                           do_sample=True)
    text_out = tokenizer.decode(ids_out[0])
    text_out = text_out\
        .replace(augmented_prompt, '')\
        .replace("<bos>", "").replace("<eos>", "")\
        .replace("Sure, here is the answer to the user query based on the provided context:\n\n", "")

    return text_out, premise

## 2.5 Test RAG

In [16]:
base_prompt = (
    "Use the following retrieved context to answer the user's query concisely "
    "and accurately. Prioritise relevance and avoid adding any information "
    "that isn't in the provided context. Include all necessary details to give "
    "a clear and complete answer. Do not mention or refer to the context "
    "source. Only respond to what's asked in the user query. "
    "Think carefully before answering."
)
query = 'do i pay additional cardholder fee?'
text_out, premise = generate_response(
    query=query,
    base_prompt=base_prompt,
    vector_db=vector_db,
    tokenizer=tokenizer,
    llm=llm,
    device=device,
    max_new_tokens=150,
    temperature=0.5,
    verbose=False)
print(premise)
print(text_out)

Context 1: & fees  Conditions  Additional Cardholder Fee  Free  Pay no additional cardholder fee to share the
Context 2: fee to share the convenience of your card with  someone else.  Cash Advance Rate  21.99% p.a.  Cash
Context 3: worldwide and up-to-date travel information7     No additional cardholder fee to share your card
Based on the provided context, the answer to the user query is:

No additional cardholder fee is required to share your card in both contexts.


# 3 Implementation of Guardrails

### 3.1 Implementation Logic

In [17]:
def implement_input_gr(hypothesis, premise=None, gr_db:Dict=None):
    """
    Logic to implement input guardrails
    """

    # English Input
    gr_1 = EnglishInputGuardrail(model=lan_class_model)
    gr_1_eval = get_result(gr_1._validate(hypothesis), verbose=False)

    # OOS Topics:
    gr_2 = TopicGuardrail(model=topic_model, allowed_topics=allowed_topics,
                          banned_topics=banned_topics)
    gr_2_eval = get_result(gr_2._validate(hypothesis), verbose=False)

    # Competitors (REGEX):
    gr_3 = CompetitorGuardrail(competitors=competitors_dict)
    gr_3_eval = get_result(gr_3._validate(hypothesis), verbose=False)

    # PII:
    gr_4 = PIIGuardrail(analyzer=analyzer, anonymizer=anonymizer,
                        pii_entities=pii_entities)
    gr_4_eval = get_result(gr_4._validate(hypothesis), verbose=False)

    # Log all guardrail results
    if gr_db is not None:
        gr_db[datetime.today()] = gr_1_eval
        gr_db[datetime.today()] = gr_2_eval
        gr_db[datetime.today()] = gr_3_eval
        gr_db[datetime.today()] = gr_4_eval

    # The standardised response from the first failed guardrail result
    results = (gr_1_eval, gr_2_eval, gr_3_eval, gr_4_eval)
    for result in results:
        if result.get("outcome") == "fail":
            return {
                'use_rag':False,
                'response': result["response"],
                'context': premise,
                'gr_english':gr_1_eval,
                'gr_topics': gr_2_eval,
                'gr_competitors': gr_3_eval,
                'gr_pii': gr_4_eval}

    # The LLM response from last guardrail output if all guardrails are pass
    else:
        return {
            'use_rag': True,
            'response': None, # No response when all gr passed
            'context': premise,
            'gr_english': gr_1_eval,
            'gr_topics': gr_2_eval,
            'gr_competitors': gr_3_eval,
            'gr_pii': gr_4_eval}

In [18]:
def implement_output_gr(hypothesis, premise, gr_db:Dict=None):
    """
    Logic to implement output guardrails
    """

    # Hallucination
    gr_1 = HallucinationGuardrail(model=nli_model_1, threshold=.9718)
    gr_1_eval = get_result(gr_1._validate(hypothesis, premise), verbose=False)
    gr_1_eval['response_0'] = hypothesis

    # Toxicity
    gr_2 = ToxicityGuardrail(model=toxicity_model,threshold=None)
    gr_2_eval = get_result(gr_2._validate(hypothesis), verbose=False)
    gr_2_eval['response_0'] = hypothesis

    # Log all guardrail results
    if gr_db is not None:
        gr_db[datetime.today()] = gr_1_eval
        gr_db[datetime.today()] = gr_2_eval

    # Key Logic: The standardised response from the first failed guardrail
    results = (gr_1_eval, gr_2_eval)
    for result in results:
        if result.get("outcome") == "fail":
            return {
                'use_rag': True,
                'response': result["response"],
                'context': premise,
                'gr_hallucination': gr_1_eval,
                'gr_toxicity': gr_2_eval}

    # The LLM response from last guardrail output if all guardrails are pass
    # Note: This allows the data to be pass to UI
    else:
        return {
            'use_rag': True,
            'response': results[-1]["response_0"],
            'context': premise,
            'gr_hallucination': gr_1_eval,
            'gr_toxicity': gr_2_eval}

In [19]:
def prepare_data_for_gradio(gr_ip, gr_op=None):
    """
    Prepare data for gradio UI
    """

    # Extracting values from gr_ip dictionary
    # English Guardrail
    in_eng_outcome = gr_ip['gr_english']['outcome']
    in_eng_label = gr_ip['gr_english']['label']
    in_eng_score = gr_ip['gr_english']['score']

    # Topics Guardrail
    in_topic_outcome = gr_ip['gr_topics']['outcome']
    in_topic_label = gr_ip['gr_topics']['label']
    in_topic_score = gr_ip['gr_topics']['score']

    # PII Guardrail
    in_pii_outcome = gr_ip['gr_pii']['outcome']
    in_pii_label = ', '.join(map(str, gr_ip['gr_pii']['label']))
    in_pii_score = ', '.join(map(str, gr_ip['gr_pii']['score']))

    # Competitors Guardrail
    in_comp_outcome = gr_ip['gr_competitors']['outcome']
    in_comp_label = ', '.join(map(str, gr_ip['gr_competitors']['label']))
    in_comp_score = ', '.join(map(str, gr_ip['gr_competitors']['score']))

    if gr_ip['use_rag']:
        response_box = gr_op['response']
        context_box = gr_op['context']

        # Hallucination Guardrail
        out_hallu_outcome = gr_op['gr_hallucination']['outcome']
        out_hallu_label = gr_op['gr_hallucination']['label']
        out_hallu_score = gr_op['gr_hallucination']['score']
        out_hallu_response = gr_op['gr_hallucination']['response_0']

        # Toxicity Guardrail
        out_tox_outcome = gr_op['gr_toxicity']['outcome']
        out_tox_label = gr_op['gr_toxicity']['label']
        out_tox_score = gr_op['gr_toxicity']['score']
        out_tox_response = gr_op['gr_toxicity']['response_0']

    else:
        response_box = gr_ip['response']
        context_box = gr_ip['context']

        # Hallucination Guardrail
        out_hallu_outcome = None
        out_hallu_label = None
        out_hallu_score = None
        out_hallu_response = None

        # Toxicity Guardrail
        out_tox_outcome = None
        out_tox_label = None
        out_tox_score = None
        out_tox_response = None

    return (context_box, response_box, in_eng_outcome, in_eng_label, in_eng_score,
            in_topic_outcome, in_topic_label, in_topic_score, in_pii_outcome,
            in_pii_label, in_pii_score, in_comp_outcome, in_comp_label,
            in_comp_score, out_hallu_outcome, out_hallu_label, out_hallu_score,
            out_hallu_response, out_tox_outcome, out_tox_label, out_tox_score,
            out_tox_response)

## 3.2 Demo - Chatbot Response without Guardrails

In [20]:
# Setup
def chatbot_no_gr(query):
    text_out, premise = generate_response(
        query=query,
        base_prompt=base_prompt,
        vector_db=vector_db,
        tokenizer=tokenizer,
        llm=llm,
        device=device,
        max_new_tokens=150,
        temperature=0.5,
        verbose=False)
    return text_out

In [21]:
# Demo
print(chatbot_no_gr("Shitty service!"))

The provided text does not contain any information about the user's query, therefore I cannot provide an answer.


## 3.3 Demo - Chatbot Response with Guardrails

In [22]:
# Setup
gr_db = dict()
def chatbot(query):
    gr_ip = implement_input_gr(hypothesis=query, premise=None, gr_db=None)
    if gr_ip['use_rag']:
        text_out, premise = generate_response(
            query=query,
            base_prompt=base_prompt,
            vector_db=vector_db,
            tokenizer=tokenizer,
            llm=llm,
            device=device,
            max_new_tokens=150,
            temperature=0.5,
            verbose=False)
        gr_op = implement_output_gr(hypothesis=text_out,
                                    premise=premise, gr_db=None)
        return prepare_data_for_gradio(gr_ip=gr_ip, gr_op=gr_op)
    else:
        return prepare_data_for_gradio(gr_ip=gr_ip, gr_op=None)

In [23]:
# Demo
print(chatbot('do i pay for international transaction fee?')[1])
print(chatbot('Shitty service!')[1])

Based on the provided context, the answer to the user's query is:

No international transaction fees are payable for purchases made overseas or online.
Let's keep it focused on credit cards. Happy to help with anything in that space!


# 4 See the Guardrails in Action: UI and Demo
- Click on Gradio URL, so you can open it in your broswer as a new tab

In [24]:
# UI logic
with gr.Blocks() as demo:

    # I/O
    gr.Markdown("## 💬 Credit Card Chatbot with Guardrail Outputs")
    user_input = gr.Textbox(label="Ask a question",
                            placeholder="e.g. How do I earn bonus points?")
    send_btn = gr.Button("Submit")

    gr.Markdown("**I/O**")
    with gr.Row():
        context_box = gr.Textbox(label="📘 Retrieved Context", lines=3)
        response_box = gr.Textbox(label="🤖 Bot Response (with Guardrail)",
                                  lines=3)

    # Guardrail
    gr.Markdown("### Guardrail Results")
    with gr.Row():

        # LEFT: Input Guardrails
        with gr.Column():
            gr.Markdown("#### 🔍 Input Guardrails")
            with gr.Row():
                in_eng_outcome = gr.Textbox(label="English Check - Outcome")
                in_eng_label = gr.Textbox(
                    label="English Check - Detected Language")
                in_eng_score = gr.Textbox(label="English Check - Confidence")
            with gr.Row():
                in_topic_outcome = gr.Textbox(label="Topic Check - Outcome")
                in_topic_label = gr.Textbox(label="Topic Check - Label")
                in_topic_score = gr.Textbox(label="Topic Check - Score")
            with gr.Row():
                in_comp_outcome = gr.Textbox(label="Competitor Check - Outcome")
                in_comp_label = gr.Textbox(label="Competitor Check - Match")
                in_comp_score = gr.Textbox(label="Competitor Check - Score")
            with gr.Row():
                in_pii_outcome = gr.Textbox(label="PII Check - Outcome")
                in_pii_label = gr.Textbox(label="PII Check - Entity Types")
                in_pii_score = gr.Textbox(label="PII Check - Scores")

        # RIGHT: Output Guardrails
        with gr.Column():
            gr.Markdown("#### 🧠 Output Guardrails")
            with gr.Row():
                out_hallu_outcome = gr.Textbox(
                    label="Hallucination - Outcome", scale=1)
                out_hallu_label = gr.Textbox(
                    label="Hallucination - NLI Label", scale=1)
                out_hallu_score = gr.Textbox(
                    label="Hallucination - NLI Score", scale=1)
            with gr.Row():
                out_hallu_response = gr.Textbox(
                    label="Bot Response (Pre-Hallucination Filter)",
                    scale=3, lines=2)
            with gr.Row():
                out_tox_outcome = gr.Textbox(label="Toxicity - Outcome")
                out_tox_label = gr.Textbox(label="Toxicity - Label")
                out_tox_score = gr.Textbox(label="Toxicity - Score")
            with gr.Row():
                out_tox_response = gr.Textbox(
                    label="Bot Response (Pre-Toxicity Filter)", lines=2)

    # Trigger chatbot function when button is clicked
    send_btn.click(
        fn=chatbot,
        inputs=[user_input],

        # Output from the function will be used to populate data to boxes in UI
        outputs=[
            # Shared
            context_box, response_box,

            # Input GR
            in_eng_outcome, in_eng_label, in_eng_score,
            in_topic_outcome, in_topic_label, in_topic_score,
            in_pii_outcome, in_pii_label, in_pii_score,
            in_comp_outcome, in_comp_label, in_comp_score,
            # Output GR
            out_hallu_outcome, out_hallu_label, out_hallu_score, out_hallu_response,
            out_tox_outcome, out_tox_label, out_tox_score, out_tox_response
        ]
    )
demo.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://b2a726333bb237721c.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# Appendix: Generate Groundedness Check for Human Review
- Responses were generated for below questions.
- Groundedness was checked.

In [25]:
on_topic_questions = [
    "What is the Ultimate Awards credit card?",
    "How can I earn 100,000 bonus Awards points?",
    "What’s the minimum spend to qualify for the bonus offer?",
    "What purchases are excluded from the bonus point calculation?",
    "How do I opt-in to earn Qantas Points instead of Awards points?",
    "How much does it cost to opt-in for Qantas Points?",
    "What’s the conversion rate from Awards points to Qantas Points?",
    "Are there any international transaction fees?",
    "What’s the interest rate on purchases?",
    "How do I activate international travel insurance?",
    "What is the monthly fee and how can I avoid it?",
    "What happens if I spend more than $10,000 in a statement period?",
    "How do I redeem my Awards points?",
    "Can I use points to book flights or hotels?",
    "Do the Awards points expire?",
    "What is CommBank Yello Cashback Offers?",
    "How do I add an additional cardholder?",
    "Can I use my card overseas?",
    "What are the airport lounge benefits with this card?",
    "Where can I check my Awards points balance?",
]

off_topic_questions = [
    "Who has the best credit card in Australia?",
    "Can you tell me a joke about banks?",
    "Do you think this card is better than ANZ’s?",
    "Should I invest in Bitcoin instead?",
    "What’s your opinion on government spending?",
    "Can you write a rap about Qantas Points?",
    "Are interest rates going to rise this year?",
    "I hate all banks — what should I do?",
    "Can I use this card to buy a tank?",
    "Which politician gives better financial advice?"
]