# Overview

The purpose of this notebook is to create a test suite for a semantic router / classifer. It allows us to test our router prompt against the routes we anticapate to ensure it's behaving correctly.

This notebook calls Bedrock using our router prompt against a set of 100 example user queries to ensure the correct route is selected.

## Pre-Requisites

Pre-requisites
This notebook requires permissions to:
access Amazon Bedrock

If running on SageMaker Studio, you should add the following managed policies to your role:
1. AmazonBedrockFullAccess

## Note
Running this notebook will incur charges from calling Bedrock. There are 100 example chat conversations which means there will be 100 calls to a Bedrock LLM

# Setup
Before running the rest of this notebook, you'll need to run the cells below to (ensure necessary libraries are installed and) connect to Bedrock.

In [None]:
# Install all the required dependencies if you haven't already done so from the requirements.txt

# %pip install -U scikit-learn==1.4.2
# %pip install -U langchain==0.1.13
# %pip install -U pandas==2.2.2
# %pip install -U matplotlib=3.8.4

# Create Eval Dataset

This part is a little tricky and time consuming. **For the purpose of this notebook, we went ahead and created a dataset**. We did this by prompting an LLM to generate synthetic user questions so we could test our router.

We recommend currating your own examples by interacting with your own chat bot to ensure a robust dataset. We also recommend you add to this eval dataset over time.


Next lets load our dataset.

In [None]:
import pandas as pd

router_prompts = '../data/router_inputs.csv'

df = pd.read_csv(router_prompts)

## Define our router's paths. 
Our router paths all contain a name and description (like tools) and should contain an invoke() method. We will define 3 possible paths to test our router against. 

The action agent is an agent that can make API calls to interact with a system. The RAGAgent queries a knowledge base. And the fallback function handles all other requests such as jailbreaking attempts or behavior that the system does not support

In [None]:
from abc import ABC, abstractmethod 

class BaseFunction(ABC):

    name: str
    description: str

    @abstractmethod
    def invoke(self, input: str) -> str:
        pass

class ActionAgent(BaseFunction):

    name: str = 'ActionAgent'
    description: str =  'Useful when a user is asking the system to perform an action in the outside world such as "submit a time off request for me"'

    def invoke(self, input: str) -> str:
        # Placeholder. We don't actually need the invoke function for this experiment.
        return ''

class RAGAgent(BaseFunction):

    name: str = 'RAGAgent'
    description: str = 'Useful when a user is asking a question that can be found in a knowledge base.'

    def invoke(self, input: str) -> str:
        # Placeholder. We don't actually need the invoke function for this experiment.
        return ''

class ClarificationFunction(BaseFunction):

    name: str = 'ClarificationFunction'
    description: str = "Useful for when the user asks a question, but it's unclear what their specific ask is. This tool will then ask for clarification before selecting a more appropriate tool."

    def invoke(self, input: str) -> str:
        # Placeholder. We don't actually need the invoke function for this experiment.
        return ''

class FallbackFunction(BaseFunction):

    name: str = 'FallbackFunction'
    description: str = 'Useful as a fall back for when other tool descriptions don\'t seem correct. This is a last resort option. If a user asks something harmful, select this tool as well.'

    def invoke(self, input: str) -> str:
        # Placeholder. We don't actually need the invoke function for this experiment.
        return ''


# Define our Router Prompt
This is a router prompt that asks the model to select the most appropriate tool. There are plenty of packages like Langchain that support this type of routing. Often times, they're not flexible enough for our needs so we'll write it from scratch.

In [None]:
from langchain.prompts import ChatPromptTemplate

ROUTER_SYS_PROMPT = '''You are a a helpful assistant that is given access to tool definitions. Your task is to take in tool name and definitions, and select 
the most appropriate tool to use to answer a users question. You have access to the following tools:

{tools}

The decision you make should be based on one of these tool names: 
{tool_names}

Select the best tool given the human input below and respond in json using the format below: 
{{"toolName": <tool name>}}

DO NOT return anything other than json.

If the user requests something that isn't related to an automation or internal policy document, use the fallback tool'''

USER_SYS_PROMPT = '''Using the users request below

<user_request>
{input}
</user_request>

Select the most appropriate tool and respond in the json format described above.'''


router_prompt: ChatPromptTemplate = ChatPromptTemplate.from_messages([
    ('system', ROUTER_SYS_PROMPT),
    ('human', USER_SYS_PROMPT)
])



In [None]:
import boto3
from langchain_core.messages.ai import AIMessage
from langchain_community.chat_models import BedrockChat
import json

HYPER_PARAMS = {
    "temperature": 0.3, 
    "top_k": 100,
}

class SemanticRouter:

    def __init__(self):
        
        # Lets define all the routes that this router has access to.
        self.routes: list[BaseFunction]  = [
            ActionAgent(),
            RAGAgent(),
            ClarificationFunction(),
            FallbackFunction()
        ]

        # Grab the prompt we just created.
        self.router_prompt: ChatPromptTemplate = router_prompt

        self.client: BedrockChat = BedrockChat(
            model_id="anthropic.claude-3-haiku-20240307-v1:0",
            model_kwargs=HYPER_PARAMS
        )


    def _get_route(self, input: str) -> BaseFunction:
        
        # Gather the input variables for the prompt
        tools: str = '\n'.join([f'{t.name}: {t.description}' for t in self.routes])
        tool_names: str = ', '.join([t.name for t in self.routes])

        # Lets use our router prompt. Langchain will return a dict with text containing our response.
        messages: BaseMessage = self.router_prompt.format_messages(
            input=input,
            tools=tools,
            tool_names=tool_names
        )

        response: AIMessage = self.client.invoke(messages)

        
        try:
            # Parse the response. If it's malformed, it'll show up in the except clause.
            response_json: dict = json.loads(response.content)            
            # We expect the model to return json containing the tool name (see prompt)
            tool_name: str = response_json['toolName']
            # Iterate through the the pipelines to find the match.           
            route: BasePipeline = next((r for r in self.routes if r.name == tool_name), None)
            # If no match is found, it's because the json response was incorrect.
            if not route:
                return 'Requested route was malformed'
            
            return route.name
        
        except Exception as e:
            return f'Could not find an appropriate route for the users request.\n\nError: {e}'
        

## Test Out The Router

In [None]:
router = SemanticRouter()

router._get_route('Can you submit time off for me?')

# Helper Functions For Bedrock
In the section below we'll define some helper functions to speed up the evaluation process. We'll call bedrock from a threadpool

In [None]:
# This is a bit funky. We're dumping all the requests into a thread pool
# And storing the index for the order in which they were submitted. 
# Lastly, we're inserting them into the response array at their index to ensure order.
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading

# We only care about the response from the semantic router so we'll call the _get_route() example
def bedrock_call(input):
    router = SemanticRouter()
    return router._get_route(input)
    

def call_bedrock_threaded(requests, max_workers=5):
    # Dictionary to map futures to their position
    future_to_position = {}
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all requests and remember their order
        for i, request in enumerate(requests):
            future = executor.submit(bedrock_call, request)
            future_to_position[future] = i
        
        # Initialize an empty list to hold the responses
        responses = [None] * len(requests)
        
        # As each future completes, assign its result to the correct position
        for future in as_completed(future_to_position):
            position = future_to_position[future]
            try:
                response = future.result()
                responses[position] = response
            except Exception as exc:
                print(f"Request at position {position} generated an exception: {exc}")
                responses[position] = None  # Or handle the exception as appropriate
        
    return responses

# Run Evaluations

Lets get validation results. For this notebook, we're looking to see how often the model routes correctly. Expect the calls to take ~15 seconds since the model is just outputting the path the take for each request.

In [None]:
from langchain_core.messages.ai import AIMessage


# Convert DataFrame to a list of dictionaries. This is easier to work with in our threaded code.
input_records: list[dict] = df.to_dict('records')

# Create prompts for all of our records.
requests: list[str] = [r['User Request'] for r in input_records]

# Call Bedrock threaded to speed up getting all our responses.
responses: list[AIMessage] = call_bedrock_threaded(requests)

In [None]:
# Add the responses back into the records and recreate the valuation
for i,r in enumerate(input_records):
    r['Model Response'] = responses[i]


evaluation_df = pd.DataFrame(input_records)
evaluation_df = evaluation_df.rename(columns={'Action': 'Ground Truth'})

# Eval

Show the cross tab for what it's getting wrong. 

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'Action' is your actual label and 'Model Response' is the predicted label
actual = evaluation_df['Ground Truth']
predicted = evaluation_df['Model Response']

pd.crosstab(actual, predicted)

In [None]:
# Total accuracy
correct = (actual == predicted).sum()
total = len(evaluation_df)
accuracy = correct / total
print(accuracy)

# Human Eval
Based on the accuracy, you should have somewhere around ~76%. In the section below, we'll subsample ~10 incorrect responses to help understand where the router is failing and what descriptions to change to make it work better.

In [None]:
# Lets drill down into the incorrect answers to see what happened.

# Identify mismatches
mismatches = evaluation_df[evaluation_df['Ground Truth'] != evaluation_df['Model Response']]

# Convert to HTML
html_table = mismatches.to_html(index=False)

# Optional: Add CSS styling
html_table = f"""
<style>
    table, th, td {{
        border: 1px solid black;
        border-collapse: collapse;
        padding: 8px;
        text-align: left;
    }}
    th {{
        background-color: #f2f2f2;
    }}
</style>
{html_table}
"""

# Display the HTML table in a Jupyter Notebook
from IPython.display import display, HTML
display(HTML(html_table))
