In [1]:
%load_ext autoreload
%autoreload 2

# Week 6 - Systematically Improving Your Rag Application

> **Prerequisites**: Make sure that you've ran the previous notebook `1. Evaluate Tools.ipynb` before continuing. A lot of the code in this notebook will be based off the evaluation methods that we cover in that notebook.

In this notebook, we'll create a synthetic dataset of user queries to evaluate our model's ability to select the right tools. By generating realistic test cases, we can identify common failure modes and measure how well different approaches handle them.

## Why this matters

When deploying RAG systems with multiple tools, we need confidence that our models will select the right tools for each user request. Manual testing misses edge cases and can't scale. By generating synthetic test queries that deliberately target potential failure modes - like selecting between similar tools or coordinating multiple steps - we can systematically identify and fix weaknesses in our tool selection logic.

## What you'll learn

Through hands-on testing with a Raycast-inspired tool selection system, you'll discover how to:

1. Identify Key Failure Modes
- Understand common tool selection mistakes
- Map out similar tools that cause confusion
- Spot missing dependencies between tools

2. Generate Strategic Test Data
- Create queries that target specific weaknesses
- Test tool combinations systematically
- Validate query realism and diversity

3. Benchmark Performance
- Measure precision and recall on synthetic data
- Track improvements across different approaches
- Identify which tools need more attention

By the end of this notebook, you'll have a set of test queries that can be used to systematically evaluate and improve your tool selection logic.

## Our Initial Commands

We've downloaded a list of commands ahead of time in a `raw_commands.json` file. These consist of a set of commands that we've downloaded from the `raycast` application ahead of time as well as some additional commands that we've added to the application.

We'll load these commands into a list of Command objects as seen below. We'll store the following fields from data in the `raw_commands.json` file

- `description` : A short description of the comamnd from the extension's documentation
- `extension` : This is the name of the extension that the command belongs to
- `command_name` : This is the name of the command as it appears in the `raycast` extension

In order to ensure we have a unique command for each extension, we'll concatenate the `extension` and `command_name` fields together to form a unique key. This will help avoid a situation whereby we have multiple commands with the same name in different extensions.

Eg. If our Obsidian and Apple Notes extensions both use the same `search` command, we'll have two commands with the same name. This would be very confusing and hard to test. So we would get our unique key here as `obsidian.search` and `apple-notes.search` which helps solve this issue.


In [2]:
from pydantic import BaseModel, computed_field
import json


class Command(BaseModel):
    extension_name: str
    command_name: str
    command_description: str

    @computed_field
    def key(self) -> str:
        return f"{self.extension_name}.{self.command_name}"


def load_commands(file_path: str) -> list[Command]:
    with open(file_path, "r") as file:
        return [
            Command(
                extension_name=command["extension_name"],
                command_name=command["source_name"],
                command_description=command["description"],
            )
            for command in json.load(file)
        ]


commands = load_commands("raw_commands.json")
len(commands)

70

### Identifying Failure Modes

When deciding which apps/commands to call, two main pitfalls emerge:

**Lack of Context**

Users operate multiple note-taking apps (Notion, Obsidian, Apple Notes). Without explicit context—e.g., that Apple Notes is only for quick, ephemeral tasks—our model might choose the wrong note-taking command, leading to confusion and incorrect data storage.

**Multi-Step Tasks**

Some user requests require calling multiple commands in sequence or in parallel (e.g., create a new release post, then ping the #engineering channel). Our model might forget one step or mix up the order.
We want our test prompts to systematically surface these weaknesses, allowing us to measure how reliably the model navigates them


#### Lack Of Context

Let's imagine we have four commands that we want to evalute

- `obsidian.search`
- `apple-notes.search`
- `notion.search`
- `confluence-search.people`

We'll represent our tool calls as a list of commands that the model has selected. For now, we'll only be evaluating whether the model has selected the correct tool or not as an initial step. When you implement this for your use case, you'll also want to evaluate the arguments that the model has selected for each specific command.

Without much context or description of what the extension's command does, we might expect our model to get confused. For instance, it's perfectly valid for a user to use apple-notes for every single note they take, thus resulting in us ideally never calling any other command related to notetaking other than apple notes.

Let's see this in action below where our model is provided with these commands and asked to call the correct tool given a user request.

In [3]:
from pydantic import BaseModel, field_validator, ValidationInfo


class UserCommandArgument(BaseModel):
    title: str
    value: str


class UserCommand(BaseModel):
    key: str
    arguments: list[UserCommandArgument]


class SelectedCommands(BaseModel):
    selected_commands: list[UserCommand]

    @field_validator("selected_commands")
    def validate_selected_commands(cls, v, info: ValidationInfo):
        commands: list[Command] = info.context["commands"]
        valid_command_keys = [command.key for command in commands]
        invalid_keys = [
            command.key for command in v if command.key not in valid_command_keys
        ]
        if invalid_keys:
            raise ValueError(
                f"Commands {invalid_keys} are not valid commands. Valid commands that can be used are {valid_command_keys}"
            )

        if len(v) > 4:
            raise ValueError("You can only select at most 4 commands")

        return v

In response to a user query like `fetch me my notes on CS325 tagged as important`, we might expect our model to select the `obsidian.search` command. In this case, it would call it with the arguments `title: CS325` and `tag: important`. This would in turn translate to the following command call

```python
SelectedCommands(
    selected_commands=[
        UserCommand(
            key="obsidian.search",
            arguments=[
                UserCommandArgument(title="title", value="CS325"), 
                UserCommandArgument(title="tag", value="important")
            ]
        )
    ]
)
```

This could then be executed as a command in the `raycast` application with its own validation logic and return the results to the user. We're also able to modify the list of valid commands on demand by reading a shared list of commands from the `ValidationInfo` object which we can access in both our validation and prompt formatting logic.

Let's see this in action below where we provide our model with the list of commands we've provided in the `raw_commands.json` file and ask it to select the correct tool given a user request. In this example, our user has the following user behaviour.

- He uses Obsidian for personal notes and reflections on a wide range of topics
- He uses Apple notes for quick notes that are more one-off. Examples of these notes includes recipes, shopping lists, notes about a movie we want to watch, things to note down etc, todo lists, reminders etc
- He uses Notion for planning trips, tracking expenses and other forms of long term planning. 
- He uses Confluence for company documentation, posts and notes. 

We can represent these in some simple test-cases as seen below.

In [3]:
import instructor
from openai import OpenAI
from rich import print

queries = [
    [
        "Find my cheeseburger recipe",
        ["apple-notes.index"],
    ],
    ["What did I write about LSTMs previously?", ["obsidian.searchNoteCommand"]],
    [
        "Does Sarah sit on the product or engineering team?",
        ["confluence-search.people"],
    ],
    [
        "Where are we staying in Japan on 15-18th January 2025?",
        ["notion.search"],
    ],
]

client = instructor.from_openai(OpenAI())


for query, expected_tool in queries:
    response = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can execute commands in response to a user query. You have access to the following commands:
                
                <commands>
                {% for command in commands %}
                - {{ command.key }} : {{ command.command_description }}
                {% endfor %}
                </commands>

                You must select at least one command to be called.
                """,
            },
            {
                "role": "user",
                "content": query,
            },
        ],
        model="gpt-4o-mini",
        response_model=SelectedCommands,
        context={"commands": commands},
    )
    print(
        f"\nQuery: {query}\nSelected commands: {[command.key for command in response.selected_commands]}\nExpected tool: {expected_tool}\n{'-' * 50}"
    )



Without any context, our model struggles to decide what the right tool to be called is in response to the user request. In our four examples above, it only gets two of them right. This is an indication that for our model to be able to call the right tool, we need to provide it with more context.

#### Multi-Step Tasks

How about user queries then that require multiple tool calls to be called and executed in a specific order?

- `Send bobby a message that we're going to be late for spin class later` : If we're using imessage for this, we might need to call `imessage.findChat` first to find the right conversation and then call the `imessage.sendMessage` command to send the message.
- `Create a new release page about our latest deployment and ping the #engineering team to get started filling in the details` : We might use `confluence.new-blog` here to create a blog post about the deployment information and then also call `microsoft-teams.findChat` to call the engineering team to get started filling in the details.

We can see that in this case, when it comes to the tool call itself, we have calls that have dependencies ( Eg. findChat and sendMessage) as well as multiple commands that should be executed in parallel ( Eg. new-blog and sendMessage). 

Let's see our ability to call the right tools in response to a user query.

In [4]:
queries = [
    [
        "Let's start scaffolding out a new release post about our latest deployment. Also send a message the #engineering channel to tell them to fill it up",
        [
            "confluence-search.new-blog",
            "jira.active-sprints",
            "microsoft-teams.findChat",
            "microsoft-teams.sendMessage",
        ],
    ],
    [
        "find weather taiwan dec and generate shopping list for it",
        ["google-search.index", "apple-notes.new-note", "apple-notes.add-text"],
    ],
]

client = instructor.from_openai(OpenAI())


for query, expected_tool in queries:
    response = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can execute commands in response to a user query. You have access to the following commands:
                
                <commands>
                {% for command in commands %}
                - {{ command.key }} : {{ command.command_description }}
                {% endfor %}
                </commands>

                You must select at least one command to be called.
                """,
            },
            {
                "role": "user",
                "content": query,
            },
        ],
        response_model=SelectedCommands,
        context={"commands": commands},
        model="gpt-4o-mini",
    )
    print(
        f"\nQuery: {query}\nSelected commands: {[command.key for command in response.selected_commands]}\nExpected tool: {expected_tool}\n{'-' * 50}"
    )

We can see that our model's performance is slightly worse here. 

In the first case, it's wrongly identifies that we should send a message using the `discord.sendMessage` command instead of the `teams.sendMessage` command. Additionally, it doesn't call the `jira.active-sprints` command as expected.

In the second case it's able to call the `google-search.index` command but then struggles to call the `apple-notes.new-note` and `apple-notes.add-text` commands to generate the shopping list and save it into our notes.

## Generating Synthetic Queries

We've identified that our model struggles to call the right tool when a user query requires some implicit context or multiple steps to be executed. We'll now start generating some synthetic queries that specifically test these failure modes.

We'll start by writing out a brief prompt and defining how our user uses these individual extensions. We'll then randomly sample from our list of commands and use them to generate a list of queries that require the use of these commands.

These will then come in handy when we want to seed future generations of queries by allowing us to generate more diverse and unique queries.

In [5]:
import random
from pydantic import BaseModel, field_validator, ValidationInfo
from rich import print
from openai import AsyncOpenAI


class UserQuery(BaseModel):
    chain_of_thought: str
    user_query: str
    commands: list[UserCommand]

    @field_validator("commands")
    def validate_commands(cls, v, info: ValidationInfo):
        commands: list[Command] = info.context["commands"]
        valid_command_keys = [command.key for command in commands]
        invalid_keys = [
            command.key for command in v if command.key not in valid_command_keys
        ]

        desired_command = info.context["command"]
        if desired_command.key not in valid_command_keys:
            raise ValueError(
                f"You must use the command {desired_command.key} in your query."
            )
        if invalid_keys:
            raise ValueError(
                f"Commands {invalid_keys} are not valid commands. Valid commands that can be used are {valid_command_keys}"
            )
        return v


async def generate_query(
    client: instructor.AsyncInstructor,
    command: Command,
    commands: list[Command],
    user_behaviour: str,
) -> UserQuery:
    query_length = random.randint(10, 30)

    return await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                Generate a hypothetical user message that is about {{ length }} uses the following command and at most 2 more commands from the list of commands below. Make sure to use the specific command name as the key for the command.

                command_name: {{command.key}}    
                description: {{command.command_description}}

                Here are a list of other commands that you can use in conjunction with the above command 

                <commands>
                {% for command in commands %}
                <command>
                    <command_name>{{ command.key }}</command_name>
                    <command_description>{{ command.command_description }}</command_description>
                </command>
                {% endfor %}
                </commands>

                Here is a rough description of how our user uses the application

                <user_behaviour>
                {{ user_behaviour }}
                </user_behaviour>

                Think carefully about what this specific command is used for, how it differs from other commands available in the same extension and other commands in the application. Lastly consider about how we could use this command in conjunction with other commands based off the user behaviour listed above. 

                Once you've done so, remember to generate a user message that uses the command in a way that is consistent with the user behaviour listed above and is written in the imperative as an demand/request.

                Favour Accronyms and short forms of words where possible (Eg. LCMs instead of Latent Consistency Models) and refrain from mentioning the specific application/extension in your query. Remember that you shouldn't mention the specific application/extension in your query.
                """,
            },
        ],
        context={
            "command": command,
            "commands": commands,
            "user_behaviour": user_behaviour,
            "length": query_length,
        },
        model="gpt-4o-mini",
        response_model=UserQuery,
    )


client = instructor.from_openai(AsyncOpenAI())


user_behaviour = """
Currently our user uses the following extensions for the following purposes

- Confluence is used for company documentation, posts and notes. Note that we should use a post when it's a one time event or announcement (eg. Feature Release ) and a page when we'd like to keep it around for a longer period (Eg. Onboarding Document, Team Handbook, Incident Reports that we want to refer to down the line). Use filters for common queries/views that I need to refer to 
- Notion is used for planning trips, tracking expenses and other forms of long term planning. 
- Apple notes are used for quick notes that are more one-off. Examples of these notes includes recipes, shopping lists, notes about a movie we want to watch, things to note down etc, todo lists, reminders etc
- Obsidian is used for personal notes and reflections on a wide range of topics (Eg. Classes we've taken, books we've read, notes about a lecture we went to etc)

- Google Search is used for searching the web for information
- iMessage is used for sending messages to friends and family. These messages are short informal and mostly about the weather, plans for the weekend, coordinating certain events, looking up appointments etc
- Discord is used for gaming - so we'll use it for sending messages that are related to gaming and coordinating these gaming sessions with friends
- Teams is used for sending messages for work stuff - we might use it to send messages to a channel or to a specific person in response to certain work related projects, requests, developments etc

- Github is used for tracking pull requests, collaborating with other developers, running tests and deploying code (Eg. What's the update on the CI, is there a new release of the app, what's the status of the new feature branch, any new security vulnerabilities that we flagged, any new PRs to review etc)
- We use Jira to track outstanding bugs and issues that users have reported and we need to work on. Often times we'll be tracking the issue in jira and then creating PRs in github to fix it

When it comes to tracking things to be done, use apple notes to track these reminders. Jira is for official work projects.
"""

print(await generate_query(client, random.choice(commands), commands, user_behaviour))

In [9]:
from tqdm.asyncio import tqdm_asyncio as asyncio

queries = await asyncio.gather(
    *[
        generate_query(client, random.choice(commands), commands, user_behaviour)
        for _ in range(5)
    ]
)

with open("queries.jsonl", "a") as file:
    for query in queries:
        file.write(
            json.dumps(
                {
                    "query": query.user_query,
                    "labels": [command.key for command in query.commands],
                }
            )
            + "\n"
        )

100%|██████████| 5/5 [00:04<00:00,  1.21it/s]


### Providing Contrastive and Positive Examples

Now that we've generated our initial list of queries, let's use our generated queries to help us generate better queries. 

In order to generate better queries, we'll also provide some examples of queries that use this specific command we've chosen and some examples of queries that don't. This in turn allows us to provide both positive and negative examples to our model so it sees a greater and more diverse set of examples.

Let's start by reading in our list of existing queries and mapping each query to its relevant command. Then we'll write a function which randomly samples queries for other comamnds given a command. Then we'll use this specific function and update our previous query generation function to generate our new set of queries using these two different sets of queries.

Let's see this in action below.


In [4]:
import json


def load_queries(commands: list[Command], query_path: str):
    valid_commands = set(command.key for command in commands)
    with open(query_path, "r") as f:
        queries = [json.loads(line) for line in f]
        for query in queries:
            for label in query["labels"]:
                if label not in valid_commands:
                    raise ValueError(f"Command {label} not found in commands")
    return queries


commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")

In [38]:
command_to_query = {}
for query in queries:
    for command in query["labels"]:
        if command not in command_to_query:
            command_to_query[command] = []
        command_to_query[command].append(
            {
                "query": query["query"],
                "labels": query["labels"],
            }
        )

print(command_to_query["apple-notes.new"])

In [41]:
def get_contrastive_examples(command: Command, command_to_query: dict, n_examples: int):
    command_queries = (
        set([item["query"] for item in command_to_query[command.key]])
        if command.key in command_to_query
        else set([])
    )
    seen_contrastive_queries = set([])
    contrastive_queries = []
    for command in command_to_query:
        for query in command_to_query[command]:
            if query["query"] not in command_queries:
                contrastive_queries.append(query)
                seen_contrastive_queries.add(query["query"])

    return random.sample(contrastive_queries, k=n_examples)


chosen_command = Command(
    command_description="Create a new note",
    extension_name="apple-notes",
    command_name="new",
)
print(get_contrastive_examples(chosen_command, command_to_query, 2))

We can see that with this specific function, we're able to provide queries that don't use the `apple-notes.new-note` command. This helps the model gain more granular context about the command how it different from other commands that it might be able to call.

For instance, we can clearly see that `microsoft-teams.findChat` and `microsoft-teams.sendMessage` wouldn't likely be called with the `apple-notes.new-note` command which deals with more personal notes rather than the work related content.

In [42]:
async def generate_new_query_with_examples(
    client: instructor.AsyncInstructor,
    command: Command,
    commands: list[Command],
    command_to_query: dict,
    user_behaviour: str,
    n_examples: int,
):
    contrastive_queries = get_contrastive_examples(
        command, command_to_query, n_examples
    )
    if command.key in command_to_query:
        command_queries = (
            command_to_query[command.key]
            if len(command_to_query[command.key]) <= 3
            else random.sample(command_to_query[command.key], k=3)
        )
    else:
        command_queries = []

    return await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                Generate a hypothetical user message that is about {{ length }} uses the following command and at most 2 more commands from the list of commands below. Make sure to use the specific command name as the key for the command.

                command_name: {{command.key}}    
                description: {{command.command_description}}


                {% if positive_examples|length > 0 %}
                Here are some examples of how this command is used
                
                <positive_examples>
                {% for example in positive_examples %}
                    <positive_example>
                        <query>{{ example["query"] }}</query>
                        <labels>{{ example["labels"] }}</labels>
                    </positive_example>
                {% endfor %}
                </positive_examples>
                {% endif %}

                Here are some examples of how other commands that don't use this specific command are used

                <negative_examples>
                {% for example in negative_examples %}
                <negative_example>
                    <query>{{ example["query"] }}</query>
                    <labels>{{ example["labels"] }}</labels>
                </negative_example>
                {% endfor %}
                </negative_examples>

                Here are a list of other commands that you can use in conjunction with the above command 

                <commands>
                {% for command in commands %}
                <command>
                    <command_name>{{ command.key }}</command_name>
                    <command_description>{{ command.command_description }}</command_description>
                </command>
                {% endfor %}
                </commands>

                Here is a rough description of how our user uses the application

                <user_behaviour>
                {{ user_behaviour }}
                </user_behaviour>

                

                Think carefully about what this specific command is used for, how it differs from other commands available in the same extension and other commands in the application. Lastly consider about how we could use this command in conjunction with other commands based off the user behaviour listed above. 

                Once you've done so, remember to generate a user message that uses the command in a way that is consistent with the user behaviour listed above and is written in the imperative as an demand/request. 

                Favour Accronyms and short forms of words where possible (Eg. LCMs instead of Latent Consistency Models) and refrain from mentioning the specific application/extension in your query. Commands should be written in the imperative as an demand/request and try to combine multiple commands where possible in a natural way that would require context to understand

                Invent and add specific and realistic details to the query where possible to make it more specific and interesting. 

                Here are some sample details that you should avoid reproducing
                <bad details>
                repo 123
                jira #123
                </bad details>

                Here is an example of a good detail - it's a feasible detail that would be used in a real world scenario
                <good details>
                supabase/go-sdk
                jira 10023
                </good details>

                Do not copy these details, instead generate your own realistic details that would be used in a real world scenario
                """,
            },
        ],
        response_model=UserQuery,
        model="gpt-4o-mini",
        context={
            "command": command,
            "commands": commands,
            "user_behaviour": user_behaviour,
            "length": random.randint(10, 30),
            "positive_examples": command_queries,
            "negative_examples": contrastive_queries,
        },
    )


client = instructor.from_openai(AsyncOpenAI())
print(
    await generate_new_query_with_examples(
        client, random.choice(commands), commands, command_to_query, user_behaviour, 4
    )
)

Let's now generate a few more queries using this sampling that we've implemented above.

In [23]:
coros = [
    generate_new_query_with_examples(
        client, random.choice(commands), commands, command_to_query, user_behaviour, 4
    )
    for _ in range(5)
]
queries = await asyncio.gather(*coros)

client = instructor.from_openai(AsyncOpenAI())
with open("queries.jsonl", "a") as file:
    for query in queries:
        file.write(
            json.dumps(
                {
                    "query": query.user_query,
                    "labels": [command.key for command in query.commands],
                }
            )
            + "\n"
        )

100%|██████████| 5/5 [00:02<00:00,  1.91it/s]


## Benchmarking

Now that we've generated our initial list of queries, let's see how our model performs when we only provide the command name and description in it's context.

We'll use the functions we defined in our previous notebook to evaluate the performance of our model and establish our initial precision and recall baselines. We'll use `braintrust` here to store and log the performance of our model.

In [None]:
async def generate_commands(
    query: str, client: instructor.AsyncInstructor, commands: list[Command]
):
    response = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can execute commands in response to a user query. You have access to the following commands:
                
                <commands>
                {% for command in commands %}
                - {{ command.key }} : {{ command.command_description }}
                {% endfor %}
                </commands>

                You must select at least one command to be called.
                """,
            },
            {
                "role": "user",
                "content": query,
            },
        ],
        model="gpt-4o",
        response_model=SelectedCommands,
        context={"commands": commands},
    )
    return response.selected_commands


client = instructor.from_openai(AsyncOpenAI())

commands = load_commands("raw_commands.json")
resp = await generate_commands(
    "send a message to bobby that I'll be late for spin class later", client, commands
)

print(resp)

[UserCommand(key='imessage.sendMessage', arguments=[UserCommandArgument(title='Recipient', value='Bobby'), UserCommandArgument(title='Message', value="I'll be late for spin class later")])]


In [15]:
from braintrust import Score, Eval
from helpers import calculate_precision, calculate_recall


def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name="precision",
            score=calculate_precision(output, kwargs["expected"]),
        ),
        Score(
            name="recall",
            score=calculate_recall(output, kwargs["expected"]),
        ),
    ]


client = instructor.from_openai(AsyncOpenAI())
commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")


async def task(query, hooks):
    resp = await generate_commands(query, client, commands)
    return [item.key for item in resp]


results = await Eval(
    "function-calling",
    data=[
        {
            "input": row["query"],
            "expected": row["labels"],
        }
        for row in queries
    ],
    task=task,
    scores=[evaluate_braintrust],
)

Experiment week-6-fixes-1741086106 is running at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-fixes-1741086106
function-calling (data): 51it [00:00, 44844.76it/s]


function-calling (tasks):   0%|          | 0/51 [00:00<?, ?it/s]


week-6-fixes-1741086106 compared to week-6-fixes-1741086096:
40.20% (-12.24%) 'recall'    score	(3 improvements, 11 regressions)
45.57% (-10.61%) 'precision' score	(4 improvements, 11 regressions)

1741086106.46s start
1741086109.36s end
2.89s (+36.42%) 'duration'	(23 improvements, 28 regressions)

See results for week-6-fixes-1741086106 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-fixes-1741086106


Let's now analyse the results that we've obtained and see what our model struggles with. We can see that we've got a low recall of 0.40 and a low precision of around 0.46

Are there specific commands that our model struggles with? Let's see if we can find out.

In [19]:
import pandas as pd
from helpers import calculate_precision_recall_for_queries

df = pd.DataFrame(
    [
        {
            "query": row.input,
            "expected": row.expected,
            "actual": row.output,
        }
        for row in results.results
    ]
)

df = calculate_precision_recall_for_queries(df)
df.sort_values(by="recall", ascending=True).head(15)

Unnamed: 0,query,expected,actual,precision,recall,CORRECT
25,can you help me find that sketch I made that e...,[obsidian.searchMedia],[obsidian.searchNoteCommand],0.0,0.0,N
48,check if there are any dependency vulnerabilit...,[github.unread-notifications],"[github.search-issues, jira.search-issues, jir...",0.0,0.0,N
36,Send a message to #project-alpha: Update on th...,"[microsoft-teams.findChat, microsoft-teams.sen...",[discord.sendMessage],0.0,0.0,N
49,did anyone comment on the pr for the performan...,[github.unread-notifications],"[github.my-pull-requests, github.search-pull-r...",0.0,0.0,N
24,can you help me find the diagrams I did to sho...,[obsidian.searchMedia],"[obsidian.searchNoteCommand, apple-notes.index]",0.0,0.0,N
23,Just got approval for the Nike summer campaign...,[confluence-search.add-text],[notion.search-page],0.0,0.0,N
22,any more prs or security alerts to worry about?,[github.unread-notifications],"[github.my-pull-requests, github.notifications...",0.0,0.0,N
21,What did we discuss last week in the meeting w...,[confluence-search.search],"[microsoft-teams.searchMessages, discord.searc...",0.0,0.0,N
20,open release notes for the latest release,[confluence-search.go],[apple-notes.new],0.0,0.0,N
19,pull up my recently accessed pages related to ...,[confluence-search.search],"[confluence-search.recent, notion.search-page,...",0.0,0.0,N


In [24]:
from helpers import calculate_per_tool_recall

calculate_per_tool_recall(df).sort_values(
    by=["recall", "expected"], ascending=[True, False]
).head(15)

Unnamed: 0,tool,actual,expected,recall
35,apple-notes.add-text,0,7,0.0
43,github.unread-notifications,0,5,0.0
16,confluence-search.new-blog,0,2,0.0
3,confluence-search.add-text,0,1,0.0
8,confluence-search.go,0,1,0.0
19,confluence-search.new-page,0,1,0.0
21,jira.open-issues,0,1,0.0
25,notion.search-page,0,1,0.0
41,notion.create-database-page,0,1,0.0
28,microsoft-teams.findChat,1,13,0.08


We can see that there are a few commands that are not performing well. Let's look at a few of these commands and try to understand what's going on. In this case, we'll look at the following commands

1. `microsoft-teams.findChat`
2. `obsidian.searchMedia` 
3. `apple-notes.add-text`

Let's start by defining a method which grabs the mismatched examples for a tool where it wasn't called


In [29]:
def get_mismatched_examples_for_tool(df, tool_substring, num_examples=3):
    """
    Filter dataframe for rows where a specific tool substring appears in expected tools
    and the actual output doesn't match the expected output.

    Args:
        df: DataFrame containing the data
        tool_substring: String to search for in the expected tools
        num_examples: Number of examples to display (default: 3)

    Returns:
        DataFrame containing filtered examples where actual != expected
    """
    # Filter for rows where the tool substring is in expected tools
    filtered_df = df[
        df["expected"].apply(lambda x: any(tool_substring in item for item in x))
    ]

    # Filter for rows where actual doesn't match expected
    mismatched_df = filtered_df[
        filtered_df.apply(
            lambda row: set(row["expected"]) != set(row["actual"]), axis=1
        )
    ]

    # Return the top examples
    return mismatched_df.head(num_examples)

In [36]:
pd.set_option("display.max_colwidth", None)
get_mismatched_examples_for_tool(df, "microsoft-teams.findChat", 20)[
    ["query", "expected", "actual"]
]

Unnamed: 0,query,expected,actual
1,"Let's create a new release post about our latest deployment, also make sure to link the specific issues that were fixed in the latest sprint and send a message the #engineering channel to let them know about it","[confluence-search.new-blog, jira.active-sprints, microsoft-teams.findChat, microsoft-teams.sendMessage]","[confluence-search.new-page, discord.sendMessage]"
10,check with tim when we're supposed to do the onboarding for the new hires?,"[microsoft-teams.findChat, microsoft-teams.sendMessage]",[imessage.sendMessage]
11,check with james that we've shipped the new feature?,"[microsoft-teams.findChat, microsoft-teams.sendMessage]",[imessage.findChat]
30,Set my status to OOO and message #team-general that I'll be taking PTO next week to handle some family matters,"[microsoft-teams.setStatus, microsoft-teams.findChat, microsoft-teams.sendMessage]","[microsoft-teams.setStatus, microsoft-teams.sendMessage]"
31,"Set status to focusing, check unread messages, and let the team know I'm working on JIRA-1234","[microsoft-teams.setStatus, microsoft-teams.unreadMessages, microsoft-teams.findChat, microsoft-teams.sendMessage]","[microsoft-teams.setStatus, discord.setStatus, microsoft-teams.unreadMessages, discord.unreadMessages]"
32,Check active sprint tasks and update the team channel about today's blockers,"[jira.active-sprints, microsoft-teams.findChat, microsoft-teams.sendMessage]","[jira.active-sprints, microsoft-teams.findChat]"
33,Check workflow runs status and update #devops about the failed builds,"[github.workflow-runs, microsoft-teams.findChat, microsoft-teams.sendMessage]","[github.workflow-runs, discord.findChat]"
34,Pull up the latest release notes and notify #engineering about the deployment,"[confluence-search.search, microsoft-teams.findChat, microsoft-teams.sendMessage]","[confluence-search.search, discord.findChat]"
36,"Send a message to #project-alpha: Update on the recent production incident? Check Jira ticket 1024 for details, Github PRs are linked there.","[microsoft-teams.findChat, microsoft-teams.sendMessage, jira.search-issues]",[discord.sendMessage]
37,"Create a Jira bug for high CPU usage on login screen, affecting iOS and Android and assign it to kenny. ping him the link once done.","[jira.create-issue, microsoft-teams.findChat, microsoft-teams.sendMessage]",[jira.create-issue]


In [33]:
get_mismatched_examples_for_tool(df, "obsidian.searchMedia", 3)

Unnamed: 0,query,expected,actual,precision,recall,CORRECT
24,can you help me find the diagrams I did to sho...,[obsidian.searchMedia],"[obsidian.searchNoteCommand, apple-notes.index]",0.0,0.0,N
25,can you help me find that sketch I made that e...,[obsidian.searchMedia],[obsidian.searchNoteCommand],0.0,0.0,N
26,search for my aws architecture drawings,[obsidian.searchMedia],"[obsidian.searchMedia, notion.search-page, con...",0.33,1.0,N


In [44]:
get_mismatched_examples_for_tool(df, "apple-notes.add-text", 4)

Unnamed: 0,query,expected,actual,precision,recall,CORRECT
0,"create a grocery list note with Milk, eggs, bread, and cheese, and pin it so that i can find it later.","[apple-notes.new, apple-notes.add-text, apple-notes.menu-bar]","[apple-notes.new, apple-notes.menu-bar]",1.0,0.67,N
2,find weather taiwan december and generate a shopping list for it,"[google-search.index, apple-notes.new, apple-notes.add-text]","[google-search.index, obsidian.createNoteCommand]",0.5,0.33,N
8,add a reminder to my todos to buy some groceries for dinner tonight,[apple-notes.add-text],[apple-notes.new],0.0,0.0,N
15,Add a reminder to book a dentist appointment next week,[apple-notes.add-text],[apple-notes.new],0.0,0.0,N


We can see that there are a few trends that we can see in these specific errors for these three functions

1. `microsoft-teams.findChat` : Our model is inconsistent with its decision to call `findChat` or `sendMessage`. It never calls them together but instead only calls either one.
2. `obsidian.searchMedia` : Our model doesn't quite know when to use searchMedia, instead it defaults to searchNotes as the default even when graphics are mentioned.
3. `apple-notes.add-text` : Our model doesn't know that `add-text` can be called when there is an existing note and `apple-notes.new` should only be called when we have a new note that needs to be created before we can use `add-text` to add the text we want to the note itself

We'll target these weak points in the next notebook

## Conclusion

In Week 6, we've shown how to systematically evaluate tool selection. We started with a yaml taxonomy that's easy for team members to modify, then used it to generate synthetic queries that test specific failure modes like missing dependencies, wrong tool choices, and over-reliance on general commands. 

We applied the same precision and recall metrics from Week 1's retrieval evaluation to measure tool selection accuracy. Our synthetic dataset revealed key weaknesses: models skip required setup steps (like `findChat` before `sendMessage`), pick wrong tools (using GitHub instead of Confluence for docs), and default to general commands instead of specialized ones. Finding these patterns lets us target improvements effectively.

In the final notebook, we'll improve our model's performance by adding system prompts that explain user workflows and few-shot examples showing correct tool combinations. With our test set and metrics in place, we can measure exactly how much these changes help. This matches our core goal throughout the course - making systematic, measurable improvements backed by data.