In [10]:
%load_ext autoreload
%autoreload 2

# Week 6 - Systematically Improving Your Rag Application

In this notebook, we'll explore how to evaluate the ability of a model to select the right tools for a given user query by measuring precision and recall.

## Why this matters

As RAG systems grow more complex, they often need to coordinate multiple tools - from searching documentation to querying databases to sending notifications. Without objective metrics to evaluate tool selection, we risk calling unnecessary tools (wasting resources) or missing critical ones (degrading user experience).

While traditional RAG evaluation focuses on retrieval quality, tool selection requires its own framework. A model might find relevant content but fail to take the right actions with it. By measuring precision and recall of tool selection, we can systematically improve how our models coordinate multiple tools.

## What you'll learn

Through hands-on examples using a personal assistant chatbot, you'll discover how to:

1. Measure Tool Selection Quality
- Understand what precision and recall mean for tool calls
- Create clear test cases to evaluate tool selection
- Visualize and analyze tool selection performance

2. Implement Evaluation Methods
- Convert precision and recall into code
- Write unit tests for tool selection accuracy
- Track performance across different queries

3. Compare Tool Selection Approaches
- Evaluate parallel vs sequential tool calling
- Measure tradeoffs in accuracy and speed
- Make data-driven decisions about tool selection methods


By the end of this notebook, you'll have a framework for measuring and improving how accurately your models select tools as well as how to quantitatively measure the impact of specific changes to your tool calling strategy.

## Raycast Natural Language Extensions

Raycast is an application which enables you to launch custom shortcuts and integrations on your computer. It combines a variety of different integrations with tools such as Jira, Airtable, Google among many others and will be launching an easy way to help prompt extensions with natural language.

For instance, given the command `@calendar when's my next meeting?`, Raycast will be able to execute a series of commands that you have installed which will fetch all of your meetings and then return your next meeting timing after the current time. This will allow users to be able to interact with their system quickly and effeciently. 

In this notebook series, we'll look at how we might prototype a similar application. We'll do so over 3 notebooks.

1. **Evaluating Tool Calling ability**: Using a simple set of tools, we'll calculate precision and recall and see how to use these two metrics to evaluate the tools a model has called relative to a set of expected tool calls.

2. **Creating a Dataset** : We'll first examine some failure modes that models experience when it comes to tool calling and how we might generate a synthetic dataset to cover these failure cases. We'll then generate an initial set of queries that mimic these failure cases and then use them to generate a larger synthetic dataset of queries. While doing so, we'll use `braintrust` to evaluate the performance of our model's tool calling ability and establish an initial baseline

3. **Improvement** : Once we've done so, we'll explore different techniques that we can use to improve the performance of our model's tool calling ability such as few shot examples and system prompts provided by users. We'll then compare these against our original baseline and use the same techniques to evaluate the performance of our model.

## Understanding Model Performance

In this section, we'll be looking at how we can evaluate the performance of a model to call the right tool. We'll do so in 3 steps

1. **Metrics** : We'll first look at precision and recall and why we want to use them to evaluate our model's performance
2. **Tool Calling** : We'll then see how we can evalute the performance of our model using these metrics by writing simple assertions and unit tests
3. **Parallel Tool Calling**: We'll then see how we can leverage parallel tool calling to improve the latency of our application and improve the performance of our model



### Precision and Recall

We want to evaluate how well our model performs when it comes to calling tools. In order to do so, we'll be using two main metrics

1. Precision : Precision tells us what fraction of the tools we called were actualy useful. A high precision means we avoid wasting resources on calling irrelevant tools.
2. Recall : Recall tells us what fraction of the relevant tools we actually used. A high recall means we're not missing important steps that the user needs.

Balancing these two metrics is critical. If we only focus on recall, the model might call too many tools—most of which are unnecessary. If we only focus on precision, then we might miss out on potential tools that the user needs. 

Let's now see how we can manually calculate these metrics.


In [11]:
# Tools that our model called
model_tool_call = [
    "GET_CALENDAR_EVENTS",
    "CREATE_REMINDER",
    "SEND_EMAIL",
]

# Tools that we expected our model to call
expected_tool_call = [
    "GET_CALENDAR_EVENTS",
]


def calculate_precision(model_tool_call, expected_tool_call):
    if len(model_tool_call) == 0:
        return 0

    relevant_results = sum(1 for tool in model_tool_call if tool in expected_tool_call)
    return round(relevant_results / len(model_tool_call), 2)


def calculate_recall(model_tool_call, expected_tool_call):
    if len(expected_tool_call) == 0:
        return 1

    if len(model_tool_call) == 0:
        return 0

    relevant_results = sum(1 for tool in expected_tool_call if tool in model_tool_call)
    return round(relevant_results / len(expected_tool_call), 2)


precision, recall = (
    calculate_precision(model_tool_call, expected_tool_call),
    calculate_recall(model_tool_call, expected_tool_call),
)

precision, recall

(0.33, 1.0)

We can see that for this specific case, we had two tools that were called that were irrelevant to the user's query - `CREATE_REMINDER` and `SEND_EMAIL`. For a production application, we'd want to avoid this.

We did achieve a perfect recall - but remember here that a perfect recall can also be achieved by calling every single tool in our application. We want to minimise the amount of wasted computation. Let's see another example of how to compute these metrics.

In [12]:
# Tools that our model called
model_tool_call = [
    "GET_CALENDAR_EVENTS",
]

# Tools that we expected our model to call
expected_tool_call = [
    "GET_CALENDAR_EVENTS",
    "CREATE_REMINDER",
]

precision, recall = (
    calculate_precision(model_tool_call, expected_tool_call),
    calculate_recall(model_tool_call, expected_tool_call),
)

precision, recall

(1.0, 0.5)

While we have a slightly lower recall of 0.5 here because we didn't call the `CREATE_REMINDER` tool, we have a higher precision of 1. This is preferable to the previous case where we called two irrelevant tools.

Therefore, what we want to do is to maximise precision while keeping recall high. This means that we ideally want to make sure that **all of our tools called are relevant** while making sure that we **call as many of the relevant tools as possible**. This is quite distinct from RAG where we want to amximise recall while relying on the model's ability to filter out irrelevant information.

### Defining our Tools

We want to have a set of test cases that we can use to evaluate the performance of our model. We want to use them to measure the precision and recall of our model's tool calling in response to a user query. 

To demonstrate how we can do so, we'll do so in 3 steps below

1. We'll first define some tools that a simple personal assistant chatbot might use
2. We'll then define a set of test cases and corresponding expected tool calls
3. Lastly, we'll evaluate how well our model performs on these test cases using simple precision and recall metrics


In [13]:
from pydantic import BaseModel, Field
from typing import Literal
from datetime import datetime, timedelta
from typing import Union


class SendEmail(BaseModel):
    email: str
    subject: str
    body: str


class GetCalendarEvents(BaseModel):
    calendar: list[Literal["work", "personal"]]
    start_date: datetime = Field(default_factory=datetime.now)
    end_date: datetime = Field(
        default_factory=lambda: datetime.now() + timedelta(days=7)
    )


class CreateReminder(BaseModel):
    title: str
    description: str
    due_date: datetime


class ToolCalls(BaseModel):
    calls: list[
        Union[
            SendEmail,
            GetCalendarEvents,
            CreateReminder,
        ]
    ]

In [24]:
import instructor
from openai import AsyncOpenAI
from asyncio import Semaphore, get_running_loop
import time


client = instructor.from_openai(AsyncOpenAI())


async def generate_tool_calls(query: str, sem: Semaphore):
    async with sem:
        start = get_running_loop().time()
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": f"You are a helpful assistant that can call tools in response to user requests. Today's date is {datetime.now().strftime('%Y-%m-%d')}",
                },
                {"role": "user", "content": query},
            ],
            response_model=ToolCalls,
        )
        end = get_running_loop().time()
        return {
            "response": resp,
            "time": end - start,
        }

In [25]:
import asyncio

tests = [
    # Single tool queries
    ["Send an email to john@example.com about the project update", [SendEmail]],
    ["What meetings do I have scheduled for tomorrow?", [GetCalendarEvents]],
    ["Set a reminder for my dentist appointment next week", [CreateReminder]],
    # Two tool combinations
    [
        "Check my calendar for next week's meetings and set reminders for each one",
        [GetCalendarEvents, CreateReminder],
    ],
    [
        "Look up my team meeting schedule and send the agenda to all participants",
        [GetCalendarEvents, SendEmail],
    ],
    [
        "Set a reminder for the client call and send a confirmation email to the team",
        [CreateReminder, SendEmail],
    ],
]

sem = asyncio.Semaphore(10)
coros = [generate_tool_calls(query, sem) for query, _ in tests]

results = await asyncio.gather(*coros)

In [26]:
import pandas as pd


def calculate_precision_recall_for_queries(df):
    df = df.copy()
    df["precision"] = df.apply(
        lambda x: calculate_precision(x["actual"], x["expected"]), axis=1
    )
    df["recall"] = df.apply(
        lambda x: calculate_recall(x["actual"], x["expected"]), axis=1
    )
    df["CORRECT"] = df.apply(
        lambda x: "Y" if set(x["expected"]) == set(x["actual"]) else "N", axis=1
    )
    return df


df = pd.DataFrame(
    [
        {
            "query": test_item[0],
            "expected": [tool.__name__ for tool in test_item[1]],
            "actual": list(
                set([type(tool).__name__ for tool in result["response"].calls])
            ),
            "time": round(result["time"], 2),
        }
        for test_item, result in zip(tests, results)
    ]
)
df = calculate_precision_recall_for_queries(df)
df

Unnamed: 0,query,expected,actual,time,precision,recall,CORRECT
0,Send an email to john@example.com about the pr...,[SendEmail],[SendEmail],2.57,1.0,1.0,Y
1,What meetings do I have scheduled for tomorrow?,[GetCalendarEvents],[GetCalendarEvents],1.81,1.0,1.0,Y
2,Set a reminder for my dentist appointment next...,[CreateReminder],[CreateReminder],1.65,1.0,1.0,Y
3,Check my calendar for next week's meetings and...,"[GetCalendarEvents, CreateReminder]",[GetCalendarEvents],1.78,1.0,0.5,N
4,Look up my team meeting schedule and send the ...,"[GetCalendarEvents, SendEmail]",[GetCalendarEvents],0.87,1.0,0.5,N
5,Set a reminder for the client call and send a ...,"[CreateReminder, SendEmail]","[SendEmail, CreateReminder]",2.51,1.0,1.0,Y


In [27]:
df[["recall", "precision", "time"]].mean().round(2)

recall       0.83
precision    1.00
time         1.86
dtype: float64

We can see a few things here 

1. In general, our model has a high precision - this means that when it decides to call a tool, it's almost always relevant to the user's query. 
2. It has a low recall when we combine certain tools together. In this case, it struggled with the query - `look at my team meeting schedule and send the agenda to all participants` and struggled to understand that it should call the `GetCalendarEvents` and `SendEmail` tools together.

We can go one step further and segment questions by examining the recall per class. We want to do so because it allows us to identify specific classes of tool(s) that our model struggles with.

In [38]:
def calculate_per_tool_recall(df):
    """
    This assumes that we have a dataframe with the columns expected and actual that correspond to the expected and actual tool calls respectively.
    """
    # Get all unique tools
    all_tools = set()
    for tools in df["expected"] + df["actual"]:
        all_tools.update(tools)

    occurences = {tool: 0 for tool in all_tools}
    expected_occurences = {tool: 0 for tool in all_tools}

    # Count occurrences for each individual tool
    for _, row in df.iterrows():
        expected_tools = set(row["expected"])
        actual_tools = set(row["actual"])

        for tool in expected_tools:
            expected_occurences[tool] += 1

        for tool in actual_tools:
            if tool in expected_tools:
                occurences[tool] += 1

    # Calculate per-tool recall
    per_tool_recall = []
    for tool in all_tools:
        per_tool_recall.append(
            {
                "tool": tool,
                "actual": occurences[tool],
                "expected": expected_occurences[tool],
                "recall": occurences[tool] / expected_occurences[tool],
            }
        )

    return pd.DataFrame(per_tool_recall).round(2)


# Call the function with our dataframe
calculate_per_tool_recall(df)

Unnamed: 0,tool,actual,expected,recall
0,SendEmail,1,3,0.33
1,CreateReminder,1,3,0.33
2,GetCalendarEvents,2,3,0.67


## Parallel Tool Calling

In our previous example, we needed to wait for an entire response from the model to call our tools. This meant that each tool needs to wait for prior tool calls to complete before it can be generated.

With Parallel tool calling, we can sidestep and generate multiple tool calls in a single request. We can benchmark and determine the impact of this improvement in latency on our model performance with our evals.

Let's see how we can do so.

In [29]:
import openai
import instructor
from typing import Iterable, Union
from rich import print

client = instructor.from_openai(
    openai.AsyncOpenAI(), mode=instructor.Mode.PARALLEL_TOOLS
)

function_calls = await client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": f"You must always use tools. Today's date is {datetime.now().strftime('%Y-%m-%d')}",
        },
        {
            "role": "user",
            "content": "Can you fetch my calendar events for the next week and send an email to John(john@example.com) about the meeting we have tomorrow?",
        },
    ],
    response_model=Iterable[Union[GetCalendarEvents, SendEmail, CreateReminder]],
)

for fc in function_calls:
    print(fc)

Let's now see how we can adopt our previous unit test to evaluate the performance of our model with parallel tool calling.

In [30]:
from typing import Iterable, Union

client = instructor.from_openai(AsyncOpenAI(), mode=instructor.Mode.PARALLEL_TOOLS)


async def generate_parallel_tool_calls(query: str, sem: Semaphore):
    async with sem:
        start = time.time()
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You must always use tools"},
                {
                    "role": "user",
                    "content": query,
                },
            ],
            response_model=Iterable[
                Union[GetCalendarEvents, SendEmail, CreateReminder]
            ],
        )
        end = time.time()

        try:
            tools = [tool for tool in resp]
        except Exception:
            tools = []

        return {
            "response": tools,
            "time": end - start,
        }

In [31]:
import asyncio

tests = [
    # Single tool queries
    ["Send an email to john@example.com about the project update", [SendEmail]],
    ["What meetings do I have scheduled for tomorrow?", [GetCalendarEvents]],
    ["Set a reminder for my dentist appointment next week", [CreateReminder]],
    # Two tool combinations
    [
        "Check my calendar for next week's meetings and set reminders for each one",
        [GetCalendarEvents, CreateReminder],
    ],
    [
        "Look up my team meeting schedule and send the agenda to all participants",
        [GetCalendarEvents, SendEmail],
    ],
    [
        "Set a reminder for the client call and send a confirmation email to the team",
        [CreateReminder, SendEmail],
    ],
]

sem = asyncio.Semaphore(10)
coros = [generate_parallel_tool_calls(query, sem) for query, _ in tests]

results = await asyncio.gather(*coros)

In [39]:
df = pd.DataFrame(
    [
        {
            "query": test_item[0],
            "expected": [tool.__name__ for tool in test_item[1]],
            "actual": list(set([type(tool).__name__ for tool in result["response"]])),
            "time": round(result["time"], 2),
        }
        for test_item, result in zip(tests, results)
    ]
)
df = calculate_precision_recall_for_queries(df)
df

Unnamed: 0,query,expected,actual,time,precision,recall,CORRECT
0,Send an email to john@example.com about the pr...,[SendEmail],[],0.89,0.0,0.0,N
1,What meetings do I have scheduled for tomorrow?,[GetCalendarEvents],[GetCalendarEvents],1.9,1.0,1.0,Y
2,Set a reminder for my dentist appointment next...,[CreateReminder],[],1.26,0.0,0.0,N
3,Check my calendar for next week's meetings and...,"[GetCalendarEvents, CreateReminder]",[GetCalendarEvents],1.64,1.0,0.5,N
4,Look up my team meeting schedule and send the ...,"[GetCalendarEvents, SendEmail]",[],1.63,0.0,0.0,N
5,Set a reminder for the client call and send a ...,"[CreateReminder, SendEmail]","[SendEmail, CreateReminder]",2.81,1.0,1.0,Y


In [40]:
calculate_per_tool_recall(df)

Unnamed: 0,tool,actual,expected,recall
0,SendEmail,1,3,0.33
1,CreateReminder,1,3,0.33
2,GetCalendarEvents,2,3,0.67


In [41]:
df[["recall", "precision", "time"]].mean().round(2)

recall       0.42
precision    0.50
time         1.69
dtype: float64

## Conclusion

In this notebook, we showed how precision and recall metrics can be used as objective metrics of a model's tool selection abilities. Looking at our experiment results below, we found that while parallel tool calling reduced average execution time by ~9%, it came with significant tradeoffs. We saw ~50% drop in both precision and recall.


| Metric        | Tool Calling (Baseline) | Parallel Tool Calling |
|---------------|-------------------------|----------------------|
| **Precision** | 1.00                   | 0.50  ( -50% )       |
| **Recall**    | 0.83                   | 0.42  ( -49% )       |
| **Avg Time**  | 1.86                   | 1.69  ( -9% )        |

Just as Week 1 established quick, objective metrics for retrieval quality, we've shown how similar principles apply to tool selection. Rather than evaluating complex tool executions, we can rapidly prototype and improve selection logic using simple binary metrics. This lets us identify issues early - like our model's difficulty combining multiple tools - before investing in more complex infrastructure.

In the next notebook, we'll scale this evaluation framework from our test case of 3-4 tools to handle a real world application with 70 tools to call. We'll show how to dynamically provide these tool options using a yaml configuration and how to generate synthetic test data to cover different tool combinations and edge cases.

With clear metrics in place, evaluating the impact of techniques like few-shot examples and improved prompts on tool selection accuracy becomes easy.