In [1]:
%load_ext autoreload
%autoreload 2

# Week 6 - Systematically Improving Your Rag Application

> **Prerequisites**: Make sure that you've completed the previous notebooks `1. Evaluate Tools` and `2. Generate Dataset` before continuing with this notebook. We'll be using the results from the previous notebook to evaluate the effectiveness of our new techniques.

In this notebook, we'll explore how to improve our model's ability to select the right tools using system prompts and few-shot examples.

## Why this matters

When deploying RAG systems that coordinate multiple tools, getting tool selection wrong wastes resources and degrades user experience. Simple techniques like system prompts and few-shot examples can significantly boost tool selection accuracy without complex infrastructure changes.

Just as Week 1 showed how synthetic data could improve retrieval, and Week 4 demonstrated how topic modeling helps understand query patterns, strategic prompting can help models better understand when and how to use different tools. By systematically testing these improvements against our evaluation framework, we can quantify exactly how much each change helps.

## What you'll learn

Through hands-on experimentation with a personal assistant chatbot, you'll discover how to:


1. Leverage System Prompts
- Write effective prompts that explain tool usage patterns
- Help models understand user workflows and preferences
- Validate prompt improvements with metrics

2. Design Few-Shot Examples 
- Create examples that demonstrate correct tool combinations
- Target specific failure modes identified in testing
- Balance example diversity and relevance


3. Measure Improvements
- Compare performance before and after changes
- Track precision and recall across different approaches
- Make data-driven decisions about prompting strategies

By the end of this notebook, you'll understand how to systematically improve tool selection accuracy through better prompting, and how to measure the impact of these changes using objective metrics.

## System Prompts

By adding a system prompt for users to outline their specific workflow and tool usage, our model can handle a greater variety of users and their specific tool usage patterns.

Let's see this in action below where we add the user provided system prompt to our prompt template.

In [2]:
import instructor
from helpers import load_commands, load_queries, Command, SelectedCommands

user_system_prompt = """
I work as a software engineer at a company. When it comes to work, we normally track all outstanding tasks in Jira and handle the code review/discussions in github itself. 

refer to jira to get ticket information and updates but github is where code reviews, discussions and any other specific code related updates are tracked. Use the recently updated issues command to get the latest updates over search. 

for todos, i use a single note in apple notes for all my todos unless i say otherwise. Obsidian is where I store diagrams, charts and notes that I've taken down for things that I'm studying up on. Our company uses confluence for documentation, wikis, release reports, meeting notes etc that need to be shared with the rest of the team. Notion I use it for financial planning, tracking expenses and planning for trips. I always use databases in notion.

For messaging apps, I tend to just use discord for chatting with my friends when we game, i use microsoft teams for communicating with colleague about spcifically work related matters and iMessage for personal day to day stuff (Eg. coordinate a party, ask about general things in a personal context)
"""


async def generate_commands_with_system_prompt(
    query: str,
    client: instructor.AsyncInstructor,
    commands: list[Command],
    user_system_prompt: str,
):
    response = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can execute commands in response to a user query. You have access to the following commands:
                
                <commands>
                {% for command in commands %}
                - {{ command.key }} : {{ command.command_description }}
                {% endfor %}
                </commands>

                You must select at least one command to be called.

                Here is some information about how the user uses each extension. Remember to find a chat before sending a message.

                <user_behaviour>
                {{ user_behaviour }}
                </user_behaviour>
                """,
            },
            {
                "role": "user",
                "content": query,
            },
        ],
        model="gpt-4o",
        response_model=SelectedCommands,
        context={"commands": commands, "user_behaviour": user_system_prompt},
    )
    return response.selected_commands



Let's now run our evaluations again to see how it performs.

In [3]:
import instructor
from openai import AsyncOpenAI

commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")
client = instructor.from_openai(AsyncOpenAI())
await generate_commands_with_system_prompt(
    "notify #engineering that I'm running late for our meeting and i'll be there in 20 minutes",
    client,
    commands,
    user_system_prompt,
)

[UserCommand(key='microsoft-teams.findChat', arguments=[UserCommandArgument(title='Chat or Channel Name', value='#engineering')])]

In [13]:
from braintrust import Score, EvalAsync
from helpers import calculate_precision, calculate_recall


def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name="precision",
            score=calculate_precision(output, kwargs["expected"]),
        ),
        Score(
            name="recall",
            score=calculate_recall(output, kwargs["expected"]),
        ),
    ]


commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")


client = instructor.from_openai(AsyncOpenAI())
commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")


async def task(query, hooks):
    resp = await generate_commands_with_system_prompt(
        query, client, commands, user_system_prompt
    )
    hooks.meta(
        input=query,
        output=resp,
    )
    return [item.key for item in resp]


results = await EvalAsync(
    "function-calling",
    data=[
        {
            "input": row["query"],
            "expected": row["labels"],
        }
        for row in queries
    ],
    task=task,
    scores=[evaluate_braintrust],
)

Experiment week-6-fixes-1741091518 is running at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-fixes-1741091518
function-calling (data): 51it [00:00, 63268.12it/s]


function-calling (tasks):   0%|          | 0/51 [00:00<?, ?it/s]

  hooks.meta(



week-6-fixes-1741091518 compared to week-6-fixes-1741091506:
54.25% (+00.47%) 'recall'    score	(5 improvements, 5 regressions)
63.55% (+03.10%) 'precision' score	(8 improvements, 8 regressions)

1741091518.79s start
1741091521.06s end
2.27s (+14.47%) 'duration'	(24 improvements, 27 regressions)

See results for week-6-fixes-1741091518 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-fixes-1741091518


Performance Improvement with System Prompt

| Metric | Baseline | System Prompt |
|--------|----------|--------------|
| Precision | 0.45 | 0.64 (+42%) |
| Recall | 0.40 | 0.54 (+35%) |

By providing a system prompt, we saw a significant improvement in performance across both precision and recall metrics.

This is a significant improvement and shows that providing a system prompt can help the model understand how the user uses each tool. 

Better yet, using system prompts allow our model to be more flexible and handle a greater variety of users that may have different ways of interacting with the tools. 

### Comparing System Prompts with Baseline

Now that we've seen a overall improvement across the board, let's look at what specific queries our model is having issues with. Let's do so by computing the same metrics as we did in our previous notebook

In [14]:
import pandas as pd
from helpers import (
    calculate_per_tool_recall,
    calculate_precision_recall_for_queries,
    get_mismatched_examples_for_tool,
)

df = pd.DataFrame(
    [
        {
            "query": row.input,
            "expected": row.expected,
            "actual": row.output,
        }
        for row in results.results
    ]
)

df = calculate_precision_recall_for_queries(df)
df.head(10)

Unnamed: 0,query,expected,actual,precision,recall,CORRECT
0,"create a grocery list note with Milk, eggs, br...","[apple-notes.new, apple-notes.add-text, apple-...","[apple-notes.new, apple-notes.menu-bar, apple-...",0.67,0.67,N
1,Let's create a new release post about our late...,"[confluence-search.new-blog, jira.active-sprin...","[jira.recently-updated-issues, confluence-sear...",0.5,0.25,N
2,find weather taiwan december and generate a sh...,"[google-search.index, apple-notes.new, apple-n...","[google-search.index, apple-notes.index]",0.5,0.33,N
3,"Set my status to 'Working from home today, cat...",[microsoft-teams.setStatus],"[microsoft-teams.setStatus, microsoft-teams.fi...",0.33,1.0,N
4,any security alerts raised since we upgraded o...,[github.unread-notifications],[github.search-issues],0.0,0.0,N
5,Tell mum i'll be back for dinner around 7pm,"[imessage.findChat, imessage.sendMessage]",[imessage.findChat],1.0,0.5,N
6,just booked the latest accomodations for tokyo...,[notion.create-database-page],[notion.create-database-page],1.0,1.0,Y
7,"search messages Gregory modal, need to find th...",[microsoft-teams.searchMessages],[microsoft-teams.findChat],0.0,0.0,N
8,add a reminder to my todos to buy some groceri...,[apple-notes.add-text],[apple-notes.add-text],1.0,1.0,Y
9,"pull up munich plans, send mike the airbnb lin...","[notion.search-page, imessage.findChat, imessa...","[notion.search-page, imessage.findChat]",1.0,0.67,N


In [17]:
df_per_tool = calculate_per_tool_recall(df)
df_per_tool.sort_values(by="recall", ascending=True).head(20)

Unnamed: 0,tool,actual,expected,recall
21,jira.open-issues,0,1,0.0
2,confluence-search.add-text,0,1,0.0
27,jira.active-sprints,0,2,0.0
30,microsoft-teams.searchMessages,0,1,0.0
16,discord.sendMessage,0,3,0.0
6,github.my-pull-requests,0,2,0.0
15,microsoft-teams.sendMessage,0,13,0.0
8,imessage.sendMessage,0,5,0.0
14,discord.searchMessages,0,1,0.0
24,confluence-search.go,0,1,0.0


From the table above, we can see that some potential function calls with a high discrepancy between the expected calls and actual calls are 

1. `microsoftTeams.sendMessage` and `imessage.sendMessage`
2. `github.unreadNotifications`
3. `discord.findChat` and `discord.sendMessage`

We'll grab some of these examples where a specific tool call didn't get executed below


In [18]:
# Set display width to maximum
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", None)

# Example usage for unread-notifications
unread_notification_examples = get_mismatched_examples_for_tool(
    df, "unread-notifications", num_examples=5
)
unread_notification_examples

Unnamed: 0,query,expected,actual,precision,recall,CORRECT
4,any security alerts raised since we upgraded our nextjs dependencies over to 14.2.0?,[github.unread-notifications],[github.search-issues],0.0,0.0,N
22,any more prs or security alerts to worry about?,[github.unread-notifications],"[github.my-pull-requests, github.notifications, github.unread-notifications]",0.33,1.0,N
48,check if there are any dependency vulnerabilities raised recently,[github.unread-notifications],"[jira.recently-updated-issues, github.notifications, github.search-issues]",0.0,0.0,N
49,did anyone comment on the pr for the performance fix?,[github.unread-notifications],[github.my-pull-requests],0.0,0.0,N
50,pull up those security alerts and ping the security team,"[github.unread-notifications, microsoft-teams.findChat, microsoft-teams.sendMessage]","[github.notifications, microsoft-teams.findChat]",0.5,0.33,N


In [19]:
unread_notification_examples = get_mismatched_examples_for_tool(
    df, "unread-notifications", num_examples=5
)
unread_notification_examples

Unnamed: 0,query,expected,actual,precision,recall,CORRECT
4,any security alerts raised since we upgraded our nextjs dependencies over to 14.2.0?,[github.unread-notifications],[github.search-issues],0.0,0.0,N
22,any more prs or security alerts to worry about?,[github.unread-notifications],"[github.my-pull-requests, github.notifications, github.unread-notifications]",0.33,1.0,N
48,check if there are any dependency vulnerabilities raised recently,[github.unread-notifications],"[jira.recently-updated-issues, github.notifications, github.search-issues]",0.0,0.0,N
49,did anyone comment on the pr for the performance fix?,[github.unread-notifications],[github.my-pull-requests],0.0,0.0,N
50,pull up those security alerts and ping the security team,"[github.unread-notifications, microsoft-teams.findChat, microsoft-teams.sendMessage]","[github.notifications, microsoft-teams.findChat]",0.5,0.33,N


In [20]:
send_message_examples = get_mismatched_examples_for_tool(
    df, "imessage.sendMessage", num_examples=5
)
send_message_examples

Unnamed: 0,query,expected,actual,precision,recall,CORRECT
5,Tell mum i'll be back for dinner around 7pm,"[imessage.findChat, imessage.sendMessage]",[imessage.findChat],1.0,0.5,N
9,"pull up munich plans, send mike the airbnb link to the accoms on the 22nd","[notion.search-page, imessage.findChat, imessage.sendMessage]","[notion.search-page, imessage.findChat]",1.0,0.67,N
27,Message David to ask if he's still up for basketball this weekend,"[imessage.findChat, imessage.sendMessage]",[imessage.findChat],1.0,0.5,N
28,Send Kevin a message asking if he still has my charger,"[imessage.findChat, imessage.sendMessage]",[imessage.findChat],1.0,0.5,N
29,Tell Alex I'm running 15 minutes late for brunch,"[imessage.findChat, imessage.sendMessage]",[imessage.findChat],1.0,0.5,N


In [21]:
send_message_examples = get_mismatched_examples_for_tool(
    df, "discord.sendMessage", num_examples=5
)
send_message_examples

Unnamed: 0,query,expected,actual,precision,recall,CORRECT
12,"let's go crack open that new raid, set status to dnd and ping #bois","[discord.setStatus, discord.findChat, discord.sendMessage]","[discord.setStatus, discord.findChat]",1.0,0.67,N
44,tell #team-alpha I'll be late for tonight's dungeon run,"[discord.findChat, discord.sendMessage]",[discord.findChat],1.0,0.5,N
45,drop a message in #guild chat that I'm taking a break and set status to idle for now,"[discord.setStatus, discord.findChat, discord.sendMessage]","[discord.findChat, discord.setStatus]",1.0,0.67,N


### Few Shot Prompting

Once we've identified potential problem areas - like the model failing to find findChat - few shot examples can explicitly demonstrate these commands used in context. 

For instance, we can show a few examples of how to use the `findChat` command with a `sendMessage` command. A natural fit here could be to grab some content from an internal documentation site like `confluence` and then sending it over to a chat.

```
<query>generate release notes for the tickets closed in our current sprint and send the link over to the #product channel ahead of time so they know what's coming</query>
<commands>
    confluence-search.new-blog,
    confluence-search.add-text,
    microsoft-teams.findChat,
    microsoft-teams.sendMessage
</commands>
```

We could also be inventive and use the `searchMedia` command alongside a normal `searchNoteCommand` to show the model how each command differs.

```
<query>Can you grab my notes and sketches which I put together about cross-attention?</query>
<commands>
    obsidian.searchMedia,
    obsidian.searchNote
</commands>
```

Including these concrete examples in the prompt teaches the model the correct sequence of steps and drastically reduces the chances it calls the wrong command.”

In [40]:
async def generate_commands_with_prompt_and_examples(
    query: str,
    client: instructor.AsyncInstructor,
    commands: list[Command],
    user_behaviour: str,
):
    response = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
You are a helpful assistant that can execute commands in response to a user query. Only choose from the commands listed below. 

You have access to the following commands:

<commands>
{% for command in commands %}
<command>
    <command key>{{ command.key }}</command key>
    <command description>{{ command.command_description }}</command description>
</command>
{% endfor %}
</commands>

Select between 1-4 commands to be called in response to the user query.

Here is some information about how the user uses each extension. Remember to find a chat before sending a message.

<user_behaviour>
{{ user_behaviour }}
</user_behaviour>

Here are some past examples of queries that the user has asked in the past and the keys of the commands that were expected to be called. These provide valuable context and so look at it carefully and understand why each command was called, taking into account the user query below and the user behaviour provided above.

<examples>
    <example>
        <query>Compile any new outstanding PRs that have been reviewed recently with any vulnerabilities that have been reported and send them in a message to the engineering team in the #engineering channel to fix it before our next release</query>
        <commands>
            github.unread-notifications
            microsoft-teams.findChat
            microsoft-teams.sendMessage
        </commands>
    </example>
    <example>
        <query>Can you text Philip a link to the notion document for our trip to Taiwan next week?</query>
        <commands>
            notion.search
            imessage.findChat
            imessage.sendMessage
        </commands>
    </example>
    <example>
        <query>Can you pull up the visualisation I made to show how our D&D dungeon map layout works and then forward it to the party members in #gaming?</query>
        <commands>
            obsidian.searchMedia
            discord.findChat
            discord.sendMessage
        </commands>
    </example>
</examples>
                """,
            },
            {
                "role": "user",
                "content": query,
            },
        ],
        model="gpt-4o",
        response_model=SelectedCommands,
        context={"commands": commands, "user_behaviour": user_behaviour},
    )
    return response.selected_commands

In [43]:
def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name="precision",
            score=calculate_precision(output, kwargs["expected"]),
        ),
        Score(
            name="recall",
            score=calculate_recall(output, kwargs["expected"]),
        ),
    ]


commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")


client = instructor.from_openai(AsyncOpenAI())
commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")


async def task(query, hooks):
    resp = await generate_commands_with_prompt_and_examples(
        query, client, commands, user_system_prompt
    )
    hooks.meta(
        input=query,
        output=resp,
    )
    return [item.key for item in resp]


results = await EvalAsync(
    "function-calling",
    data=[
        {
            "input": row["query"],
            "expected": row["labels"],
        }
        for row in queries
    ],
    task=task,
    scores=[evaluate_braintrust],
)

Experiment week-6-fixes-1741092345 is running at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-fixes-1741092345
function-calling (data): 51it [00:00, 47940.27it/s]


function-calling (tasks):   0%|          | 0/51 [00:00<?, ?it/s]

  hooks.meta(



week-6-fixes-1741092345 compared to week-6-fixes-1741092333:
78.94% (+05.75%) 'recall'    score	(9 improvements, 7 regressions)
83.96% (+08.14%) 'precision' score	(10 improvements, 7 regressions)

1741092344.94s start
1741092347.20s end
2.26s (-03.44%) 'duration'	(28 improvements, 23 regressions)

See results for week-6-fixes-1741092345 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-fixes-1741092345


## Conclusion

In this notebook, we've shown how simple techniques like few-shot prompting and system prompts can significantly boost model performance in tool selection.

| Metric | Baseline | System Prompt | System Prompt + Few Shot |
|--------|----------|---------------|--------------------------|
| Precision | 0.45 | 0.64 (+42%) | 0.79 (+76%) |
| Recall | 0.40 | 0.54 (+35%) | 0.84 (+110%) |



These gains came from two key insights : clear system prompts help models understand tool usage patterns (like using Teams for work vs Discord for gaming), while targeted examples prevent common mistakes like forgetting to find a chat before sending a message.

This reflects the systematic pattern we've followed throughout the course. Each week started by defining clear metrics to optimize - whether that was MRR and recall for retrieval (Week 1), recall and MRR for metadata filtering in Week 5, or precision metrics for tool selection here in Week 6. With these metrics in place, we could use synthetic data to rapidly test improvements and validate the improvements that our changes have on the system.

As we deploy these systems to production, this data-driven approach becomes even more crucial. By collecting real user feedback through UI elements like thumbs up/down ratings and establishing clear evaluation metrics early, teams can quantify the impact of each change they make. This systematic strategy - defining metrics, using synthetic data for rapid testing, then validating with real user feedback - provides a reliable framework for improving RAG systems even as they grow more complex with multiple data sources, tools and retrieval methods.