# Homework 4: Recipe Bot Retrieval Evaluation

In this notebook, we will cover:

- Creating a retrieval evaluation dataset with adverserial examples
- Building evals to test our retrieval system against user queries and reformulated queries from a query re-write agent
- Making recipe bot truly agentic by putting it in a while loop and giving it tools to do retrieval.

Now when user's talk to our recipe bot, they'll see traces like this (along with some scoring functions I chose to configure as "online scorers" in Braintrust)

<img src="./data/recipe_bot_with_retrieval_and_scoring.png" width="800"/>


## Imports


In [None]:
# Recipe Similarity Analysis using Embeddings
import json
import os
import sys

from datetime import datetime
from pathlib import Path

sys.path.append(os.path.abspath(".."))

from functools import partial
from typing import Any, Dict, List, Tuple

import braintrust as bt
import numpy as np

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

from backend.query_rewrite_agent import QueryRewriteAgent
from backend.retrieval import create_retriever, retrieve_bm25

print("Libraries imported successfully")


## Setup


In [None]:
BT_PROJECT_NAME = "recipe-bot"

os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Part 1: Create Your Retrieval Evaluation Dataset


In [None]:
# 1. Load data/processed_recipes.json
def load_recipes(file_path: str) -> List[Dict[str, Any]]:
    """Load recipe data from JSON file."""
    with open(file_path, "r") as f:
        recipes = json.load(f)
    print(f"Loaded {len(recipes)} recipes")
    return recipes


In [None]:
recipes = load_recipes("data/processed_recipes.json")

# Display sample recipe
print(f"\nSample recipe:")
print(f"ID: {recipes[0]['id']}")
print(f"Name: {recipes[0]['name']}")
print(f"Description: {recipes[0]['description'][:100]}...")


In [None]:
# 2. Create document embedding for based on "name"
def create_embeddings(texts: List[str], model_name: str = "all-MiniLM-L6-v2") -> np.ndarray:
    """Create embeddings for a list of texts using sentence transformers."""
    model = SentenceTransformer(model_name)
    embeddings = model.encode(texts, show_progress_bar=True)
    return embeddings


In [None]:
recipe_full_texts = [recipe["full_text"] for recipe in recipes]
full_text_embeddings = create_embeddings(recipe_full_texts)

print(f"Embeddings shape: {full_text_embeddings.shape}")
print(f"Embedding dimension: {full_text_embeddings.shape[1]}")


In [None]:
# 3. Create tuple of ("id", "name", "full_text, and the embedding of "full_text")
def create_recipe_tuples(recipes: List[Dict], embeddings: np.ndarray) -> List[Tuple[int, str, str, np.ndarray]]:
    """Create tuples of (id, name, embedding) for each recipe."""
    recipe_tuples = []

    for i, recipe in enumerate(recipes):
        recipe_tuple = (recipe["id"], recipe["name"], recipe["full_text"], embeddings[i])
        recipe_tuples.append(recipe_tuple)

    return recipe_tuples


In [None]:
recipe_tuples = create_recipe_tuples(recipes, full_text_embeddings)

print(f"Created {len(recipe_tuples)} recipe tuples")
print(f"Sample tuple structure:")
print(f"  ID: {recipe_tuples[0][0]}")
print(f"  Name: {recipe_tuples[0][1]}")
print(f"  Embedding shape: {recipe_tuples[0][3].shape}")


As discussed in Section 7.2 of the course reader, we can create more challenging queries by asking the LLM to generate inputs that include wording similar to content in multiple recipes. We'll do that in Braintrust by utilizing three of the most similar recipes to each of the 200 recipes in our final dataset.


In [None]:
# 4. Function to get the n most similar embeddings to each "id"
def get_most_similar_recipes(target_id: int, recipe_tuples: List[Tuple[int, str, str, np.ndarray]], n: int = 3) -> Dict[str, Any]:
    """
    Get the n most similar recipes to a target recipe ID.

    Args:
        target_id: ID of the target recipe
        recipe_tuples: List of (id, name, embedding) tuples
        n: Number of most similar recipes to return (default: 3)

    Returns:
        Dictionary with keys "id", "name", "most_similar"
        where "most_similar" is a list of dicts with "id", "name", "score"
    """
    # Find the target recipe
    target_recipe = None
    target_embedding = None

    for recipe_id, name, full_text, embedding in recipe_tuples:
        if recipe_id == target_id:
            target_recipe = (recipe_id, name, full_text)
            target_embedding = embedding
            break

    if target_recipe is None or target_embedding is None:
        raise ValueError(f"Recipe with ID {target_id} not found")

    # Calculate similarities to all other recipes
    similarities = []
    target_embedding_2d = target_embedding.reshape(1, -1)

    for recipe_id, name, full_text, embedding in recipe_tuples:
        if recipe_id != target_id:  # Don't include the target recipe itself
            embedding_2d = embedding.reshape(1, -1)
            similarity_score = cosine_similarity(target_embedding_2d, embedding_2d)[0][0]
            similarities.append({"id": recipe_id, "name": name, "full_text": full_text, "score": float(similarity_score)})

    # Sort by similarity score (descending) and take top n
    similarities.sort(key=lambda x: x["score"], reverse=True)
    most_similar = similarities[:n]

    return {"id": target_recipe[0], "name": target_recipe[1], "full_text": target_recipe[2], "most_similar": most_similar}


In [None]:
sample_recipe_id = recipes[0]["id"]  # Use the first recipe as an example
print(f"Testing similarity for recipe ID: {sample_recipe_id}")

result = get_most_similar_recipes(sample_recipe_id, recipe_tuples, n=5)

print(f"\nTarget Recipe:")
print(f"  ID: {result['id']}")
print(f"  Name: {result['name']}")
print(f"  Full Text: {result['full_text']}")

print(f"\nTop 5 Most Similar Recipes:")
for i, similar in enumerate(result["most_similar"], 1):
    print(f"  {i}. ID: {similar['id']:6} | Score: {similar['score']:.4f} | Name: {similar['name']}")


In [None]:
sample_recipe_id_2 = recipes[10]["id"]  # Use the 11th recipe
print(f"Testing similarity for recipe ID: {sample_recipe_id_2}")

result_2 = get_most_similar_recipes(sample_recipe_id_2, recipe_tuples, n=3)

print(f"\nTarget Recipe:")
print(f"  ID: {result_2['id']}")
print(f"  Name: {result_2['name']}")
print(f"  Full Text: {result_2['full_text']}")

print(f"\nTop 3 Most Similar Recipes:")
for i, similar in enumerate(result_2["most_similar"], 1):
    print(f"  {i}. ID: {similar['id']:6} | Score: {similar['score']:.4f} | Name: {similar['name']}")


In [None]:
recipe_data = []
for recipe in recipes:
    res = get_most_similar_recipes(recipe["id"], recipe_tuples, n=3)
    recipe_data.append({**res, **recipe})

In [None]:
print(len(recipe_data))
recipe_data[0]

In [None]:
with open("data/processed_recipes_with_similarities.json", "w") as f:
    json.dump(recipe_data, f)

## Part 2: Evaluate the BM25 retriever

As we've stated before, evals require three things ... **data** to be ran through a function that performs a given **task** and one or more **scorers** to evaluate how well that task accomplished its objective(s).


In [49]:
retriever = create_retriever(Path("./data/processed_recipes.json"), Path("./data/bm25_index.pkl"))


Loading recipes from data/processed_recipes.json
Loaded 200 recipes
Loading BM25 index from data/bm25_index.pkl
BM25 index loaded successfully
Using existing BM25 index


### Data

We use `init_dataset` to pull down our "golden dataset" of ground truth "user query" -> "target recipe" examples to evaluate how well each query is used by our retriever to get that recipe.


In [39]:
all_queries_ds = bt.init_dataset(project=BT_PROJECT_NAME, name="test_recipe_retrieval_data")
rows = list(all_queries_ds)

print(len(rows))
rows[0]

199


{'_pagination_key': 'p07537850107917041763',
 '_xact_id': '1000195610717742217',
 'created': '2025-08-12T23:48:11.204Z',
 'dataset_id': 'ae17cedd-7e40-433a-b669-b02c2d7ddf01',
 'expected': {'fact': 'This recipe makes one large, round loaf of bread.',
  'query': 'bulk prep for a single large round bread loaf for the week'},
 'id': 'a2eb327b-9626-41c7-a9d1-c2bd5ba80a9a',
 'input': {'description': "recipe paraphrased from a recipe in cooks illustrated 1/2008 this recipe comes out best made in an enameled cast-iron dutch oven with a lid that fits tightly. it can also be made in a regular cast-iron dutch oven or a heavy stockpot. use a mild flavored beer like budweiser or a mild flavor non-alcoholic beer. the bread is best the day it's baked. it can be wrapped in foil and stored in a cool dry place for 2-days. this recipe makes one large, round loaf of bread",
  'full_text': "almost no knead bread recipe paraphrased from a recipe in cooks illustrated 1/2008 this recipe comes out best made i

### Task

Here we define two task functions to use in separate evals. The first uses the user query as is, while the second uses a query re-write agent to re-formulate the user query as something it believes more likely to return the right recipes.


In [40]:
def fetch_recipes(query: str, *_, n_results: int = 5) -> List[Dict[str, Any]]:
    """Fetch recipes from the database."""
    recipes = retriever.retrieve_bm25(query, n_results)
    return recipes

In [41]:
test_example = rows[20]
print(test_example["expected"]["query"])
print(test_example["metadata"]["id"])

found_recipes = fetch_recipes(test_example["expected"]["query"], n_results=5)
[recipe["id"] for recipe in found_recipes]

brie cherry tartlets with chocolate walnut topping for party appetizers
518146


[518146, 417845, 129581, 307279, 53922]

In [42]:
qwra = QueryRewriteAgent()


def fetch_recipes_with_query_rewrite(query: str, *_, n_results: int = 5) -> List[Dict[str, Any]]:
    """Fetch recipes from the database."""
    res_d = qwra.process_query(query)
    recipes = retriever.retrieve_bm25(res_d["processed_query"], n_results)
    return recipes


In [43]:
test_example = rows[20]
print(test_example["expected"]["query"])
print(test_example["metadata"]["id"])

found_recipes = fetch_recipes_with_query_rewrite(test_example["expected"]["query"], n_results=5)
[recipe["id"] for recipe in found_recipes]

brie cherry tartlets with chocolate walnut topping for party appetizers
518146


[518146, 417845, 314774, 307279, 242]

### Scorers

There are several ways to [configure scorers for your Braintrust evals](https://www.braintrust.dev/docs/guides/experiments/write#scorers). Here we demonstrate how to return several related scores/metrics in one function.


In [44]:
def calc_retrieval_metrics(output: dict, metadata: dict) -> list[bt.Score]:
    """Braintrust scorer for Recall@1 - whether target recipe is ranked 1st."""
    retrieved_ids = [recipe["id"] for recipe in output]
    recall_at_1_score = 1.0 if len(retrieved_ids) > 0 and retrieved_ids[0] == metadata["id"] else 0.0
    recall_at_3_score = 1.0 if len(retrieved_ids) > 0 and metadata["id"] in retrieved_ids[:3] else 0.0
    recall_at_5_score = 1.0 if len(retrieved_ids) > 0 and metadata["id"] in retrieved_ids[:5] else 0.0
    mrr_score = 1.0 / (retrieved_ids.index(metadata["id"]) + 1) if metadata["id"] in retrieved_ids else 0.0

    return [
        bt.Score(name="Recall@1", score=recall_at_1_score),
        bt.Score(name="Recall@3", score=recall_at_3_score),
        bt.Score(name="Recall@5", score=recall_at_5_score),
        bt.Score(name="MRR", score=mrr_score),
    ]

In [46]:
calc_retrieval_metrics(found_recipes, test_example["metadata"])  # type: ignore

[Score(name='Recall@1', score=1.0, metadata={}, error=None),
 Score(name='Recall@3', score=1.0, metadata={}, error=None),
 Score(name='Recall@5', score=1.0, metadata={}, error=None),
 Score(name='MRR', score=1.0, metadata={}, error=None)]

### Eval

With our data, task, and scorers in place, we can now run evals for our retriever using the incoming queries as is, and also, by using the query re-write agent to re-formulate them.


In [47]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
exp_metadata = {"query_rewrite": False, "n_results": 5}

await bt.EvalAsync(
    name="recipe-bot",
    experiment_name=f"recipe_retrieval_metrics_{timestamp}",
    data=[{"input": example["expected"]["query"], "metadata": example["metadata"]} for example in rows],  # type: ignore
    task=partial(fetch_recipes, n_results=5),
    metadata=exp_metadata,
    scores=[calc_retrieval_metrics],  # type: ignore
)

Experiment recipe_retrieval_metrics_20250812_1701 is running at https://www.braintrust.dev/app/aie-course-2025/p/recipe-bot/experiments/recipe_retrieval_metrics_20250812_1701
recipe-bot [experiment_name=recipe_retrieval_metrics_20250812_1701] (data): 199it [00:00, 86431.24it/s]
recipe-bot [experiment_name=recipe_retrieval_metrics_20250812_1701] (tasks): 100%|██████████| 199/199 [00:06<00:00, 32.02it/s]



recipe_retrieval_metrics_20250812_1701 compared to add_queries_it_20250728_2025:
92.23% 'MRR'      score
87.94% 'Recall@1' score
96.48% 'Recall@3' score
97.99% 'Recall@5' score

1755043274.20s start
1755043278.55s end
1.85s duration
0tok prompt_tokens
0tok completion_tokens
0tok total_tokens
0tok prompt_cached_tokens
0tok prompt_cache_creation_tokens

See results for recipe_retrieval_metrics_20250812_1701 at https://www.braintrust.dev/app/aie-course-2025/p/recipe-bot/experiments/recipe_retrieval_metrics_20250812_1701


EvalResultWithSummary(summary="...", results=[...])

In [48]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
exp_metadata = {"query_rewrite": True, "n_results": 5}

await bt.EvalAsync(
    name="recipe-bot",
    experiment_name=f"recipe_retrieval_metrics_{timestamp}",
    data=[{"input": example["expected"]["query"], "metadata": example["metadata"]} for example in rows],  # type: ignore
    task=partial(fetch_recipes_with_query_rewrite, n_results=5),
    metadata=exp_metadata,
    scores=[calc_retrieval_metrics],  # type: ignore
)

Experiment recipe_retrieval_metrics_20250812_1701-29989171 is running at https://www.braintrust.dev/app/aie-course-2025/p/recipe-bot/experiments/recipe_retrieval_metrics_20250812_1701-29989171
recipe-bot [experiment_name=recipe_retrieval_metrics_20250812_1701] (data): 199it [00:00, 45197.73it/s]
recipe-bot [experiment_name=recipe_retrieval_metrics_20250812_1701] (tasks): 100%|██████████| 199/199 [00:17<00:00, 11.43it/s]



recipe_retrieval_metrics_20250812_1701-29989171 compared to recipe_retrieval_metrics_20250812_1701:
85.37% (-06.86%) 'MRR'      score	(9 improvements, 32 regressions)
79.40% (-08.54%) 'Recall@1' score	(7 improvements, 24 regressions)
90.45% (-06.03%) 'Recall@3' score	(4 improvements, 16 regressions)
94.97% (-03.02%) 'Recall@5' score	(2 improvements, 8 regressions)

1755043304.64s start
1755043318.32s end
6.64s (+478.46%) 'duration'                    	(0 improvements, 199 regressions)
0.78s llm_duration
139.47tok (+13947.24%) 'prompt_tokens'               	(0 improvements, 199 regressions)
27.37tok (+2736.68%) 'completion_tokens'           	(0 improvements, 199 regressions)
166.84tok (+16683.92%) 'total_tokens'                	(0 improvements, 199 regressions)
0.00$ estimated_cost
0tok (-) 'prompt_cached_tokens'        	(0 improvements, 0 regressions)
0tok (-) 'prompt_cache_creation_tokens'	(0 improvements, 0 regressions)

See results for recipe_retrieval_metrics_20250812_1701-2998917

EvalResultWithSummary(summary="...", results=[...])

We can see that surprisingly, our query re-write agent performs worse than just using the user query as is. There are many reasons that might be in our particular implmentation of recipe bot, and with some initial experiments in place, we can begin exploring why and iterating on improving our application.

<img src="./data/retrieval_evals.png" width="800"/>
