# Homework 4: Recipe Bot Retrieval Evaluation

This notebook walks through building and evaluating a BM25 retrieval system for recipes.

**What you'll learn:**
- How to build a BM25 retrieval engine
- How to evaluate retrieval with Recall@k and MRR
- Why some queries work better than others for keyword search

Video walkthrough: https://youtu.be/GMShL5iC8aY

**Bonus**: [Using AI Assisted Coding to Tackle Homework Problems](https://link.courses.maven.com/c/eJw80M2upCAQBeCngZ0Gil8XLGbja5gCymkTbAyoyX37id2Tu6rUl7OoOqm-ajuXLQeQk554qlfr9OxST8rxHHQWMhpOQTrrrFIgNacdt7Kkgr2H2CrmhP38r-fPQYHerZZCmdP7Xr5-XVsOR6t5hKSzIXKDB2MHnQwNHiQMWhIJQ-C9Q_4K3mmbY1zRC_LZw-Scw9XHKCe_Kot8CyDACimMdFIpNRpjwGaX_JogSeuZFt9_-rjjTe8x1Z1vfVlb3ZePhBlLJ17C6zyPztQfBjOD-TfNYD6wFXwnGgrGzmCmG8szQYAZFIO5_5SC8Xpsr_kq9El5J4ziLWwdMY1rwfPFtPj7VPE54w7wLwAA__8a93gB)

![AI Assisted Coding Walkthrough Location](../imgs/AIHwWalkthrough.png)

In [1]:
import json
from pathlib import Path
from rank_bm25 import BM25Okapi
from typing import List, Dict, Tuple

## 1. Look at Your Data First

Before writing any code, **look at your data**.

> ðŸ’¡ **What is retrieval?** When a user asks "quick pasta with tomatoes," the system needs to find the right recipe from thousands of options. Retrieval is the "search" step that narrows down candidates before the LLM generates a response.

We have two files in `reference_files/`:
- `processed_recipes.json` - 200 recipes with ingredients, steps, and tags
- `synthetic_queries.jsonl` - 200 queries, each linked to a source recipe

### Use the HTML Viewer

Open `reference_files/query_viewer.html` in your browser and upload the JSONL file. This lets you:
- Navigate between queries with arrow keys
- See what each query looks like and its source recipe
- Understand the evaluation task before coding

**Pro tip**: You can vibe-code your own viewer. Try this prompt:

> "Make a self-contained HTML file to view JSONL files. It should let me upload a file, navigate between records, and display all fields nicely."

This is a useful skill for quickly exploring any dataset.

In [2]:
BASE_PATH = Path('reference_files')

# Load recipes
with open(BASE_PATH / 'processed_recipes.json') as f:
    recipes = json.load(f)

print(f"Loaded {len(recipes)} recipes")
print(f"Example recipe keys: {list(recipes[0].keys())}")

Loaded 200 recipes
Example recipe keys: ['id', 'name', 'description', 'minutes', 'ingredients', 'n_ingredients', 'steps', 'n_steps', 'tags', 'nutrition', 'submitted', 'contributor_id', 'full_text']


In [3]:
# Look at one recipe
recipe = recipes[0]
print(f"Name: {recipe['name']}")
print(f"Cooking time: {recipe['minutes']} minutes")
print(f"Ingredients: {recipe['ingredients'][:5]}...")  # First 5
print(f"Steps: {len(recipe['steps'])} steps")

Name: 5 cheese crab lasagna with roasted garlic and vegetables
Cooking time: 245 minutes
Ingredients: ['garlic', 'extra virgin olive oil', 'dry white wine', 'fresh asparagus', 'cooking spray']...
Steps: 108 steps


In [4]:
# Load queries (JSONL format - one JSON object per line)
queries = []
with open(BASE_PATH / 'synthetic_queries.jsonl') as f:
    for line in f:
        if line.strip():
            queries.append(json.loads(line))

print(f"Loaded {len(queries)} queries")
print(f"Example query keys: {list(queries[0].keys())}")

Loaded 200 queries
Example query keys: ['query', 'salient_fact', 'source_recipe_id', 'source_recipe_name', 'source_recipe_url', 'ingredients', 'cooking_time', 'tags']


In [5]:
# Look at one query
q = queries[0]
print(f"Query: {q['query']}")
print(f"\nSource recipe: {q['source_recipe_name']}")
print(f"Source recipe ID: {q['source_recipe_id']}")
print(f"\nSalient fact (what makes this query answerable):\n{q['salient_fact'][:300]}...")

Query: What temperature should I set my oven to and how long do I need to bake this sweet, yeast-based bread for it to turn out fluffy and perfectly cooked?

Source recipe: amish friendship bread
Source recipe ID: 246125

Salient fact (what makes this query answerable):
1. **Appliance Settings**: The recipe specifies to "preheat oven to 325Â°F," which is a precise temperature setting necessary for baking the Amish friendship bread.

2. **Timing Specifics**: The recipe indicates a baking time of "1 hour," which is crucial for ensuring the bread is cooked properly and...


## 2. Build BM25 Retriever

> ðŸ’¡ **BM25 in plain English**: It's a word-matching algorithm. If your query contains "chicken" and "lemon," it finds recipes that mention those words, ranking ones where those terms appear frequently but aren't common across all recipes.

BM25 is a keyword-based ranking function. It scores documents based on:
- Term frequency (how often query terms appear in doc)
- Inverse document frequency (rarer terms matter more)
- Document length normalization

We need to:
1. Create a text representation of each recipe
2. Tokenize the text (split into words)
3. Build the BM25 index

In [6]:
def recipe_to_text(recipe: Dict) -> str:
    """Combine recipe fields into searchable text."""
    parts = [
        recipe['name'],
        ' '.join(recipe.get('ingredients', [])),
        ' '.join(recipe.get('steps', [])),
        ' '.join(recipe.get('tags', []))
    ]
    return ' '.join(parts).lower()

# Create corpus
corpus_texts = [recipe_to_text(r) for r in recipes]
print(f"Example text (first 300 chars):\n{corpus_texts[0][:300]}...")

Example text (first 300 chars):
5 cheese crab lasagna with roasted garlic and vegetables garlic extra virgin olive oil dry white wine fresh asparagus cooking spray garlic salt salt & freshly ground black pepper red bell peppers fresh basil dry lasagna noodles roma tomatoes dried oregano parmesan-romano cheese mix butter sweet onio...


In [7]:
# Simple tokenization (split on whitespace)
tokenized_corpus = [text.split() for text in corpus_texts]

# Build BM25 index
bm25 = BM25Okapi(tokenized_corpus)
print(f"BM25 index built with {len(tokenized_corpus)} documents")

BM25 index built with 200 documents


In [8]:
def retrieve(query: str, top_k: int = 5) -> List[Tuple[int, float, str]]:
    """Retrieve top-k recipes for a query.
    
    Returns: List of (recipe_index, score, recipe_name)
    """
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)
    
    # Get top-k indices
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    
    results = []
    for idx in top_indices:
        results.append((recipes[idx]['id'], scores[idx], recipes[idx]['name']))
    return results

In [9]:
# Test it
test_query = "air fryer chicken crispy"
results = retrieve(test_query, top_k=5)

print(f"Query: {test_query}\n")
print("Top 5 results:")
for i, (recipe_id, score, name) in enumerate(results, 1):
    print(f"  {i}. {name} (score: {score:.2f})")

Query: air fryer chicken crispy

Top 5 results:
  1. 7 layer elote dip (score: 6.75)
  2. amazingly juicy grilled lemon chicken (score: 6.19)
  3. algerian chicken preserved lemon bourek (score: 5.78)
  4. a grape picker s lunch sausages and lentils with thyme and wine (score: 5.57)
  5. alton s french onion soup attacked by sandi (score: 5.54)


## 3. Evaluate Retrieval

For each query, we know the "ground truth" recipe it came from. We measure:

- **Recall@k**: Is the correct recipe in the top k results?
- **MRR (Mean Reciprocal Rank)**: Average of 1/rank for correct recipes

> ðŸ’¡ **What do these metrics tell us?** Recall@5 of 65% means "65% of the time, the right recipe is somewhere in the top 5." MRR accounts for *where* it ranksâ€”finding it at #1 is better than #5.

In [10]:
def evaluate_retrieval(queries: List[Dict], top_k: int = 5) -> Dict:
    """Evaluate retrieval performance."""
    recall_at_1 = 0
    recall_at_3 = 0
    recall_at_5 = 0
    reciprocal_ranks = []
    
    for q in queries:
        query_text = q['query']
        target_id = q['source_recipe_id']
        
        results = retrieve(query_text, top_k=top_k)
        retrieved_ids = [r[0] for r in results]
        
        # Check recall at different k
        if target_id in retrieved_ids[:1]:
            recall_at_1 += 1
        if target_id in retrieved_ids[:3]:
            recall_at_3 += 1
        if target_id in retrieved_ids[:5]:
            recall_at_5 += 1
        
        # Calculate reciprocal rank
        if target_id in retrieved_ids:
            rank = retrieved_ids.index(target_id) + 1
            reciprocal_ranks.append(1.0 / rank)
        else:
            reciprocal_ranks.append(0.0)
    
    n = len(queries)
    return {
        'recall_at_1': recall_at_1 / n,
        'recall_at_3': recall_at_3 / n,
        'recall_at_5': recall_at_5 / n,
        'mrr': sum(reciprocal_ranks) / n,
        'total_queries': n
    }

In [11]:
# Run evaluation
results = evaluate_retrieval(queries)

print("Retrieval Performance")
print("=" * 30)
print(f"Recall@1: {results['recall_at_1']:.1%}")
print(f"Recall@3: {results['recall_at_3']:.1%}")
print(f"Recall@5: {results['recall_at_5']:.1%}")
print(f"MRR:      {results['mrr']:.3f}")
print(f"\nEvaluated on {results['total_queries']} queries")

Retrieval Performance
Recall@1: 40.0%
Recall@3: 55.0%
Recall@5: 65.0%
MRR:      0.488

Evaluated on 200 queries


## 4. Analyze Results

Numbers alone don't tell the full story. Let's look at specific successes and failures to understand *why* retrieval works or breaks.

> ðŸ’¡ **This is where you build intuition.** Looking at failures reveals specific patterns that give you objective, measurable things to improveâ€”rather than vague feelings that are hard to act on.

In [12]:
# Find examples where retrieval worked and failed
successes = []
failures = []

for q in queries:
    results = retrieve(q['query'], top_k=5)
    retrieved_ids = [r[0] for r in results]
    
    if q['source_recipe_id'] in retrieved_ids:
        rank = retrieved_ids.index(q['source_recipe_id']) + 1
        successes.append({'query': q, 'rank': rank, 'results': results})
    else:
        failures.append({'query': q, 'results': results})

print(f"Successes: {len(successes)}")
print(f"Failures: {len(failures)}")

Successes: 130
Failures: 70


In [13]:
# Look at a success
if successes:
    s = successes[0]
    print("SUCCESS EXAMPLE")
    print("=" * 50)
    print(f"Query: {s['query']['query']}")
    print(f"\nTarget: {s['query']['source_recipe_name']}")
    print(f"Found at rank: {s['rank']}")
    print(f"\nTop 5 results:")
    for i, (rid, score, name) in enumerate(s['results'], 1):
        marker = "âœ“" if rid == s['query']['source_recipe_id'] else " "
        print(f"  {marker} {i}. {name}")

SUCCESS EXAMPLE
Query: What temperature should I set my oven to and how long do I need to bake this sweet, yeast-based bread for it to turn out fluffy and perfectly cooked?

Target: amish friendship bread
Found at rank: 2

Top 5 results:
    1. 100 whole wheat bread non dense heavy white bread texture
  âœ“ 2. amish friendship bread
    3. amazing stuffing from scratch breadmaker recommended
    4. 10 calorie chocolate miracle noodle cookies
    5. 100 whole grain wheat bread


In [14]:
# Look at a failure
if failures:
    f = failures[0]
    print("FAILURE EXAMPLE")
    print("=" * 50)
    print(f"Query: {f['query']['query']}")
    print(f"\nTarget: {f['query']['source_recipe_name']}")
    print(f"\nTop 5 results (none are correct):")
    for i, (rid, score, name) in enumerate(f['results'], 1):
        print(f"  {i}. {name}")
    print(f"\nWhy it failed: The query terms may not match the recipe text well.")

FAILURE EXAMPLE
Query: What's the best way to poach salmon so that it's perfectly cooked and not dry? I heard the timing and temperature are really important, but I'm not sure how long to simmer it or how hot to get the water first.

Target: 4th of july salmon with egg sauce

Top 5 results (none are correct):
  1. 3 step fall off the bone ribs easy
  2. amish friendship bread
  3. amish cinnamon bread friendship bread
  4. 2 hour turkey really
  5. the gumbo pages traditional red beans and rice

Why it failed: The query terms may not match the recipe text well.


## 5. Optional: Query Rewriting

One way to improve retrieval is to rewrite queries to better match recipe text.

For example:
- "What temp for crispy chicken?" â†’ "chicken crispy temperature degrees oven"

This can be done with an LLM. Here's the concept (not run by default):

In [15]:
# Conceptual example - would require LLM API
REWRITE_PROMPT = """
You are optimizing a cooking query for recipe search. 
Rewrite the query to include terms that appear in recipe text.

Guidelines:
- Use specific cooking terms
- Include equipment names
- Add ingredient names
- Remove question words (what, how, when)

Original: "{query}"
Optimized:
"""

print("Example rewrite prompt:")
print(REWRITE_PROMPT.format(query="What air fryer settings for frozen chicken?"))

Example rewrite prompt:

You are optimizing a cooking query for recipe search. 
Rewrite the query to include terms that appear in recipe text.

Guidelines:
- Use specific cooking terms
- Include equipment names
- Add ingredient names
- Remove question words (what, how, when)

Original: "What air fryer settings for frozen chicken?"
Optimized:



## Summary

**What we built:**
- BM25 retrieval engine for recipes
- Evaluation pipeline with Recall@k and MRR

**Key insights:**
- BM25 works well when query terms match recipe text
- Failures often happen when users use different words than recipes
- Query rewriting can help bridge this vocabulary gap

**Next steps:**
- Try different text representations (e.g., just ingredients)
- Implement query rewriting with an LLM
- Compare BM25 with embedding-based retrieval