# 🧠 Part 2: Evaluation — Code Search Engine on CoSQA Dataset
In this notebook, we will evaluate our search engine on the [CoSQA dataset](https://huggingface.co/datasets/CoIR-Retrieval/cosqa) using the following ranking metrics:
- **Recall@10**
- **MRR@10 (Mean Reciprocal Rank)**
- **nDCG@10 (Normalized Discounted Cumulative Gain)**

## 🔧 Setup & Imports
We’ll import the necessary modules, including the dataset loader, the search engine, and helper libraries.

In [1]:
from datasets import load_dataset
import json
from search_engine import Search
from collections import defaultdict
from typing import List, Set
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


## 📏 Define Evaluation Metrics
We implement the three required metrics below.

In [2]:
def calculate_recall_at_10(found_ids: List[str], correct_ids: Set[str]) -> int:
    """
    Calculates Recall@10.
    Returns 1 if any of the correct IDs are in the found_ids list, otherwise 0.
    """
    return 1 if not set(found_ids).isdisjoint(correct_ids) else 0

def calculate_mrr_at_10(found_ids: List[str], correct_ids: Set[str]) -> float:
    """
    Calculates Mean Reciprocal Rank @ 10.
    Finds the rank of the first correct document and returns 1/rank.
    """
    for i, found_id in enumerate(found_ids):
        if found_id in correct_ids:
            return 1 / (i + 1)
    return 0.0

def calculate_ndcg_at_10(found_ids: List[str], correct_ids: Set[str]) -> float:
    """
    Calculates normalized Discounted Cumulative Gain @ 10.
    This metric considers the position of all relevant documents.
    """
    dcg = 0.0
    for i, found_id in enumerate(found_ids):
        if found_id in correct_ids:
            dcg += 1 / np.log2(i + 2)

    idcg = 0.0
    num_correct = min(len(correct_ids), 10)
    for i in range(num_correct):
        idcg += 1 / np.log2(i + 2)

    return dcg / idcg if idcg > 0 else 0.0

## 📚 Load the CoSQA Dataset
We will load the corpus (documents), queries, and test set using the Hugging Face `datasets` library.

In [3]:
corpus_dataset = load_dataset("CoIR-Retrieval/cosqa", "corpus", split="corpus")
print(f"Corpus loaded with {len(corpus_dataset)} documents.")

queries_dataset = load_dataset("CoIR-Retrieval/cosqa", "queries", split="queries")
print(f"Queries loaded with {len(queries_dataset)} queries.")

eval_dataset = load_dataset("CoIR-Retrieval/cosqa", name="default", split="test")
print(f"Evaluation 'test' split loaded with {len(eval_dataset)} query-document pairs.")

Corpus loaded with 20604 documents.
Queries loaded with 20604 queries.
Evaluation 'test' split loaded with 500 query-document pairs.


## 🧩 Prepare the Data and Search Engine
We extract text and IDs from the corpus, then build the search index using the `Search` class you implemented earlier.

In [4]:
documents = [item['text'] for item in corpus_dataset]
doc_ids = [item['_id'] for item in corpus_dataset]

search_instance = Search()
print("Building the search index... (This may take a while)")
search_instance.index(documents=documents, doc_ids=doc_ids)
print("\nIndexing complete! Your search engine is ready.")

Loading model...
Model loaded.
Building the search index... (This may take a while)
Generating embeddings for 20604 documents...
Embeddings generated.
Index built. Total vectors in index: 20604
Index saved to search_index.usearch
Document map saved to documents.json

Indexing complete! Your search engine is ready.


## 🚀 Evaluation Loop
We will now test each query from the test set, perform a search, and compute the metrics for each.

In [5]:
recall_scores = []
mrr_scores = []
ndcg_scores = []

queries_map = {item['_id']: item['text'] for item in queries_dataset}

ground_truth = defaultdict(set)
for item in eval_dataset:
    ground_truth[item['query-id']].add(item['corpus-id'])

test_query_ids = sorted(list(ground_truth.keys()))
print(f"Found {len(test_query_ids)} unique queries to test.")

for i, query_id in enumerate(test_query_ids):
    query_text = queries_map.get(query_id)
    if not query_text:
        continue

    correct_doc_ids = ground_truth[query_id]
    found_doc_ids = search_instance.search(query=query_text, top_k=10, return_ids=True)

    recall_scores.append(calculate_recall_at_10(found_doc_ids, correct_doc_ids))
    mrr_scores.append(calculate_mrr_at_10(found_doc_ids, correct_doc_ids))
    ndcg_scores.append(calculate_ndcg_at_10(found_doc_ids, correct_doc_ids))

    if (i + 1) % 50 == 0:
        print(f"  ...processed {i+1}/{len(test_query_ids)} queries")

Found 500 unique queries to test.
Searching for: 'sort by a token in string python'
Searching for: 'python check file is readonly'
Searching for: 'declaring empty numpy array in python'
Searching for: 'test for iterable is string in python'
Searching for: 'python print results of query loop'
Searching for: 'how to save header of fits file to export python'
Searching for: 'python calc page align'
Searching for: 'python numpy array as float'
Searching for: 'input string that replaces occurences python'
Searching for: 'python check all items in list are ints'
Searching for: 'how to save variable to text file python'
Searching for: 'how to skip an index in a for loop python'
Searching for: 'how to create a tokenization code in python'
Searching for: 'python raise without parentheses'
Searching for: 'how to seperate list with commas python'
Searching for: 'python asynchronous function call return'
Searching for: 'how to make a seconds to time in python'
Searching for: 'python cast true or f

## 🧮 Compute Final Metrics
After looping through all queries, we compute and print the average results.

In [6]:
if recall_scores:
    final_recall = np.mean(recall_scores)
    final_mrr = np.mean(mrr_scores)
    final_ndcg = np.mean(ndcg_scores)

    print("\n✅ Evaluation complete.")
    print("\n--- FINAL RESULTS ---")
    print(f"Recall@10: {final_recall:.4f}")
    print(f"MRR@10:    {final_mrr:.4f}")
    print(f"nDCG@10:   {final_ndcg:.4f}")
else:
    print("\nEvaluation could not be completed.")


✅ Evaluation complete.

--- FINAL RESULTS ---
Recall@10: 0.5600
MRR@10:    0.3544
nDCG@10:   0.4028
