## Evaluation retrieval

we'll evaluate different search approaches using our synthetic dataset. We'll compare keyword-based search, vector search, and hybrid approaches to see which performs best.

In [9]:
import pandas as pd
import sys

sys.path.insert(0, '..')

In [5]:
df_ground_truth = pd.read_csv('ground_truth_evidently.csv')

In [6]:
ground_truth = df_ground_truth.to_dict(orient='records')

In [13]:
rec = ground_truth[0]
rec

{'question': 'data definition in Evidently',
 'summary_answer': 'The article describes how to create a `DataDefinition` object to map column types and roles essential for data evaluation in Evidently.',
 'difficulty': 'beginner',
 'intent': 'text',
 'filename': 'docs/library/data_definition.mdx'}

In [10]:
import docs

github_data = docs.read_github_data()
parsed_data = docs.parse_data(github_data)
chunks = docs.chunk_documents(parsed_data)

In [11]:
#  index the chunked documents
from minsearch import Index
from typing import Any, Dict, List, TypedDict

index = Index(
    text_fields=["content", "filename", "title", "description"],
)

index.fit(chunks)

<minsearch.minsearch.Index at 0x722e92cb7050>

In [12]:
# define our baseline search function using keyword-based search
class SearchResult(TypedDict):
    """Represents a single search result entry."""
    start: int
    content: str
    title: str
    description: str
    filename: str


def search(query: str) -> List[SearchResult]:
    """
    Search the index for documents matching the given query.

    Args:
        query (str): The search query string.

    Returns:
        List[SearchResult]: A list of search results. Each result dictionary contains:
            - start (int): The starting position or offset within the source file.
            - content (str): A text excerpt or snippet containing the match.
            - title (str): The title of the matched document.
            - description (str): A short description of the document.
            - filename (str): The path or name of the source file.
    """
    return index.search(
        query=query,
        num_results=5,
    )

## Collecting Search Results for Evaluation
Now let's run our search function against all questions in the ground truth dataset

In [19]:
#run our search function against all questions in the ground truth dataset

from tqdm.auto import tqdm

all_search_results = []

for gt_rec in tqdm(ground_truth):
    sr = search(gt_rec['question'])
    filename = gt_rec['filename']
    relevance = [filename == sr_rec['filename'] for sr_rec in sr]
    all_search_results.append(relevance)

  0%|          | 0/478 [00:00<?, ?it/s]

## Evaluation Metrics
We will implement two search evaluation metrics:
- Hit Rate: The percentage of queries where at least one relevant document appears in the top results.
- Mean Reciprocal Rank (MRR): The average of reciprocal ranks of the first relevant document. It rewards finding relevant documents early in the result list.

In [20]:
# hit rate implementation

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

For MRR, we also look at the position:

- if the relevant result is at position 1, score is 1
- position 2 => 1/2
- position 3 => 1/3
- position 4 => 1/4

In [21]:
# mrr implementation

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)
                break

    return total_score / len(relevance_total)

In [22]:
# put them together

def evaluate(
        ground_truth,
        search_function,
        question_column='question',
        id_column='filename'
):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q[id_column]
        results = search_function(q[question_column])
        relevance = [d[id_column] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Baseline Performance Evaluation
Let's evaluate our keyword-based search to establish a baseline

In [24]:
evaluate(ground_truth, search)

  0%|          | 0/478 [00:00<?, ?it/s]

{'hit_rate': 0.4372384937238494, 'mrr': 0.37036262203626213}

## Vector Search Implementation

Keyword search struggles with semantic similarity. Let's try vector search, which can understand the meaning behind queries.

First, we need to set up the embedding model

In [25]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

In [26]:
# create embeddings for all document chunks 

import numpy as np
from tqdm.auto import tqdm

embeddings = []

for d in tqdm(chunks):
    text = d.get('title', '') + ' ' + d.get('description', '') + ' ' + d.get('content', '')
    text = text.strip()
    v = embedding_model.encode(text)
    embeddings.append(v)

embeddings = np.array(embeddings)


  0%|          | 0/575 [00:00<?, ?it/s]

In [27]:
# index them with vector search 

from minsearch import VectorSearch

vindex = VectorSearch()
vindex.fit(embeddings, chunks)

<minsearch.vector.VectorSearch at 0x722d777e3fe0>

In [28]:
# define vector search function

def v_search(query: str) -> List[SearchResult]:
    """
    Search the index for documents matching the given query.

    Args:
        query (str): The search query string.

    Returns:
        List[SearchResult]: A list of search results. Each result dictionary contains:
            - start (int): The starting position or offset within the source file.
            - content (str): A text excerpt or snippet containing the match.
            - title (str): The title of the matched document.
            - description (str): A short description of the document.
            - filename (str): The path or name of the source file.
    """

    q = embedding_model.encode(query)

    return vindex.search(
        q,
        num_results=5,
    )

## Vector Search Evaluation

In [29]:
evaluate(ground_truth, v_search)

  0%|          | 0/478 [00:00<?, ?it/s]

{'hit_rate': 0.7280334728033473, 'mrr': 0.5702928870292887}

## Hybrid Search Approach

Can we get even better results by combining both approaches? Let's try a hybrid search that uses both vector and keyword search:

In [33]:
def h_search(query: str) -> List[SearchResult]:
    """
    Search the index for documents matching the given query.

    Args:
        query (str): The search query string.

    Returns:
        List[SearchResult]: A list of search results. Each result dictionary contains:
            - start (int): The starting position or offset within the source file.
            - content (str): A text excerpt or snippet containing the match.
            - title (str): The title of the matched document.
            - description (str): A short description of the document.
            - filename (str): The path or name of the source file.
    """
    return v_search(query) + search(query)


## Hybrid Search Evaluation

evaluate hybrid approach

In [32]:
evaluate(ground_truth, h_search)

  0%|          | 0/478 [00:00<?, ?it/s]

TypeError: 'NoneType' object is not iterable