# Generate Custom Benchmark

This notebook walks through how to generate a custom benchmark based on your data.

We will be using OpenAI for our embedding model and LLM, but this can easily be switched out:
- Various embedding functions are provided in `embedding_functions.py`
- LLM prompts are provided in `llm_functions.py`

## 1. Setup

### 1.1 Install & Import

Install the necessary packages.

In [3]:
!pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m


Import modules.

In [36]:
%load_ext autoreload
%autoreload 2

import chromadb
import pandas as pd
import numpy as np
import datasets
import json
import os
from datetime import datetime
from dotenv import load_dotenv
import datetime
from openai import OpenAI as OpenAIClient
from anthropic import Anthropic as AnthropicClient
from functions.llm import *
from functions.embed import *
from functions.chroma import *
from functions.evaluate import *
from functions.visualize import *

load_dotenv()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


True

### 1.2 Set Variables

We use pre-chunked [Chroma Docs](https://docs.trychroma.com/docs/overview/introduction) as an example. To run this notebook with your own data, uncomment the commented out lines and fill in.

In [5]:
with open('data/chroma_docs.json', 'r') as f:
    corpus = json.load(f)

context = "This is a technical support bot for Chroma, a vector database often used by developers for building AI applications."
example_queries = """
    how to add to a collection
    filter by metadata
    retrieve embeddings when querying
    how to use openai embedding function when adding to collection
    """

COLLECTION_NAME = "chroma-docs"

# Generate a Benchmark with your own data:

# with open('filepath/to/your/data.json', 'r') as f:
#     corpus = json.load(f)

# context = "FILL IN WITH CONTEXT RELEVANT TO YOUR USE CASE"
# example_queries = "FILL IN WITH EXAMPLE QUERIES"

# COLLECTION_NAME = "YOUR COLLECTION NAME"

### 1.2 Load API Keys

To use Chroma Cloud, you can sign up for a Chroma Cloud account [here](https://www.trychroma.com/) and create a new database.

In [6]:
# Embedding Model & LLM
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# If you want to use Chroma Cloud, uncomment and fill in the following:
# CHROMA_TENANT = "YOUR CHROMA TENANT ID"
# X_CHROMA_TOKEN = "YOUR CHROMA API KEY"
# DATABASE_NAME = "YOUR CHROMA DATABASE NAME"

### 1.3 Set Clients

Initialize the clients.

In [7]:
chroma_client = chromadb.Client()

# If you want to use Chroma Cloud, uncomment the following line:
# chroma_client = chromadb.HttpClient(
#   ssl=True,
#   host='api.trychroma.com',
#   tenant=CHROMA_TENANT,
#   database=DATABASE_NAME,
#   headers={
#     'x-chroma-token': X_CHROMA_TOKEN
#   }
# )

openai_client = OpenAIClient(api_key=OPENAI_API_KEY)

## 2. Create Chroma Collection

If you already have a Chroma Collection for your data, skip to **3. Filter Documents for Quality**.

### 2.1 Load in Your Data

In [8]:
corpus_ids = list(corpus.keys())
corpus_documents = [corpus[key] for key in corpus_ids]

### 2.2 Embed Data & Add to Chroma Collection

Embed your documents using an embedding model of your choice. We use Openai's text-embedding-3-large here, but have other functions available in `embed.py`. You may also define your own embedding function.

We use batching and multi-threading for efficiency.

In [9]:
corpus_embeddings = openai_embed_in_batches(
    openai_client=openai_client,
    texts=corpus_documents,
    model="text-embedding-3-large",
)

corpus_collection = chroma_client.get_or_create_collection(
    name=COLLECTION_NAME,
    metadata={"hnsw:space": "cosine"}
)

collection_add_in_batches(
    collection=corpus_collection,
    ids=corpus_ids,
    texts=corpus_documents,
    embeddings=corpus_embeddings,
)

Processing OpenAI batches: 100%|██████████| 3/3 [00:03<00:00,  1.29s/it]


Adding 0 to 100Adding 100 to 200

Adding 200 to 216


## 3. Filter Documents for Quality

We begin by filtering our documents prior to query generation, this step ensures that we avoid generating queries from irrelevant or incomplete documents.

### 3.1 Set Criteria

We use the following criteria:
- `relevance` checks whether the document is relevant to the specified context
- `completeness` checks for overall quality of the document

You can modify the criteria as you see fit.

Fill in `context` according to your use case.

In [10]:
relevance = f"The document is relevant to the following context: {context}"
completeness = "The document is complete, meaning that it contains useful information to answer queries and does not only serve as an introduction to the main content that users may be looking for."

criteria = [relevance, completeness]
criteria_labels = ["relevance", "completeness"]

### 3.2 Get Documents

Get your Chroma collection and filter documents according to criteria.

In [11]:
corpus_collection = chroma_client.get_collection(
    name=COLLECTION_NAME
)

corpus = get_collection_items(
    collection=corpus_collection
)

corpus_ids = [key for key in corpus.keys()]
corpus_documents = [corpus[key]['document'] for key in corpus_ids]

Processing batches: 100%|██████████| 3/3 [00:00<00:00, 111.96it/s]


### 3.3 Filter Documents

We filter our documents using gpt-4o-mini. Batching functions are also available in `llm.py`.

In [12]:
filtered_document_ids = filter_documents(
    client=openai_client,
    model="gpt-4o-mini",
    documents=corpus_documents,
    ids=corpus_ids,
    criteria=criteria,
    criteria_labels=criteria_labels
)

Filtering documents: 100%|██████████| 216/216 [05:25<00:00,  1.51s/it]


In [13]:
passed_documents = [corpus[id]['document'] for id in filtered_document_ids]

failed_document_ids = [id for id in corpus_ids if id not in filtered_document_ids]

### 3.4 View Results

In [17]:
print(f"Number of documents passed: {len(filtered_document_ids)}")
print(f"Number of documents failed: {len(failed_document_ids)}")
print("-"*80)
print("Example of passed document:")
print(corpus[filtered_document_ids[0]]['document'])
print("-"*80)
print("Example of failed document:")
print(corpus[failed_document_ids[0]]['document'])
print("-"*80)

Number of documents passed: 46
Number of documents failed: 170
--------------------------------------------------------------------------------
Example of passed document:

    Any property of records in a collection can be updated with .update

    If an id is not found in the collection, an error will be logged and the update will be ignored. If documents are supplied without corresponding embeddings, the embeddings will be recomputed with the collection's embedding function.

    If the supplied embeddings are not the same dimension as the collection, an exception will be raised.
    
--------------------------------------------------------------------------------
Example of failed document:

    # typescript
    import { OpenAIEmbeddingFunction } from 'chromadb';

    const embeddingFunction = new OpenAIEmbeddingFunction({
        openai_api_key: "apiKey",
        openai_model: "text-embedding-3-small"
    })

    // use directly
    const embeddings = embeddingFunction.generate(["

## 4. Generate Golden Dataset

Using our filtered documents, we can genereate a golden dataset of queries.

### 4.1 Create Custom Prompt

We will use the `context` (from the prior section) and `example_queries` for query generation.

Fill in `example_queries` with examples of what users may ask. These examples help indicate what kind of topics users typically focus on, as well as the style of query that should be generated.

### 4.2 Generate Queries

Generate queries with gpt-4o. Batching functions are available in `llm.py`.

In [24]:
golden_dataset = create_golden_dataset(
    client=openai_client,
    model="gpt-4o",
    documents=passed_documents,
    ids=filtered_document_ids,
    context=context,
    example_queries=example_queries
)

golden_dataset.head()

Generating queries: 100%|██████████| 46/46 [00:35<00:00,  1.30it/s]


Unnamed: 0,id,query
0,3e8cc591-aa06-4f76-90ae-ee1e284c29fd,update records with new embeddings
1,bc68272d-b038-4f5f-9132-74d07dd010e3,upsert operation with multiple documents
2,51dedd38-9496-4e7a-8666-b5acf2bd29bb,querying with multiple embeddings
3,9c820c40-7492-481c-9aba-07b08b0c860d,include parameter usage with get
4,8e14f0d2-fe83-46f3-a456-044a0b9e4949,submit a CIP for new feature proposal


## 5. Evaluate

Now that we have our golden dataset, we will can run our evaluation.

### 5.1 Prepare Inputs

In [25]:
queries = golden_dataset['query'].tolist()
ids = golden_dataset['id'].tolist()

Embed generated queries.

In [26]:
query_embeddings = openai_embed_in_batches(
    openai_client=openai_client,
    texts=golden_dataset["query"],
    model="text-embedding-3-large"
)

query_embeddings_lookup = {
    id: {
        "text": query,
        "embedding": embedding
    }
    for id, query, embedding in zip(golden_dataset["id"], golden_dataset["query"], query_embeddings)
}

Processing OpenAI batches: 100%|██████████| 1/1 [00:01<00:00,  1.17s/it]


Create our qrels (query relevance labels) dataframe. In this case, each query and its corresponding document share the same id.

In [27]:
qrels = pd.DataFrame(
    {
        "query-id": ids,
        "corpus-id": ids,
        "score": 1
    }
)

### 5.2 Run Benchmark

In [28]:
results = run_benchmark(
    query_embeddings_lookup=query_embeddings_lookup,
    collection=corpus_collection,
    qrels=qrels
)

Processing batches: 100%|██████████| 1/1 [00:00<00:00, 14.67it/s]

NDCG@1: 0.5
NDCG@3: 0.68064
NDCG@5: 0.69936
NDCG@10: 0.7357
MAP@1: 0.5
MAP@3: 0.63768
MAP@5: 0.64855
MAP@10: 0.66421
Recall@1: 0.5
Recall@3: 0.80435
Recall@5: 0.84783
Recall@10: 0.95652
P@1: 0.5
P@3: 0.26812
P@5: 0.16957
P@10: 0.09565





Save results.

This is helpful for comparison (e.g. comparing different embedding models or chunking strategies).

In [39]:
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
results_to_save = {
    "model": "text-embedding-3-large",
    "results": results
}

In [40]:
with open(f'results/{timestamp}.json', 'w') as f:
    json.dump(results_to_save, f)