# Generate Custom Benchmark

This notebook walks through how to generate a custom benchmark based on your data.

We will be using OpenAI for our embedding model and LLM, but this can easily be switched out:

Various embedding functions are provided in `functions/embed.py`
- LLM prompts are provided in `functions/llm.py`
- NOTE: When switching out embedding models, you will need to make a new collection for your new embeddings. Then, embed the same documents and queries with the embedding model of your choice.

Use the same golden dataset of queries when comparing embedding models on the same data.

Cells that should be modified when switching out embedding models are labeled as **[Modify]**

In [None]:
%pip install -r requirements.txt -q

## 0. (Optional) Use your own data!

This demo is already preloaded with sample data from Chroma docs, but you can also use your own data.

If you want to use your own data, change `COLLECTION_NAME` to the name of the collection where you want your data.

If you want to create a new collection with your own local data, use `load_data.py`

In [2]:
# Sample data collection name: generative-benchmarking-chroma-docs
COLLECTION_NAME = "generative-benchmarking-chroma-docs"

For the sample dataset, we will be creating a dataset for creating a technical support bot.

You can change the subject of the generated dataset by changing the following variables:

In [3]:
context = "This is a technical support bot for Chroma, a vector database company often used by developers for building AI applications."
example_queries = """
    how to add to a collection
    filter by metadata
    retrieve embeddings when querying
    how to use openai embedding function when adding to collection
    """

## 1. Setup

### 1.1 Install & Import

Install the necessary packages.

Import modules.

In [None]:
%load_ext autoreload
%autoreload 2

import chromadb
import pandas as pd
import json
import os
import dotenv
from pathlib import Path
from datetime import datetime
from openai import OpenAI as OpenAIClient
from functions.utils import *
from functions.llm import *
from functions.embed import *
from functions.chroma import *
from functions.evaluate import *
from functions.visualize import *
from functions.types import *

### 1.2 Set Variables

In [8]:
dotenv.load_dotenv();

# Embedding Model & LLM
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
assert OPENAI_API_KEY != None and len(OPENAI_API_KEY) > 0, 'Please provide an OpenAI API key'

CHROMA_CLOUD_API_KEY = os.getenv("CHROMA_CLOUD_API_KEY")
CHROMA_HOST=os.getenv("CHROMA_HOST")
CHROMA_TENANT=os.getenv("CHROMA_TENANT")
CHROMA_DB_NAME=os.getenv("CHROMA_DB_NAME")

using_chroma_cloud = env_var_provided(CHROMA_CLOUD_API_KEY)

if using_chroma_cloud:
    assert env_var_provided(CHROMA_HOST)
    assert env_var_provided(CHROMA_TENANT)
    assert env_var_provided(CHROMA_DB_NAME)

### 1.3 Connect to Chroma

In [None]:
chroma_client = chromadb.HttpClient(
    ssl=using_chroma_cloud,
    host=CHROMA_HOST,
    tenant=CHROMA_TENANT,
    database=CHROMA_DB_NAME,
    headers={
        'x-chroma-token': CHROMA_CLOUD_API_KEY
    }
)

corpus_collection = chroma_client.get_or_create_collection(
    name=COLLECTION_NAME,
    metadata={"hnsw:space": "cosine"}
)

assert corpus_collection.count() > 0, \
    ('Your collection is empty. If you want to create a new collection with '
    'your own data, use the `load_data.py` script.')

### 1.4 Connect to Embedding Provider

In [11]:
openai_client = OpenAIClient(api_key=OPENAI_API_KEY)

## 2. Get Data

### 2.1 Load in Your Data

If you using your own data, this will embed your data. We use OpenAI's text-embedding-3-large here, but have other functions available in `embed.py`. You may also define your own embedding function.

We use batching and multi-threading for efficiency.

In [None]:
corpus = get_collection_items(
    collection=corpus_collection
)
corpus_ids = [key for key in corpus.keys()]
corpus_documents = [corpus[key]['document'] for key in corpus_ids]
corpus_embeddings = [corpus[key]['embedding'] for key in corpus_ids]
assert len(corpus) > 0

## 3. Filter Documents for Quality

We begin by filtering our documents prior to query generation, this step ensures that we avoid generating queries from irrelevant or incomplete documents.

### 3.1 Set Criteria

We use the following criteria:
- `relevance` checks whether the document is relevant to the specified context
- `completeness` checks for overall quality of the document

You can modify the criteria as you see fit.

In [13]:
relevance = f"The document is relevant to the following context: {context}"
completeness = "The document is complete, meaning that it contains useful information to answer queries and does not only serve as an introduction to the main content that users may be looking for."

criteria = [relevance, completeness]
criteria_labels = ["relevance", "completeness"]

### 3.2 Filter Documents

We filter our documents using gpt-4o-mini. Batching functions are also available in `llm.py`.

In [14]:
filtered_document_ids = filter_documents(
    client=openai_client,
    model="gpt-4o-mini",
    documents=corpus_documents,
    ids=corpus_ids,
    criteria=criteria,
    criteria_labels=criteria_labels
)

Filtering documents: 100%|██████████| 648/648 [20:07<00:00,  1.86s/it]


In [15]:
passed_documents = [corpus[id]['document'] for id in filtered_document_ids]

failed_document_ids = [id for id in corpus_ids if id not in filtered_document_ids]

### 3.3 View Results

In [None]:
print(f"Number of documents passed: {len(filtered_document_ids)}")
print(f"Number of documents failed: {len(failed_document_ids)}")
print("-"*80)
print("Example of passed document:")
print(corpus[filtered_document_ids[0]]['document'])
print("-"*80)
print("Example of failed document:")
print(corpus[failed_document_ids[0]]['document'])
print("-"*80)

## 4. Generate Golden Dataset

Using our filtered documents, we can genereate a golden dataset of queries.

### 4.1 Create Custom Prompt

We will use `context` and `example_queries` for query generation.

### 4.2 Generate Queries

Generate queries with gpt-4o. Batching functions are available in `llm.py`.

In [None]:
golden_dataset = create_golden_dataset(
    client=openai_client,
    model="gpt-4o",
    documents=passed_documents,
    ids=filtered_document_ids,
    context=context,
    example_queries=example_queries
)

golden_collection = chroma_client.get_or_create_collection(
    name=f'{COLLECTION_NAME}-golden',
    metadata={"hnsw:space": "cosine"}
)

golden_ids = golden_dataset['id'].tolist()
golden_queries = golden_dataset['query'].tolist()
golden_embeddings = openai_embed_in_batches(
    openai_client=openai_client,
    texts=golden_queries,
    model="text-embedding-3-large",
)

collection_add_in_batches(
    collection=golden_collection,
    ids=golden_ids,
    texts=golden_queries,
    embeddings=golden_embeddings,
)

golden_dataset.head()

## 5. Evaluate

Now that we have our golden dataset, we will can run our evaluation.

### 5.1 Prepare Inputs

Embed generated queries.

**[Modify]** embedding function (`openai_embed_in_batches`) to the embedding model you wish to use

In [None]:
query_embeddings = openai_embed_in_batches(
    openai_client=openai_client,
    texts=golden_queries,
    model="text-embedding-3-large"
)

query_embeddings_lookup_dict = {
    id: QueryItem(
        text=query,
        embedding=embedding
    )
    for id, query, embedding in zip(golden_ids, golden_queries, query_embeddings)
}

query_embeddings_lookup = QueryLookup(lookup=query_embeddings_lookup_dict)

Create our qrels (query relevance labels) dataframe. In this case, each query and its corresponding document share the same id.

In [19]:
qrels = pd.DataFrame(
    {
        "query-id": golden_ids,
        "corpus-id": golden_ids,
        "score": 1
    }
)

### 5.2 Run Benchmark

In [None]:
results = run_benchmark(
    query_embeddings_lookup=query_embeddings_lookup,
    collection=corpus_collection,
    qrels=qrels
)

Save results.

This is helpful for comparison (e.g. comparing different embedding models).

**[Modify]** "model" to the model you are using

In [21]:
timestamp = datetime.now().strftime("%Y-%m-%d--%H-%M-%S")
results_to_save = {
    "model": "text-embedding-3-large",
    "results": results
}

In [22]:
results_dir = Path("results")

with open(os.path.join(results_dir, f'{timestamp}.json'), 'w') as f:
    json.dump(results_to_save, f)

## Finished

Your dataset is ready!

Next steps:
1. Rerun this notebook using different embedding models
2. Use `compare.ipynb` to compare the retrieval performance of different models

In [None]:
print(f'View your dataset: `chroma browse {COLLECTION_NAME}{' --local' if not using_chroma_cloud else ''}`')
print(f'View the generated golden dataset: `chroma browse {COLLECTION_NAME}-golden{(' --local' if not using_chroma_cloud else '')}`')