# Chroma Chunking Evaluation Notebook

In this notebook, we demonstrate how to evaluate various popular chunking methods. Additionally, we show you how to create your own synthetic dataset for domain-specific evaluation.

We recommend you make a copy of this notebook so that you can edit and run the cells yourself.

## 1. Setup

### 1.1. Install the Chunking Evaluation Package

First, we need to install the Chunking Evaluation package from GitHub.

In [None]:
# Install the necessary package
!pip install git+https://github.com/brandonstarxel/chunking_evaluation.git

### 1.2. Import Required Modules
We will import the necessary modules from the Chunking Evaluation package.

In [None]:
from chunking_evaluation.chunking import FixedTokenChunker, RecursiveTokenChunker, ClusterSemanticChunker, LLMSemanticChunker, KamradtModifiedChunker
from chunking_evaluation import GeneralEvaluation, SyntheticEvaluation, BaseChunker
from chunking_evaluation.utils import openai_token_count
from chromadb.utils import embedding_functions
import pandas as pd
from IPython.display import display, clear_output
import http.client
import os

### 1.3. Load OpenAI API Key
We need to load the OpenAI API key, which can be done either by using Colab's secret storage or by manually entering the key.

In [None]:
# Load API key from either google colab secrets or manually enter it
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
# Uncomment the line below if you prefer to manually enter the key
# OPENAI_API_KEY = "YOUR_OPENAI_KEY"

### 1.4 Setup Embedding Function
We'll set up the embedding function using the OpenAI API key.

In [None]:
ef = embedding_functions.OpenAIEmbeddingFunction(api_key=OPENAI_API_KEY, model_name="text-embedding-3-large")

# Chunking Evaluation
In this section of the notebook we shall present how you can evaluate various popular chunking algorithms. In doing so we also show how you can use our custom chunking methods.

## 2. Define Chunkers
We will define a list of chunking methods to evaluate. It is up to the users to decide which chunking methods they would like to evaluate. The semantic chunking methods take roughly a minute each while the token chunking methods are faster.

### 2.1 RecursiveCharacterTextSplitter & TokenTextSplitter
We have taken the code from LangChain to implement these chunking methods. We rename them to RecursiveTokenChunker and FixedTokenChunker for consistency.

In [None]:
chunkers = [
    RecursiveTokenChunker(chunk_size=800, chunk_overlap=400, length_function=openai_token_count),
    FixedTokenChunker(chunk_size=800, chunk_overlap=400, encoding_name="cl100k_base"),
    RecursiveTokenChunker(chunk_size=400, chunk_overlap=200, length_function=openai_token_count),
    FixedTokenChunker(chunk_size=400, chunk_overlap=200, encoding_name="cl100k_base"),
    RecursiveTokenChunker(chunk_size=400, chunk_overlap=0, length_function=openai_token_count),
    FixedTokenChunker(chunk_size=400, chunk_overlap=0, encoding_name="cl100k_base"),
    RecursiveTokenChunker(chunk_size=200, chunk_overlap=0, length_function=openai_token_count),
    FixedTokenChunker(chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base"),
]

### 2.2 KamradtModifiedChunker
Below is the first semantic chunker. Run the following cell if you would like to evaluate it.

In [None]:
chunkers.append(
    KamradtModifiedChunker(avg_chunk_size = 400, embedding_function = ef)
)

### 2.3 ClusterSemanticChunker
Below are our ClusterSemanticChunkers.

In [None]:
chunkers.extend(
    [
        ClusterSemanticChunker(embedding_function=ef, max_chunk_size=400, length_function=openai_token_count),
        ClusterSemanticChunker(embedding_function=ef, max_chunk_size=200, length_function=openai_token_count)
    ]
)

### 2.4 LLMSemanticChunker
Finally, we may evaluate the LLMSemanticChunker. This can be ran with any OpenAI or Anthropic model by setting organisation and model_name.

In [None]:
chunkers.append(
     LLMSemanticChunker(organisation="openai", model_name="gpt-4o", api_key=OPENAI_API_KEY)
) # The above organisation and model_name are actually the defaults and don't need to be set. We just show this for completeness.

## 3. Initialize & Run Evaluations
We will evaluate each chunker and display the results dynamically.

In [None]:
# Initialize evaluation
evaluation = GeneralEvaluation()

results = []

# Initialize an empty DataFrame
df = pd.DataFrame()

# Display the DataFrame
display_handle = display(df, display_id=True)

for chunker in chunkers:
    result = evaluation.run(chunker, ef, retrieve=5)
    chunk_size = chunker._chunk_size if hasattr(chunker, '_chunk_size') else 0
    chunk_overlap = chunker._chunk_overlap if hasattr(chunker, '_chunk_overlap') else 0
    result['stats']['chunker'] = chunker.__class__.__name__ + f"_{chunk_size}_{chunk_overlap}"
    results.append(result['stats'])

    # Update the DataFrame
    df = pd.DataFrame(results)
    clear_output(wait=True)
    display_handle.update(df)

Unnamed: 0,iou_mean,iou_std,recall_mean,recall_std,precision_omega_mean,precision_omega_std,precision_mean,precision_std,chunker
0,0.015208,0.013013,0.864303,0.337623,0.066825,0.052225,0.015209,0.013012,RecursiveTokenChunker_800_400
1,0.013655,0.011268,0.882846,0.311666,0.046589,0.030681,0.01366,0.011271,FixedTokenChunker_800_400
2,0.032967,0.027472,0.876947,0.320614,0.139428,0.104232,0.032982,0.027483,RecursiveTokenChunker_400_200
3,0.027268,0.021392,0.902341,0.273911,0.084455,0.050817,0.027303,0.021422,FixedTokenChunker_400_200
4,0.036366,0.031817,0.907787,0.27949,0.17743,0.140447,0.036391,0.031832,RecursiveTokenChunker_400_0
5,0.027071,0.021837,0.892321,0.29195,0.125022,0.081435,0.027101,0.021855,FixedTokenChunker_400_0
6,0.070706,0.056176,0.895857,0.280983,0.299416,0.184212,0.070978,0.056285,RecursiveTokenChunker_200_0
7,0.052563,0.040383,0.890414,0.280266,0.209799,0.118731,0.052784,0.040563,FixedTokenChunker_200_0
8,0.022893,0.023497,0.869239,0.32888,0.101903,0.108392,0.022908,0.023518,KamradtModifiedChunker_0_0
9,0.04506,0.033662,0.916413,0.249236,0.207381,0.144587,0.045203,0.033809,ClusterSemanticChunker_400_0


## 4. Display Final Results

In [None]:
df

Unnamed: 0,iou_mean,iou_std,recall_mean,recall_std,precision_omega_mean,precision_omega_std,precision_mean,precision_std,chunker
0,0.015208,0.013013,0.864303,0.337623,0.066825,0.052225,0.015209,0.013012,RecursiveTokenChunker_800_400
1,0.013655,0.011268,0.882846,0.311666,0.046589,0.030681,0.01366,0.011271,FixedTokenChunker_800_400
2,0.032967,0.027472,0.876947,0.320614,0.139428,0.104232,0.032982,0.027483,RecursiveTokenChunker_400_200
3,0.027268,0.021392,0.902341,0.273911,0.084455,0.050817,0.027303,0.021422,FixedTokenChunker_400_200
4,0.036366,0.031817,0.907787,0.27949,0.17743,0.140447,0.036391,0.031832,RecursiveTokenChunker_400_0
5,0.027071,0.021837,0.892321,0.29195,0.125022,0.081435,0.027101,0.021855,FixedTokenChunker_400_0
6,0.070706,0.056176,0.895857,0.280983,0.299416,0.184212,0.070978,0.056285,RecursiveTokenChunker_200_0
7,0.052563,0.040383,0.890414,0.280266,0.209799,0.118731,0.052784,0.040563,FixedTokenChunker_200_0
8,0.022893,0.023497,0.869239,0.32888,0.101903,0.108392,0.022908,0.023518,KamradtModifiedChunker_0_0
9,0.04506,0.033662,0.916413,0.249236,0.207381,0.144587,0.045203,0.033809,ClusterSemanticChunker_400_0


# Synthetic Dataset Pipeline for Domain Specific Evaluation

In this section we demonstrate how you can create a synthetic dataset for domain-specific evaluation. Here we do this over public texts from Project Gutenberg.

First, make sure you run cells 1.1 to 1.4.

## 5.1 Prepare Corpora

In [None]:
def download_text(book_id, file_name, directory):
    conn = http.client.HTTPSConnection("www.gutenberg.org")
    url = f"/files/{book_id}/{book_id}-0.txt"

    conn.request("GET", url)
    response = conn.getresponse()

    if response.status == 200:
        text = response.read().decode('utf-8')

        # Create directory if it does not exist
        os.makedirs(directory, exist_ok=True)

        # Save the text to the specified file within the directory
        file_path = os.path.join(directory, file_name)
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text)
        print(f"Book '{file_name}' downloaded and saved successfully in '{directory}'.")
    else:
        print(f"Failed to download the book. Status code: {response.status}")

# Define the books to download with their IDs and file names
books = {
    1661: "the_adventures_of_sherlock_holmes.txt",
    1342: "pride_and_prejudice.txt",
    174: "the_picture_of_dorian_gray.txt"
}

# Define the directory to save the books
directory = "corpora"

# Download each book
for book_id, file_name in books.items():
    download_text(book_id, file_name, directory)


Book 'the_adventures_of_sherlock_holmes.txt' downloaded and saved successfully in 'corpora'.
Book 'pride_and_prejudice.txt' downloaded and saved successfully in 'corpora'.
Book 'the_picture_of_dorian_gray.txt' downloaded and saved successfully in 'corpora'.


## 5.2. Initialize the Environment
Specify the corpora paths and output CSV file, and initialize the evaluation.

In [None]:
# Specify the corpora paths and output CSV file
corpora_paths = [
    'corpora/the_adventures_of_sherlock_holmes.txt',
    'corpora/pride_and_prejudice.txt',
    'corpora/the_picture_of_dorian_gray.txt',
]
queries_csv_path = 'generated_queries_and_excerpts.csv'

# Initialize the evaluation
evaluation = SyntheticEvaluation(corpora_paths, queries_csv_path, openai_api_key=OPENAI_API_KEY)


## 5.3. Generate Queries and Excerpts
Generate queries and excerpts, and save them to a CSV file. Note that you may interrupt this cell early if you are happy with the number of queries generated.

In [None]:
# Generate queries and excerpts, and save to CSV
evaluation.generate_queries_and_excerpts(approximate_excerpts=True, num_rounds=1, queries_per_corpus=3)

Trying Query 0
Trying Query 1
Trying Query 2
Trying Query 0
Trying Query 1
Trying Query 2
Error occurred: Each reference must contain 'start_chunk' and 'end_chunk' keys.
Trying Query 2
Trying Query 0
Trying Query 1
Trying Query 2
Error occurred: Each reference must contain 'start_chunk' and 'end_chunk' keys.
Trying Query 2


## 5.4. Apply Filters
Apply filters to remove queries with poor excerpts and duplicates.

In [None]:
# Apply filter to remove queries with poor excerpts
evaluation.filter_poor_excerpts(threshold=0.36)

# Apply filter to remove duplicates
evaluation.filter_duplicates(threshold=0.6)

Corpus: corpora/the_adventures_of_sherlock_holmes.txt - Removed 0 .
Corpus: corpora/pride_and_prejudice.txt - Removed 2 .
Corpus: corpora/the_picture_of_dorian_gray.txt - Removed 1 .
Corpus: corpora/the_adventures_of_sherlock_holmes.txt - Removed 0 .
Corpus: corpora/pride_and_prejudice.txt - Removed 0 .
Corpus: corpora/the_picture_of_dorian_gray.txt - Removed 0 .


## 5.5. Define a Custom Chunker
Define a custom chunking class for evaluation.

In [None]:
# Define a custom chunking class
class CustomChunker(BaseChunker):
    def split_text(self, text):
        # Custom chunking logic
        return [text[i:i+1200] for i in range(0, len(text), 1200)]

## 5.6. Run the evaluation
Instantiate the custom chunker and evaluate it over the filtered data.

In [None]:
# Instantiate the custom chunker
chunker = CustomChunker()

# Run the evaluation on the filtered data
results = evaluation.run(chunker, embedding_function=ef)

# Print results via pandas
df_results = pd.DataFrame([results['stats']])
df_results.head()

Unnamed: 0,iou_mean,iou_std,recall_mean,recall_std,precision_omega_mean,precision_omega_std,precision_mean,precision_std
0,0.043485,0.02436,0.875114,0.206952,0.173333,0.055689,0.043667,0.024236
