# Text-Only Zero-Shot Q/A Chunking Benchmark

Please refer [here](https://github.com/VectorChat/chunking_benchmarks) for the main repository of the benchmarks. The chunks that we generated are located in the "Resultant Chunked Dataset" folder

We chose to use RAG (Retrieval Augmented Generation) for this benchmark as it is a common use case after having chunked text. Each chunk becomes an embedding that can be used to retrieve relevant information from a knowledge base. The retrieved information can then be used to generate answers to questions about the chunk via an LLM completion.

Here's the setup:

1. We chunk the text using various methods    
    - Unstructured IOs [partitioner](https://docs.unstructured.io/open-source/introduction/quick-start), with the semantic chunking strategy
    - AI21's [text segmentation api](https://docs.ai21.com/docs/text-segmentation-api), via [`AI21SemanticTextSplitter`](https://python.langchain.com/v0.2/docs/integrations/document_transformers/ai21_semantic_text_splitter/)
    - Our chunker based on our internal research
2. We save to a Vector DB (i.e. [Pinecone](https://pinecone.io), [Astra](https://astra.datastax.com/)) for retrieval
3. Generate questions using foundational LLM (GPT-4o, Claude Sonnet 3.5, etc.) from a piece of fiction introduced after LLama 3 training
4. Run RAG question-answering for each chunking method, score results based on answer accuracy

### Here's some sample code to understand the basic idea

In [2]:
import asyncio
from enum import Enum
from typing import List, Dict

# Define our types up here

class ChunkingMethod(Enum):
    UNSTRUCTURED = "unstructured"
    AI21 = "ai21"
    CUSTOM = "custom"

class VectorDB(Enum):
    PINECONE = "pinecone"
    ASTRA = "astra"

# Chunk the text based on some method
async def chunk_text(text: str, method: ChunkingMethod) -> List[str]:    
    return ["chunk1", "chunk2", "chunk3"]

# Save the chunks to a vector database
async def save_chunks_to_vector_db(chunks: List[str], db: VectorDB):    
    print(f"Saving {len(chunks)} chunks to {db.value}")

# Generate questions based on the input text 
async def generate_questions(text: str, num_questions: int) -> List[str]:    
    return ["Question 1?", "Question 2?", "Question 3?"]

# Do a top k search to retrieve relevant chunks
async def retrieve_relevant_chunks(question: str, db: VectorDB, top_k = 3) -> List[str]:    
    return ["relevant chunk 1", "relevant chunk 2"]

# Generate an answer based on the question and context chunks
async def generate_answer(question: str, context: List[str]) -> str:    
    return "Generated answer based on context"

# Evaluate the generated answer
async def evaluate_answer(generated_answer: str, correct_answer: str) -> float:    
    return 0.85 

# Run the benchmark
async def run_benchmark(text: str, chunking_methods: List[ChunkingMethod], vector_db: VectorDB):
    results = {}

    # Chunk text using different methods and save to vector DB
    for method in chunking_methods:
        chunks = await chunk_text(text, method)
        await save_chunks_to_vector_db(chunks, vector_db)

    # Generate questions
    questions = await generate_questions(text, num_questions=10)

    # Run RAG question-answering for each chunking method
    for method in chunking_methods:
        method_scores = []
        for question in questions:
            relevant_chunks = await retrieve_relevant_chunks(question, vector_db)
            answer = await generate_answer(question, relevant_chunks)
            score = await evaluate_answer(answer, "correct_answer")  # Placeholder for correct answer
            method_scores.append(score)
        
        results[method.value] = sum(method_scores) / len(method_scores)

    return results    

In [None]:
text = "Long piece of fiction text..."
chunking_methods = [ChunkingMethod.UNSTRUCTURED, ChunkingMethod.AI21, ChunkingMethod.CUSTOM]
vector_db = VectorDB.ASTRA

benchmark_results = await run_benchmark(text, chunking_methods, vector_db)

print("Benchmark Results:")
for method, score in benchmark_results.items():
    print(f"{method}: {score:.2f}")

## Here's some results

Here's some results we got after multiple runs of the benchmark, the "Percent Correct" column is the percentage of questions answered correctly by the RAG model (excluding questions where the LLM had an invalid response) during the benchmark phase. The "Cost" column is the cost of the chunking method based on the number of API calls made and the cost per call during the chunking phase.

| Method | Percent Correct | Cost |
|--------|-----------------|------|
| unstructured | 51.05% | $0.01 |
| ai21 | 54.01% | $0.0106 |
| chunking.com| **60.51%** | **$0.00504** |

The Unstructured IO serverless API call took about 10 pages at a cost of $0.001 per page. The AI21 API took 11 calls (with limit of 100k characters per call).

#### We used the following params:

| Param | Value |
|-------|-------|
| max_chunk_size | 300 |
| chunk_size_chars | 1200 | 
| top_k | 3 |
| embedding_model | text-embedding-3-small |

The novel had ~250k tokens (via llama 3's tokenizer) and was chunked with a max chunk size of 300 tokens (~1200 characters). The cost is based on the service api call or on the OpenAI embedding api calls.