# Evaluating Semantic-based Chunking with Ragas

This notebook compares the performance of a baseline RAG application using RecursiveCharacterTextSplitter against a semantic-based chunking approach using TOCChunker.

## Notebook Structure
1. Dependencies and Setup
2. Baseline RAG Evaluation (RecursiveCharacterTextSplitter)
3. Evaluating the TOCChunker
4. Performance Comparison

## Dependencies and Setup

Install required packages:

In [1]:
!pip install ragas langchain langchain-openai langchain-community langchain-qdrant langgraph qdrant-client pymupdf openai pillow rapidfuzz

Collecting rapidfuzz
  Downloading rapidfuzz-3.13.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (12 kB)
Downloading rapidfuzz-3.13.0-cp313-cp313-macosx_11_0_arm64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz
Successfully installed rapidfuzz-3.13.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Set up API keys:

In [2]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

Utility functions:

In [3]:
import json
# For debugging
def printJSON(j):
    output = json.dumps(j, indent=2)
    lines = output.split("\n")
    for line in lines:
        print(line)

## Baseline RAG Evaluation

### Data Preparation

Load the loan data documents:

In [4]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Synthetic Test Data Generation

Generate synthetic evaluation data using Ragas knowledge graph approach:

In [5]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]           unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
Applying SummaryExtractor:  48%|████▊     | 15/31 [00:08<00:13,  1.15it/s]Property 'summary' already exists in node '5960f8'. Skipping!
Applying SummaryExtractor:  52%|█████▏    | 16/31 [00:09<00:12,  1.20it/s]Property 'summary' already exists in node 'ae0774'. Skipping!
Applying SummaryExtractor:  55%|█████▍    | 17/31 [00:10<00:11,  1.25it/s]Property 'summary' already exists in node '076ae4'. Skipping!
Applying SummaryExtractor:  61%|██████▏   | 19/31 [00:14<00:15,  1.26s/it]Property 'summary' already exists in node '7e95f0'. Skipping!
Applying SummaryExtractor:  68%|██████▊   | 21/31 [00:18<00:14,  1.46s/it]Property 'summary' already exists in node 'f22e47'. Skipping!
Applying Summary

In [7]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the use of BBAY 3 affect Direct Loan ...,"[non-term (includes clock-hour calendars), or ...",If substantially equal nonstandard terms in a ...,single_hop_specifc_query_synthesizer
1,"As a Financial Aid Administrator, how does the...",[Inclusion of Clinical Work in a Standard Term...,If required osteopathic clinical work meets al...,single_hop_specifc_query_synthesizer
2,What are the Non-Term Characteristics that det...,[Non-Term Characteristics A program that measu...,A program is considered to have Non-Term Chara...,single_hop_specifc_query_synthesizer
3,When must the annual loan limit for a Direct L...,[both the credit or clock hours and the weeks ...,If a student enrolled in a program that is gre...,single_hop_specifc_query_synthesizer
4,What are the disbursement requirements for fed...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,multi_hop_abstract_query_synthesizer
5,what is the disbursement requirements for fede...,[<1-hop>\n\nboth the credit or clock hours and...,the disbursement requirements for federal stud...,multi_hop_abstract_query_synthesizer
6,What is the distinction between standard and n...,[<1-hop>\n\nInclusion of Clinical Work in a St...,The distinction between standard and nonstanda...,multi_hop_abstract_query_synthesizer
7,How do the disbursement requirements for feder...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,multi_hop_abstract_query_synthesizer
8,How do the characteristics of non-term and sub...,[<1-hop>\n\nnon-term (includes clock-hour cale...,The characteristics of non-term and subscripti...,multi_hop_specific_query_synthesizer
9,"According to Volume 8, Chapter 3, how do the d...",[<1-hop>\n\nDisbursement Timing in Subscriptio...,"Volume 8, Chapter 3 explains that in subscript...",multi_hop_specific_query_synthesizer


### Baseline RAG Pipeline

Build the baseline RAG using RecursiveCharacterTextSplitter:

In [8]:
# Load documents again for RAG pipeline
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

1102

In [10]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [11]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="loan_data",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="loan_data",
    embedding=embeddings,
)

In [12]:
_ = vector_store.add_documents(documents=split_documents)

In [13]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

In [14]:
def retrieve(state):
    retrieved_docs = retriever.invoke(state["question"])
    return {"context": retrieved_docs}

### RAG Prompt and Generation

In [15]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [16]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-nano")

In [17]:
def generate(state):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
    response = llm.invoke(messages)
    return {"response": response.content}

### LangGraph RAG Pipeline

In [18]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
    question: str
    context: List[Document]
    response: str

In [19]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Test the baseline pipeline:

In [20]:
response = graph.invoke({"question": "What are the different kinds of loans?"})
response["response"]

'Based on the provided context, the different kinds of loans mentioned are:\n\n1. **Direct Loan**  \n   - This includes loans that are associated with academic programs, where the type of academic year and program structure can influence the monitoring and eligibility. The Direct Loan can be either subsidized or unsubsidized, with the latter allowing interest to accrue during in-school periods if the borrower chooses to pay it.\n\n2. **Direct Unsubsidized Loan**  \n   - A specific type of Direct Loan where interest accrues during periods when the borrower is in school, and there is an option to pay the interest while in school.\n\nThe context primarily discusses the structure and management of these loans rather than explicitly listing other types such as Stafford or PLUS loans, but it emphasizes the concept of direct loans, including unsubsidized varieties.'

### Baseline Evaluation with Ragas

Run the synthetic queries through the baseline pipeline:

In [21]:
for test_row in dataset:
    response = graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [22]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [23]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

In [24]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

baseline_result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
baseline_result

Evaluating:  99%|█████████▊| 71/72 [06:01<00:30, 30.39s/it]Exception raised in Job[53]: TimeoutError()
Evaluating: 100%|██████████| 72/72 [07:04<00:00,  5.90s/it]


{'context_recall': 0.7954, 'faithfulness': 0.9132, 'factual_correctness(mode=f1)': 0.6258, 'answer_relevancy': 0.9673, 'context_entity_recall': 0.3975, 'noise_sensitivity(mode=relevant)': 0.2987}

In [25]:
baseline_df = baseline_result.to_pandas()
baseline_df

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness(mode=f1),answer_relevancy,context_entity_recall,noise_sensitivity(mode=relevant)
0,How does the use of BBAY 3 affect Direct Loan ...,[BBAY 3 for purposes of monitoring Direct Loan...,"[non-term (includes clock-hour calendars), or ...",The use of BBAY 3 affects Direct Loan annual l...,If substantially equal nonstandard terms in a ...,0.0,0.857143,0.33,0.999998,1.0,0.0
1,"As a Financial Aid Administrator, how does the...",[Credit hours associated with the practicum or...,[Inclusion of Clinical Work in a Standard Term...,"As a Financial Aid Administrator, when a requi...",If required osteopathic clinical work meets al...,1.0,0.941176,0.7,0.954285,0.266667,0.157895
2,What are the Non-Term Characteristics that det...,[Non-Term Characteristics\nA program that meas...,[Non-Term Characteristics A program that measu...,The Non-Term Characteristics that determine if...,A program is considered to have Non-Term Chara...,1.0,1.0,0.83,1.0,0.444444,0.888889
3,When must the annual loan limit for a Direct L...,[information on Direct Loan annual loan limit ...,[both the credit or clock hours and the weeks ...,The annual loan limit for a Direct Loan must b...,If a student enrolled in a program that is gre...,1.0,1.0,0.57,0.999999,0.777778,0.0
4,What are the disbursement requirements for fed...,[Disbursement Timing in Subscription-Based Pro...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,In clock-hour or non-term credit-hour programs...,0.833333,1.0,0.84,0.960329,0.222222,0.0
5,what is the disbursement requirements for fede...,[Disbursement Timing in Subscription-Based Pro...,[<1-hop>\n\nboth the credit or clock hours and...,The disbursement requirements for federal stud...,the disbursement requirements for federal stud...,1.0,0.933333,0.7,0.951837,0.357143,0.333333
6,What is the distinction between standard and n...,"[be offered in nonstandard terms. Also, like s...",[<1-hop>\n\nInclusion of Clinical Work in a St...,The distinction between standard and nonstanda...,The distinction between standard and nonstanda...,1.0,1.0,0.63,0.92091,0.444444,0.375
7,How do the disbursement requirements for feder...,[section below.\nExcept as noted above for the...,[<1-hop>\n\nboth the credit or clock hours and...,The disbursement requirements for federal stud...,In clock-hour or non-term credit-hour programs...,0.666667,0.909091,0.46,0.987057,0.4375,0.363636
8,How do the characteristics of non-term and sub...,"[use of a Scheduled Academic Year (SAY), BBAY ...",[<1-hop>\n\nnon-term (includes clock-hour cale...,The characteristics of non-term and subscripti...,The characteristics of non-term and subscripti...,0.777778,0.692308,0.61,0.966016,0.578947,
9,"According to Volume 8, Chapter 3, how do the d...",[just one annual loan limit for the entire 110...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,"According to Volume 8, Chapter 3, the disburse...","Volume 8, Chapter 3 explains that in subscript...",0.666667,0.875,0.5,0.951857,0.0,0.5


## Evaluating the TOCChunker

Now we'll implement the same RAG pipeline but using semantic-based chunking with TOCChunker instead of RecursiveCharacterTextSplitter.

### Install Additional Dependencies for TOCChunker

In [26]:
!pip install llama-index-core llama-index-readers-file

Collecting llama-index-core
  Downloading llama_index_core-0.12.49-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-readers-file
  Downloading llama_index_readers_file-0.4.11-py3-none-any.whl.metadata (5.3 kB)
Collecting aiosqlite (from llama-index-core)
  Downloading aiosqlite-0.21.0-py3-none-any.whl.metadata (4.3 kB)
Collecting banks<3,>=2.0.0 (from llama-index-core)
  Downloading banks-2.2.0-py3-none-any.whl.metadata (12 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core)
  Downloading Deprecated-1.2.18-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting dirtyjson<2,>=1.0.8 (from llama-index-core)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting filetype<2,>=1.2.0 (from llama-index-core)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting llama-index-workflows<2,>=1.0.1 (from llama-index-core)
  Downloading llama_index_workflows-1.1.0-py3-none-any.whl.metadata (6.1 kB)
Collecting networkx>=3.0 (from llama-index-core)

### Document Processing for TOCChunker

Convert LangChain documents to format compatible with TOCChunker:

In [27]:
from llama_index.core import Document as LlamaDocument
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.node_parser import get_leaf_nodes

# Convert LangChain documents to LlamaIndex format
llama_docs = []
for doc in docs:
    llama_doc = LlamaDocument(
        text=doc.page_content,
        metadata=doc.metadata
    )
    llama_docs.append(llama_doc)

print(f"Converted {len(llama_docs)} documents to LlamaIndex format")

Converted 269 documents to LlamaIndex format


### Implement TOCChunker Approach

Create hierarchical chunks using LlamaIndex's HierarchicalNodeParser:

In [29]:
# Create hierarchical node parser (similar to TOCChunker approach)
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 1024, 512],  # Different levels of chunking
    chunk_overlap=20,
)

# Parse documents into hierarchical nodes
nodes = node_parser.get_nodes_from_documents(llama_docs)

# Get leaf nodes (finest level chunks)
leaf_nodes = get_leaf_nodes(nodes)
print(f"Created {len(leaf_nodes)} semantic chunks using hierarchical parsing")

Created 698 semantic chunks using hierarchical parsing


### Convert Back to LangChain Format

Convert the semantic chunks back to LangChain Document format:

In [30]:
from langchain_core.documents import Document as LangChainDocument

# Convert LlamaIndex nodes back to LangChain documents
semantic_documents = []
for node in leaf_nodes:
    doc = LangChainDocument(
        page_content=node.text,
        metadata=node.metadata
    )
    semantic_documents.append(doc)

print(f"Converted {len(semantic_documents)} semantic chunks to LangChain format")

Converted 698 semantic chunks to LangChain format


### Create New Vector Store for TOCChunker

Build a new QDrant vector store with the semantic chunks:

In [31]:
# Create new QDrant client and collection for semantic chunks
semantic_client = QdrantClient(":memory:")

semantic_client.create_collection(
    collection_name="loan_data_semantic",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

semantic_vector_store = QdrantVectorStore(
    client=semantic_client,
    collection_name="loan_data_semantic",
    embedding=embeddings,
)

In [32]:
# Add semantic documents to the vector store
_ = semantic_vector_store.add_documents(documents=semantic_documents)

In [33]:
# Create retriever for semantic chunks
semantic_retriever = semantic_vector_store.as_retriever(search_kwargs={"k": 5})

### Build Semantic RAG Pipeline

Create the same pipeline structure but with semantic retrieval:

In [34]:
def retrieve_semantic(state):
    retrieved_docs = semantic_retriever.invoke(state["question"])
    return {"context": retrieved_docs}

In [35]:
class SemanticState(TypedDict):
    question: str
    context: List[Document]
    response: str

semantic_graph_builder = StateGraph(SemanticState).add_sequence([retrieve_semantic, generate])
semantic_graph_builder.add_edge(START, "retrieve_semantic")
semantic_graph = semantic_graph_builder.compile()

Test the semantic pipeline:

In [36]:
semantic_response = semantic_graph.invoke({"question": "What are the different kinds of loans?"})
semantic_response["response"]

'The different kinds of loans mentioned in the provided context include:\n\n1. **Direct Loans**\n   - Subsidized Loans\n   - Unsubsidized Loans\n   - PLUS Loans (including Direct PLUS Loans)\n\n2. **Federal and Non-Federal Loans**\n   - Federal Direct Loans (subcategories above)\n   - Private Loans\n   - State-sponsored Loans\n   - Institutional Loans\n\n3. **Loans used to replace certain education-related benefits**\n   - Education savings accounts such as TEACH Grants and AmeriCorps education awards\n   - Foster care benefits received under Title IV, Part E, of the Social Security Act, including education and training vouchers and room and board benefits\n   - Emergency financial assistance (such as emergency grants or short-term loans)\n\n4. **Private Education Loans**\n   - Including income share agreements (ISAs) used to finance postsecondary education expenses, which are considered private education loans.\n\nThese cover various federal and private financing options available to 

### Evaluate TOCChunker Approach

Run the same evaluation using the semantic chunking approach:

In [37]:
import copy
import time

# Create a copy of the dataset for semantic evaluation
semantic_dataset = copy.deepcopy(dataset)

# Run queries through the semantic pipeline
for test_row in semantic_dataset:
    response = semantic_graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
    time.sleep(1)  # Rate limiting

In [38]:
semantic_evaluation_dataset = EvaluationDataset.from_pandas(semantic_dataset.to_pandas())

In [39]:
semantic_result = evaluate(
    dataset=semantic_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
semantic_result

Evaluating:  99%|█████████▊| 71/72 [05:27<00:19, 19.56s/it]Exception raised in Job[53]: TimeoutError()
Evaluating: 100%|██████████| 72/72 [07:02<00:00,  5.87s/it]


{'context_recall': 0.8972, 'faithfulness': 0.9341, 'factual_correctness(mode=f1)': 0.5933, 'answer_relevancy': 0.9653, 'context_entity_recall': 0.3332, 'noise_sensitivity(mode=relevant)': 0.2934}

In [40]:
semantic_df = semantic_result.to_pandas()
semantic_df

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness(mode=f1),answer_relevancy,context_entity_recall,noise_sensitivity(mode=relevant)
0,How does the use of BBAY 3 affect Direct Loan ...,[regains eligibility for a new annual loan lim...,"[non-term (includes clock-hour calendars), or ...",The use of BBAY 3 affects Direct Loan annual l...,If substantially equal nonstandard terms in a ...,1.0,0.857143,0.22,0.999998,0.5,0.0
1,"As a Financial Aid Administrator, how does the...",[Inclusion of Clinical Work in a Standard Term...,[Inclusion of Clinical Work in a Standard Term...,"As a Financial Aid Administrator, the inclusio...",If required osteopathic clinical work meets al...,1.0,1.0,0.6,0.959953,0.266667,0.428571
2,What are the Non-Term Characteristics that det...,"[Nonstandard Terms\nGenerally, nonstandard ter...",[Non-Term Characteristics A program that measu...,The Non-Term Characteristics that determine if...,A program is considered to have Non-Term Chara...,1.0,0.857143,0.29,1.0,0.333333,0.285714
3,When must the annual loan limit for a Direct L...,"[Specifically, if a student enrolled in a prog...",[both the credit or clock hours and the weeks ...,The annual loan limit for a Direct Loan must b...,If a student enrolled in a program that is gre...,1.0,1.0,0.8,1.0,0.444444,0.25
4,What are the disbursement requirements for fed...,[Except as noted above for the Direct Loan Pro...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,In clock-hour or non-term credit-hour programs...,0.833333,1.0,0.71,0.958516,0.375,0.294118
5,what is the disbursement requirements for fede...,[Disbursement Timing in Subscription-Based Pro...,[<1-hop>\n\nboth the credit or clock hours and...,The disbursement requirements for federal stud...,the disbursement requirements for federal stud...,1.0,0.933333,0.7,0.952477,0.428571,0.375
6,What is the distinction between standard and n...,"[Nonstandard Terms\nGenerally, nonstandard ter...",[<1-hop>\n\nInclusion of Clinical Work in a St...,The distinction between standard and nonstanda...,The distinction between standard and nonstanda...,1.0,1.0,0.64,0.926679,0.55,0.285714
7,How do the disbursement requirements for feder...,[Except as noted above for the Direct Loan Pro...,[<1-hop>\n\nboth the credit or clock hours and...,The disbursement requirements differ between c...,In clock-hour or non-term credit-hour programs...,0.666667,1.0,0.67,0.967309,0.35,0.227273
8,How do the characteristics of non-term and sub...,[Substantially\nequal nonstandard terms may be...,[<1-hop>\n\nnon-term (includes clock-hour cale...,The characteristics of non-term and subscripti...,The characteristics of non-term and subscripti...,1.0,0.911765,0.7,0.966189,0.5,
9,"According to Volume 8, Chapter 3, how do the d...","[For example, a school might offer an\n1100 cl...",[<1-hop>\n\nDisbursement Timing in Subscriptio...,"According to Volume 8, Chapter 3, the disburse...","Volume 8, Chapter 3 explains that in subscript...",0.666667,0.727273,0.57,0.953025,0.0,0.6


## Performance Comparison

Compare the results between baseline RecursiveCharacterTextSplitter and semantic TOCChunker approaches:

In [56]:
import pandas as pd

# Extract metric averages
baseline_metrics = {
    'approach': 'Baseline (RecursiveCharacterTextSplitter)',
    'context_recall': baseline_df['context_recall'].mean(),
    'faithfulness': baseline_df['faithfulness'].mean(),
    'factual_correctness': baseline_df['factual_correctness(mode=f1)'].mean(),
    'answer_relevancy': baseline_df['answer_relevancy'].mean(),
    'context_entity_recall': baseline_df['context_entity_recall'].mean(),
    'noise_sensitivity_relevant': baseline_df['noise_sensitivity(mode=relevant)'].mean()
}

semantic_metrics = {
    'approach': 'Semantic (TOCChunker)',
    'context_recall': semantic_df['context_recall'].mean(),
    'faithfulness': semantic_df['faithfulness'].mean(),
    'factual_correctness': semantic_df['factual_correctness(mode=f1)'].mean(),
    'answer_relevancy': semantic_df['answer_relevancy'].mean(),
    'context_entity_recall': semantic_df['context_entity_recall'].mean(),
    'noise_sensitivity_relevant': semantic_df['noise_sensitivity(mode=relevant)'].mean()
}

# Create comparison DataFrame
comparison_df = pd.DataFrame([baseline_metrics, semantic_metrics])
comparison_df

Unnamed: 0,approach,context_recall,faithfulness,factual_correctness,answer_relevancy,context_entity_recall,noise_sensitivity_relevant
0,Baseline (RecursiveCharacterTextSplitter),0.79537,0.913171,0.625833,0.967331,0.397505,0.298675
1,Semantic (TOCChunker),0.897222,0.934144,0.593333,0.965309,0.333168,0.293378


### Analysis

The comparison shows the performance differences between:

1. **Baseline Approach**: Uses RecursiveCharacterTextSplitter with fixed chunk sizes (1000 characters, 200 overlap)
2. **Semantic Approach**: Uses hierarchical parsing to create semantically meaningful chunks

Key metrics to focus on:
- **Answer Relevancy**: How relevant the generated answers are to the questions
- **Faithfulness**: How well the answers stick to the provided context
- **Context Recall**: How well the retrieval captures relevant information

The semantic chunking approach aims to preserve document structure and meaning, potentially leading to better context coherence and improved retrieval performance.

##### Conclusion
Overall what it looks like is there is a significant improvement in `context_recall` as for the rest of the metrics there is small improvement for `faithfulness` and small deterioration in the others although it could be argued tha amount is not significant (< 10%). It does point to this approach to chunking (or how it was implemented) while seemingly getting us better retrieval metrics, did not move the needle for the more important metric of `answer_relevancy` as opposed to reranking where we use a model (Cohere's) trained on Q&A data to help us rank the chunks. My intuition is that the key reason is that while we are pulling more chunks that are **semantically similar** they may not be as **relevant** to the query.     