# Semantic Chunking RAG Evaluation with RAGAS

This notebook implements and evaluates a RAG application using semantic chunking strategies compared to naive retrieval.

## Objectives:
- Baseline LangGraph RAG Application using NAIVE RETRIEVAL
- Baseline Evaluation using RAGAS METRICS
- Implement SEMANTIC CHUNKING STRATEGY
- Create LangGraph RAG Application using SEMANTIC CHUNKING with NAIVE RETRIEVAL
- Compare and contrast results

## RAGAS Metrics:
- Faithfulness
- Answer Relevancy
- Context Precision
- Context Recall
- Answer Correctness

## Our Semantic Chunking Strategy

We implement a semantic chunking strategy that groups semantically similar sentences together while strictly enforcing size constraints.



## Setup and Dependencies


In [1]:
import os
import pandas as pd
import numpy as np
from typing import List, Dict, Any
import re
import nltk
from nltk.tokenize import sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity

# LangChain imports
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain_core.documents import Document
from langchain.prompts import ChatPromptTemplate

# LangGraph imports
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict

# RAGAS imports
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextPrecision,
    FactualCorrectness, 
    LLMContextRecall
)
from ragas import evaluate, RunConfig
from ragas.dataset_schema import EvaluationDataset

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')

print("Dependencies loaded successfully!")

Dependencies loaded successfully!


## Environment Setup


In [3]:
from getpass import getpass

# Set up API keys
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

# Initialize LLM and embeddings
llm = ChatOpenAI(model="gpt-4.1-nano")
embeddings = OpenAIEmbeddings()


## Data Loading and Preparation


In [4]:
# Load documents
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

print(f"Loaded {len(docs)} documents")
print(f"Total characters: {sum(len(doc.page_content) for doc in docs)}")

# Display document info
for i, doc in enumerate(docs):
    print(f"Document {i+1}: {len(doc.page_content)} characters")


Loaded 64 documents
Total characters: 112856
Document 1: 1672 characters
Document 2: 2051 characters
Document 3: 3823 characters
Document 4: 4164 characters
Document 5: 3965 characters
Document 6: 3635 characters
Document 7: 2858 characters
Document 8: 2895 characters
Document 9: 2323 characters
Document 10: 2940 characters
Document 11: 2303 characters
Document 12: 2195 characters
Document 13: 1669 characters
Document 14: 1523 characters
Document 15: 1440 characters
Document 16: 2425 characters
Document 17: 863 characters
Document 18: 2086 characters
Document 19: 3097 characters
Document 20: 1205 characters
Document 21: 2087 characters
Document 22: 2144 characters
Document 23: 453 characters
Document 24: 608 characters
Document 25: 1888 characters
Document 26: 474 characters
Document 27: 3051 characters
Document 28: 982 characters
Document 29: 2180 characters
Document 30: 2192 characters
Document 31: 117 characters
Document 32: 1228 characters
Document 33: 3898 characters
Document 34: 

## Baseline RAG Application (Naive Retrieval)


In [15]:
# Create baseline chunks using RecursiveCharacterTextSplitter
baseline_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=0,
)

baseline_chunks = baseline_splitter.split_documents(docs)
print(f"Baseline chunks created: {len(baseline_chunks)}")

baseline_chunk_sizes = [len(chunk.page_content) for chunk in baseline_chunks]
print(f"Baseline chunk size stats:")
print(f"  Min: {min(baseline_chunk_sizes)}")
print(f"  Max: {max(baseline_chunk_sizes)}")
print(f"  Mean: {np.mean(baseline_chunk_sizes):.2f}")
print(f"  Median: {np.median(baseline_chunk_sizes):.2f}")

# Show distribution
print(f"\n📈 Character Distribution:")
for i, chunk in enumerate(baseline_chunks[:5]):  # Show first 5 chunks
    print(f"Chunk {i+1}: {len(chunk.page_content):,} characters")
if len(baseline_chunks) > 5:
    print(f"... and {len(baseline_chunks)-5} more chunks")




Baseline chunks created: 275
Baseline chunk size stats:
  Min: 2
  Max: 499
  Mean: 409.60
  Median: 448.00

📈 Character Distribution:
Chunk 1: 474 characters
Chunk 2: 485 characters
Chunk 3: 486 characters
Chunk 4: 222 characters
Chunk 5: 473 characters
... and 270 more chunks


In [16]:
# Create vector store for baseline
baseline_vector_store = Qdrant.from_documents(
    documents=baseline_chunks,
    embedding=embeddings,
    location=":memory:",
    collection_name="baseline_rag"
)

baseline_retriever = baseline_vector_store.as_retriever(search_kwargs={"k": 3})
print("Baseline vector store and retriever created!")

Baseline vector store and retriever created!


In [17]:
class BaselineState(TypedDict):
    question: str
    context: List[Document]
    response: str

# Define retrieval function
def baseline_retrieve(state: BaselineState) -> Dict[str, List[Document]]:
    """Retrieve relevant documents for the question"""
    retrieved_docs = baseline_retriever.invoke(state["question"])
    return {"context": retrieved_docs}

# Define RAG prompt
RAG_PROMPT = """
You are a helpful assistant who answers questions based on provided context. 
You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}

Please provide a comprehensive answer based on the context above.
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

# Define generation function
def baseline_generate(state: BaselineState) -> Dict[str, Any]:
    """Generate response based on retrieved context"""
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
    response = llm.invoke(messages)
    return {"response": response.content}

# Build baseline RAG graph
baseline_graph_builder = StateGraph(BaselineState).add_sequence([baseline_retrieve, baseline_generate])
baseline_graph_builder.add_edge(START, "baseline_retrieve")
baseline_graph = baseline_graph_builder.compile()

print("Baseline RAG application created successfully!")


Baseline RAG application created successfully!


## Custom Semantic Chunking Implementation

We implement a straightforward function for semantic chunking that meets your exact requirements:


### **Core Algorithm:**
1. **Sentence Embedding**: Generate embeddings for all sentences using OpenAI embeddings
2. **Semantic Grouping**: Group sentences using a greedy algorithm that enforces similarity measure (cosine distance)
3. **Size Enforcement**: Apply strict size limits with multi-level fallback strategy

### **Multi-Level Fallback Strategy:**
- **Level 1**: Semantic grouping based on sentence similarity
- **Level 2**: Paragraph splitting when groups exceed `max_chunk_size`
- **Level 3**: Sentence splitting when paragraphs exceed `max_chunk_size`
- **Level 4**: Word splitting when sentences exceed `max_chunk_size`

### **Key Features:**
- **Semantic Coherence**: Groups related sentences together for better context
- **Size Compliance**: Guarantees no chunk exceeds specified limits
- **Flexible Thresholds**: Configurable similarity threshold and size limits
- **Metadata Tracking**: Comprehensive metadata for evaluation and debugging
- **Production Ready**: Robust error handling and edge case management

### **Benefits Over Naive Chunking:**
- Better semantic coherence in retrieved contexts
- Reduced chunk fragmentation
- Improved answer quality through better context grouping
- Maintains semantic relationships while respecting size constraints 




In [None]:
def semantic_chunk_documents(
    documents: List[Document], 
    embeddings_model: OpenAIEmbeddings,
    similarity_threshold: float = 0.7,
    max_chunk_size: int = 1000,
    min_chunk_size: int = 50
) -> List[Document]:
    """
    Simple function for semantic chunking that meets requirements:
    - Chunk semantically similar sentences (based on threshold)
    - Then paragraphs, greedily
    - Up to a maximum chunk size (strictly enforced)
    - Minimum chunk size is a single sentence
    """
    
    def preprocess_text(text: str) -> str:
        """Clean and preprocess text"""
        return re.sub(r'\s+', ' ', text.strip())
    
    def get_sentence_embeddings(sentences: List[str]) -> np.ndarray:
        """Get embeddings for sentences"""
        embeddings = embeddings_model.embed_documents(sentences)
        return np.array(embeddings)
    
    def group_similar_sentences(sentences: List[str], embeddings: np.ndarray, similarity_threshold: float) -> List[List[int]]:
        """Group semantically similar sentences greedily"""
        similarity_matrix = cosine_similarity(embeddings)
        groups = []
        used_indices = set()
        
        for i in range(len(sentences)):
            if i in used_indices:
                continue
                
            group = [i]
            used_indices.add(i)
            
            # Greedily find similar sentences
            for j in range(i + 1, len(sentences)):
                if j in used_indices:
                    continue
                    
                if similarity_matrix[i][j] >= similarity_threshold:
                    group.append(j)
                    used_indices.add(j)
            
            groups.append(group)
        
        return groups
    
    def handle_single_sentence(text: str, doc: Document, max_chunk_size: int) -> List[Document]:
        """Handle documents with single sentences, enforcing max_chunk_size"""
        chunks = []
        
        if len(text) > max_chunk_size:
            # Split the single sentence by words if it's too long
            words = text.split()
            current_chunk = ""
            chunk_id = 0
            
            for word in words:
                if len(current_chunk + word) > max_chunk_size:
                    if current_chunk.strip():
                        chunk_doc = Document(page_content=current_chunk.strip(), metadata=doc.metadata.copy())
                        chunk_doc.metadata['chunking_method'] = 'semantic'
                        chunk_doc.metadata['chunk_id'] = chunk_id
                        chunk_doc.metadata['chunk_size'] = len(current_chunk.strip())
                        chunk_doc.metadata['config'] = {
                            'similarity_threshold': similarity_threshold,
                            'max_chunk_size': max_chunk_size,
                            'min_chunk_size': min_chunk_size
                        }
                        chunks.append(chunk_doc)
                        chunk_id += 1
                    current_chunk = word + " "
                else:
                    current_chunk += word + " "
            
            if current_chunk.strip():
                chunk_doc = Document(page_content=current_chunk.strip(), metadata=doc.metadata.copy())
                chunk_doc.metadata['chunking_method'] = 'semantic'
                chunk_doc.metadata['chunk_id'] = chunk_id
                chunk_doc.metadata['chunk_size'] = len(current_chunk.strip())
                chunk_doc.metadata['config'] = {
                    'similarity_threshold': similarity_threshold,
                    'max_chunk_size': max_chunk_size,
                    'min_chunk_size': min_chunk_size
                }
                chunks.append(chunk_doc)
        else:
            # Single sentence within limits
            chunk_doc = Document(page_content=text, metadata=doc.metadata.copy())
            chunk_doc.metadata['chunking_method'] = 'semantic'
            chunk_doc.metadata['chunk_id'] = 0
            chunk_doc.metadata['chunk_size'] = len(text)
            chunk_doc.metadata['config'] = {
                'similarity_threshold': similarity_threshold,
                'max_chunk_size': max_chunk_size,
                'min_chunk_size': min_chunk_size
            }
            chunks.append(chunk_doc)
        
        return chunks
    
    def create_chunks_from_groups(sentences: List[str], groups: List[List[int]]) -> List[str]:
        """Create chunks from grouped sentences, strictly enforcing max chunk size"""
        chunks = []
        
        for group in groups:
            # Sort group indices to maintain order
            group.sort()
            group_sentences = [sentences[i] for i in group]
            chunk_text = ' '.join(group_sentences)
            
            # Strictly enforce max_chunk_size
            if len(chunk_text) > max_chunk_size:
                # Split by paragraphs first
                paragraphs = chunk_text.split('\n\n')
                current_chunk = ""
                
                for paragraph in paragraphs:
                    # If single paragraph exceeds limit, split it by sentences
                    if len(paragraph) > max_chunk_size:
                        # Save current chunk if it exists
                        if current_chunk.strip():
                            chunks.append(current_chunk.strip())
                            current_chunk = ""
                        
                        # Split paragraph into sentences and add them individually
                        para_sentences = sent_tokenize(paragraph)
                        for sentence in para_sentences:
                            # If single sentence exceeds limit, split by words
                            if len(sentence) > max_chunk_size:
                                if current_chunk.strip():
                                    chunks.append(current_chunk.strip())
                                    current_chunk = ""
                                
                                # Split sentence by words
                                words = sentence.split()
                                temp_chunk = ""
                                for word in words:
                                    if len(temp_chunk + word) > max_chunk_size:
                                        if temp_chunk.strip():
                                            chunks.append(temp_chunk.strip())
                                        temp_chunk = word + " "
                                    else:
                                        temp_chunk += word + " "
                                
                                if temp_chunk.strip():
                                    current_chunk = temp_chunk.strip()
                            else:
                                # Check if adding this sentence would exceed limit
                                if len(current_chunk + " " + sentence) > max_chunk_size:
                                    if current_chunk.strip():
                                        chunks.append(current_chunk.strip())
                                    current_chunk = sentence
                                else:
                                    current_chunk += (" " + sentence) if current_chunk else sentence
                    else:
                        # Check if adding this paragraph would exceed limit
                        if len(current_chunk + "\n\n" + paragraph) > max_chunk_size:
                            if current_chunk.strip():
                                chunks.append(current_chunk.strip())
                            current_chunk = paragraph
                        else:
                            current_chunk += ("\n\n" + paragraph) if current_chunk else paragraph
                
                if current_chunk.strip():
                    chunks.append(current_chunk.strip())
            else:
                chunks.append(chunk_text)
        
        return chunks
    
    # Main processing
    all_chunks = []
    
    for doc in documents:
        text = preprocess_text(doc.page_content)
        
        # Split into sentences
        sentences = sent_tokenize(text)
        
        if len(sentences) <= 1:
            single_sentence_chunks = handle_single_sentence(text, doc, max_chunk_size)
            all_chunks.extend(single_sentence_chunks)
            continue
        
        # Get embeddings for sentences
        sentence_embeddings = get_sentence_embeddings(sentences)
        
        # Group similar sentences
        groups = group_similar_sentences(sentences, sentence_embeddings, similarity_threshold)
        
        # Create chunks from groups while strictly enforcing max_chunk_size
        chunks = create_chunks_from_groups(sentences, groups)
        
        # Filter chunks by minimum size (single sentence minimum)
        filtered_chunks = [chunk for chunk in chunks if len(chunk) >= min_chunk_size]
        
        # Create Document objects
        for i, chunk in enumerate(filtered_chunks):
            metadata = doc.metadata.copy()
            metadata['chunk_id'] = i
            metadata['chunking_method'] = 'semantic'
            metadata['chunk_size'] = len(chunk)
            metadata['config'] = {
                'similarity_threshold': similarity_threshold,
                'max_chunk_size': max_chunk_size,
                'min_chunk_size': min_chunk_size
            }
            all_chunks.append(Document(page_content=chunk, metadata=metadata))
    
    return all_chunks


In [35]:
# Create semantic chunks 
semantic_chunks = semantic_chunk_documents(
    documents=docs,
    embeddings_model=embeddings,
    similarity_threshold=0.8,
    max_chunk_size=700,
    min_chunk_size=50
)
print(f"Semantic chunks created: {len(semantic_chunks)}")

# Display chunk statistics
chunk_sizes = [len(chunk.page_content) for chunk in semantic_chunks]
print(f"Semantic chunk size stats:")
print(f"  Min: {min(chunk_sizes)}")
print(f"  Max: {max(chunk_sizes)}")
print(f"  Mean: {np.mean(chunk_sizes):.2f}")
print(f"  Median: {np.median(chunk_sizes):.2f}")


# Show distribution
print(f"\n📈 Character Distribution:")
for i, chunk in enumerate(semantic_chunks[:5]):  # Show first 5 chunks
    print(f"Chunk {i+1}: {len(chunk.page_content):,} characters")
if len(semantic_chunks) > 5:
    print(f"... and {len(semantic_chunks)-5} more chunks")


Semantic chunks created: 355
Semantic chunk size stats:
  Min: 51
  Max: 699
  Mean: 311.97
  Median: 253.00

📈 Character Distribution:
Chunk 1: 693 characters
Chunk 2: 273 characters
Chunk 3: 107 characters
Chunk 4: 156 characters
Chunk 5: 99 characters
... and 350 more chunks


In [39]:
baseline_chunks[2].page_content

'At least one co-author has disclosed additional relationships of potential relevance for this research. \nFurther information is available online at http://www.nber.org/papers/w34255\nNBER working papers are circulated for discussion and comment purposes. They have not been \npeer-reviewed or been subject to the review by the NBER Board of Directors that accompanies \nofficial NBER publications.\n© 2025 by Aaron Chatterji, Thomas Cunningham, David J. Deming, Zoe Hitzig, Christopher Ong,'

In [38]:
semantic_chunks[2].page_content

'We especially thank Tyna Eloundou and Pamela Mishkin who in several ways laid the foundation for this work.'

In [40]:
# Create vector store for semantic chunks
semantic_vector_store = Qdrant.from_documents(
    documents=semantic_chunks,
    embedding=embeddings,
    location=":memory:",
    collection_name="semantic_rag"
)

semantic_retriever = semantic_vector_store.as_retriever(search_kwargs={"k": 5})
print("Semantic vector store and retriever created!")

Semantic vector store and retriever created!


In [41]:
# Define LangGraph state for semantic RAG
class SemanticState(TypedDict):
    question: str
    context: List[Document]
    response: str

# Define retrieval function for semantic RAG
def semantic_retrieve(state: SemanticState) -> Dict[str, List[Document]]:
    """Retrieve relevant documents for the question using semantic chunks"""
    retrieved_docs = semantic_retriever.invoke(state["question"])
    return {"context": retrieved_docs}

# Define generation function for semantic RAG
def semantic_generate(state: SemanticState) -> Dict[str, Any]:
    """Generate response based on retrieved semantic context"""
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
    response = llm.invoke(messages)
    return {"response": response.content}

# Build semantic RAG graph
semantic_graph_builder = StateGraph(SemanticState).add_sequence([semantic_retrieve, semantic_generate])
semantic_graph_builder.add_edge(START, "semantic_retrieve")
semantic_graph = semantic_graph_builder.compile()

print("Semantic RAG application created successfully!")


Semantic RAG application created successfully!


## Create Synthetic Data for Evaluation

In [53]:
# Use the same model configuration as the working notebook
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [44]:
generator = TestsetGenerator(llm=evaluator_llm, embedding_model=evaluator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)



# # Load existing test data
# import ast 

# test_data = pd.read_csv("test_data_RAG_nb1.csv")
# print(f"Loaded {len(test_data)} test samples from CSV")

# # Fix the reference_contexts column - convert string representations to actual lists
# test_data['reference_contexts'] = test_data['reference_contexts'].apply(
#     lambda x: ast.literal_eval(x) if isinstance(x, str) else x
# )

# print("✅ Fixed reference_contexts format")

# # Convert to RAGAS dataset format
# dataset = EvaluationDataset.from_pandas(test_data)
# print("✅ Using existing test data instead of generating new one")

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/38 [00:00<?, ?it/s]

Property 'summary' already exists in node 'aac21c'. Skipping!
Property 'summary' already exists in node 'a061dc'. Skipping!
Property 'summary' already exists in node 'b9786e'. Skipping!
Property 'summary' already exists in node 'f69887'. Skipping!
Property 'summary' already exists in node 'e0a420'. Skipping!
Property 'summary' already exists in node '734e07'. Skipping!
Property 'summary' already exists in node '92f509'. Skipping!
Property 'summary' already exists in node 'b16d37'. Skipping!
Property 'summary' already exists in node '95b87a'. Skipping!
Property 'summary' already exists in node '536e13'. Skipping!
Property 'summary' already exists in node '48652a'. Skipping!
Property 'summary' already exists in node 'c4fee6'. Skipping!
Property 'summary' already exists in node '232502'. Skipping!
Property 'summary' already exists in node 'a0e5d6'. Skipping!
Property 'summary' already exists in node 'dc37ec'. Skipping!
Property 'summary' already exists in node '150b46'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/48 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '8bd3e0'. Skipping!
Property 'summary_embedding' already exists in node 'dc37ec'. Skipping!
Property 'summary_embedding' already exists in node '150b46'. Skipping!
Property 'summary_embedding' already exists in node '48652a'. Skipping!
Property 'summary_embedding' already exists in node 'a061dc'. Skipping!
Property 'summary_embedding' already exists in node '95b87a'. Skipping!
Property 'summary_embedding' already exists in node '232502'. Skipping!
Property 'summary_embedding' already exists in node 'aac21c'. Skipping!
Property 'summary_embedding' already exists in node 'e0a420'. Skipping!
Property 'summary_embedding' already exists in node 'f69887'. Skipping!
Property 'summary_embedding' already exists in node 'b9786e'. Skipping!
Property 'summary_embedding' already exists in node '92f509'. Skipping!
Property 'summary_embedding' already exists in node 'a0e5d6'. Skipping!
Property 'summary_embedding' already exists in node '734e07'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [46]:
dataset_df = dataset.to_pandas()

# save evaluation dataset df
dataset_df.to_csv("evaluation_dataset_AB.csv", index=False)

dataset_df


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What does Acemoglu (2024) contribute to the di...,[Introduction ChatGPT launched in November 202...,The context notes that the sudden growth in la...,single_hop_specifc_query_synthesizer
1,Could you provide a detailed breakdown of the ...,[Month Non-Work (M) (%) Work (M) (%) Total Mes...,"In Jun 2024, the total number of messages was ...",single_hop_specifc_query_synthesizer
2,What were the key trends in ChatGPT message us...,[Table 1: ChatGPT daily message counts (millio...,"Leading up to the 26th of June 2024, ChatGPT e...",single_hop_specifc_query_synthesizer
3,chatgpt business is what?,[Variation by Occupation Figure 23 presents va...,ChatGPT Business (formerly known as Teams) is ...,single_hop_specifc_query_synthesizer
4,How does the role of ChatGPT as an advisor and...,[<1-hop>\n\nConclusion This paper studies the ...,The role of ChatGPT as an advisor and decision...,multi_hop_abstract_query_synthesizer
5,how chatgpt get so many users so fast and what...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,chatgpt launched in november 2022 and by july ...,multi_hop_abstract_query_synthesizer
6,How does the rapid adoption and diffusion of C...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,"The rapid adoption and diffusion of ChatGPT, a...",multi_hop_abstract_query_synthesizer
7,Based on the provided message volume statistic...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"Between June 2024 and June 2025, the total dai...",multi_hop_abstract_query_synthesizer
8,"By July 2025, how did the rapid growth of Chat...",[<1-hop>\n\nIntroduction ChatGPT launched in N...,"By July 2025, ChatGPT had achieved unprecedent...",multi_hop_specific_query_synthesizer
9,What was the total number of messages sent on ...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,"In June 2025, a total of 2,627 million (2.627 ...",multi_hop_specific_query_synthesizer



## Evaluation Comparison between RAG systems

In [None]:
metrics = [LLMContextRecall(), ContextPrecision(), Faithfulness(), FactualCorrectness(), AnswerRelevancy()]

In [None]:
# Step 1a: Generate responses from Baseline RAG system
import time
import copy

print("🔄 Running Baseline RAG system on test dataset...")
for test_row in dataset:
    response = baseline_graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

print("✅ Baseline RAG responses generated!")


🔄 Running Baseline RAG system on test dataset...
✅ Baseline RAG responses generated!


In [50]:
# Step 1b: Evaluate Baseline RAG system
print("🔄 Evaluating Baseline RAG system...")

baseline_evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

baseline_result = evaluate(
    dataset=baseline_evaluation_dataset,
    metrics=metrics,
    llm=evaluator_llm,
    run_config=RunConfig(timeout=360)
)

print("✅ Baseline evaluation completed!")
baseline_result


🔄 Evaluating Baseline RAG system...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[19]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-GqxpYrKHFaUcOMjEhQsI0XlV on tokens per min (TPM): Limit 30000, Used 29490, Requested 1079. Please try again in 1.138s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[22]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-GqxpYrKHFaUcOMjEhQsI0XlV on tokens per min (TPM): Limit 30000, Used 30000, Requested 1812. Please try again in 3.624s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[26]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-GqxpYrKHFaUcOMjEhQsI0XlV on tokens per min (TPM): Limit 30000, Used 30000, Reque

✅ Baseline evaluation completed!


{'context_recall': 0.4722, 'faithfulness': 0.7648, 'factual_correctness': 0.5755, 'answer_relevancy': 0.7900, 'context_precision': 0.5909}

In [51]:
# Step 2a: Generate responses from Semantic RAG system
semantic_dataset = copy.deepcopy(dataset)

print("🔄 Running Semantic RAG system on test dataset...")
for test_row in semantic_dataset:
    response = semantic_graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

print("✅ Semantic RAG responses generated!")


🔄 Running Semantic RAG system on test dataset...
✅ Semantic RAG responses generated!


In [54]:
# Step 2b: Evaluate Semantic RAG system
print("🔄 Evaluating Semantic RAG system...")

semantic_evaluation_dataset = EvaluationDataset.from_pandas(semantic_dataset.to_pandas())

semantic_result = evaluate(
    dataset=semantic_evaluation_dataset,
    metrics=metrics,
    llm=evaluator_llm,
    run_config=RunConfig(timeout=600, max_retries=3, max_workers=4, max_wait=60)
)

print("✅ Semantic evaluation completed!")
semantic_result


🔄 Evaluating Semantic RAG system...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

✅ Semantic evaluation completed!


{'context_recall': 0.7639, 'faithfulness': 0.8815, 'factual_correctness': 0.5933, 'answer_relevancy': 0.9555, 'context_precision': 0.6906}

## 📊 Comprehensive Evaluation Comparison

This section provides a detailed comparison between the **Baseline RAG** (using RecursiveCharacterTextSplitter) and the **Semantic RAG** (using semantic chunking) systems across key RAGAS metrics.


In [55]:

# Extract metric scores from results
baseline_scores = {
    "context_recall": baseline_result['context_recall'],
    "context_precision": baseline_result['context_precision'],
    "faithfulness": baseline_result['faithfulness'],
    "factual_correctness": baseline_result['factual_correctness'],
    "answer_relevancy": baseline_result['answer_relevancy']
}

semantic_scores = {
    "context_recall": semantic_result['context_recall'],
    "context_precision": semantic_result['context_precision'],
    "faithfulness": semantic_result['faithfulness'],
    "factual_correctness": semantic_result['factual_correctness'],
    "answer_relevancy": semantic_result['answer_relevancy']
}

print("📊 Baseline RAG Scores:")
for metric, scores in baseline_scores.items():
    print(f"  {metric}: {np.nanmean(scores):.4f}")

print("\n📊 Semantic RAG Scores:")
for metric, scores in semantic_scores.items():
    print(f"  {metric}: {np.nanmean(scores):.4f}")


📊 Baseline RAG Scores:
  context_recall: 0.4722
  context_precision: 0.5909
  faithfulness: 0.7648
  factual_correctness: 0.5755
  answer_relevancy: 0.7900

📊 Semantic RAG Scores:
  context_recall: 0.7639
  context_precision: 0.6906
  faithfulness: 0.8815
  factual_correctness: 0.5933
  answer_relevancy: 0.9555


In [58]:
# Create comprehensive comparison table with metadata
comparison_data = {
    'Metric': [
        'Context Precision',
        'Context Recall', 
        'Answer Relevancy',
        'Faithfulness',
        'Factual Correctness'
    ],
    'Comparison': [
        'Question → Context',
        'Reference Answer → Context',
        'Question ↔ Response',
        'Response → Context',
        'Response ↔ Reference Answer'
    ],
    'Stage': [
        'Retrieval',
        'Retrieval',
        'Generation',
        'Generation',
        'End-to-End'
    ],
    'What it Measures': [
        'Relevant chunks ranked higher in retrieval',
        'All needed information was retrieved',
        'Answer directly addresses the question',
        'Answer claims supported by retrieved context',
        'Answer statements match ground truth' ,
    ],
    'Better': [
        'Higher ✅',
        'Higher ✅',
        'Higher ✅',
        'Higher ✅',
        'Higher ✅'
    ],
    'Baseline Score': [
        f"{np.nanmean(baseline_scores['context_precision']):.4f}",
        f"{np.nanmean(baseline_scores['context_recall']):.4f}",
        f"{np.nanmean(baseline_scores['answer_relevancy']):.4f}",
        f"{np.nanmean(baseline_scores['faithfulness']):.4f}",
        f"{np.nanmean(baseline_scores['factual_correctness']):.4f}"
    ],
    'Semantic Score': [
        f"{np.nanmean(semantic_scores['context_precision']):.4f}",
        f"{np.nanmean(semantic_scores['context_recall']):.4f}",
        f"{np.nanmean(semantic_scores['answer_relevancy']):.4f}",
        f"{np.nanmean(semantic_scores['faithfulness']):.4f}",
        f"{np.nanmean(semantic_scores['factual_correctness']):.4f}"
    ]
}

comparison_df = pd.DataFrame(comparison_data)

# Calculate improvement
comparison_df['Improvement'] = comparison_df.apply(
    lambda row: f"{((float(row['Semantic Score']) - float(row['Baseline Score'])) / float(row['Baseline Score']) * 100):.2f}%"
    if float(row['Baseline Score']) > 0 else "N/A",
    axis=1
)

comparison_df


Unnamed: 0,Metric,Comparison,Stage,What it Measures,Better,Baseline Score,Semantic Score,Improvement
0,Context Precision,Question → Context,Retrieval,Relevant chunks ranked higher in retrieval,Higher ✅,0.5909,0.6906,16.87%
1,Context Recall,Reference Answer → Context,Retrieval,All needed information was retrieved,Higher ✅,0.4722,0.7639,61.77%
2,Answer Relevancy,Question ↔ Response,Generation,Answer directly addresses the question,Higher ✅,0.79,0.9555,20.95%
3,Faithfulness,Response → Context,Generation,Answer claims supported by retrieved context,Higher ✅,0.7648,0.8815,15.26%
4,Factual Correctness,Response ↔ Reference Answer,End-to-End,Answer statements match ground truth,Higher ✅,0.5755,0.5933,3.09%
