In [1]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key:")

In [2]:
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader

# Define the path to the document directory
path = "data/"

# Load all .pdf documents using PyMuPDFLoader
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

# Optional: Check number of documents and preview content
print(f"Loaded {len(docs)} documents")
print(docs[0].page_content[:500])  # Preview first 500 characters of first doc


Loaded 269 documents
Application and Verification Guide
Introduction
This guide is intended for college financial aid administrators and counselors who help students with the financial aid
process4completing the Free Application for Federal Student Aid (FAFSA®) form, verifying information, and making
corrections and other changes to the information reported on the FAFSA form.
Throughout the Federal Student Aid Handbook, we use <college,= <school,= and <institution= interchangeably unless a
more specific use is given


In [3]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Wrap OpenAI LLM and embedding model for use with RAGAS
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

# Initialize test set generator
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Generate a synthetic dataset from your docs (limit to first 20 docs for speed)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

# Optional: preview first generated question
dataset.samples[0].eval_sample.user_input


  from .autonotebook import tqdm as notebook_tqdm
Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]           unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
Applying SummaryExtractor:  63%|██████▎   | 17/27 [00:24<00:32,  3.22s/it]Property 'summary' already exists in node '19bfc8'. Skipping!
Applying SummaryExtractor:  67%|██████▋   | 18/27 [00:28<00:31,  3.45s/it]Property 'summary' already exists in node 'eb9bf4'. Skipping!
Applying SummaryExtractor:  70%|███████   | 19/27 [00:29<00:22,  2.81s/it]Property 'summary' already exists in node 'ee8901'. Skipping!
Applying SummaryExtractor:  74%|███████▍  | 20/27 [00:29<00:15,  2.21s/it]Property 'summary' already exists in node '2893f7'. Skipping!
Applying SummaryExtractor:  78%|███████▊  | 21/27 [00:29<00:10,  1.71s/it]Property 'summary' already exists in node '606283'. Skipping!
Applying SummaryExtractor:  81%|████████▏ 

'What is the main impact of the Fostering Undergraduate Talent by Unlocking Resources for Education Act on the FAFSA process?'

In [4]:
import copy

dataset_naive = copy.deepcopy(dataset)
dataset_semantic_k5 = copy.deepcopy(dataset) 
dataset_semantic_k20 = copy.deepcopy(dataset)
dataset_rerank = copy.deepcopy(dataset)

dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the main impact of the Fostering Under...,[Application and Verification Guide Introducti...,The Fostering Undergraduate Talent by Unlockin...,single_hop_specifc_query_synthesizer
1,Wut happend to the retuning FAFSA filers secton?,[Chapter 1: The Application Process We removed...,The <Returning FAFSA Filers= section was remov...,single_hop_specifc_query_synthesizer
2,Why FPS look at the FAFSA and say there mistak...,[The FPS also checks the application for possi...,The FPS checks the application for inconsisten...,single_hop_specifc_query_synthesizer
3,how i get isir for student with fafsa partner ...,[Output Documents After processing is complete...,"if your school not listed on student fafsa, yo...",single_hop_specifc_query_synthesizer
4,how do the fps output documents like isir and ...,[<1-hop>\n\nThe FPS also checks the applicatio...,"the fps output documents, which are the isir a...",multi_hop_abstract_query_synthesizer
5,How has the FAFSA application process changed ...,[<1-hop>\n\nApplication and Verification Guide...,The FAFSA application process has changed in t...,multi_hop_abstract_query_synthesizer
6,If a student submits a FAFSA application with ...,[<1-hop>\n\nThe FPS also checks the applicatio...,When a student submits a FAFSA application wit...,multi_hop_abstract_query_synthesizer
7,Howw does the disclosure and use of Federal Ta...,[<1-hop>\n\n2. The disclosure of their FTI by ...,The disclosure and use of Federal Tax Informat...,multi_hop_abstract_query_synthesizer
8,"According to Chapter 2, how does the determina...","[<1-hop>\n\nAVG, Chapter 2, Example 6: A stude...","In Chapter 2, the determination of legal depen...",multi_hop_specific_query_synthesizer
9,Wich departmant is responsibel for obtaning co...,[<1-hop>\n\nFederal Tax Information The follow...,Only the Department has the authority to obtai...,multi_hop_specific_query_synthesizer


In [5]:
# Naive Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Naive chunking: split docs into 1000-character chunks with 200-character overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)

# Optional: Confirm result
print(f"Total chunks created: {len(split_documents)}")
print(split_documents[0].page_content[:300])  # Preview first chunk

Total chunks created: 1102
Application and Verification Guide
Introduction
This guide is intended for college financial aid administrators and counselors who help students with the financial aid
process4completing the Free Application for Federal Student Aid (FAFSA®) form, verifying information, and making
corrections and oth


In [6]:
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

# Define the embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Set up Qdrant in-memory client
client = QdrantClient(":memory:")

# Create a Qdrant collection (1536 = embedding size of text-embedding-3-small)
client.create_collection(
    collection_name="loan_data",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Store chunks in Qdrant
vector_store = QdrantVectorStore(
    client=client,
    collection_name="loan_data",
    embedding=embeddings,
)

_ = vector_store.add_documents(documents=split_documents)

# Expose retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

In [7]:
# LangGraph Pipeline
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Define the RAG prompt template
RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context.
You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

# Define the LLM for generation
llm = ChatOpenAI(model="gpt-4.1-nano")

# Retrieval function
def retrieve(state):
    retrieved_docs = retriever.invoke(state["question"])
    return {"context": retrieved_docs}

# Generation function
def generate(state):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = rag_prompt.format_messages(
        question=state["question"],
        context=docs_content
    )
    response = llm.invoke(messages)
    return {"response": response.content}


In [8]:
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict, List
from langchain_core.documents import Document

# Define the state schema
class State(TypedDict):
    question: str
    context: List[Document]
    response: str

# Build the LangGraph pipeline
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()


In [9]:
# Test Run
response = graph.invoke({"question": "What are the different kinds of loans?"})
print(response["response"])


The provided context mentions different aspects of student loans, such as Direct Loans and various loan terms, but it does not explicitly list or define the different kinds of loans. Therefore, based solely on the provided information, there is no specific answer regarding the types of loans.


In [10]:
for test_row in dataset_naive:
    # Run LangGraph with the question
    response = graph.invoke({"question": test_row.eval_sample.user_input})

    # Save the generated response and retrieved context
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [
        doc.page_content for doc in response["context"]
    ]


In [11]:
print("✅ Sample generated answer:")
print("Question:", dataset_naive.samples[0].eval_sample.user_input)
print("Answer:", dataset_naive.samples[0].eval_sample.response)
print("Context snippet:", dataset_naive.samples[0].eval_sample.retrieved_contexts[0][:200])


✅ Sample generated answer:
Question: What is the main impact of the Fostering Undergraduate Talent by Unlocking Resources for Education Act on the FAFSA process?
Answer: The main impact of the Fostering Undergraduate Talent by Unlocking Resources for Education (FUTURE) Act on the FAFSA process is the implementation of the FUTURE Act Direct Data Exchange (FA-DDX), which allows the IRS to directly disclose certain tax information to the Department of Education. This eliminates the need for most applicants to self-report their income and tax information when completing the FAFSA form.
Context snippet: analysis, and many policies and procedures for schools that participate in the Title IV programs. FSA implemented the
FAFSA Simplification Act alongside the FAFSA portion of the Fostering Undergraduat


In [12]:
from ragas.evaluation import EvaluationDataset
from ragas import evaluate, RunConfig
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall,
    NoiseSensitivity
)

# Wrap the evaluator LLM (same or smaller model is fine)
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

# Convert dataset to RAGAS EvaluationDataset
evaluation_dataset = EvaluationDataset.from_pandas(dataset_naive.to_pandas())

# Optional: set timeout to avoid stalling on any row
custom_run_config = RunConfig(timeout=360)

# Run evaluation
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[
        LLMContextRecall(),
        Faithfulness(),
        FactualCorrectness(),
        ResponseRelevancy(),
        ContextEntityRecall(),
        NoiseSensitivity()
    ],
    llm=evaluator_llm,
    run_config=custom_run_config
)

# Display results
result


Evaluating: 100%|██████████| 72/72 [05:10<00:00,  4.31s/it]


{'context_recall': 0.8554, 'faithfulness': 0.9335, 'factual_correctness(mode=f1)': 0.6183, 'answer_relevancy': 0.7784, 'context_entity_recall': 0.5048, 'noise_sensitivity(mode=relevant)': 0.1974}

In [13]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /home/DJ/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/DJ/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [14]:
from typing import List
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import nltk

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

def semantic_chunk_document(
    doc: Document,
    similarity_threshold: float = 0.85,
    max_chunk_size: int = 1000
) -> List[Document]:
    sentences = nltk.sent_tokenize(doc.page_content)
    if not sentences:
        return []

    sentence_embeddings = embedding_model.embed_documents(sentences)
    chunks = []
    current_chunk = [sentences[0]]
    current_length = len(sentences[0])
    current_vector = sentence_embeddings[0]

    for i in range(1, len(sentences)):
        sim = cosine_similarity(
            [current_vector],
            [sentence_embeddings[i]]
        )[0][0]
        sentence_len = len(sentences[i])

        if sim >= similarity_threshold and current_length + sentence_len <= max_chunk_size:
            current_chunk.append(sentences[i])
            current_length += sentence_len
            current_vector = np.mean(
                [embedding_model.embed_query(" ".join(current_chunk))], axis=0
            )
        else:
            chunk_text = " ".join(current_chunk)
            chunks.append(Document(page_content=chunk_text, metadata=doc.metadata))
            current_chunk = [sentences[i]]
            current_length = sentence_len
            current_vector = sentence_embeddings[i]

    # Add final chunk
    if current_chunk:
        chunk_text = " ".join(current_chunk)
        chunks.append(Document(page_content=chunk_text, metadata=doc.metadata))

    return chunks


In [15]:
semantic_chunks = []
for doc in docs:
    semantic_chunks.extend(semantic_chunk_document(doc))


In [16]:
print(f"Generated {len(semantic_chunks)} semantic chunks.")
print(semantic_chunks[0].page_content[:])

Generated 4630 semantic chunks.
Application and Verification Guide
Introduction
This guide is intended for college financial aid administrators and counselors who help students with the financial aid
process4completing the Free Application for Federal Student Aid (FAFSA®) form, verifying information, and making
corrections and other changes to the information reported on the FAFSA form.


In [17]:
from qdrant_client.http.models import Distance, VectorParams
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from langchain_openai import OpenAIEmbeddings

# Reuse or redefine the embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

# Create a new in-memory Qdrant client
client = QdrantClient(":memory:")

# New collection name for semantic chunks
client.create_collection(
    collection_name="loan_data_semantic",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Add semantic chunks to the collection
semantic_vectorstore = QdrantVectorStore(
    client=client,
    collection_name="loan_data_semantic",
    embedding=embedding_model,
)

_ = semantic_vectorstore.add_documents(documents=semantic_chunks)

# Expose a retriever
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k": 5})


In [18]:
#  Define a New LangGraph for Semantic Chunking
from langgraph.graph import START, StateGraph
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from typing_extensions import TypedDict, List
from langchain_core.documents import Document

# Prompt stays the same
semantic_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

# Semantic retrieval function
def semantic_retrieve(state):
    retrieved_docs = semantic_retriever.invoke(state["question"])
    return {"context": retrieved_docs}

# Generation function stays the same
def generate(state):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = semantic_prompt.format_messages(
        question=state["question"], context=docs_content
    )
    response = llm.invoke(messages)
    return {"response": response.content}

# Shared state schema
class State(TypedDict):
    question: str
    context: List[Document]
    response: str

# Build semantic graph
semantic_graph_builder = StateGraph(State).add_sequence([semantic_retrieve, generate])
semantic_graph_builder.add_edge(START, "semantic_retrieve")
semantic_graph = semantic_graph_builder.compile()


In [19]:
for test_row in dataset_semantic_k5:
    response = semantic_graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [
        doc.page_content for doc in response["context"]
    ]


In [20]:
from ragas.evaluation import EvaluationDataset
from ragas import evaluate, RunConfig
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall,
    NoiseSensitivity
)

# Wrap evaluator
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

# Convert dataset to EvaluationDataset
evaluation_dataset_semantic = EvaluationDataset.from_pandas(dataset_semantic_k5.to_pandas())

# Run evaluation
semantic_result = evaluate(
    dataset=evaluation_dataset_semantic,
    metrics=[
        LLMContextRecall(),
        Faithfulness(),
        FactualCorrectness(),
        ResponseRelevancy(),
        ContextEntityRecall(),
        NoiseSensitivity()
    ],
    llm=evaluator_llm,
    run_config=RunConfig(timeout=360)
)

semantic_result


Evaluating: 100%|██████████| 72/72 [03:56<00:00,  3.28s/it]


{'context_recall': 0.5478, 'faithfulness': 0.7615, 'factual_correctness(mode=f1)': 0.6700, 'answer_relevancy': 0.8553, 'context_entity_recall': 0.4344, 'noise_sensitivity(mode=relevant)': 0.1552}

In [21]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k": 20})


In [22]:
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict, List
from langchain_core.documents import Document
from langchain.prompts import ChatPromptTemplate

# Prompt (reuse previous)
semantic_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

# New retriever logic
def semantic_retrieve(state):
    retrieved_docs = semantic_retriever.invoke(state["question"])
    return {"context": retrieved_docs}

# Generation logic
def generate(state):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = semantic_prompt.format_messages(
        question=state["question"], context=docs_content
    )
    response = llm.invoke(messages)
    return {"response": response.content}

class State(TypedDict):
    question: str
    context: List[Document]
    response: str

# Build updated LangGraph
semantic_graph_k20 = StateGraph(State).add_sequence([semantic_retrieve, generate])
semantic_graph_k20.add_edge(START, "semantic_retrieve")
semantic_graph_k20 = semantic_graph_k20.compile()


In [23]:
for test_row in dataset_semantic_k20:
    response = semantic_graph_k20.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [
        doc.page_content for doc in response["context"]
    ]


In [24]:
from ragas.evaluation import EvaluationDataset
from ragas import evaluate, RunConfig
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall,
    NoiseSensitivity
)

evaluation_dataset_semantic_k20 = EvaluationDataset.from_pandas(dataset_semantic_k20.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

semantic_result_k20 = evaluate(
    dataset=evaluation_dataset_semantic_k20,
    metrics=[
        LLMContextRecall(),
        Faithfulness(),
        FactualCorrectness(),
        ResponseRelevancy(),
        ContextEntityRecall(),
        NoiseSensitivity()
    ],
    llm=evaluator_llm,
    run_config=RunConfig(timeout=360)
)

semantic_result_k20


Evaluating:  90%|█████████ | 65/72 [05:19<05:34, 47.81s/it]Exception raised in Job[17]: TimeoutError()
Evaluating:  92%|█████████▏| 66/72 [05:42<04:01, 40.31s/it]Exception raised in Job[29]: TimeoutError()
Evaluating:  93%|█████████▎| 67/72 [05:49<02:31, 30.33s/it]Exception raised in Job[41]: TimeoutError()
Evaluating:  94%|█████████▍| 68/72 [06:02<01:40, 25.17s/it]Exception raised in Job[47]: TimeoutError()
Evaluating:  96%|█████████▌| 69/72 [06:07<00:57, 19.14s/it]Exception raised in Job[53]: TimeoutError()
Evaluating:  97%|█████████▋| 70/72 [06:21<00:35, 17.66s/it]Exception raised in Job[65]: TimeoutError()
Evaluating:  99%|█████████▊| 71/72 [06:52<00:21, 21.59s/it]Exception raised in Job[71]: TimeoutError()
Evaluating: 100%|██████████| 72/72 [07:05<00:00,  5.90s/it]


{'context_recall': 0.7794, 'faithfulness': 0.8369, 'factual_correctness(mode=f1)': 0.7017, 'answer_relevancy': 0.9258, 'context_entity_recall': 0.4364, 'noise_sensitivity(mode=relevant)': 0.1533}

In [25]:
# ----------------------------
import os
from getpass import getpass
os.environ["COHERE_API_KEY"] = getpass("Cohere API Key:")

import cohere
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableLambda
from langchain_core.prompts import PromptTemplate
from langgraph.graph import StateGraph
from langchain_core.documents import Document
from typing import TypedDict, List

# Set API keys
co = cohere.Client(os.environ["COHERE_API_KEY"])   # or use getpass() for security

In [26]:
class State(TypedDict):
    question: str
    retrieved_context: List[Document]
    answer: str


In [27]:
def cohere_rerank(query: str, docs: list, top_n: int = 8):
    response = co.rerank(
        model="rerank-v3.5",  # or rerank-english-v3.0 if needed
        query=query,
        documents=[doc.page_content for doc in docs],
        top_n=top_n
    )
    return [docs[r.index] for r in response.results]

def retrieve_with_cohere_rerank(state: State):
    initial_docs = semantic_vectorstore.similarity_search(state["question"], k=25)
    reranked_docs = cohere_rerank(state["question"], initial_docs, top_n=8)
    return {
        "question": state["question"],
        "retrieved_context": reranked_docs
    }

retriever_node = RunnableLambda(retrieve_with_cohere_rerank)


In [28]:
llm = ChatOpenAI(model="gpt-4o")

prompt = PromptTemplate.from_template("""\
You must answer the question using only the provided context.
If you do not know, say "I don't know."

Context:
{context}

Question: {question}
Answer:""")

def generate_answer(state: State):
    context_str = "\n\n".join([doc.page_content for doc in state["retrieved_context"]])
    response = llm.invoke(prompt.format(context=context_str, question=state["question"]))
    return {
        "question": state["question"],
        "retrieved_context": state["retrieved_context"],
        "answer": response.content
    }

llm_node = RunnableLambda(generate_answer)


In [29]:
graph = StateGraph(State)

graph.add_node("retriever", retriever_node)
graph.add_node("llm", llm_node)

graph.set_entry_point("retriever")
graph.add_edge("retriever", "llm")
graph.set_finish_point("llm")

rerank_chain = graph.compile()


In [30]:
for test_row in dataset_rerank:
    response = rerank_chain.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["answer"]  # Note: rerank uses "answer" while others use "response"
    test_row.eval_sample.retrieved_contexts = [
        doc.page_content for doc in response["retrieved_context"]  # Note: rerank uses "retrieved_context" while others use "context"
    ]


In [31]:
from ragas.evaluation import EvaluationDataset
from ragas import evaluate, RunConfig
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall,
    NoiseSensitivity
)

evaluation_dataset_rerank = EvaluationDataset.from_pandas(dataset_rerank.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

rerank_result = evaluate(
    dataset=evaluation_dataset_rerank,
    metrics=[
        LLMContextRecall(),
        Faithfulness(),
        FactualCorrectness(),
        ResponseRelevancy(),
        ContextEntityRecall(),
        NoiseSensitivity()
    ],
    llm=evaluator_llm,
    run_config=RunConfig(timeout=360)
)

rerank_result


Evaluating: 100%|██████████| 72/72 [04:42<00:00,  3.93s/it]


{'context_recall': 0.5978, 'faithfulness': 0.5556, 'factual_correctness(mode=f1)': 0.4950, 'answer_relevancy': 0.6250, 'context_entity_recall': 0.5267, 'noise_sensitivity(mode=relevant)': 0.0492}

# Interpretation of Results

### 📊 Chunk Count Comparison: Naive vs. Semantic

| Chunking Strategy | Total Chunks | Description |
|-------------------|--------------|-------------|
| **Naive Chunking** | 1,102        | Fixed-size chunks (1,000 characters with 200-character overlap). Chunks are larger, may split or mix unrelated sentences. |
| **Semantic Chunking** | 4,630        | Sentence-level greedy merging based on semantic similarity. Results in smaller, more coherent chunks. |

The semantic chunking strategy produces **over 4x** as many chunks as naive chunking. This reflects its finer granularity and semantic awareness. While it increases the number of embeddings, it enhances context relevance and retrieval precision — key factors for improving downstream RAG performance.

The naive chunking strategy produced 1,102 chunks, while the semantic chunking strategy produced 4,630 chunks — more than 4x the number of segments.

This difference highlights a key behavior:

Naive chunking uses fixed-length blocks (e.g., 1,000 characters), which results in larger, more uniform chunks that often combine unrelated sentences or split thoughts mid-way.

Semantic chunking, in contrast, splits text based on sentence boundaries and meaning, leading to shorter, more focused chunks. These are smaller in size but more coherent in content, especially useful for RAG retrieval precision.

This trade-off has implications for:

Retrieval granularity (semantic chunks allow finer filtering)

Context relevance (retrieved passages are more focused)

Embedding volume (more chunks = more embeddings = more token usage upfront)

## 📊 RAG Evaluation Results Comparison

| **Metric**                       | **Naive** | **Semantic (k=5)** | **Semantic (k=20)** | **Cohere Rerank** |
|----------------------------------|-----------|---------------------|----------------------|--------------------|
| **Context Recall**               | 0.8554    | 0.5478              | 0.7794               | 0.5978             |
| **Faithfulness**                 | 0.9335    | 0.7615              | 0.8369               | 0.5556             |
| **Factual Correctness (F1)**     | 0.6183    | 0.6700              | 0.7017               | 0.4950             |
| **Answer Relevancy**            | 0.7784    | 0.8553              | 0.9258               | 0.6250             |
| **Context Entity Recall**        | 0.5048    | 0.4344              | 0.4364               | 0.5267             |
| **Noise Sensitivity (Relevant)** | 0.1974    | 0.1552              | 0.1533               | 0.0492             |

---

## Interpretation

- **Naive Chunking** performs well in **faithfulness** and **context recall** due to broad document coverage but introduces the most **noise**.
- **Semantic (k=5)** improves **factual correctness** and **answer relevancy**, but sacrifices **recall** by returning too few documents.
- **Semantic (k=20)** is the strongest overall: it offers high **recall**, **faithfulness**, and **relevancy**, while keeping noise moderate.
- **Cohere Rerank** surprisingly underperforms in **faithfulness** and **factual correctness**, despite excelling in **noise sensitivity** and **context entity recall**.

---

## ❗ Why Did Cohere Underperform?

The reranker pipeline used:
- `k = 25` documents from the vectorstore (via similarity search)
- `top_n = 8` after reranking with Cohere (`rerank-v3.5`)

While this configuration **reduced irrelevant content**, it also **over-pruned the retrieved set**, potentially excluding essential supporting context. As a result:
- The model couldn't always **justify its answers** using the provided context → lower **faithfulness**
- It sometimes **missed key facts** → lower **factual correctness**

---

## ✅ Recommendation

To improve performance while retaining the benefits of reranking:

- Increase retrieval depth: `k = 50`
- Expand rerank cutoff: `top_n = 15`

This should restore more relevant information while keeping the quality boost from reranking.

```python
initial_docs = vectorstore.similarity_search(query, k=50)
response = co.rerank(query=query, documents=initial_docs, top_n=15)
```
