# Retrieval Pipeline Evaluation Report

**Title:** *Evaluating Document Retrieval Quality using Fixed Token Chunking and MiniLM Embeddings on the State of the Union Corpus*  
**Author:** Kyrylo Goroshenko  

---

## 1. Introduction

This project focuses on building and evaluating a document retrieval pipeline using open-source tools. The retrieval system is tested on a real-world dataset—**State of the Union** addresses—by chunking the corpus, embedding both queries and chunks, and evaluating the system's ability to return relevant text.

---

## 2. Dataset Preparation

- **Corpus**: `state_of_the_union.md`
- **Queries & References**: Extracted into `questions_state.csv` using a basic filtering script.
- Each query in the CSV includes a `question` field and `references`, which are character-level (start, end) tuples marking relevant spans in the corpus.

---

## 3. Chunking Strategy

- **Algorithm**: `FixedTokenChunker`
- **Source**: [GitHub - fixed_token_chunker.py](https://github.com/brandonstarxel/chunking_evaluation/blob/main/chunking_evaluation/chunking/fixed_token_chunker.py)
- **Chunking Parameters**:
  - Chunk sizes: 100, 200, 400, 600, 800
  - Chunk overlaps: 0 to 400 (varied based on size)

This helped analyze how different granularities and overlaps affect precision and recall.

---

## 4. Embedding Model

- **Model**: `all-MiniLM-L6-v2`
- **Source**: HuggingFace 
- **Embedding Function**: Implemented in `embedding_function.py` with batch support.

---

## 5. Evaluation Metrics

- **Precision**: Proportion of retrieved chunk text overlapping with ground truth.
- **Recall**: Proportion of ground truth captured in retrieved chunks.

Implementation details:
- Calculated based on range overlaps (using character-level spans)
- Handles union of intersecting ranges

---

## 6. Retrieval Pipeline Overview

The pipeline operates as follows:
Corpus → Chunking → Chunk Embedding → Query Embedding → Similarity Search → Top-N Retrieval → Scoring

- **Chunker**: `FixedTokenChunker`
- **Retriever**: Cosine similarity over embeddings
- **Evaluation**: Run for each query and averaged

Implemented in `pipeline.ipynb`.

---

## 7. Experiments & Results

- **Tested Retrieval Sizes**: 1, 2, 3, 5, 10
- **Tested Chunk Sizes/Overlaps**: Several (see part of table below, or whole of it in `experiment_comparison_table.csv`)
- All combinations were evaluated and logged.

In [5]:
import pandas as pd
df = pd.read_csv("experiment_comparison_table.csv")
df.sort_values(by=["chunk_size", "chunk_overlap", "chunks_retieved"])

Unnamed: 0,chunk_size,chunk_overlap,chunks_retieved,precision,recall
0,100,0,1,0.188042,0.486034
1,100,0,2,0.120373,0.614665
2,100,0,3,0.093550,0.702523
3,100,0,5,0.073808,0.897149
4,100,0,10,0.039737,0.976826
...,...,...,...,...,...
75,800,400,1,0.031116,0.608412
76,800,400,2,0.021086,0.778466
77,800,400,3,0.014702,0.809186
78,800,400,5,0.008961,0.822344


Some plots of this data can be seen by running `plots.py` 

## 8. Key Insights

- **Smaller chunks** offer high precision but may hurt recall (due to missing context).
- **Overlap** improves recall by covering context that spans boundaries between chunks.
- **Best results** regarding *recall* were observed with:
  - **Chunk size**: 200–600 tokens
  - **Overlap**: half of the chunk size
  - **Number of chunks retrieved**: 10
- **Increasing** the number of chunks retrieved always lowered precision but improved recall, which aligns with intuition.

---

## 9. Conclusion

The implemented retrieval system successfully retrieves relevant document excerpts using fixed-size chunking and dense vector representations. Through multiple experiments, the project highlights how chunk size and overlap can significantly affect retrieval quality. The pipeline demonstrates the potential for using simple chunking methods and embeddings for document retrieval tasks.

---

## 10. Some Thoughts/Observations Regarding the Task

Unfortunately, the link provided in the task didn’t lead to an explanation of the **FixedTokenChunker** algorithm, so I had to deduce the chunking process from the code itself.\
While reading the associated paper, I encountered some concerns regarding how the dataset was created. Using a LLM to obtain relevant context for the questions doesn’t guarantee that the retrieved excerpts are always relevant, nor does it ensure that all relevant excerpts are captured. The authors attempted to mitigate the first issue by checking cosine similarity between queries and references, but it still felt somewhat imprecise at times. The second issue—ensuring that all relevant content is retrieved—was not addressed in the paper, and the authors acknowledged this limitation, which I believe is challenging due to the synthetic nature of the dataset. \
Nonetheless, the task was highly interesting, and I gained valuable insights from it.

---

## 11. Deliverables

- `data/questions_state.csv` and `state_of_the_union.md`
- `embedding_function.py`
- `fixed_token_chunker.py`
- `pipeline.ipynb`
- `experiment_comparison_table.csv` 
- `plots.py` (simple script to generate plots from experiment results)
- `report.ipynb` (this file)

---