## 🧩 Document Chunking Techniques for RAG

This notebook implements various **document chunking techniques** to create retrievable passages for a **Retrieval-Augmented Generation (RAG)** pipeline.

---

### 🔹 Fixed Window Chunking
Splits text into fixed-size overlapping windows (e.g., every 500 tokens with 100 overlap).

### 🔹 Recursive Character Splitter Chunking
Recursively splits text by paragraphs, sentences, or characters to maintain semantic coherence.

### 🔹 Semantic Chunking
Uses embeddings or sentence similarity to group semantically related sentences together.

### 🔹 Embedding-Based Chunking
Leverages vector similarity to merge or refine chunks based on embedding distances.

### 🔹 Agentic Chunking
Dynamically decides how to chunk text based on context using an LLM (e.g., chunk-by-intent or topic).

---

> 💡 *Goal:* To evaluate and compare chunking strategies that best balance retrieval precision and generation quality.


### Fixed Window Chunking

In [None]:
from typing import List
def fixed_window_splitter(text: str, chunk_size: int = 1000) -> List[str]:
    """Splits text at given chunk_size"""
    splits = []
    for i in range(0, len(text), chunk_size):
        splits.append(text[i:i + chunk_size])
    return splits


# Read the entire file as a single string
with open("sample.txt", "r", encoding="utf-8") as f:
    text = f.read()

splitted_document = fixed_window_splitter(text, chunk_size=50)
print(splitted_document)


['What Are Transformers?\nTransformers are a type of ', 'neural network architecture that has revolutionize', 'd the field of natural language processing (NLP). ', 'Introduced in a 2017 paper titled "Attention is Al', 'l You Need" by Vaswani et al., Transformers are de', 'signed to handle sequential data, such as text, by', ' using a mechanism called self-attention.\n\nKey Cha', 'racteristics of Transformers:\nSelf-Attention Mecha', 'nism: The self-attention mechanism allows Transfor', 'mers to weigh the importance of different words in', ' a sentence when making predictions. This is cruci', 'al for understanding context, especially in long s', 'entences.\nParallelization: Unlike traditional RNNs', ' (Recurrent Neural Networks), which process data s', 'equentially, Transformers can process multiple wor', 'ds at once, making them faster and more efficient.', '\nVersatility: Transformers are not limited to lang', 'uage tasks; they can be applied to any problem inv', 'olving sequential da

### Fixed Window Chunking with overlapping

In [None]:
def fixed_window_with_overlap_splitter(text: str, chunk_size: int = 1000, chunk_overlap: int = 10) -> List[str]:
    """Splits text at given chunk_size, and starts next chunk from start - chunk_overlap position"""
    chunks = []
    start = 0

    while start <= len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - chunk_overlap

    return chunks


# Read the entire file as a single string
with open("sample.txt", "r", encoding="utf-8") as f:
    text = f.read()

splitted_document = fixed_window_with_overlap_splitter(text, chunk_size=50, chunk_overlap=20)
print(splitted_document)


['What Are Transformers?\nTransformers are a type of ', 'rmers are a type of neural network architecture th', 'work architecture that has revolutionized the fiel', 'olutionized the field of natural language processi', 'al language processing (NLP). Introduced in a 2017', 'Introduced in a 2017 paper titled "Attention is Al', 'led "Attention is All You Need" by Vaswani et al.,', '" by Vaswani et al., Transformers are designed to ', 'ers are designed to handle sequential data, such a', 'uential data, such as text, by using a mechanism c', ' using a mechanism called self-attention.\n\nKey Cha', '-attention.\n\nKey Characteristics of Transformers:\n', 'cs of Transformers:\nSelf-Attention Mechanism: The ', 'tion Mechanism: The self-attention mechanism allow', 'tion mechanism allows Transformers to weigh the im', 'mers to weigh the importance of different words in', 'f different words in a sentence when making predic', 'e when making predictions. This is crucial for und', 's is crucial for un

### Recursive Character Splitter

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=50)

# Read the entire file as a single string
with open("documents/sample.txt", "r", encoding="utf-8") as f:
    text = f.read()

splitted_document = text_splitter.split_text(text)
print(splitted_document)

['What Are Transformers?', 'Transformers are a type of neural network architecture that has revolutionized the field of natural language processing (NLP). Introduced in a 2017 paper titled "Attention is All You Need" by Vaswani et al., Transformers are designed to handle', 'et al., Transformers are designed to handle sequential data, such as text, by using a mechanism called self-attention.', 'Key Characteristics of Transformers:', 'Self-Attention Mechanism: The self-attention mechanism allows Transformers to weigh the importance of different words in a sentence when making predictions. This is crucial for understanding context, especially in long sentences.', 'Parallelization: Unlike traditional RNNs (Recurrent Neural Networks), which process data sequentially, Transformers can process multiple words at once, making them faster and more efficient.', 'Versatility: Transformers are not limited to language tasks; they can be applied to any problem involving sequential data, including tas

### Semantic Chunking

In [None]:
from openai import OpenAI
from typing import List
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import os

# -----------------------------
# Step 0: Setup API key
# -----------------------------
os.environ["OPENAI_API_KEY"] = ""  # Replace with your key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("❌ OPENAI_API_KEY not found. Please set it as an environment variable.")

client = OpenAI(api_key=api_key)

def split_sentences(text: str) -> list[str]:
    import re
    # Split on period, exclamation, question mark followed by space or line break
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s.strip() for s in sentences if s.strip()]

# -----------------------------
# Step 2: Semantic chunking function
# -----------------------------
def semantic_chunking_openai(text: str, chunk_size: int = 5, similarity_threshold: float = 0.7) -> List[str]:
    """
    Chunk text based on semantic similarity using OpenAI embeddings.
    Each chunk combines sentences that are semantically close.
    chunk_size = number of sentences per chunk
    similarity_threshold = minimum cosine similarity to combine sentences
    """
    # Split text into sentences
    # sentences = nltk.sent_tokenize(text)
    sentences = split_sentences(text)

    # Get embeddings for each sentence
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=sentences
    )
    embeddings = [data.embedding for data in response.data]

    # Group sentences into semantically similar chunks
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        sim = cosine_similarity([embeddings[i-1]], [embeddings[i]])[0][0]
        if sim >= similarity_threshold and len(current_chunk) < chunk_size:
            current_chunk.append(sentences[i])
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# -----------------------------
# Step 3: Read file and chunk
# -----------------------------
with open("sample.txt", "r", encoding="utf-8") as f:
    text = f.read()

splitted_document = semantic_chunking_openai(
    text=text,
    chunk_size=5,              # number of sentences per chunk
    similarity_threshold=0.6   # semantic similarity threshold
)

# -----------------------------
# Step 4: Print chunks
# -----------------------------
for i, chunk in enumerate(splitted_document, 1):
    print(f"--- Chunk {i} ---")
    print(chunk)
    print()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aniln\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


--- Chunk 1 ---
What Are Transformers? Transformers are a type of neural network architecture that has revolutionized the field of natural language processing (NLP). Introduced in a 2017 paper titled "Attention is All You Need" by Vaswani et al., Transformers are designed to handle sequential data, such as text, by using a mechanism called self-attention. Key Characteristics of Transformers:
Self-Attention Mechanism: The self-attention mechanism allows Transformers to weigh the importance of different words in a sentence when making predictions.

--- Chunk 2 ---
This is crucial for understanding context, especially in long sentences.

--- Chunk 3 ---
Parallelization: Unlike traditional RNNs (Recurrent Neural Networks), which process data sequentially, Transformers can process multiple words at once, making them faster and more efficient. Versatility: Transformers are not limited to language tasks; they can be applied to any problem involving sequential data, including tasks like image 

### Agentic Chunking

In [None]:
from openai import OpenAI
import os
from typing import List

# -----------------------------
# Step 0: Setup API key
# -----------------------------
os.environ["OPENAI_API_KEY"] = ""  # Replace with your key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("❌ OPENAI_API_KEY not found. Please set it as an environment variable.")

client = OpenAI(api_key=api_key)

# -----------------------------
# Step 1: Agentic chunking function
# -----------------------------
def agentic_chunking_openai(text_data: str, max_chunk_chars: int = 1000) -> List[str]:
    """
    Split a document into semantically coherent chunks using OpenAI LLM.
    Each chunk will be ≤ max_chunk_chars and preserve meaning.
    """
    prompt = f"""
    I am providing a document below. 
    Please split the document into chunks that maintain semantic coherence and ensure that each chunk represents a complete and meaningful unit of information. 
    Each chunk should stand alone, preserving the context and meaning without splitting key ideas across chunks. 
    Ensure that no chunk exceeds {max_chunk_chars} characters in length, and prioritize keeping related concepts or sections together.

    Do not modify the document, just split into chunks and return them as an array of strings, 
    where each string is one chunk of the document. Return the entire text; do not stop midway.

    Document:
    {text_data}
    """

    # Use the ChatCompletion API
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # or gpt-4o if you have access
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )

    # Extract the text
    output_text = response.choices[0].message.content

    # Try to parse as list if model returned Python-style array
    import ast
    try:
        chunks = ast.literal_eval(output_text)
        if isinstance(chunks, list):
            return chunks
    except:
        pass

    # Fallback: split by double newlines
    chunks = [c.strip() for c in output_text.split("\n\n") if c.strip()]
    return chunks

# -----------------------------
# Step 2: Usage
# -----------------------------
with open("sample.txt", "r", encoding="utf-8") as f:
    text = f.read()

chunks = agentic_chunking_openai(text, max_chunk_chars=1000)
for i, chunk in enumerate(chunks, 1):
    print(f"--- Chunk {i} ---")
    print(chunk)
    print()
