# **Chnunking Strategies:**

## **Different Types of Chunking Methods:**

### **1. Fixed Size Chunking:**

Fixed size chunking is a method of splitting documents into smaller chunks of a specified size, with optional overlap between chunks. This is useful when you want to process large documents in smaller, manageable pieces. Divides documents into equal-sized chunks based on a predefined metric, such as word or token count.​ <br>

* **How:** The text is split into segments of a specified size, regardless of the content's semantic boundaries.​


* **Example:** A 1,000-word document split into 100-word chunks results in 10 chunks.​


* **Pros:**
    * Simple and quick to implement.
    * Ensures uniform chunk sizes, facilitating consistent processing.​


* **Cons:**
    * May split sentences or paragraphs, disrupting context.
    * Not ideal for content requiring semantic coherence.​

### **2. Document-Based Chunking:**

Document chunking is a method of splitting documents into smaller chunks based on document structure like **`paragraphs`** and **`sections`**. It analyzes natural document boundaries rather than splitting at fixed character counts. This is useful when you want to process large documents while preserving semantic meaning and context. **Segments documents based on their `inherent structural elements`, such as `sections` or `headings`.** <br>

* **How:** Utilizes the document's layout to determine chunk boundaries, preserving logical groupings.​

* **Example:** A research paper divided into chunks corresponding to its Abstract, Introduction, Methods, Results, and Conclusion sections.

* **Pros:**
    * Maintains the document's logical flow and context.
    * Enhances retrieval relevance by aligning with natural divisions.

* **Cons:**
    * Requires documents to have clear structural markers.
    * Less effective for unstructured or free-form text.​

### **3. Semantic Chunking:**

**Semantic chunking is a method of splitting documents into smaller chunks by analyzing `semantic similarity` between text segments using `embeddings`**. It uses the chonkie library to identify natural **breakpoints where the semantic meaning changes significantly**, based on a configurable **similarity threshold**. This helps preserve context and meaning better than fixed-size chunking by ensuring semantically related content stays together in the same chunk, while splitting occurs at meaningful topic transitions. Divides text into chunks based on semantic meaning, ensuring each chunk contains a coherent idea. <br>


* **How:** Employs natural language processing techniques to identify semantically related sentences or paragraphs.​

* **Example:** A news article segmented into chunks where each discusses a distinct aspect of the story.

* **Pros:**
    * Preserves meaning and context within chunks.
    * Improves retrieval accuracy for semantically rich queries.

* **Cons:**
    * More computationally intensive than fixed-size methods.
    * Complex to implement due to reliance on semantic analysis.

### **4. Agentic Chunking:**


**Agentic chunking is an intelligent method of splitting documents into smaller chunks by using an `LLM to determine natural breakpoints` in the text**. Rather than splitting text at fixed character counts, it analyzes the content to find semantically meaningful boundaries like paragraph breaks and topic transitions. Utilizes AI agents to dynamically determine chunk boundaries based on task-specific requirements. <br>

* **How:** An AI agent analyzes the document, identifying and segmenting content into chunks optimized for specific tasks or queries.​

* **Example:** For a customer support chatbot, the agent segments a product manual into chunks corresponding to common user issues.

* **Pros:**
    * Highly adaptable to various tasks and contexts.
    * Enhances relevance by aligning chunks with user intents.

* **Cons:**
    * Requires sophisticated AI models and training.
    * Potentially resource-intensive and complex to maintain.

### **5. Recursive Chunking:**

Recursive chunking is a method of splitting documents into smaller chunks by recursively applying a chunking strategy. This is useful when you want to process large documents in smaller, manageable pieces.  <br>

Recursive Chunking is a strategy that splits documents by trying a **`hierarchy of separators`** in a recursive manner. The goal is to preserve **`semantic meaning`** and **`structure`** while ensuring chunks stay within a specific token limit. <br>

* **How It Works:** It recursively splits the document using a predefined list of separators like:
    ```bash
        ["\n\n", "\n", ".", " ", ""]
    ```
    1. Start with the largest separator (e.g., paragraphs: `\n\n`).
    2. If a chunk is too large:
        * Try a smaller separator (e.g., line break: `\n`)
    3. If still too large:
        * Try sentence-level split (e.g., `"."`)
    4. Continue down to character level if needed

* **Why Use Recursive Chunking?**
    * Maintains semantic boundaries (e.g., full paragraphs or sentences)
    * Avoids breaking up ideas mid-sentence
    * More natural for QA systems, chatbots, and search-based tasks

* **Implementation:**
    ```python
        from langchain.text_splitter import RecursiveCharacterTextSplitter

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,         # max tokens or characters per chunk
            chunk_overlap=50,       # overlap for context continuity
            separators=["\n\n", "\n", ".", " ", ""]
        )

        chunks = splitter
    ```

## **Ideal Overlap Size When Chunking a Documents:**


* **Overlap size** is the number of tokens (or characters) shared between adjacent chunks to maintain continuity of context. <br>

* **General Recommendation:**
    * Chunk size: `500–1,000` tokens
    * Overlap size: `10–20%` of chunk size
        * 👉 Typical value: `50–200` tokens

* **Why Overlap is Important:**
    * Ensures no important sentence or semantic content is cut off between chunks.
    * Preserves context for models like GPT-4 to generate meaningful answers.

### **Calculate Total Token and Character Counts of a PDF Document:**

In [None]:
# %pip install pymupdf tiktoken


In [2]:
import fitz  # PyMuPDF
import tiktoken

# Load PDF and extract all text
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

# Count characters and tokens
def get_token_character_counts(text, model="gpt-4"):
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    return {
        "character_count": len(text),
        "token_count": len(tokens)
    }


# Example usage
pdf_path = "mypdf.pdf"
text = extract_text_from_pdf(pdf_path)
stats = get_token_character_counts(text)

print(f"Total Characters: {stats['character_count']}")
print(f"Total Tokens: {stats['token_count']}")


Total Characters: 2111
Total Tokens: 490


## **Recommended Chunk Sizes (based on your case):**


| **Use Case**                    | **Chunk Size**     | **Overlap**      | **Why**                                |
|-----------------------------|----------------|--------------|------------------------------------|
| FAQ-style Q&A               | 100 tokens     | 10 tokens    | Small content, fast retrieval     |
| Chatbot-style RAG           | 150–200 tokens | 20–30 tokens | Balances granularity & context    |
| Deep Document QA / Legal    | 300 tokens     | 50 tokens    | Keeps larger context together     |
| Highly Semantic Search      | 100–150 tokens | 15–25 tokens | Allows precise embedding vectors |


## **References:**

https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d

https://docs.phidata.com/chunking/document-chunking

https://www.ibm.com/think/tutorials/chunking-strategies-for-rag-with-langchain-watsonx-ai

https://www.analyticsvidhya.com/blog/2024/10/chunking-techniques-to-build-exceptional-rag-systems/