# 📚 Chunking in NLP & RAG

**Chunking** is the process of dividing large text data into smaller, manageable pieces (*chunks*) for processing in NLP tasks.  

In a **RAG pipeline**, chunking is essential because:

### 🔹 Why Chunking Matters
- **Retrieval**:  
  Large documents (e.g., support tickets, manuals) often exceed the input limits of embedding models (e.g., BERT’s 512-token limit).  
  ➝ *Chunking breaks them into smaller units for indexing and similarity search.*

- **Generation**:  
  Generative models have context window limits.  
  ➝ *Chunks ensure that only relevant portions are fed to the model.*

- **Efficiency**:  
  Smaller chunks reduce computational overhead.  
  ➝ *This improves retrieval speed and system scalability.*

---

### ⚖️ Impact of Chunking Strategy
Choosing the right chunking strategy affects:
- **Retrieval accuracy** → Well-structured chunks increase the chances of finding relevant matches.  
- **Generation coherence** → Better chunks provide complete, context-rich input to the LLM.  
- **System performance** → Balanced chunk sizes optimize memory and compute usage.  

⚠️ Poorly chosen chunks may **split important context**, leading to incomplete or irrelevant retrieval results.


# 🔑 Why is Chunking Important for RAG?

In **Retrieval-Augmented Generation (RAG)**, the goal is to retrieve relevant documents (or document chunks) based on a query and use them to generate accurate responses.  

Chunking plays a critical role because it directly impacts both **retrieval quality** and **generation effectiveness**.

---

### 📌 Key Reasons
- **Embedding Quality**  
  Chunks must be semantically coherent to produce meaningful embeddings for retrieval.  

- **Context Preservation**  
  Chunks should retain enough context to be useful for both retrieval and generation.  

- **Scalability**  
  Efficient chunking reduces memory usage and processing requirements for large datasets.  

- **Domain Relevance**  
  For domain-specific use cases (e.g., customer support tickets), chunks should preserve critical information like product codes, timestamps, or issue descriptions.  
  ---



# 🧩 Chunking Strategies in RAG

Different **chunking strategies** can be applied depending on the dataset, task, and model constraints. Choosing the right one is critical for balancing **retrieval accuracy, context preservation, and efficiency**.

---

### 🔹 1. Fixed-size Chunking
- Divides text into chunks of a fixed length (e.g., 100 words or 512 tokens).  
- **Pros:** Simple, predictable, easy to implement.  
- **Cons:** May cut off sentences or split related context unnaturally.  

---

### 🔹 2. Sentence-based Chunking
- Splits text into chunks based on sentence boundaries.  
- **Pros:** Preserves sentence integrity, avoids mid-sentence splits.  
- **Cons:** May create chunks of uneven size, which can affect embedding consistency.  

---

### 🔹 3. Semantic Chunking
- Groups text based on **semantic similarity or coherence** (e.g., clustering embeddings).  
- **Pros:** Keeps related concepts together, improves retrieval relevance.  
- **Cons:** More complex, requires additional computation.  

---

### 🔹 4. Overlapping Windows
- Creates chunks with **overlapping segments** (e.g., 512 tokens with a 50-token overlap).  
- **Pros:** Preserves context across boundaries, reduces risk of missing key info.  
- **Cons:** Increases storage and indexing requirements.  

---

### 🔹 5. Document Structure-aware Chunking
- Uses **document structure** (e.g., paragraphs, headings, sections) to create chunks.  
- **Pros:** Naturally preserves logical flow and context.  
- **Cons:** Not all documents have clean or consistent structure.  

---

### 🔹 6. Chunk Size Optimization
- Dynamically adjusts chunk size based on **task requirements or evaluation metrics**.  
- **Pros:** Adaptive, balances retrieval performance and efficiency.  
- **Cons:** Requires experimentation and tuning.  


In [3]:
# install libraries
!pip install langchain langchain-text-splitters spacy sentence-transformers scikit-learn pandas numpy nltk

Collecting typer<0.10.0,>=0.3.0 (from spacy)
  Downloading typer-0.9.4-py3-none-any.whl.metadata (14 kB)
Downloading typer-0.9.4-py3-none-any.whl (45 kB)
Installing collected packages: typer
  Attempting uninstall: typer
    Found existing installation: typer 0.12.5
    Uninstalling typer-0.12.5:
      Successfully uninstalled typer-0.12.5
Successfully installed typer-0.9.4


DEPRECATION: Loading egg at c:\users\hp\appdata\local\programs\python\python311\lib\site-packages\mcqgenerator-0.0.1-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
    pytz>=2011n
        ~~~~~~^
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
crewai 0.130.0 requires tokenizers>=0.20.3, but you have tokenizers 0.15.2 which is incompatible.
evidently 0.6.7 requires fsspec>=2024.6.1, but you have fsspec 2024.2.0 which is incompatible.
gradio 5.1.0 requires typer<1.0,>=0.12; sys_platform != "emscripten", but you have typer 0.9.4 which is incompatible.
langchain-chroma 0.1.1 requires langchain-core<0.3,>=0.1.40, but you have langchain-core 0.3.66 which is incompatible.
langchain-cli 0.0.24 requires tomlkit<0.13.0

In [8]:
# Sample cleaned customer support ticket
sample_ticket = """
Hello, I am experiencing a critical issue with my device (Product123). The screen suddenly stopped 
responding, and I can no longer interact with it properly. Initially, I thought it was a temporary 
glitch, but after multiple restarts, the issue persists. This started immediately after I updated 
the device to version 2.3, which makes me think the update caused the malfunction. 

Before the screen issue occurred, I had noticed that the device was running slower than usual, 
frequently lagging when switching between applications. I assumed this was normal after the update, 
but now with the screen completely failing, I suspect the two issues are connected. 

I already tried basic troubleshooting: restarting the device, disconnecting it from power, leaving 
it off for several minutes, and reconnecting. None of these worked. I also reset the device to 
factory settings, but the issue remains. The problem is making it impossible for me to use the device 
for my daily work, and it’s becoming a serious inconvenience. 

I reached out to customer support last week and logged this under ticket12345, but unfortunately I 
have not received any response. My account is linked to user123, and I’ve been a premium customer for 
over two years. I was expecting faster assistance, especially given the urgency of the issue. 

Please note, this product is still under warranty (code xyz-789). I also purchased extended coverage, 
so I would like to request either a replacement device or a repair service as soon as possible. I 
depend on this product for my work, and the downtime is costing me productivity every single day. 

In addition, I want to mention that I tried connecting the device to an external monitor, and while 
the output displays fine externally, the touch screen on the actual device does not respond at all. 
This seems to confirm that the issue is specifically with the screen hardware or the drivers related 
to it. If this is a known bug introduced in version 2.3, I would like to be informed about any 
upcoming patches or fixes. 

To summarize:
1. Device: Product123  
2. Issue: Screen unresponsive after updating to version 2.3  
3. Previous symptoms: Slow performance, frequent lag  
4. Troubleshooting steps: Restart, factory reset, external monitor test (failed to fix)  
5. Support ticket already logged: ticket12345 (no response)  
6. Warranty: Active (code xyz-789), extended coverage purchased  
7. Urgency: High – device required for daily work  

I kindly request immediate assistance. If a replacement or repair cannot be processed quickly, please 
provide me with a temporary workaround so that I can continue working without further disruptions. 
This issue is impacting my business, and I hope to receive a resolution as soon as possible. Thank you. 
"""


### Fixed Sized Chunking

**Description: Splits text into chunks of a fixed length (e.g., by characters or tokens). Simple but may split sentences.**

In [10]:
from langchain.text_splitter import CharacterTextSplitter



def fixed_size_chunk(text: str, chunk_size: int = 100, overlap: int = 0) -> list:
    """
    Split text into fixed-size chunks using LangChain CharacterTextSplitter.
    
    Args:
        text (str): Input text.
        chunk_size (int): Number of characters per chunk.
        overlap (int): Number of overlapping characters.
    
    Returns:
        list: List of chunks.
    """
    try:
        splitter = CharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=overlap,
            separator=' ',
            strip_whitespace=True
        )
        chunks = splitter.split_text(text)
        return chunks
    except Exception as e:
        return []

# Apply fixed-size chunking
fixed_chunks = fixed_size_chunk(sample_ticket, chunk_size=200, overlap=0)
print("\nFixed-size Chunks (100 characters):")
for i, chunk in enumerate(fixed_chunks):
    print(f"Chunk {i+1}: {chunk}")


Fixed-size Chunks (100 characters):
Chunk 1: Hello, I am experiencing a critical issue with my device (Product123). The screen suddenly stopped 
responding, and I can no longer interact with it properly. Initially, I thought it was a temporary
Chunk 2: glitch, but after multiple restarts, the issue persists. This started immediately after I updated 
the device to version 2.3, which makes me think the update caused the malfunction. 

Before the
Chunk 3: screen issue occurred, I had noticed that the device was running slower than usual, 
frequently lagging when switching between applications. I assumed this was normal after the update, 
but now with
Chunk 4: the screen completely failing, I suspect the two issues are connected. 

I already tried basic troubleshooting: restarting the device, disconnecting it from power, leaving 
it off for several minutes,
Chunk 5: and reconnecting. None of these worked. I also reset the device to 
factory settings, but the issue remains. The problem is 

### Sentence-based Chunking

**Description: Splits text into chunks based on sentence boundaries, preserving syntactic units**.





In [13]:
from langchain.text_splitter import SpacyTextSplitter

def sentence_chunk(text: str) -> list:
    """
    Split text into sentence-based chunks using LangChain SpacyTextSplitter.
    
    Args:
        text (str): Input text.
    
    Returns:
        list: List of sentence chunks.
    """
    try:
        splitter = SpacyTextSplitter(chunk_size=1000)  # Large chunk_size to ensure sentence-based splitting
        chunks = splitter.split_text(text)
        print(f"Sentence chunking: {len(chunks)} chunks created")
        return chunks
    except Exception as e:
        print(f"Sentence chunking error: {str(e)}")
        return []

# Apply sentence-based chunking
sentence_chunks = sentence_chunk(sample_ticket)
print("\nSentence-based Chunks:")
for i, chunk in enumerate(sentence_chunks):
    print(f"Chunk {i+1}: {chunk}")

Sentence chunking: 4 chunks created

Sentence-based Chunks:
Chunk 1: Hello, I am experiencing a critical issue with my device (Product123).

The screen suddenly stopped 
responding, and I can no longer interact with it properly.

Initially, I thought it was a temporary 
glitch, but after multiple restarts, the issue persists.

This started immediately after I updated 
the device to version 2.3, which makes me think the update caused the malfunction. 



Before the screen issue occurred, I had noticed that the device was running slower than usual, 
frequently lagging when switching between applications.

I assumed this was normal after the update, 
but now with the screen completely failing, I suspect the two issues are connected. 



I already tried basic troubleshooting: restarting the device, disconnecting it from power, leaving 
it off for several minutes, and reconnecting.

None of these worked.

I also reset the device to 
factory settings, but the issue remains.
Chunk 2: None of 

### Semantic ChunkingDescription: 
**Groups text into chunks based on semantic similarity, using embeddings and clustering.**



In [15]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import spacy

def semantic_chunk(text: str, max_chunks: int = 3) -> list:
    """
    Split text into semantically coherent chunks using embeddings and clustering.
    
    Args:
        text (str): Input text.
        max_chunks (int): Maximum number of chunks.
    
    Returns:
        list: List of semantic chunks.
    """
    try:
        # Split into sentences using SpaCy
        nlp = spacy.load('en_core_web_sm')
        doc = nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents]
        
        # Generate embeddings
        model = SentenceTransformer('all-MiniLM-L6-v2')
        embeddings = model.encode(sentences, show_progress_bar=False)
        
        # Cluster sentences
        kmeans = KMeans(n_clusters=min(max_chunks, len(sentences)), random_state=42)
        labels = kmeans.fit_predict(embeddings)
        
        # Group sentences by cluster
        chunks = [[] for _ in range(max_chunks)]
        for sent, label in zip(sentences, labels):
            chunks[label].append(sent)
        
        chunks = [' '.join(chunk) for chunk in chunks if chunk]
        print(f"Semantic chunking: {len(chunks)} chunks created")
        return chunks
    except Exception as e:
        print(f"Semantic chunking error: {str(e)}")
        return []

# Apply semantic chunking
semantic_chunks = semantic_chunk(sample_ticket, max_chunks=3)
print("\nSemantic Chunks:")
for i, chunk in enumerate(semantic_chunks):
    print(f"Chunk {i+1}: {chunk}")



Semantic chunking: 3 chunks created

Semantic Chunks:
Chunk 1: Hello, I am experiencing a critical issue with my device (Product123). The screen suddenly stopped 
responding, and I can no longer interact with it properly. Initially, I thought it was a temporary 
glitch, but after multiple restarts, the issue persists. This started immediately after I updated 
the device to version 2.3, which makes me think the update caused the malfunction. Before the screen issue occurred, I had noticed that the device was running slower than usual, 
frequently lagging when switching between applications. I assumed this was normal after the update, 
but now with the screen completely failing, I suspect the two issues are connected. I already tried basic troubleshooting: restarting the device, disconnecting it from power, leaving 
it off for several minutes, and reconnecting. I also reset the device to 
factory settings, but the issue remains. In addition, I want to mention that I tried connecting the 



### Overlapping Windows
**Description: Creates fixed-size chunks with overlapping segments to preserve context**.



In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def overlapping_chunk(text: str, chunk_size: int = 100, overlap: int = 20) -> list:
    """
    Split text into chunks with overlap using LangChain RecursiveCharacterTextSplitter.
    
    Args:
        text (str): Input text.
        chunk_size (int): Number of characters per chunk.
        overlap (int): Number of overlapping characters.
    
    Returns:
        list: List of chunks.
    """
    try:
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=overlap,
            separators=[' ', '\n'],
            strip_whitespace=True
        )
        chunks = splitter.split_text(text)
        print(f"Overlapping chunking: {len(chunks)} chunks created")
        return chunks
    except Exception as e:
        print(f"Overlapping chunking error: {str(e)}")
        return []

# Apply overlapping chunking
overlap_chunks = overlapping_chunk(sample_ticket, chunk_size=100, overlap=20)
print("\nOverlapping Chunks (100 characters, 20-character overlap):")
for i, chunk in enumerate(overlap_chunks):
    print(f"Chunk {i+1}: {chunk}")

Overlapping chunking: 35 chunks created

Overlapping Chunks (100 characters, 20-character overlap):
Chunk 1: Hello, I am experiencing a critical issue with my device (Product123). The screen suddenly stopped
Chunk 2: suddenly stopped 
responding, and I can no longer interact with it properly. Initially, I thought
Chunk 3: I thought it was a temporary 
glitch, but after multiple restarts, the issue persists. This started
Chunk 4: This started immediately after I updated 
the device to version 2.3, which makes me think the
Chunk 5: makes me think the update caused the malfunction. 

Before the screen issue occurred, I had noticed
Chunk 6: I had noticed that the device was running slower than usual, 
frequently lagging when switching
Chunk 7: when switching between applications. I assumed this was normal after the update, 
but now with the
Chunk 8: but now with the screen completely failing, I suspect the two issues are connected. 

I already
Chunk 9: I already tried basic troubleshooting

### Document Structure-aware Chunking
Description: Uses document structure (e.g., paragraphs, headings) to create chunks.



In [17]:
def structure_aware_chunk(text: str) -> list:
    """
    Split text into chunks based on paragraph breaks (newlines).
    
    Args:
        text (str): Input text.
    
    Returns:
        list: List of paragraph chunks.
    """
    try:
        chunks = [para.strip() for para in text.split('\n') if para.strip()]
        print(f"Structure-aware chunking: {len(chunks)} chunks created")
        return chunks
    except Exception as e:
        print(f"Structure-aware chunking error: {str(e)}")
        return []

# Simulate paragraphs with newlines
structured_ticket = """
hello i have an issue with product123 the screen is broken please help.
i tried restarting the device but it did not work.
the issue started after updating to version 2.3. i contacted support at ticket12345 but no response.
my account is linked to user123. this is urgent please assist asap.
the product is under warranty code xyz-789. i also noticed slow performance before the screen issue.
"""
structure_chunks = structure_aware_chunk(structured_ticket)
print("\nStructure-aware Chunks (Paragraphs):")
for i, chunk in enumerate(structure_chunks):
    print(f"Chunk {i+1}: {chunk}")

Structure-aware chunking: 5 chunks created

Structure-aware Chunks (Paragraphs):
Chunk 1: hello i have an issue with product123 the screen is broken please help.
Chunk 2: i tried restarting the device but it did not work.
Chunk 3: the issue started after updating to version 2.3. i contacted support at ticket12345 but no response.
Chunk 4: my account is linked to user123. this is urgent please assist asap.
Chunk 5: the product is under warranty code xyz-789. i also noticed slow performance before the screen issue.


### Chunk Size Optimization
Description: Dynamically adjusts chunk size based on task requirements or evaluation metrics (e.g., embedding similarity).



In [19]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def optimize_chunk_size(text: str, chunk_sizes: list = [50, 100, 150], strategy: str = 'fixed') -> dict:
    """
    Test different chunk sizes and evaluate using embedding similarity.
    
    Args:
        text (str): Input text.
        chunk_sizes (list): List of chunk sizes to test.
        strategy (str): Chunking strategy ('fixed' or 'sentence').
    
    Returns:
        dict: Mapping of chunk size to average similarity score.
    """
    try:
        model = SentenceTransformer('all-MiniLM-L6-v2')
        original_embedding = model.encode([text], show_progress_bar=False)[0]
        results = {}
        
        for size in chunk_sizes:
            if strategy == 'fixed':
                chunks = fixed_size_chunk(text, chunk_size=size, overlap=0)
            elif strategy == 'sentence':
                chunks = sentence_chunk(text)[:size]  # Limit to first N sentences
            else:
                raise ValueError("Unsupported strategy")
            
            chunk_embeddings = model.encode(chunks, show_progress_bar=False)
            similarities = cosine_similarity([original_embedding], chunk_embeddings)[0]
            avg_similarity = np.mean(similarities)
            results[size] = avg_similarity
            print(f"Chunk Size {size} ({strategy}): Avg Similarity = {avg_similarity:.4f}")
        
        print(f"Chunk size optimization: {results}")
        return results
    except Exception as e:
        print(f"Chunk size optimization error: {str(e)}")
        return {}

# Test chunk size optimization
optimization_results = optimize_chunk_size(sample_ticket, chunk_sizes=[50, 100, 150], strategy='fixed')
print("\nChunk Size Optimization Results:", optimization_results)



Chunk Size 50 (fixed): Avg Similarity = 0.2865
Chunk Size 100 (fixed): Avg Similarity = 0.3751
Chunk Size 150 (fixed): Avg Similarity = 0.4289
Chunk size optimization: {50: 0.28648812, 100: 0.37513182, 150: 0.42890257}

Chunk Size Optimization Results: {50: 0.28648812, 100: 0.37513182, 150: 0.42890257}
