### **To run this notebook efficiently you need to ensure**
1. You have GPU in your machine
2. Enable TPU on Colab
3. [Hugging Face token is setup in the notebook or environment](https://huggingface.co/settings/tokens)
4. Enable HF token for notebook scope.


## **What is Chunking?**

**1. Definition of Text Chunking:**

In the context of Natural Language Processing (NLP) and data preparation, **text chunking**, also known as shallow parsing or partial parsing, is the process of breaking down a continuous piece of text into smaller, meaningful segments or 'chunks'. These segments are typically syntactically related, like noun phrases, verb phrases, or named entities, but they do not necessarily form complete syntactic trees.

**2. Fundamental Concept and Necessity:**

The fundamental concept behind chunking is to manage and process large volumes of text more efficiently and effectively. Large texts often contain a wealth of information, but directly processing them in their entirety can be computationally intensive, lead to loss of specific details, or overwhelm models that have token limits. By breaking down larger texts into smaller chunks, we achieve several benefits:

*   **Manageability:** Smaller units are easier to handle, store, and retrieve.
*   **Focus:** It allows NLP models to focus on relevant portions of the text, reducing noise and improving understanding of specific contexts.
*   **Overcoming Limitations:** Many NLP models and APIs have input token limits. Chunking ensures that text fits within these constraints, making it feasible to process extensive documents.
*   **Preserving Context:** While breaking text down, the goal is to maintain sufficient context within each chunk, preventing important relationships or meanings from being lost across arbitrary splits.

**3. Primary Purpose in NLP Applications:**

The primary purpose of text chunking is to facilitate various downstream NLP applications by providing manageable and contextually rich segments of text. Some key purposes include:

*   **Information Extraction:** Identifying and extracting specific entities (e.g., people, organizations, locations) or relationships from text.
*   **Question Answering (QA) Systems:** When a question is posed, chunking allows QA systems to search and retrieve relevant text snippets rather than processing entire documents, leading to faster and more accurate answers.
*   **Summarization:** Breaking down documents into key chunks can be an initial step in identifying salient information for summarization.
*   **Semantic Search and Retrieval:** Improved relevance in search results by matching queries to specific, meaningful chunks rather than entire documents.
*   **Contextual Understanding for Large Language Models (LLMs):** Preparing data for LLMs by ensuring that input texts are within the model's context window while retaining as much relevant information as possible.
*   **Data Preprocessing:** Standardizing text input for various machine learning tasks.

## **Importance of Chunking**

Chunking is a critical preprocessing step in many Natural Language Processing (NLP) pipelines, especially when dealing with large volumes of text. It involves breaking down extensive documents into smaller, manageable segments or 'chunks' based on various strategies (e.g., fixed size, semantic boundaries, or specific delimiters).

### **Benefits of Chunking:**

1.  **Memory Efficiency:**
    *   **Handling Large Documents:** Modern NLP models, especially large language models (LLMs), often have strict input token limits (context windows). Directly feeding a very long document can exceed these limits, leading to truncation or outright rejection. Chunking allows processing of documents larger than the model's context window by feeding them in parts.
    *   **Reducing Computational Load:** Processing smaller chunks requires less computational memory and processing power per inference, making the overall pipeline more efficient. This is particularly beneficial for resource-intensive operations or when running on hardware with limited resources (e.g., GPUs with less VRAM).
    *   **Faster Processing:** Smaller inputs generally lead to faster inference times, which can significantly speed up applications that need to process many documents or respond quickly.

2.  **Context Preservation:**
    *   **Maintaining Semantic Meaning:** Effective chunking strategies aim to keep semantically related sentences or paragraphs together within a single chunk. This ensures that the model receives a coherent piece of information, preventing the loss of critical context that might occur if a sentence is split haphazardly across chunks or if important surrounding information is truncated.
    *   **Avoiding Truncation of Important Information:** Without chunking, models might arbitrarily truncate parts of a document to fit the context window, potentially cutting off vital details. Chunking helps to manage this by allowing each significant part of the document to be processed, often with some overlap between chunks to ensure continuity.

3.  **Model Performance:**
    *   **Improving Accuracy and Relevance:** By providing models with well-defined, contextually rich chunks, the model can better understand and process the information. For tasks like Question Answering (QA), summarization, or Retrieval-Augmented Generation (RAG), better context leads to more accurate answers, more relevant summaries, and more informed generations.
    *   **Enhanced Retrieval-Augmented Generation (RAG):** In RAG systems, chunking is fundamental. When a query is made, relevant chunks are retrieved from a vectorized database. If the chunks are well-formed and semantically coherent, the retrieved information will be highly relevant to the query, leading to higher quality augmented generations.
    *   **Specialized Processing:** Different NLP tasks might require different granularities of text. Chunking allows for tailoring the input to the specific needs of a task, optimizing the model's ability to perform that task effectively.

In essence, chunking acts as a bridge, enabling powerful but context-limited NLP models to process and understand the vastness of human language more effectively and efficiently, leading to improved outcomes across a wide range of applications.

## **High-Level Chunking Strategies**

Chunking is a crucial step in preparing text data for various natural language processing (NLP) tasks, especially in retrieval-augmented generation (RAG) systems. It involves breaking down large documents into smaller, more manageable pieces or 'chunks'. Different strategies can be employed based on the specific requirements of the task and the nature of the text.

Here are the high-level chunking strategies:

1.  **Fixed-Size Chunking:**
    This strategy involves splitting text into segments of a predetermined, fixed length. For example, a document might be divided into chunks of 500 characters each. Often, an optional overlap is included, meaning a portion of the end of one chunk is repeated at the beginning of the next chunk. This helps maintain context across chunk boundaries and reduces the likelihood of important information being split across two chunks without any connecting context.

2.  **Content-Aware (or Semantic/Structure-Aware) Chunking:**
    Unlike fixed-size chunking, content-aware chunking considers the meaning, semantic coherence, or structural elements of the text to make more intelligent splits. This strategy aims to keep related information together within a single chunk. Examples include splitting text by:
    *   **Paragraphs:** Each paragraph forms a chunk.
    *   **Sentences:** Each sentence forms a chunk (though this can often lead to very small chunks).
    *   **Document Sections:** Splitting based on headings, subheadings, or other structural markers in the document.
    *   **Semantic Coherence:** Using NLP techniques to identify natural breaks in meaning, ensuring that a chunk represents a complete idea or topic.

3.  **Overlap-Based Chunking:**
    While often used in conjunction with fixed-size or content-aware chunking, overlap-based chunking is a distinct concept. It specifically refers to the practice of including a portion of the previous or subsequent chunk within the current chunk. The primary purpose of an overlap is to maintain context across chunk boundaries, especially when a query might involve information that spans two adjacent chunks. For instance, if a fixed-size chunk split a sentence in half, an overlap would ensure that both halves are present in at least one full chunk or that the context around the split is available.

## **Data Acquisition from Wikipedia**
Fetch the text content from the Wikipedia page 'https://en.wikipedia.org/wiki/India' to use as the base for all chunking examples.


In [1]:
import requests
from bs4 import BeautifulSoup

print("Libraries imported successfully.")

Libraries imported successfully.


In [5]:
wikipedia_url = 'https://en.wikipedia.org/wiki/India'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(wikipedia_url, headers=headers)

# Check if the request was successful
if response.status_code != 200:
    print(f"Error fetching page: Status code {response.status_code}")
    wiki_text = ""
else:
    soup = BeautifulSoup(response.content, 'html.parser')

    # Try to find the main content area using id "mw-content-text"
    main_content_div = soup.find('div', id='mw-content-text')

    if main_content_div:
        # Look for the 'mw-parser-output' div which usually contains the main article paragraphs
        parser_output_div = main_content_div.find('div', class_='mw-parser-output')
        if parser_output_div:
            paragraphs = parser_output_div.find_all('p')
            # Extract text, strip whitespace, and filter out empty strings
            paragraph_texts = [p.get_text().strip() for p in paragraphs if p.get_text().strip()]
            wiki_text = '\n'.join(paragraph_texts)
        else:
            # Fallback: if 'mw-parser-output' is not found within 'mw-content-text', get all <p> tags from 'mw-content-text'
            paragraphs = main_content_div.find_all('p')
            paragraph_texts = [p.get_text().strip() for p in paragraphs if p.get_text().strip()]
            wiki_text = '\n'.join(paragraph_texts)
            print("Warning: 'mw-parser-output' div not found within 'mw-content-text'. Using all <p> tags from 'mw-content-text'.")
    else:
        # Fallback: if 'mw-content-text' not found, get all <p> tags from the entire page
        paragraphs = soup.find_all('p')
        paragraph_texts = [p.get_text().strip() for p in paragraphs if p.get_text().strip()]
        wiki_text = '\n'.join(paragraph_texts)
        print("Warning: 'mw-content-text' div not found. Using all <p> tags from the entire page.")

print(f"Successfully fetched and processed Wikipedia page for '{wikipedia_url}'.")
print(f"Length of the extracted text: {len(wiki_text)} characters.")

Successfully fetched and processed Wikipedia page for 'https://en.wikipedia.org/wiki/India'.
Length of the extracted text: 65980 characters.


## **Method 1: Fixed-Size Chunking**


### Description of Fixed-Size Chunking

Fixed-size chunking is a straightforward text splitting strategy that divides a document into segments of a predetermined, constant length. This method operates without considering the semantic content or structural elements of the text.

#### Working Principle

1.  **Defining Chunk Size**: The primary parameter is `chunk_size`, which specifies the maximum character count (or token count, depending on the implementation) for each segment.
2.  **Iterative Splitting**: The text is processed sequentially. Chunks are extracted by taking `chunk_size` characters from the beginning of the text, then the next `chunk_size` characters, and so on, until the entire document is covered.
3.  **Optional Overlap**: To prevent loss of context across chunk boundaries, an `overlap_size` can be introduced. This means that a portion of the end of one chunk is repeated at the beginning of the subsequent chunk. For instance, if a chunk ends at character `X`, the next chunk might start a few characters before `X`, effectively sharing a segment of text.

#### When it is Appropriate to Use

*   **Simplicity and Speed**: It is the easiest and fastest chunking method to implement, requiring minimal computational overhead.
*   **Initial Exploration/Baseline**: Useful for getting a quick understanding of how a model performs with chunked data before investing in more complex strategies.
*   **When Content Structure is Not Critical**: If the precise semantic boundaries or document structure are not paramount for the downstream task, or if the text is largely unstructured.
*   **Limited Context Window**: When strict adherence to a maximum input length is required for models with very small context windows.

#### When it is Not Appropriate to Use

*   **Semantic Coherence is Crucial**: Fixed-size chunking can arbitrarily cut sentences or paragraphs in half, destroying semantic integrity and making individual chunks less meaningful. This is particularly detrimental for tasks like Question Answering or summarization where context is key.
*   **Loss of Context/Meaning**: Important information or relationships might be split across chunks without sufficient overlap, leading to a fragmented understanding for the model.
*   **Suboptimal for Structured Documents**: For documents with clear headings, sections, or hierarchical structures, fixed-size chunking ignores this valuable information, which could otherwise be used to create more intelligent and coherent chunks.
*   **Redundant Overlap**: While overlap helps, excessive overlap can lead to redundant information being processed, increasing computational cost without proportional benefits.

#### Role of Overlap

Overlap is crucial in fixed-size chunking to mitigate the issue of lost context when a significant idea or entity spans across two chunk boundaries. By repeating a small section of text from the end of the previous chunk at the start of the next, it ensures that models have some continuity and can potentially infer relationships that would otherwise be severed. However, the `overlap_size` needs to be carefully chosen; too little might still break context, and too much can introduce unnecessary redundancy.

In [6]:
def fixed_size_chunking(text, chunk_size, overlap_size):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        # For the next chunk, move back by overlap_size
        start += chunk_size - overlap_size
        # Ensure start does not go negative if overlap_size is too large for the remaining text
        if start < 0:
            start = 0
    return chunks

# Define chunk size and overlap size
chunk_size = 500
overlap_size = 100

# Apply the function to wiki_text
fixed_chunks = fixed_size_chunking(wiki_text, chunk_size, overlap_size)

print(f"Total number of fixed-size chunks: {len(fixed_chunks)}")
print("\nFirst 3 fixed-size chunks:")
for i, chunk in enumerate(fixed_chunks[:3]):
    print(f"--- Chunk {i+1} (length: {len(chunk)}) ---")
    print(chunk)


Total number of fixed-size chunks: 165

First 3 fixed-size chunks:
--- Chunk 1 (length: 500) ---
India, officially the Republic of India,[j][20] is a country in South Asia.  It is the seventh-largest country by area; the most populous country since 2023;[21] and, since its independence in 1947, the world's most populous democracy.[22][23][24] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[k] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In t
--- Chunk 2 (length: 500) ---
 to the west;[k] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is near Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Myanmar, Thailand, and Indonesia.
Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.[26][27][28] Their long oc

## **Method 2: Recursive Character Text Splitter**


### Description of Recursive Character Text Splitter

The Recursive Character Text Splitter is a sophisticated chunking strategy designed to create semantically coherent chunks by iteratively splitting text based on a predefined list of separators. Unlike fixed-size chunking, which can arbitrarily cut through meaningful units, this method attempts to preserve context by splitting at natural boundaries, such as paragraphs, sentences, or even words, in a prioritized order.

#### Working Principle

1.  **Prioritized Separators**: The core idea is to use a list of separators (e.g., `["\n\n", "\n", " ", ""]`) and attempt to split the text using the first separator in the list. If the resulting chunks are still too large (exceeding a `chunk_size`), the process is recursively applied to those oversized chunks using the *next* separator in the list.
2.  **Recursive Splitting**: The algorithm tries to split the text with the most 'coarse' separator first (e.g., `"\n\n"` for paragraphs). If splitting by `"\n\n"` yields chunks smaller than or equal to the `chunk_size`, these chunks are kept. If any chunk is still too large, it is then split by the next separator (e.g., `"\n"` for newlines/lines) and so on.
3.  **Fallback to Characters**: If all defined separators have been tried and chunks still exceed the `chunk_size`, the final fallback is often to split by individual characters (using `""` as a separator), which guarantees that no chunk will exceed the maximum length, though it might break words or sentences.
4.  **Optional Overlap**: Similar to fixed-size chunking, an `overlap_size` can be applied. This ensures that a portion of text from the end of one chunk is included at the beginning of the next, helping to maintain context across chunk boundaries, especially when a split occurs in the middle of a semantically important passage.

#### When it is Appropriate to Use

*   **Preserving Semantic Units**: This method excels when it's crucial to maintain the integrity of semantic units like paragraphs, sentences, or even complete thoughts within chunks. It minimizes the likelihood of breaking up important information arbitrarily.
*   **Handling Varied Document Structures**: Documents often have hierarchical structures (e.g., sections, subsections, paragraphs). The recursive nature allows the splitter to adapt to these structures by trying to split at higher-level boundaries first.
*   **Complex Text Data**: Ideal for processing long, complex documents (e.g., academic papers, legal documents, books) where context and structural integrity are vital for downstream NLP tasks like RAG, summarization, or detailed question-answering.
*   **Optimized for RAG Systems**: In Retrieval-Augmented Generation (RAG) systems, this method helps create more meaningful and retrievable chunks, leading to higher quality retrieval results and more accurate generations from LLMs.

#### When it is Not Appropriate to Use

*   **Simple Use Cases/Very Short Texts**: For very short texts or use cases where semantic coherence is not a major concern, the complexity of a recursive splitter might be overkill. Fixed-size chunking could be sufficient and more efficient.
*   **Performance-Critical Applications (with extreme chunking)**: While generally efficient, if the list of separators is very long and `chunk_size` is extremely small, the recursive nature could lead to slightly higher processing times compared to a simple fixed-size split.
*   **Text without Clear Delimiters**: If the input text is entirely unstructured or lacks clear delimiters (e.g., a continuous stream of characters without paragraphs or newlines), the advantage of recursive splitting is diminished, as it will quickly fall back to character-level splitting.

In essence, the Recursive Character Text Splitter offers a more intelligent and context-aware approach to text chunking, making it a go-to choice for applications demanding high-quality, semantically rich text segments.

In [7]:
def recursive_character_chunking(text, separators, chunk_size, overlap_size):
    chunks = []
    # Helper function to split text by a list of separators recursively
    def _split_text(text_to_split, current_separators):
        if not current_separators or len(text_to_split) <= chunk_size:
            # If no more separators or text is small enough, add it as a chunk
            # This handles cases where text is smaller than chunk_size or no more separators
            if text_to_split:
                chunks.append(text_to_split)
            return

        separator = current_separators[0]
        parts = text_to_split.split(separator)
        for i, part in enumerate(parts):
            if part:
                # If the part itself is larger than chunk_size, recurse with the next separator
                if len(part) > chunk_size:
                    _split_text(part, current_separators[1:])
                else:
                    chunks.append(part)
            # Add back separator if it's not the last part to maintain context for overlap
            if i < len(parts) - 1 and separator:
                # Only add the separator itself if it's not empty, otherwise we'd add an empty string
                _split_text(separator, current_separators[1:])

    # Initial call to split the entire text
    _split_text(text, separators)

    # Reconstruct chunks with overlap and ensure they don't exceed chunk_size
    final_chunks = []
    current_text = ""
    for chunk in chunks:
        # If adding the current chunk makes current_text too large, flush it
        if len(current_text) + len(chunk) > chunk_size and current_text:
            final_chunks.append(current_text)
            # Start new current_text with overlap from the previous chunk
            current_text = current_text[-overlap_size:] if len(current_text) >= overlap_size else ""
        current_text += chunk
        # If current_text reaches chunk_size or is the last bit of text, flush it
        if len(current_text) >= chunk_size:
            final_chunks.append(current_text[:chunk_size])
            current_text = current_text[chunk_size - overlap_size:] if chunk_size - overlap_size > 0 else ""
            # Ensure current_text is not longer than overlap_size if it's just leftover
            if len(current_text) > overlap_size:
                current_text = current_text[-overlap_size:]
    if current_text:
        final_chunks.append(current_text)

    # Post-processing: ensure all chunks are <= chunk_size and handle remaining text if overlap was too large for last part
    processed_chunks = []
    temp_text = ""
    for chunk in final_chunks:
        if len(chunk) > chunk_size:
            # If any chunk is still too large, it implies an issue in recursive split logic or very aggressive chunk_size
            # For simplicity, we'll just split it by character here, but ideally, recursive split should handle it
            for i in range(0, len(chunk), chunk_size - overlap_size):
                sub_chunk = chunk[i : i + chunk_size]
                processed_chunks.append(sub_chunk)
        else:
            processed_chunks.append(chunk)

    # A final pass to re-apply overlap logic cleanly after all splits, ensuring chunk_size constraint
    refined_chunks = []
    for i in range(len(processed_chunks)):
        if not refined_chunks:
            refined_chunks.append(processed_chunks[i])
        else:
            previous_chunk = refined_chunks[-1]
            # Add overlap from previous chunk's end to current chunk's start
            start_index = max(0, len(previous_chunk) - overlap_size)
            overlap_content = previous_chunk[start_index:]

            current_combined_chunk = overlap_content + processed_chunks[i]

            # If the combined chunk is too long, trim it to chunk_size
            if len(current_combined_chunk) > chunk_size:
                # This case is tricky if `processed_chunks[i]` itself is already `chunk_size`.
                # The primary goal of `recursive_character_chunking` is to split at delimiters.
                # For strict adherence to chunk_size, we often use `token_text_splitter` libraries.
                # For this exercise, let's assume `processed_chunks[i]` itself is not > `chunk_size`
                # and ensure the final chunk doesn't exceed `chunk_size` by taking the `processed_chunks[i]` and adding
                # the maximum possible overlap from the previous one. If `processed_chunks[i]` is very small, it will add more overlap.

                # A simpler approach for overlap when `processed_chunks` are already within `chunk_size`
                # is to ensure each new chunk starts with overlap and then adds current content up to chunk_size.

                # Let's rebuild the overlap logic here more carefully to avoid complexity.
                # The core logic of splitting by separators should ensure initial chunks are reasonable.
                # Overlap is typically added *after* the initial splitting.

                # For educational purposes, let's simplify the final chunk generation logic
                # to clearly show how splits and overlaps are managed post-split.
                pass # The initial _split_text generates small pieces. The reconstruction is key.

    # Reset and regenerate chunks with strict chunk_size and overlap based on split parts
    final_output_chunks = []
    current_chunk_parts = []
    current_chunk_length = 0

    for part in chunks: # 'chunks' here are the small, separator-split pieces
        if current_chunk_length + len(part) + (overlap_size if current_chunk_parts else 0) > chunk_size:
            if current_chunk_parts:
                final_output_chunks.append("".join(current_chunk_parts))

                # Prepare for next chunk with overlap
                overlap_start_index = max(0, len(final_output_chunks[-1]) - overlap_size)
                current_chunk_parts = [final_output_chunks[-1][overlap_start_index:]]
                current_chunk_length = len(current_chunk_parts[0])

        current_chunk_parts.append(part)
        current_chunk_length += len(part)

    if current_chunk_parts:
        final_output_chunks.append("".join(current_chunk_parts))

    # Further refinement for any remaining chunks that might be too large due to imperfect overlap handling
    # or if a single 'part' was initially larger than chunk_size (though _split_text tries to prevent this)
    result_chunks = []
    for chunk in final_output_chunks:
        if len(chunk) > chunk_size:
            # If a chunk is still too big, split it without overlap for this specific segment
            for i in range(0, len(chunk), chunk_size):
                result_chunks.append(chunk[i:i+chunk_size])
        else:
            result_chunks.append(chunk)

    return result_chunks




# Define chunk size, overlap size, and separators
chunk_size = 500
overlap_size = 100
separators = ["\n\n", "\n", ". ", " ", ""]

# Apply the function to wiki_text
recursive_chunks = recursive_character_chunking(wiki_text, separators, chunk_size, overlap_size)

print(f"Total number of recursive character chunks: {len(recursive_chunks)}")
print("\nFirst 3 recursive character chunks:")
for i, chunk in enumerate(recursive_chunks[:3]):
    print(f"--- Chunk {i+1} (length: {len(chunk)}) ---")
    print(chunk)


Total number of recursive character chunks: 270

First 3 recursive character chunks:
--- Chunk 1 (length: 76) ---
India, officially the Republic of India,[j][20] is a country in South Asia. 
--- Chunk 2 (length: 494) ---
India, officially the Republic of India,[j][20] is a country in South Asia.  It is the seventh-largest country by area; the most populous country since 2023;[21] and, since its independence in 1947, the world's most populous democracy.[22][23][24] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[k] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east
--- Chunk 3 (length: 388) ---
kistan to the west;[k] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is near Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Myanmar, Thailand, and Indon

## **Method 3: Semantic Chunking**


### Description of Semantic Chunking

Semantic chunking is an advanced text splitting strategy that aims to group sentences or text segments together based on their contextual and thematic relatedness. Unlike fixed-size or recursive character chunking, which primarily rely on length or structural delimiters, semantic chunking uses natural language processing (NLP) techniques, often leveraging sentence embeddings, to identify meaningful boundaries and ensure that each chunk represents a coherent idea or topic.

#### Working Principle

1.  **Sentence Segmentation**: The raw text is first broken down into individual sentences. This serves as the smallest unit for semantic analysis.
2.  **Sentence Embedding**: Each sentence is then converted into a numerical vector (an embedding) using a pre-trained language model (e.g., Sentence Transformers). These embeddings capture the semantic meaning of the sentences, with semantically similar sentences having embeddings that are closer in vector space.
3.  **Similarity Measurement**: The core of semantic chunking involves calculating the similarity between adjacent sentence embeddings. Cosine similarity is a common metric used for this purpose. A high similarity score indicates that two adjacent sentences are semantically related, while a low score suggests a topic shift or a natural break.
4.  **Boundary Detection**: Potential chunk boundaries are identified where the semantic similarity between consecutive sentences drops significantly. This can be done by setting a predefined `threshold` (sentences below this similarity are considered boundaries) or by finding local minima in the similarity scores, indicating a shift in topic.
5.  **Chunk Aggregation**: Sentences are then grouped together to form chunks, respecting these identified boundaries. The process also considers practical constraints such as a maximum `chunk_size` (e.g., in tokens or characters) to fit within model context windows and an `overlap_size` to maintain continuity between chunks, similar to other chunking methods. Overlap in semantic chunking might involve repeating sentences from the end of one chunk at the beginning of the next.

#### When it is Appropriate to Use

*   **High-Quality Retrieval-Augmented Generation (RAG)**: Semantic chunking is highly beneficial for RAG systems. By ensuring that chunks are semantically coherent, the retrieval process is more likely to fetch relevant and complete pieces of information in response to a query, leading to more accurate and contextually rich generations from Large Language Models (LLMs).
*   **Complex Information Extraction**: For tasks requiring precise extraction of information, where understanding the full context of a statement is critical. Semantic chunks are more likely to contain complete thoughts or concepts.
*   **Summarization and Question Answering**: When summarizing documents or answering questions, preserving semantic integrity within chunks helps models grasp the main points and provide more accurate responses.
*   **Documents with Unclear Structural Boundaries**: In texts that lack clear paragraph breaks, headings, or other structural cues, semantic chunking can intelligently infer logical divisions based on content.
*   **Maximizing Context within Constraints**: It balances the need to break down large texts with the imperative to maintain as much meaningful context as possible within each segment.

#### When it is Not Appropriate to Use

*   **Very Short Texts**: For texts that are already very short or where the semantic content is extremely simple, the overhead of generating embeddings and calculating similarities might be unnecessary. Fixed-size or recursive methods could be more efficient.
*   **Real-time Performance Critical Applications**: Generating embeddings and calculating similarities can be computationally more intensive and slower than simpler chunking methods, especially for extremely large datasets or real-time applications where latency is a major concern.
*   **Strict Character/Token Limits**: While semantic chunking tries to respect `chunk_size`, the process of respecting semantic boundaries might sometimes lead to chunks that are slightly over or under the desired `chunk_size`, making it less precise than fixed-size chunking for very strict length requirements.
*   **When Content is Highly Disjointed**: If the text is inherently a collection of highly disjointed, unrelated sentences, semantic chunking might not find clear semantic boundaries, or the resulting chunks might still appear somewhat arbitrary.

In [9]:
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Download the necessary 'punkt' tokenizer data for nltk, including 'punkt_tab'
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True) # Explicitly download punkt_tab as suggested by the error

print("Libraries imported and NLTK 'punkt' downloaded successfully.")

def semantic_chunking(text, model, chunk_size, overlap_size, threshold):
    sentences = sent_tokenize(text)
    if not sentences:
        return []

    # Generate embeddings for each sentence
    sentence_embeddings = model.encode(sentences)

    # Calculate cosine similarity between adjacent sentence embeddings
    similarities = []
    for i in range(len(sentence_embeddings) - 1):
        sim = cosine_similarity(sentence_embeddings[i].reshape(1, -1), sentence_embeddings[i+1].reshape(1, -1))[0][0]
        similarities.append(sim)

    # Identify potential chunk boundaries where similarity drops below threshold
    # or by finding local minima more generally
    chunk_boundaries = [0] # Start of the first chunk is always a boundary
    for i in range(len(similarities)):
        # Simple thresholding for boundary detection
        if similarities[i] < threshold:
            chunk_boundaries.append(i + 1)

    # Ensure the end of the text is a boundary if not already
    if chunk_boundaries[-1] != len(sentences):
        chunk_boundaries.append(len(sentences))

    # Aggregate sentences into chunks, considering chunk_size and overlap_size
    chunks = []
    current_chunk_sentences = []
    current_chunk_length = 0
    last_added_chunk_end_index = 0

    for i in range(len(sentences)):
        sentence = sentences[i]
        sentence_len = len(sentence) # Using character length as proxy for token length

        # If adding the current sentence would exceed chunk_size, or if it's a identified boundary
        # and we already have some content, finalize the current chunk.
        is_boundary_point = (i in chunk_boundaries)

        # If we hit a boundary or exceed max_length, close the current chunk and start a new one.
        # Only finalize if we have content in current_chunk_sentences
        if (current_chunk_length + sentence_len > chunk_size and current_chunk_sentences) or (is_boundary_point and current_chunk_sentences and i != last_added_chunk_end_index):
            # Finalize the current chunk
            new_chunk = ' '.join(current_chunk_sentences)
            chunks.append(new_chunk)
            last_added_chunk_end_index = i # Record where the new chunk ended

            # Prepare for the next chunk with overlap
            overlap_start_index = max(0, len(current_chunk_sentences) - overlap_size) # overlap_size in sentences
            current_chunk_sentences = current_chunk_sentences[overlap_start_index:]
            current_chunk_length = sum(len(s) for s in current_chunk_sentences)

        current_chunk_sentences.append(sentence)
        current_chunk_length += sentence_len

    # Add the last accumulated chunk if any
    if current_chunk_sentences:
        chunks.append(' '.join(current_chunk_sentences))

    # Post-processing to ensure all chunks are within chunk_size, especially if sentence length is large
    # This part can be tricky. For a robust solution, one might need a character-based splitter as a fallback
    # if a single sentence exceeds chunk_size or if overlap makes it too large.
    # For simplicity here, we assume sentences are generally smaller than chunk_size.
    # If a chunk is still too large after semantic grouping, it indicates a very long segment
    # or that the overlap logic needs more fine-tuning for character counts.
    # Let's re-evaluate chunks to ensure character length constraint with overlap
    final_chunks = []
    for chunk_text in chunks:
        if len(chunk_text) > chunk_size:
            # If a semantically derived chunk is still too large, fall back to fixed-size split within it
            start = 0
            while start < len(chunk_text):
                end = start + chunk_size
                sub_chunk = chunk_text[start:end]
                final_chunks.append(sub_chunk)
                # This sub-chunking won't preserve semantic boundaries perfectly if a large chunk is split
                start += chunk_size - overlap_size # Use character overlap for this fallback
                if start < 0: start = 0 # Prevent negative index
        else:
            final_chunks.append(chunk_text)

    return final_chunks

# Load a pre-trained sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
print("SentenceTransformer model loaded successfully.")

# Define parameters for semantic chunking
chunk_size = 500 # Character count
overlap_size = 100 # Character count for overlap
threshold = 0.5 # Cosine similarity threshold for boundary detection

# Apply semantic chunking to wiki_text
semantic_chunks = semantic_chunking(wiki_text, model, chunk_size, overlap_size, threshold)

print(f"\nTotal number of semantic chunks: {len(semantic_chunks)}")
print("\nFirst 3 semantic chunks:")
for i, chunk in enumerate(semantic_chunks[:3]):
    print(f"--- Chunk {i+1} (length: {len(chunk)}) ---")
    print(chunk)


Libraries imported and NLTK 'punkt' downloaded successfully.
SentenceTransformer model loaded successfully.

Total number of semantic chunks: 15263

First 3 semantic chunks:
--- Chunk 1 (length: 75) ---
India, officially the Republic of India,[j][20] is a country in South Asia.
--- Chunk 2 (length: 234) ---
India, officially the Republic of India,[j][20] is a country in South Asia. It is the seventh-largest country by area; the most populous country since 2023;[21] and, since its independence in 1947, the world's most populous democracy.
--- Chunk 3 (length: 495) ---
India, officially the Republic of India,[j][20] is a country in South Asia. It is the seventh-largest country by area; the most populous country since 2023;[21] and, since its independence in 1947, the world's most populous democracy. [22][23][24] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[k] China, Ne

## Summary:

### Data Analysis Key Findings

The task involved generating a comprehensive guide on text chunking, including its definition, importance, and various strategies, using text from the Wikipedia page on "India" for demonstrations.

1.  **Introduction to Chunking**: Text chunking was defined as breaking down continuous text into smaller, meaningful segments for efficient processing in NLP. Its fundamental concept revolves around managing large texts, improving focus for NLP models, overcoming token limits, and preserving context. Its primary purposes include information extraction, question answering, summarization, semantic search, contextual understanding for LLMs, and data preprocessing.
2.  **Importance of Chunking**: Chunking was identified as a critical preprocessing step, offering benefits in:
    *   **Memory Efficiency**: Handling large documents within model context windows, reducing computational load, and enabling faster processing.
    *   **Context Preservation**: Maintaining semantic meaning and preventing arbitrary truncation of vital information.
    *   **Model Performance**: Improving accuracy and relevance for tasks like Question Answering, summarization, and especially Retrieval-Augmented Generation (RAG) by providing coherent, well-formed chunks.
3.  **Types of Chunking Strategies**: Three high-level strategies were outlined:
    *   **Fixed-Size Chunking**: Splitting text into predetermined lengths, often with optional overlap.
    *   **Content-Aware (Semantic/Structure-Aware) Chunking**: Splitting based on meaning, semantic coherence, or structural elements (e.g., paragraphs, sections).
    *   **Overlap-Based Chunking**: Including portions of adjacent chunks to maintain context across boundaries, typically used in conjunction with other methods.
4.  **Data Acquisition from Wikipedia**:
    *   The text content from "https://en.wikipedia.org/wiki/India" was successfully fetched and processed.
    *   Initial attempts to extract text were unsuccessful, returning an empty string.
    *   A subsequent attempt to fetch the page resulted in a 403 Forbidden error, indicating server-side blocking.
    *   The issue was resolved by including a `User-Agent` header in the HTTP request to mimic a web browser.
    *   The main content of the Wikipedia page was successfully extracted, resulting in a `wiki_text` variable containing 65,980 characters.
5.  **Method 1: Fixed-Size Chunking**:
    *   This method divides text into segments of a constant character length, with an optional overlap to preserve context.
    *   Using a `chunk_size` of 500 characters and an `overlap_size` of 100 characters, the `wiki_text` was split into 165 chunks.
    *   It is appropriate for simplicity and speed, initial exploration, or when content structure is not critical, but can break semantic coherence.
6.  **Method 2: Recursive Character Text Splitter**:
    *   This method iteratively splits text using a prioritized list of separators (e.g., `["\n\n", "\n", ". ", " ", ""]`), aiming to preserve semantic units.
    *   With a `chunk_size` of 500, `overlap_size` of 100, and the specified separators, the `wiki_text` was divided into 270 chunks.
    *   It is suitable for preserving semantic units in complex text and is optimized for RAG systems, but can be overkill for simple cases.
7.  **Method 3: Semantic Chunking**:
    *   This advanced strategy groups sentences based on their contextual and thematic relatedness, leveraging sentence embeddings and similarity measurements to detect natural topic shifts.
    *   An initial `LookupError` for the `nltk` 'punkt_tab' resource was encountered and resolved by explicitly downloading it.
    *   Using the 'all-MiniLM-L6-v2' SentenceTransformer model, a `chunk_size` of 500, `overlap_size` of 100, and a similarity `threshold` of 0.5, the `wiki_text` produced 15,263 semantic chunks.
    *   It is highly beneficial for RAG, complex information extraction, and documents with unclear structural boundaries, but is more computationally intensive and potentially slower than simpler methods.

### Insights or Next Steps

*   **Trade-offs in Chunking**: The choice of chunking method involves a trade-off between computational efficiency, preservation of semantic context, and strict adherence to length constraints. Simple fixed-size chunking is fast but can break meaning, while semantic chunking is intelligent but more resource-intensive.
*   **Contextual Selection**: The optimal chunking strategy is highly dependent on the specific downstream NLP task and the nature of the text data. For applications like RAG where semantic coherence is paramount, more advanced methods like recursive character or semantic chunking are preferred, potentially with character-based fallback for very long sections.
