## Key Features:
1. recursive_chunk_text() - Main method for chunking raw text strings
2. recursive_chunk_documents() - Method for chunking existing Document objects
3. chunk_with_metadata() - Method that preserves metadata across chunks

### How Recursive Chunking Works:
The RecursiveCharacterTextSplitter uses a hierarchy of separators:
1. First tries to split on \n\n (paragraph breaks)
2. Then tries \n (line breaks)
3. Then tries (spaces)
4. Finally splits on any character if needed
This approach helps preserve semantic boundaries while ensuring chunks don't exceed the specified size.

In [11]:
"""
Recursive Chunking using LangChain

This module provides methods for recursively splitting text into smaller chunks
while preserving semantic boundaries using LangChain's text splitters.
"""
class file_chunking:
    from typing import List, Optional, Dict, Any
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.schema import Document


    def recursive_chunk_text(
        text: str,
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
        separators: Optional[List[str]] = None,
        length_function: Optional[callable] = None,
        is_separator_regex: bool = False,
        keep_separator: bool = False,
        add_start_index: bool = False,
        strip_whitespace: bool = True,
        **kwargs: Any
    ) -> List[Document]:
        """
        Recursively chunk text using LangChain's RecursiveCharacterTextSplitter.
        
        This method splits text into smaller chunks while trying to preserve
        semantic boundaries by using a hierarchy of separators.
        
        Args:
            text (str): The input text to be chunked
            chunk_size (int): The target size of each chunk in characters. Default: 1000
            chunk_overlap (int): The number of characters to overlap between chunks. Default: 200
            separators (List[str], optional): List of separators to use for splitting.
                If None, uses default separators: ["\n\n", "\n", " ", ""]
            length_function (callable, optional): Function to calculate length of text.
                If None, uses len() function
            is_separator_regex (bool): Whether separators are regex patterns. Default: False
            keep_separator (bool): Whether to keep separators in the chunks. Default: False
            add_start_index (bool): Whether to add start index to metadata. Default: False
            strip_whitespace (bool): Whether to strip whitespace from chunks. Default: True
            **kwargs: Additional arguments passed to RecursiveCharacterTextSplitter
            
        Returns:
            List[Document]: List of Document objects containing the chunked text
            
        Example:
            >>> text = "This is a long text that needs to be chunked. " * 100
            >>> chunks = recursive_chunk_text(text, chunk_size=500, chunk_overlap=50)
            >>> print(f"Number of chunks: {len(chunks)}")
            >>> print(f"First chunk: {chunks[0].page_content[:100]}...")
        """
        
        # Set default separators if not provided
        if separators is None:
            separators = ["\n\n", "\n", " ", ""]
        
        # Set default length function if not provided
        if length_function is None:
            length_function = len
        
        # Create the recursive text splitter
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=separators,
            length_function=length_function,
            is_separator_regex=is_separator_regex,
            keep_separator=keep_separator,
            add_start_index=add_start_index,
            strip_whitespace=strip_whitespace,
            **kwargs
        )
        
        # Split the text into chunks
        chunks = text_splitter.split_text(text)
        
        # Convert to Document objects
        documents = [Document(page_content=chunk) for chunk in chunks]
        
        return documents



    # Example usage
    if __name__ == "__main__":
        # Example text for testing
        sample_text = """
        This is a sample document that demonstrates recursive chunking.
        
        The first paragraph contains some initial information about the topic.
        It explains the basic concepts and provides context for the reader.
        
        The second paragraph goes into more detail about the implementation.
        It discusses the technical aspects and provides examples of how to use the method.
        
        The third paragraph concludes the document with final thoughts and recommendations.
        It summarizes the key points and suggests next steps for the reader.
        
        This is additional content to make the text longer and demonstrate chunking behavior.
        The recursive chunking algorithm will split this text into smaller pieces while
        trying to preserve semantic boundaries and maintain readability.
        """
        
        # Test basic recursive chunking
        print("=== Basic Recursive Chunking ===")
        chunks = recursive_chunk_text(sample_text, chunk_size=200, chunk_overlap=50)
        print(f"Number of chunks: {len(chunks)}")
        for i, chunk in enumerate(chunks):
            print(f"\nChunk {i+1}:")
            print(f"Length: {len(chunk.page_content)}")
            print(f"Content: {chunk.page_content[:100]}...")

=== Basic Recursive Chunking ===
Number of chunks: 6

Chunk 1:
Length: 63
Content: This is a sample document that demonstrates recursive chunking....

Chunk 2:
Length: 146
Content: The first paragraph contains some initial information about the topic.
        It explains the basic...

Chunk 3:
Length: 159
Content: The second paragraph goes into more detail about the implementation.
        It discusses the techni...

Chunk 4:
Length: 160
Content: The third paragraph concludes the document with final thoughts and recommendations.
        It summa...

Chunk 5:
Length: 173
Content: This is additional content to make the text longer and demonstrate chunking behavior.
        The re...

Chunk 6:
Length: 64
Content: trying to preserve semantic boundaries and maintain readability....
