## Manual chunking
Basic overlap chunking that supports:
* Character-based or word-based chunking
* Configurable chunk size and overlap size
* Input validation

In [18]:
class file_chunking:
    def manual_overlap_chunking(text, chunk_size, overlap_size, chunking_strategy="word"):
        """
        Manually chunk text with overlapping segments to preserve context.
        
        Args:
            text (str): The input text to be chunked
            chunk_size (int): The size of each chunk
            overlap_size (int): The number of characters/words to overlap between chunks
            chunking_strategy (str): Strategy for chunking - "character" or "word"
        
        Returns:
            list: List of overlapping text chunks
        
        Raises:
            ValueError: If chunk_size <= overlap_size or if invalid chunking_strategy
        """
        if chunk_size <= overlap_size:
            raise ValueError("chunk_size must be greater than overlap_size")
        
        if chunking_strategy not in ["word"]:
            raise ValueError("chunking_strategy must be 'word'")
        
        if not text.strip():
            return []
        
        chunks = []
        
    # Word split
        words = text.split()
        if len(words) == 0:
            return []

    # Manual chunking    
        start = 0
        while start < len(words):
            end = min(start + chunk_size, len(words))
            chunk = " ".join(words[start:end])
            chunks.append(chunk)
            
            # Move start position by (chunk_size - overlap_size)
            start += chunk_size - overlap_size
            
            # Break if we've reached the end
            if end == len(words):
                break
        
        return chunks

In [19]:
long_text = """
    Hugging Face provides tools for building, training, and deploying transformer-based models.
    Chunking is useful for processing long documents that exceed the model’s max sequence length.
    By splitting text into overlapping segments, we preserve context across chunks.
    This technique is often used in NLP pipelines for retrieval-augmented generation (RAG),
    question answering, and summarization.
    """

In [20]:
print("\n--- Token-based ---")
token_chunks = manual_overlap_chunking(long_text, 15, 3, chunking_strategy="word")
for i, c in enumerate(token_chunks, 1):
    print(f"Chunk {i}: {c}")


--- Token-based ---
Chunk 1: Hugging Face provides tools for building, training, and deploying transformer-based models. Chunking is useful for
Chunk 2: is useful for processing long documents that exceed the model’s max sequence length. By splitting
Chunk 3: length. By splitting text into overlapping segments, we preserve context across chunks. This technique is
Chunk 4: This technique is often used in NLP pipelines for retrieval-augmented generation (RAG), question answering, and
Chunk 5: question answering, and summarization.
