# Chunking Methods
Chunking methods refer to various strategies for breaking down large pieces of text into manageable, meaningful segments. These methods are essential for applications in natural language processing, such as summarization, semantic search, and text generation.

- Chunk Size - The number of characters you would like in your chunks. 50, 100, 100,000, etc.
- Chunk Overlap - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.


https://www.chunkviz.com/

<img src="../figures/chunking.png" >

In [None]:
!pip install langchain
!pip install langchain-groq
!pip install langchain-openai
!pip install langchain-core
!pip install langchain-community

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_groq import ChatGroq
from langchain.embeddings import HuggingFaceBgeEmbeddings

llm = ChatGroq(model="meta-llama/llama-4-scout-17b-16e-instruct")

embedding_model = HuggingFaceBgeEmbeddings(
    model_name = "BAAI/bge-small-en-v1.5",
    model_kwargs = {'device':'cpu'},
    encode_kwargs = {'normalize_embeddings':True}
)

In [1]:
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer.

You can't understand the world without understanding the concept of superlinear returns. And if you're ambitious you definitely should, because this will be the wave you surf on.

It may seem as if there are a lot of different situations with superlinear returns, but as far as I can tell they reduce to two fundamental causes: exponential growth and thresholds.

The most obvious case of superlinear returns is when you're working on something that grows exponentially. For example, growing bacterial cultures. When they grow at all, they grow exponentially. But they're tricky to grow. Which means the difference in outcome between someone who's adept at it and someone who's not is very great.

Startups can also grow exponentially, and we see the same pattern there. Some manage to achieve high growth rates. Most don't. And as a result you get qualitatively different outcomes: the companies with high growth rates tend to become immensely valuable, while the ones with lower growth rates may not even survive.

Y Combinator encourages founders to focus on growth rate rather than absolute numbers. It prevents them from being discouraged early on, when the absolute numbers are still low. It also helps them decide what to focus on: you can use growth rate as a compass to tell you how to evolve the company. But the main advantage is that by focusing on growth rate you tend to get something that grows exponentially.

YC doesn't explicitly tell founders that with growth rate "you get out what you put in," but it's not far from the truth. And if growth rate were proportional to performance, then the reward for performance p over time t would be proportional to pt.

Even after decades of thinking about this, I find that sentence startling.
"""

### 1. **Fixed Size (Character) Text Splitting**  
This method involves dividing text into chunks based on character count. It’s straightforward and commonly used in text preprocessing.  

* **Pros:** Easy & Simple
* **Cons:** Very rigid and doesn't take into account the structure of your text
* **Best Use Cases:** :Quick preprocessing or exploratory analysis. Word frequency or simple keyword matching.

In [4]:
chunk_size = 200 # Characters

# Run through the a range with the length of your text and iterate every chunk_size you want
def fixed_size_chunking(text, chunk_size):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

fixed_size_chunks = fixed_size_chunking(text, chunk_size)
len(fixed_size_chunks)

14

### 2. **Recursive Character Text Splitting**  
Recursive splitting is used when the text is too large for fixed-size chunks. It divides text progressively into smaller chunks, ensuring better segmentation.

You can see the default separators for LangChain here. Let's take a look at them one by one.

- "\n\n" - Double new line, or most commonly paragraph breaks
- "\n" - New lines
- " " - Spaces
- "" - Characters

**Configurable Parameters**:  
  - `chunk_size`: Size of the final chunks.  
  - `chunk_overlap`: Overlap between chunks to maintain context.  

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)
recursive_chunks = splitter.split_text(text)

In [7]:
len(recursive_chunks)

18

## 3. Document Specific Splitting

Let's start to handle document types other than normal prose in a .txt. What if you have pictures? or a PDF? or code snippets?

Our first two levels wouldn't work great for this so we'll need to find a different tactic.

This level is all about making your chunking strategy fit your different data formats. Let's run through a bunch of examples of this in action

The Markdown, Python, and JS splitters will basically be similar to Recursive Character, but with different separators.

See all of LangChains document splitters [here](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter) and Llama Index ([HTML](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#htmlnodeparser), [JSON](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#jsonnodeparser), [Markdown](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#markdownnodeparser))

### Markdown

You can see the separators [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L1175).

Separators:
* `\n#{1,6}` - Split by new lines followed by a header (H1 through H6)
* ```` ```\n ```` - Code blocks
* `\n\\*\\*\\*+\n` - Horizontal Lines
* `\n---+\n` - Horizontal Lines
* `\n___+\n` - Horizontal Lines
* `\n\n` Double new lines
* `\n` - New line
* `" "` - Spaces
* `""` - Character

In [17]:
markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [18]:
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)
splitter.split_text(markdown_text)

['# Fun in California\n\n## Driving',
 'Try driving on the 1 down to San Diego',
 '### Food',
 "Make sure to eat a burrito while you're",
 'there',
 '## Hiking\n\nGo to Yosemite']

### 4. **Semantic Chunking**  
Semantic chunking uses embeddings to split text based on meaning rather than structure. This ensures that semantically similar content is grouped together.  

Embeddings represent the semantic meaning of a string. They don't do much on their own, but when compared to embeddings of other texts you can start to infer the relationship between chunks. I want to lean into this property and explore using embeddings to find clusters of semantically similar texts.

The hypothesis is that semantically similar chunks should be held together.

I tried a few methods:
1) **Heirarchical clustering with positional reward** - I wanted to see how heirarchical clustering of sentence embeddings would do. But because I chose to split on sentences, there was an issue with small short sentences after a long one. You know? (like this last sentenence). They could change the meaning of a chunk, so I added a positional reward and clusters were more likely to form if they were sentences next to each other. This ended up being ok, but tuning the parameters was slow and unoptimal.
2) **Find break points between sequential sentences** - Next up I tried a walk method. I started at the first sentence, got the embedding, then compared it to sentence #2, then compared #2 and #3 and so on. I was looking for "break points" where embedding distance was large. If it was above a threshold, then I considered it the start of a new semantic section. I originally tried taking embeddings of every sentence, but this turned out to be too noisy. So I ended up taking groups of 3 sentences (a window), then got an embedding, then dropped the first sentence, and added the next one. This worked out a bit better.


In [1]:
# !pip install langchain-experimental langchain-community

In [None]:
from langchain_community.embeddings import FakeEmbeddings
from langchain_experimental.text_splitter import SemanticChunker


embeddings = FakeEmbeddings(size=1352)
splitter = SemanticChunker(embeddings)
chunks = splitter.split_text(text)
print(chunks)

### 5. **Agentic Chunking**  
This advanced method uses AI to extract propositions or logical segments from text. It’s ideal for complex documents requiring logical segmentation.  

#### Tools:  
- `AgenticChunker` or similar frameworks.  

#### Example:  

**Input Text**:  
"Climate change is a global issue. Reducing emissions is crucial. Governments should enforce stricter laws."  

**Chunks**:  
1. "Climate change is a global issue."  
2. "Reducing emissions is crucial."  
3. "Governments should enforce stricter laws."  

**Use Case**: Extracting logical arguments for debate or analysis.

In [6]:
# !pip install agentic-chunker

In [None]:
from agentic_chunker import AgenticChunker
ac = AgenticChunker(llm_client=llm)

In [None]:
ac.pretty_print_chunks()

In [None]:
chunks = ac.get_chunks(get_type='list_of_strings')