# Semantic Chunking - Simple Approaches
* Notebook by Adam Lang
* Date: 9/16/2024

# Overview
* In this notebook we will experiment with various semantic chunkers that are often used with LLMs and RAG and in NLP in general.
* The semantic chunkers we are going to experiment with are found here: https://github.com/aurelio-labs/semantic-chunkers

## Install dependencies

In [7]:
## install
!pip install -qU \
  semantic-chunkers==0.0.3 \
  datasets==2.19.1 ## huggingface datasets just for experiments

# Semantic Chunkers
* These can be used on multi-modal data (e.g. audio, video, text, RAG, splitting, etc..).
* The examples here are more focused on RAG (Retrieval Augmented Generation).
* There are 3 main types of semantic chunkers to try here:
1. `StatisticalChunker`
   *  Statistical chunking method is the most robust chunking method,
   * Uses a **varying similarity threshold** to identify more dynamic and local similarity splits.
   * Gives a good balance between accuracy and efficiency
   * However, can ONLY be used for text documents (unlike the multi-modal ConsecutiveChunker).
   * Pros of this chunker:
    * can automatically identify a custom threshold value to use while chunking text.
    * requires less customization than other chunkers.
2. `ConsecutiveChunker`
   * Simplest method of chunking.
3. `CumulativeChunker`
  * More compute intensive process.
  * Can often provide more stable results as it is more noise resistant.
  * Very expensive in both time and money.



## Load Datasets for testing
* We will use huggingface datasets.

In [8]:
## load datasets
from datasets import load_dataset

## data
data = load_dataset('jamescalam/ai-arxiv2', split='train')
# view data
data

Dataset({
    features: ['id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'content', 'references'],
    num_rows: 2673
})

In [9]:
## view 1 of the arxiv papers from dataset
content = data[3]["content"]
print(content[:1000])

# Mamba: Linear-Time Sequence Modeling with Selective State Spaces
# Albert Gu*1 and Tri Dao*2
1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me
# Abstract
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities

In [10]:
## split the dataset
content = content[:20_000]

## Encoders
* Every chunker requires an `encoder`.
* We can use open source encoders such as `HuggingfaceEncoder` or `FastembedEncoder`.
* We can also use closed source proprietry encoders such as `OpenAIEncoder` or `CohereEncoder`.

In [11]:
## if you were using openai
## import os
## from getpass import getpass
## from semantic_router.encoders import OpenAIEncoder

In [12]:
## install sentence transformers
!pip install sentence-transformers



### Setup Encoder
* The default embedding model is: `sentence-transformers/all-MiniLM-L6-v2`.



In [13]:
## using a HuggingfaceEncoder
from semantic_router.encoders import HuggingFaceEncoder


## instantiate encoder model
encoder = HuggingFaceEncoder() ## use any open source model you wish



## 1. Statistical Chunking
* This is the best OOTB solution as it will determine the parameters of your chunking for you.
* Cost effective
* FAST

In [14]:
## statistical chunker
from semantic_chunkers import StatisticalChunker

##setup chunker
stat_chunker = StatisticalChunker(encoder=encoder)

In [15]:
## create statistical chunks
stat_chunks = stat_chunker(docs=[content])

[32m2024-09-16 18:25:43 INFO semantic_chunkers.utils.logger Single document exceeds the maximum token limit of 300. Splitting to sentences before semantically merging.[0m


In [16]:
## test out chunks
stat_chunker.print(stat_chunks[0])

Split 1, tokens 300, triggered by: token limit
[31m# Mamba: Linear-Time Sequence Modeling with Selective State Spaces # Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me # Abstract Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input

In [18]:
## stat_chunks
stat_chunks[0:2]

[[Chunk(splits=['# Mamba:', 'Linear-Time Sequence Modeling with Selective State Spaces', '# Albert Gu*1 and Tri Dao*2', '1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me', '# Abstract', 'Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module.', 'Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ', 'computational ineï¬', 'ciency on long sequences, but they have not performed as well as attention on important modalities such as language.', 'We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements.', 'First, simply letting the SSM parameters be functions of the input addr

## 2. Consecutive Chunking
* Simplest version of semantic chunking.
* Most encoders require various scoring thresholds.
* As an example, OpenAI ada-text-embedding uses similarity threshold of 0.7 to 0.8.
* Newer text embedding models such as ada-text-embedding-small uses similarity threshold of 0.3 (smaller thresholds).

### How Consecutive Chunking Works
* Looks for drop in similarity score and defines a chunk.

In [19]:
from semantic_chunkers import ConsecutiveChunker

## setup chunker
cons_chunker = ConsecutiveChunker(encoder=encoder, score_threshold=0.3)

In [20]:
## create consecutive chunks
cons_chunks = cons_chunker(docs=[content])

  0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/328 [00:00<?, ?it/s]

In [22]:
## print chunks
cons_chunker.print(cons_chunks[0])

Split 1, tokens None, triggered by: 0.06
[31m# Mamba:[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.08
[32mLinear-Time Sequence Modeling with Selective State Spaces[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.08
[34m# Albert Gu*1 and Tri Dao*2[0m
----------------------------------------------------------------------------------------


Split 4, tokens None, triggered by: 0.05
[35m1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me[0m
----------------------------------------------------------------------------------------


Split 5, tokens None, triggered by: 0.15
[31m# Abstract[0m
----------------------------------------------------------------------------------------


Split 6, tokens None, triggered by: 0.12
[32mFo

Summary:
* Depending upon the similarity threshold you set, the chunks can be too small or too big.
* It appears the chunks above are too small so we may want to up the threshold.

## 3. Cumulative Chunker
* Cumulatively adding chunks of text --> creating embeddings --> testing cosine similarity --> creates chunk based on cumulative embeddings.
* Creates more embeddings.
* More expensive if using API.
* More compute power.
* MORE NOISE resistant.
* Results can be worse than statistical chunker.

In [26]:
## import
from semantic_chunkers import CumulativeChunker

## setup chunker
cum_chunker = CumulativeChunker(encoder=encoder, score_threshold=0.3) ## change threshold

In [27]:
## chunks
cum_chunks = cum_chunker(docs=[content])

  0%|          | 0/329 [00:00<?, ?it/s]

In [28]:
## print cumulative chunks
cum_chunker.print(cum_chunks[0])

Split 1, tokens None, triggered by: 0.06
[31m# Mamba:[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.08
[32mLinear-Time Sequence Modeling with Selective State Spaces[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.08
[34m# Albert Gu*1 and Tri Dao*2[0m
----------------------------------------------------------------------------------------


Split 4, tokens None, triggered by: 0.05
[35m1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me[0m
----------------------------------------------------------------------------------------


Split 5, tokens None, triggered by: 0.15
[31m# Abstract[0m
----------------------------------------------------------------------------------------


Split 6, tokens None, triggered by: 0.09
[32mFo

# LangChain Semantic Text Splitter
* Future semantic chunker to try is the langchain chunker.
* Documentation here: https://python.langchain.com/v0.2/docs/how_to/semantic-chunker/