# Semantic Chunkers for RAG
* Notebook by Adam Lang
* Date: 7/12/2024

# Overview
* In this notebook we will go over Semantic Chunkers a popular library used for text chunking for RAG-LLM applications.

# What is Semantic Chunking?
* `Semantic Chunkers` is a multi-modal chunking python library for intelligent chunking of text, video, and audio. It makes your AI and data processing more efficient and accurate.
   * The specific repo for this library: https://github.com/aurelio-labs/semantic-chunkers?tab=readme-ov-file

* Semantic chunkers allow us to build LLM applications that are more "context aware" of information. This prevents any semantic or contextual overlap in the text if we split chunks recursively.

In [4]:
## install
!pip install -qU \
semantic-chunkers==0.0.3 \
datasets==2.19.1  # huggingface datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m774.0/774.0 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.1/66.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.9/52.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

## Load a Dataset to Experiment
* From huggingface: https://huggingface.co/datasets/jamescalam/ai-arxiv2
* This is a dataset of papers written about AI from arxiv.

In [5]:
# load huggingface dataset
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv2", split="train")
data

Downloading readme:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/217M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2673 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'content', 'references'],
    num_rows: 2673
})

In [6]:
## lets look at one of the AI ariv papers
content = data[3]["content"]
print(content[:1000])

# Mamba: Linear-Time Sequence Modeling with Selective State Spaces
# Albert Gu*1 and Tri Dao*2
1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me
# Abstract
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities

Splitting the content (paper) up to increase speed and decrease latency and cost.

In [7]:
content = content[:20_000]

# Semantic Chunking Experiments
* We will try various techniques for semantic chunking.
* Each of the semantic chunkers requires an **encoder** which we can use:
    * open source encoders: `HuggingFaceEncoder` or `FastembedEncoder`
    * proprietary encoders: `OpenAIEncoder` or `CohereEncoder`

* For this example let's use the `OpenAIEncoder` with the `text-embedding-3-small` embedding model.

In [8]:
## basic imports for openai
import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder

## openai key
OPENAI_KEY = getpass('Enter your OpenAI key: ')

Enter your OpenAI key: ··········


In [9]:
## openai environment variables
import os



os.environ['OPENAI_API_KEY'] = OPENAI_KEY

In [10]:
## instantiate encoder
encoder = OpenAIEncoder(name="text-embedding-3-small")

## Statistical Chunking
* This method is the most "robust".
* It utilizes varying similarity thresholds to identify more dynamic and local similarity splits in the data.
* This method gives balance between **accuracy and efficiency.**
* However, statistical chunking can ONLY be used for text documents (unlike the multi-modal `ConsecutiveChunker`).

### `Statistical Chunker` Overview
* Automatically identifies an appropriate threshold value to use during chunking of text so requires less customization than other chunkers.
* Works **out of the box** for parameters rather than having to pick and choose and experiment. It does the work for you!
* Cost effective + fast!

In [11]:
## instantiate statistical chunker
from semantic_chunkers import StatisticalChunker

## chunker
chunker = StatisticalChunker(encoder=encoder)

This will identify an ideal similarity threshold for you based on your text data document chunks.
* The similarity score may be different for various documents but that is the purpose of using this chunker.

In [12]:
## create chunks
chunks = chunker(docs=[content])

[32m2024-07-13 18:56:40 INFO semantic_chunkers.utils.logger Single document exceeds the maximum token limit of 300. Splitting to sentences before semantically merging.[0m


In [13]:
## print result
chunker.print(chunks[0])

Split 1, tokens 300, triggered by: token limit
[31m# Mamba: Linear-Time Sequence Modeling with Selective State Spaces # Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me # Abstract Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input

## Consecutive Chunking
* Simplest semantic chunking tool.
* The concept of how it works:
   * Chunker parses text and looks for the location of the drop in similarity between sentences based on the threshold set and creates chunks.
* Again it is more cost effective.
* However, it requires more hyperparameter finetuning than the statistical chunker.
* `score_threshold` with openAI is around 0.75, however, with newer embedding models it is lower.

In [14]:
# import chunker
from semantic_chunkers import ConsecutiveChunker

#instantiate chunker
chunker = ConsecutiveChunker(encoder=encoder, score_threshold=0.3)

# creat consecutive chunks
chunks = chunker(docs=[content])

  0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/328 [00:00<?, ?it/s]

In [15]:
# print results
chunker.print(chunks[0])

Split 1, tokens None, triggered by: 0.09
[31m# Mamba:[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.10
[32mLinear-Time Sequence Modeling with Selective State Spaces[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.25
[34m# Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me[0m
----------------------------------------------------------------------------------------


Split 4, tokens None, triggered by: 0.22
[35m# Abstract[0m
----------------------------------------------------------------------------------------


Split 5, tokens None, triggered by: 0.30
[31mFoundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architectur

## Cumulative Chunker
* Compares embeddings of Sentences 1 and 2 to Sentence 3.
* Continues to create mini cumulative chunk groups or clusters and finds the similarity on different cumulative chunks.
* Pros: more stable results
* Cons: Requires more intenstive compute process, not as cost effective!

In [16]:
from semantic_chunkers import CumulativeChunker

# create chunker
chunker = CumulativeChunker(encoder=encoder, score_threshold=0.2)

# create chunks
chunks = chunker(docs=[content])

  0%|          | 0/329 [00:00<?, ?it/s]

In [17]:
## print chunks
chunker.print(chunks[0])

Split 1, tokens None, triggered by: 0.09
[31m# Mamba:[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.10
[32mLinear-Time Sequence Modeling with Selective State Spaces[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.19
[34m# Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me # Abstract Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ ciency on long sequences, but they have 