# SemanticChunker

## Overview

This tutorial dives into a Text Splitter that uses semantic similarity to split text.

LangChain's `SemanticChunker` is a powerful tool that takes document chunking to a whole new level. Unlike traiditional methods that split text at fixed intervals, the `SemanticChunker` analyzes the meaning of the content to create more logical divisions.

This approach relies on **OpenAI's embedding model** , calculating how similar different pieces of text are by converting them into numerical representations. The tool offers various splitting options to suit your needs. You can choose from methods based on percentiles, standard deviation, or interquartile range.

What sets the `SemanticChunker` apart is its ability to preserve context by identifying natural breaks. This ultimately leads to better performance when working with large language models.

Since the `SemanticChunker` understands the actual content, it generates chunks that are more useful and maintain the flow and context of the original document.

See [Greg Kamradt's notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)


The method breaks down the text into individual sentences first. Then, it groups sementically similar sentences into chunks (e.g., 3 sentences), and finally merges similar sentences in the embedding space.

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Creating a Semantic Chunker](#creating-a-semanticchunker)
- [Text Splitting](#text-splitting)
- [Breakpoints](#breakpoints)

### References

- [Greg Kamradt's notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)
- [Greg Kamradt's video](https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580)

----

Load the sample text and output its content.

In [7]:
# Open the data/appendix-keywords.txt file to create a file object called f.
with open("/content/appendix-keywords.txt", encoding="utf-8") as f:

    file = f.read()  # Read the contents of the file and save it in the file variable.

# Print part of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


In [2]:
!pip install langchain==0.3.14
!pip install langchain-community==0.3.14
!pip install langchain-huggingface==0.1.2

Collecting langchain==0.3.14
  Downloading langchain-0.3.14-py3-none-any.whl.metadata (7.1 kB)
Collecting langsmith<0.3,>=0.1.17 (from langchain==0.3.14)
  Downloading langsmith-0.2.11-py3-none-any.whl.metadata (14 kB)
Collecting numpy<2,>=1.22.4 (from langchain==0.3.14)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Downloading langchain-0.3.14-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langsmith-0.2.11-py3-none-any.whl (326 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m326.9/326.9 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Creating a `SemanticChunker`

The `SemanticChunker` is an experimental LangChain feature, that splits text into semantically similar chunks.

This approach allows for more effective processing and analysis of text data.

Use the `SemanticChunker` to divide the text into semantically related chunks.

In [3]:
!pip install langchain_experimental



In [5]:
!pip install langchain transformers langchain_huggingface



In [4]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings


In [5]:
# Initialize a semantic chunk splitter using OpenAI embeddings.
text_splitter = SemanticChunker( HuggingFaceEmbeddings())

  text_splitter = SemanticChunker( HuggingFaceEmbeddings())
  text_splitter = SemanticChunker( HuggingFaceEmbeddings())
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [1]:
!pip install --upgrade --force-reinstall numpy

Collecting numpy
  Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.4
    Uninstalling numpy-2.2.4:
      Successfully uninstalled numpy-2.2.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.4 which is incompatible.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.2.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-2.2.4


In [14]:
!pip install --upgrade --force-reinstall numpy transformers langchain langchain_experimental langchain_huggingface

Collecting numpy
  Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting transformers
  Downloading transformers-4.50.0-py3-none-any.whl.metadata (39 kB)
Collecting langchain
  Downloading langchain-0.3.21-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain_experimental
  Using cached langchain_experimental-0.3.4-py3-none-any.whl.metadata (1.7 kB)
Collecting langchain_huggingface
  Using cached langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting filelock (from transformers)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers)
  Downloading huggingface_hub-0.29.3-py3-none-any.whl.metadata (13 kB)
Collecting packaging>=20.0 (from transformers)
  Downloading packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.m

## Text Splitting

Use the `text_splitter` with your loaded file (`file`) to split the text into smallar, more manageable unit documents. This process is often referred to as chunking.

In [8]:
chunks = text_splitter.split_text(file)

After splitting, you can examine the resulting chunks to see how the text has been divided.

In [9]:
# Print the first chunk among the divided chunks.
print(chunks[0])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks. Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text. Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17]. Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases. Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”. Associated keywords: tokenization, natural language processing, parsing

Tokenizer

Definition: A tokenizer 

The `create_documents()` function allows you to convert the individual chunks ([`file`]) into proper document objects (`docs`).


In [10]:
# Split using text_splitter
docs = text_splitter.create_documents([file])
print(
    docs[0].page_content
)  # Print the content of the first document among the divided documents.

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks. Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text. Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17]. Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases. Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”. Associated keywords: tokenization, natural language processing, parsing

Tokenizer

Definition: A tokenizer 

## Breakpoints

This chunking process works by indentifying natural breaks between sentences.

Here's how it decides where to split the text:
1. It calculates the difference between these embeddings for each pair of sentences.
2. When the difference between two sentences exceeds a certain threshold (breakpoint), the `text_splitter` identifies this as a natural break and splits the text at that point.

Check out [Greg Kamradt's video](https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580) for more details.



### Percentile-Based Splitting

This method sorts all embedding differences between sentences. Then, it splits the text at a specific percentile (e.g. 70th percentile).

In [11]:
text_splitter = SemanticChunker(
    # Initialize the semantic chunker using OpenAI's embedding model
    HuggingFaceEmbeddings(),
    # Set the split breakpoint type to percentile
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=70,
)

  HuggingFaceEmbeddings(),


Examine the resulting document list (`docs`).


In [12]:
docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

[Chunk 0]

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
[Chunk 1]

Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
[Chunk 2]

Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17]. Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.
[Chunk 3]

Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”. Associated keywords: tokenization, natural language processing, 

Use the `len(docs)` function to get the number of chunks created.

In [13]:
print(len(docs))  # Print the length of docs.

27


### Standard Deviation Splitting

This method sets a threshold based on a specified number of standard deviations (`breakpoint_threshold_amount`).

To use standard deviation for your breakpoints, set the `breakpoint_threshold_type` parameter to `"standard_deviation"` when initializing the `text_splitter`.

In [14]:
text_splitter = SemanticChunker(
    # Initialize the semantic chunker using OpenAI's embedding model.
    HuggingFaceEmbeddings(),
    # Use standard deviation as the splitting criterion.
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=1.25,
)

  HuggingFaceEmbeddings(),


After splitting, check the `docs` list and print its length (`len(docs)`) to see how many chunks were created.

In [15]:
# Split using text_splitter.
docs = text_splitter.create_documents([file])

In [16]:
docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

[Chunk 0]

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks. Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
[Chunk 1]

Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17]. Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases. Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”. Associated keywords: tokenization, natural language processing, parsing

Tokenizer

De

In [18]:
print(len(docs))  # Print the length of docs.

16


### Interquartile Range Splitting

This method utilizes the interquartile range (IQR) of the embedding differences to consider breaks, leading to a text split.

Set the `breakpoint_threshold_type` parameter to `"interquartile"` when initializing the `text_splitter` to use the IQR for splitting.

In [19]:
text_splitter = SemanticChunker(
    # Initialize the semantic chunk splitter using OpenAI's embedding model.
    HuggingFaceEmbeddings(),
    # Set the breakpoint threshold type to interquartile range.
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=0.5,
)

  HuggingFaceEmbeddings(),


In [20]:
# Split using text_splitter.
docs = text_splitter.create_documents([file])

# Print the results.
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

[Chunk 0]

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
[Chunk 1]

Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
[Chunk 2]

Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17]. Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases. Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”. Associated keywords: tokenization, natural language processing, parsing

To

Finally, print the length of `docs` list (`len(docs)`) to view the number of cunks created.


In [21]:
print(len(docs)) # Print the length of docs.

22
