# TokenTextSplitter

- Author: [Ilgyun Jeong](https://github.com/johnny9210)
- Peer Review: [Teddy Lee](https://github.com/teddylee777)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

Language models can only process a limited number of tokens (meaningful units of text) at once. This tutorial explores various methodes for splitting text into manageable chunks based on tokenization, ensuring capatibility with these limitations.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Example Usage of Tiktoken](#example-usage-of-tiktoken)
- [Example Usage of TokenTextSplitter](#example-usage-of-tokentextsplitter)
- [Example Usage of SentenceTransformers](#example-usage-of-sentencetransformers)
- [Splitting Text with NLTK](#splitting-text-with-nltk)
- [Splitting Text with spaCy](#splitting-text-with-spacy)
- [Using KoNLPy for Korean NLP](#using-konlpy-for-korean-nlp)
- [Basic Usage of Hugging Face tokenizers](#basic-usage-of-hugging-face-tokenizers)

### References

- [LangChain: How to split text by tokens](https://python.langchain.com/docs/how_to/split_by_token/)
- [Langchain TokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html)
----

## Environment Setup

Setting up your environment is the first step. See the [Environment Setup](https://wikidocs.net/257836) guide for more details.

**[Note]**
- The `langchain-opentutorial` is a bundle of easy-to-use environment setup guidance, useful functions and utilities for tutorials. 
- Check out the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [6]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_text_splitters",
        "tiktoken",
        "spacy",
        "sentence-transformers",
        "nltk",
        "konlpy",
    ],
    verbose=False,
)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [7]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "TokenTextSplitter",
    }
)

Environment variables have been set successfully.


Alternatively, you can set and load `OPENAI_API_KEY` from a `.env` file. 

**[Note]** This is only necessary if you haven't already set `OPENAI_API_KEY` in previous steps.

In [9]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Using Tiktoken for Text Splitting

`tiktoken` is a fast BPE (Byte Pair Encoding) tokenizer developed by OpenAI. Here's an example demonstrating its use with a text splitter:

1. Open the text file `appendix-keywords.txt` and read its contents. Store this text in a variable named `file`.

In [10]:
# Open the file data/appendix-keywords.txt and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

2. Display some of the content read from the `file`.

In [11]:
# Print a portion of the content read from the file.
print(file[:500])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders


Using `CharacterTextSplitter` with `tiktoken`:

1. Initialize a text splitter using the `from_tiktoken_encoder` method. This method leverages the `tiktoken` encoder for measurement and merging.

In [12]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    # Set the chunk size to 300.
    chunk_size=300,
    # Ensure there is no overlap between chunks.
    chunk_overlap=0,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

2. Print the number of resulting text chunks after splitting.

In [13]:
print(len(texts))  # Output the number of divided chunks.

10


3. Print the first element of the `texts` list, which holds the split chunks.

In [14]:
# Print the first element of the texts list.
print(texts[0])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].
Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.
Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.
Associated keywords: tokenization, natural language processing, parsing

Tokenizer


**Note**
- When using `CharacterTextSplitter.from_tiktoken_encoder`, the text is split primarily by the `CharacterTextSplitter`. The `tiktoken` tokenizer is used for measuring and merging the divided text. This might lead to chunks exceeding the token size intended for the language model.
- Consider `RecursiveCharacterTextSplitter.from_tiktoken_encoder` or directly loading the `tiktoken` splitter, for stricter control and ensuring each split adheres to the language model's token limit. If a split text exceeds this size, it is recursively divided.

## Example Usage of TokenTextSplitter

This section will cover using the `TokenTextSplitter` class to split text into chunks based on tokens.

In [15]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=200,  # Set the chunk size to 10.
    chunk_overlap=0,  # Set the overlap between chunks to 0.
)

# Split the state_of_the_union text into chunks.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first chunk of the divided text.

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].
Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.
Example: Split the sentence “I am going to school


## Example Usage of SentenceTransformers

`SentenceTransformersTokenTextSplitter` is a specialized splitter designed for `sentence-transformer` models. It automatically splits text into chunks that fit within the token window of the sentence-transformer model being used.

Steps:
1.

In [21]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# Create a sentence splitter and set the overlap between chunks to 0.
splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=0)

2. Inspect the sample text.

In [22]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


3. Calculate the number of tokens (excluding start and stop tokens) in the `file` variable, and print the result.

In [23]:
count_start_and_stop_tokens = 2  # Set the number of start and stop tokens to 2.

# Subtract the count of start and stop tokens from the total number of tokens in the text.
text_token_count = splitter.count_tokens(text=file) - count_start_and_stop_tokens
print(text_token_count)  # Print the calculated number of tokens in the text.

2231


4. Utilize the `splitter.split_text()` function to split the text stored in `text_to_split` into chunks.

In [24]:
text_chunks = splitter.split_text(text=file)  # Split the text into chunks.

5. Print the first chunk using `print(text_chunks[1])`.

In [25]:
# Print the 0th chunk.
print(text_chunks[1])  # Print the second chunk from the divided text chunks.

##ete, and more data. example : select * from users where age > 18 ; looks up information about users who are 18 years old or older. associated keywords : database, query, data management, data management csv definition : csv ( comma - separated values ) is a file format for storing data, where each data value is separated by a comma. it is used for simple storage and exchange of tabular data. example : a csv file with the headers name, age, and occupation might contain data such as hong gil - dong, 30, developer. related keywords : data format, file processing, data exchange json definition : json ( javascript object notation ) is a lightweight data interchange format that represents data objects using text that is readable to both humans and machines. example : { “ name ” : “ honggildong ”, ‘ age ’ : 30, “ occupation ” : “ developer " } is data in json format. related keywords : data exchange, web development, apis transformer definition : transformers are a type of deep learning mod

## Splitting Text with NLTK

The Natural Language Toolkit (NLTK) is a Python library for natural language processing (NLP) tasks. It supports various NLP tasks like text preprocessing, tokenization, morphological analysis, and part-of-speech tagging.

Here's how to use NLTK tokenizers for text splitting, offering an alternative to splitting by newlines (`\n\n`).
- Splitting method: NLTK tokenizer
- The chunk size is determined by the number of characters.

Before using NLTK, you need to download the necessary data files.

In [26]:
import nltk

nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to /Users/teddy/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

Downloading `punkt_tab` enables NLTK to tokenize text into words or sentences for multiple languages, including English.

Steps:
1. Repeate the process of opening `appendix-keywords.txt`, reading its contents, and storing the text in the `file` variable.

In [27]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


2. Create a text splitter using the `NLTKTextSplitter` class.
3. Set the `chunk_size` parameter to 200 (or any desired value) to control the mazimum chunk size in characters.

In [28]:
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(
    chunk_size=200,  # Set the chunk size to 200.
    chunk_overlap=0,  # Set the overlap between chunks to 0.
)

4. Utilize the `split_text` method of the `text_splitter` object to split the text stored in `file`.

In [29]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 240, which is longer than the specified 200
Created a chunk of size 225, which is longer than the specified 200
Created a chunk of size 211, which is longer than the specified 200
Created a chunk of size 231, which is longer than the specified 200
Created a chunk of size 222, which is longer than the specified 200
Created a chunk of size 203, which is longer than the specified 200
Created a chunk of size 280, which is longer than the specified 200
Created a chunk of size 230, which is longer than the specified 200
Created a chunk of size 213, which is longer than the specified 200
Created a chunk of size 219, which is longer than the specified 200
Created a chunk of size 213, which is longer than the specified 200
Created a chunk of size 214, which is longer than the specified 200
Created a chunk of size 203, which is longer than the specified 200
Created a chunk of size 211, which is longer tha

Semantic Search

Definition: A vector store is a system that stores data converted to vector format.

It is used for search, classification, and other data analysis tasks.


## Splitting Text with spaCy

spaCy is an open-source library for advanced NLP, written in Python and Cython.

Like NLTK, spaCy also provides an alternative to basic newline splitting (`\n\n`).
- Splitting method: spaCy's tokenizer
- The chunk size is measured by the number of characters.

To split text with spaCy, you need to download the `en_core_web_sm` spaCy model for English.

In [16]:
!python -m spacy download en_core_web_sm --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Steps:
1. Repeate the process of opening `appendix-keywords.txt`, reading its contents, and storing the text in the `file` variable.

In [None]:
# Open the file data/appendix-keywords.txt and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

# Print a portion of the content read from the file.



print(file[:350])

2. Create a text splitter using the `SpacyTextSplitter` class.
3. Set the `chunk_size` parameter to 200 (or any desired value) to control the mazimum chunk size in characters.

In [19]:
import warnings
from langchain_text_splitters import SpacyTextSplitter

# Ignore  warning messages.
warnings.filterwarnings("ignore")

# Create the SpacyTextSplitter.
text_splitter = SpacyTextSplitter(
    chunk_size=200,  # Set the chunk size to 200.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

4. Use the `split_text` method of the `text_splitter` object to split the `file` text.

In [20]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 241, which is longer than the specified 200
Created a chunk of size 225, which is longer than the specified 200
Created a chunk of size 211, which is longer than the specified 200
Created a chunk of size 231, which is longer than the specified 200
Created a chunk of size 230, which is longer than the specified 200
Created a chunk of size 219, which is longer than the specified 200
Created a chunk of size 214, which is longer than the specified 200
Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 203, which is longer than the specified 200
Created a chunk of size 211, which is longer than the specified 200
Created a chunk of size 218, which is longer than the specified 200
Created a chunk of size 230, which is longer than the specified 200


Semantic Search

Definition: A vector store is a system that stores data converted to vector format.

It is used for search, classification, and other data analysis tasks.


## Using KoNLPy for Korean NLP

As mentioned in the [LangChain's How-to guides](https://python.langchain.com/docs/how_to/split_by_token/#konlpy), KoNLPy offers a dedicated text splitter for Korean text processing with useful features for morphological analysis, part-of-speech tagging, and syntactic parsing.

Steps:
1. Since it is an example of processing Korean language, we need to log Korean text to split.

In [None]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords-korean.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

2. Create a text splitter using the `KonlpyTextSplitter` class.

In [31]:
from langchain_text_splitters import KonlpyTextSplitter

# Create a text splitter object using KonlpyTextSplitter.
text_splitter = KonlpyTextSplitter()

3. Use the `text_splitter` to split `the file` content into sentences.

In [None]:
texts = text_splitter.split_text(file)  # Split the file content into sentences.
print(texts[0])  # Print the first sentence from the divided text.

## Basic Usage of Hugging Face tokenizers

Hugging Face provides various tokenizers.

This tutorial demonstrates calculating the token length of a text using one of Hugging Face's tokenizers, `GPT2TokenizerFast`.
- Splitting method: Hugging Face's `GPT2TokenizerFast`
- The chunk size is determined by the number of characters

**[Note]**
- The chunk size is based on the number of tokens calculated by the Hugging Face tokenizer.
- A `tokenizer` object is created using the `GPT2TokenizerFast` class.

Call `from_pretrained` method to load the pre-trained `gpt2` tokenizer model.

In [33]:
from transformers import GPT2TokenizerFast

# Load the GPT-2 tokenizer.
hf_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

Steps:
1. Repeate the process of opening `appendix-keywords.txt`, reading its contents, and storing the text in the `file` variable.

In [34]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


2. Create a text splitter using `from_huggingface_tokenizer` method.

In [35]:
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    # Use the Hugging Face tokenizer to create a CharacterTextSplitter object.
    hf_tokenizer,
    chunk_size=300,
    chunk_overlap=50,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

3. Check the split result of the first element.

In [36]:
print(texts[1])  # Print the first element of the texts list.

Tokenizer

Definition: A tokenizer is a tool that splits text data into tokens. It is used to preprocess data in natural language processing.
Example: Split the sentence “I love programming.” into [“I”, “love”, “programming”, “.”].
Associated keywords: tokenization, natural language processing, parsing

VectorStore

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

SQL

Definition: SQL(Structured Query Language) is a programming language for managing data in a database. You can query, modify, insert, delete, and more data.
Example: SELECT * FROM users WHERE age > 18; looks up information about users who are 18 years old or older.
Associated keywords: database, query, data management, data management

CSV
