# Character Text Splitter

- Author: [hellohotkey](https://github.com/hellohotkey)
- Design:
- Peer Review : [fastjw](https://github.com/fastjw), [heewung song](https://github.com/kofsitho87)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/01-CharacterTextSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/01-CharacterTextSplitter.ipynb)

## Overview

Text splitting is a crucial step in document processing with LangChain.

The `CharacterTextSplitter` offers efficient text chunking that provides several key benefits:

- **Token Limits:** Overcomes LLM context window size restrictions
- **Search Optimization:** Enables more precise chunk-level retrieval
- **Memory Efficiency:** Processes large documents effectively
- **Context Preservation:** Maintains textual coherence through `chunk_overlap`

This tutorial explores practical implementation of text splitting through core methods like `split_text()` and `create_documents()`, including advanced features such as metadata handling.

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [CharacterTextSplitter Example](#charactertextsplitter-example)


### References

- [LangChain TextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TextSplitter.html)
- [LangChain CharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_text_splitters",
    ],
    verbose=False,
    upgrade=False,
)

## CharacterTextSplitter Example

Read and store contents from keywords file
* Open `./data/appendix-keywords.txt` file and read its contents.
* Store the read contents in the `file` variable

In [3]:
with open("/content/appendix-keywords.txt", encoding="utf-8") as f:
   file = f.read()

Print the first 500 characters of the file contents.

In [4]:
print(file[:500])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders


Create `CharacterTextSplitter` with parameters:

**Parameters**

* `separator`: String to split text on (e.g., newlines, spaces, custom delimiters)
* `chunk_size`: Maximum size of chunks to return
* `chunk_overlap`: Overlap in characters between chunks
* `length_function`: Function that measures the length of given chunks
* `is_separator_regex`: Boolean indicating whether separator should be treated as a regex pattern

In [5]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
   separator=" ",           # Splits whenever a space is encountered in text
   chunk_size=250,          # Each chunk contains maximum 250 characters
   chunk_overlap=50,        # Two consecutive chunks share 50 characters
   length_function=len,     # Counts total characters in each chunk
   is_separator_regex=False # Uses space as literal separator, not as regex
)

Split text using the `split_text()` method.
* `text_splitter.split_text(file)[0]` returns the first chunk of the split text

In [16]:
res = text_splitter.split_text(file)
for i in range(len(res)):
  print(f"{i}:{res[i]}")
  print(len(res[i]))
  print("*"*10)

0:Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick
244
**********
1:embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a
236
**********
2:textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].
Related keywords: natural
245
**********
3:as [0.65, -0.23, 0.17].
Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrase

In [18]:
s = "embeddings can be stored in a database for quick"
print(len(s))

48


Create document objects from chunks and display the first one

In [19]:
chunks = text_splitter.create_documents([file])
print(chunks[0])

page_content='Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick'


Demonstrate metadata handling during document creation:

* `create_documents` accepts both text data and metadata lists
* Each chunk inherits metadata from its source document

In [24]:
# Define metadata for each document
metadatas = [
   {"document": 1},
   {"document": 2},
]

# Create documents with metadata
documents = text_splitter.create_documents(
   [file],  # List of texts to split
   metadatas=metadatas,  # Corresponding metadata
)

print(documents[0])  # Display first document with metadata


page_content='Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick' metadata={'document': 1}


In [25]:
documents

[Document(metadata={'document': 1}, page_content='Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick'),
 Document(metadata={'document': 1}, page_content='embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a'),
 Document(metadata={'document': 1}, page_content='textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural'),
 Document(metadata={'document': 1}, page_content='as [0.65, -0.23, 0.17].\nRelated keywords: natural la

In [26]:
# Define metadata for each document
metadatas = [
   {"document": 1},
   {"document": 2},
]

# Create documents with metadata
documents = text_splitter.create_documents(
   [file, file],  # List of texts to split
   metadatas=metadatas,  # Corresponding metadata
)

print(documents[0])  # Display first document with metadata


page_content='Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick' metadata={'document': 1}


In [27]:
documents

[Document(metadata={'document': 1}, page_content='Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick'),
 Document(metadata={'document': 1}, page_content='embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a'),
 Document(metadata={'document': 1}, page_content='textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural'),
 Document(metadata={'document': 1}, page_content='as [0.65, -0.23, 0.17].\nRelated keywords: natural la