# Character Text Splitter


Việc tách văn bản (text splitting) là một bước quan trọng trong quá trình xử lý tài liệu với LangChain.

`CharacterTextSplitter` cung cấp khả năng chia nhỏ văn bản (text chunking) hiệu quả, mang lại một số lợi ích chính:

-   **Giới hạn Token (Token Limits):** Vượt qua các hạn chế về kích thước cửa sổ ngữ cảnh (context window size) của LLM.
-   **Tối ưu hóa tìm kiếm (Search Optimization):** Cho phép truy xuất chính xác hơn ở cấp độ chunk.
-   **Hiệu quả bộ nhớ (Memory Efficiency):** Xử lý tài liệu lớn một cách hiệu quả.
-   **Duy trì ngữ cảnh (Context Preservation):** Duy trì tính mạch lạc của văn bản thông qua `chunk_overlap`.

Hướng dẫn này khám phá việc triển khai thực tế của việc tách văn bản thông qua các phương pháp cốt lõi như `split_text()` và `create_documents()`, bao gồm các tính năng nâng cao như xử lý metadata.

```bash
pip install langchain_text_splitters
```


## CharacterTextSplitter Example

Read and store contents from keywords file
* Open `./data/appendix-keywords.txt` file and read its contents.
* Store the read contents in the `file` variable

In [3]:
with open("./data/appendix-keywords.txt", encoding="utf-8") as f:
   file = f.read()

In [4]:
print(file[:500])

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining

Embedding

Definition: Embedding is the process of converting textual data, such as words


Create `CharacterTextSplitter` with parameters:

**Parameters**

* `separator`: String to split text on (e.g., newlines, spaces, custom delimiters)
* `chunk_size`: Maximum size of chunks to return
* `chunk_overlap`: Overlap in characters between chunks
* `length_function`: Function that measures the length of given chunks
* `is_separator_regex`: Boolean indicating whether separator should be treated as a regex pattern

In [5]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
   separator=" ",           # Splits whenever a space is encountered in text
   chunk_size=250,          # Each chunk contains maximum 250 characters
   chunk_overlap=50,        # Two consecutive chunks share 50 characters
   length_function=len,     # Counts total characters in each chunk
   is_separator_regex=False # Uses space as literal separator, not as regex
)

In [6]:
# Create document objects from chunks and display the first one
chunks = text_splitter.create_documents([file])
print(chunks[0])

page_content='Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the'


Demonstrate metadata handling during document creation:

* `create_documents` accepts both text data and metadata lists
* Each chunk inherits metadata from its source document

In [7]:
# Define metadata for each document
metadatas = [
   {"document": 1},
   {"document": 2},
]

# Create documents with metadata
documents = text_splitter.create_documents(
   [file, file],  # List of texts to split
   metadatas=metadatas,  # Corresponding metadata
)

print(documents[0])  # Display first document with metadata

page_content='Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the' metadata={'document': 1}


Split text using the `split_text()` method.
* `text_splitter.split_text(file)[0]` returns the first chunk of the split text

In [8]:
# Split the file text and return the first chunk
text_splitter.split_text(file)[0]

'Semantic Search\n\nDefinition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.\nExample: If a user searches for “planets in the solar system,” the'