 # Text Splitting Methods in NLP


Việc tách văn bản (text splitting) là một bước tiền xử lý quan trọng trong Xử lý Ngôn ngữ Tự nhiên (NLP). Hướng dẫn này bao gồm các phương pháp và công cụ tách văn bản khác nhau, khám phá ưu điểm, nhược điểm và các trường hợp sử dụng phù hợp của chúng.

Các phương pháp chính để tách văn bản:

1.  **Tách dựa trên Token (Token-based Splitting)**
    -   Tiktoken: Bộ mã hóa BPE hiệu suất cao của OpenAI
    -   Hugging Face tokenizers: Bộ mã hóa cho nhiều mô hình tiền huấn luyện khác nhau

2.  **Tách dựa trên Câu (Sentence-based Splitting)**
    -   SentenceTransformers: Tách văn bản trong khi duy trì tính mạch lạc ngữ nghĩa
    -   NLTK: Tách câu và từ dựa trên xử lý ngôn ngữ tự nhiên
    -   spaCy: Tách văn bản sử dụng các khả năng xử lý ngôn ngữ nâng cao

3.  **Công cụ đặc thù ngôn ngữ (Language-specific Tools)**
    -   KoNLPy: Công cụ tách chuyên dụng cho xử lý văn bản tiếng Hàn

Mỗi công cụ có các đặc điểm và ưu điểm riêng:

-   Tiktoken cung cấp tốc độ xử lý nhanh và khả năng tương thích với các mô hình OpenAI
-   SentenceTransformers cung cấp khả năng tách câu dựa trên ý nghĩa
-   NLTK và spaCy triển khai tách dựa trên quy tắc ngôn ngữ
-   KoNLPy chuyên về phân tích hình thái và tách văn bản tiếng Hàn.

Thông qua hướng dẫn này, bạn sẽ hiểu các đặc điểm của từng công cụ và học cách chọn phương pháp tách văn bản phù hợp nhất cho dự án của mình.

```bash
pip install langchain_text_splitters tiktoken spacy sentence-transformers nltk konlpy
```

## Basic Usage of tiktoken

`tiktoken` is a fast BPE tokenizer created by OpenAI.

- Open the file ./data/appendix-keywords.txt and read its contents.
- Store the read content in the file variable.

In [1]:
# Open the file data/appendix-keywords.txt and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

In [2]:
# Print a portion of the content read from the file.
print(file[:500])

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining

Embedding

Definition: Embedding is the process of converting textual data, such as words


Use the `CharacterTextSplitter` to split the text.

- Initialize the text splitter using the `from_tiktoken_encoder` method, which is based on the Tiktoken encoder.

In [3]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    # Set the chunk size to 300.
    chunk_size=300,
    # Ensure there is no overlap between chunks.
    chunk_overlap=50,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

In [4]:
print(len(texts))  # Output the number of divided chunks.

9


In [5]:
# Print the first element of the texts list.
print(texts[0])

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to better understand and process the text.
Example: The word “apple” might be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing, Vectorization, Deep Learning

Token

Definition: A token refers to a smaller unit of text obtained by splitting a larger text. It can be a word, sentence, or phrase.
Example: The sentence “I go to school” can be split into tokens: “I”, “go”, “to”,

**Tham khảo**

-   Khi sử dụng `CharacterTextSplitter.from_tiktoken_encoder`, văn bản chỉ được tách bởi `CharacterTextSplitter`, và bộ mã hóa `Tiktoken` chỉ được sử dụng để đo và hợp nhất văn bản đã chia. (Điều này có nghĩa là văn bản đã tách có thể vượt quá kích thước chunk khi được đo bằng bộ mã hóa `Tiktoken`.)
-   Khi sử dụng `RecursiveCharacterTextSplitter.from_tiktoken_encoder`, văn bản đã chia được đảm bảo không vượt quá kích thước chunk cho phép của mô hình ngôn ngữ. Nếu một văn bản đã tách vượt quá kích thước này, nó sẽ được chia đệ quy. Ngoài ra, bạn có thể tải trực tiếp bộ tách `Tiktoken`, đảm bảo rằng mỗi phần tách nhỏ hơn kích thước chunk.


## Basic Usage of TokenTextSplitter

Use the `TokenTextSplitter` class to split the text into token-based chunks.

In [6]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=200,  # Set the chunk size to 10.
    chunk_overlap=50,  # Set the overlap between chunks to 0.
)

# Split the state_of_the_union text into chunks.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first chunk of the divided text.

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to better understand and process the text.
Example: The word “apple” might be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing, Vectorization, Deep Learning

Token

Definition: A token refers to a smaller unit of text obtained


## Basic Usage of spaCy

spaCy là một thư viện phần mềm mã nguồn mở cho xử lý ngôn ngữ tự nhiên nâng cao, được viết bằng ngôn ngữ lập trình Python và Cython.

Một lựa chọn thay thế khác cho NLTK là sử dụng bộ mã hóa spaCy.

1.  Cách văn bản được chia: Văn bản được chia bằng cách sử dụng bộ mã hóa spaCy.
2.  Cách kích thước chunk được đo: Nó được đo bằng số lượng ký tự.

In [7]:
!python -m spacy download en_core_web_sm --quiet

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [8]:
# Open the file data/appendix-keywords.txt and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

In [9]:
# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: 


Create a text splitter using the `SpacyTextSplitter` class.


In [10]:
import warnings
from langchain_text_splitters import SpacyTextSplitter

# Ignore  warning messages.
warnings.filterwarnings("ignore")

# Create the SpacyTextSplitter.
text_splitter = SpacyTextSplitter(
    chunk_size=200,  # Set the chunk size to 200.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

Use the `split_text` method of the `text_splitter` object to split the `file` text.

In [11]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

Created a chunk of size 373, which is longer than the specified 200
Created a chunk of size 315, which is longer than the specified 200
Created a chunk of size 226, which is longer than the specified 200
Created a chunk of size 263, which is longer than the specified 200
Created a chunk of size 289, which is longer than the specified 200
Created a chunk of size 212, which is longer than the specified 200
Created a chunk of size 206, which is longer than the specified 200
Created a chunk of size 258, which is longer than the specified 200
Created a chunk of size 207, which is longer than the specified 200
Created a chunk of size 210, which is longer than the specified 200
Created a chunk of size 221, which is longer than the specified 200
Created a chunk of size 209, which is longer than the specified 200
Created a chunk of size 258, which is longer than the specified 200


Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.


## Basic Usage of SentenceTransformers

`SentenceTransformersTokenTextSplitter` là một bộ tách văn bản chuyên dụng cho các mô hình `sentence-transformer`.

Hành vi mặc định của nó là chia văn bản thành các chunk phù hợp với cửa sổ token của mô hình sentence-transformer đang được sử dụng.


In [12]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# Create a sentence splitter and set the overlap between chunks to 50.
splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=50)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: 


The following code counts the number of tokens in the text stored in `the file` variable, excluding the count of start and stop tokens, and prints the result.

In [14]:
count_start_and_stop_tokens = 2  # Set the number of start and stop tokens to 2.

# Subtract the count of start and stop tokens from the total number of tokens in the text.
text_token_count = splitter.count_tokens(text=file) - count_start_and_stop_tokens
print(text_token_count)  # Print the calculated number of tokens in the text.

2121


Use the `splitter.split_text()` function to split the text stored in the `text_to_split` variable into chunks.

In [15]:
text_chunks = splitter.split_text(text=file)  # Split the text into chunks.

Split the text into chunks.

In [16]:
# Print the 0th chunk.
print(text_chunks[1])  # Print the second chunk from the divided text chunks.

used for tasks like retrieval, classification, and other data analysis. example : word embedding vectors can be stored in a database for quick access. related keywords : embedding, database, vectorization sql definition : sql ( structured query language ) is a programming language for managing data in databases. it supports operations like querying, modifying, inserting, and deleting data. example : select * from users where age > 18 ; retrieves information about users older than 18. related keywords : database, query, data management csv definition : csv ( comma - separated values ) is a file format for storing data where each value is separated by a comma. it is often used for simple data storage and exchange in tabular form. example : a csv file with headers “ name, age, job ” might contain data like “ john doe, 30, developer ”. related keywords : file format, data handling, data exchange json definition : json ( javascript object notation ) is a lightweight data exchange format tha

## Basic Usage of NLTK

Natural Language Toolkit (NLTK) là một thư viện và tập hợp các chương trình cho xử lý ngôn ngữ tự nhiên (NLP) tiếng Anh, được viết bằng ngôn ngữ lập trình Python.

Thay vì chỉ đơn giản là tách theo "\n\n", NLTK có thể được sử dụng để tách văn bản dựa trên bộ mã hóa NLTK.

1.  Phương pháp tách văn bản: Văn bản được tách bằng cách sử dụng bộ mã hóa NLTK.
2.  Đo kích thước chunk: Kích thước được đo bằng số lượng ký tự.
3.  `nltk` (Natural Language Toolkit) là một thư viện Python cho xử lý ngôn ngữ tự nhiên.
4.  Nó hỗ trợ các tác vụ NLP khác nhau như tiền xử lý văn bản, tokenization, phân tích hình thái và gắn thẻ part-of-speech.


In [17]:
import nltk

nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to /home/dino/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [19]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: 


- Create a text splitter using the `NLTKTextSplitter` class.
- Set the `chunk_size` parameter to 1000 to split the text into chunks of up to 1000 characters.

In [20]:
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(
    chunk_size=200,  # Set the chunk size to 200.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

In [21]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

Created a chunk of size 373, which is longer than the specified 200
Created a chunk of size 225, which is longer than the specified 200
Created a chunk of size 226, which is longer than the specified 200
Created a chunk of size 262, which is longer than the specified 200
Created a chunk of size 288, which is longer than the specified 200
Created a chunk of size 245, which is longer than the specified 200
Created a chunk of size 269, which is longer than the specified 200
Created a chunk of size 279, which is longer than the specified 200
Created a chunk of size 212, which is longer than the specified 200
Created a chunk of size 205, which is longer than the specified 200
Created a chunk of size 281, which is longer than the specified 200
Created a chunk of size 257, which is longer than the specified 200
Created a chunk of size 206, which is longer than the specified 200
Created a chunk of size 210, which is longer than the specified 200
Created a chunk of size 254, which is longer tha

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.


## Basic Usage of Hugging Face tokenizer

Hugging Face cung cấp nhiều bộ mã hóa (tokenizers) khác nhau.

Đoạn mã này minh họa cách tính độ dài token của một văn bản bằng một trong các bộ mã hóa của Hugging Face, GPT2TokenizerFast.

Phương pháp tách văn bản như sau:

-   Văn bản được tách ở cấp độ ký tự.

Việc đo kích thước chunk được xác định như sau:

-   Nó dựa trên số lượng token được tính bởi bộ mã hóa Hugging Face.
-   Một đối tượng `tokenizer` được tạo bằng lớp `GPT2TokenizerFast`.
-   Phương thức `from_pretrained` được gọi để tải mô hình bộ mã hóa "gpt2" được huấn luyện trước.


In [22]:
from transformers import GPT2TokenizerFast

# Load the GPT-2 tokenizer.
hf_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

In [23]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: 


In [24]:
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    # Use the Hugging Face tokenizer to create a CharacterTextSplitter object.
    hf_tokenizer,
    chunk_size=300,
    chunk_overlap=50,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

In [25]:
print(texts[1])  # Print the first element of the texts list.

Tokenizer

Definition: A tokenizer is a tool that splits text data into tokens. It is commonly used in natural language processing for data preprocessing.
Example: The sentence “I love programming.” can be tokenized into [“I”, “love”, “programming”, “.”].
Related Keywords: Tokenization, Natural Language Processing, Parsing

VectorStore

Definition: A vector store is a system for storing data in vector form. It is used for tasks like retrieval, classification, and other data analysis.
Example: Word embedding vectors can be stored in a database for quick access.
Related Keywords: Embedding, Database, Vectorization

SQL

Definition: SQL (Structured Query Language) is a programming language for managing data in databases. It supports operations like querying, modifying, inserting, and deleting data.
Example: SELECT * FROM users WHERE age > 18; retrieves information about users older than 18.
Related Keywords: Database, Query, Data Management

CSV
