# SemanticChunker


Hướng dẫn này đi sâu vào một Text Splitter sử dụng độ tương đồng ngữ nghĩa để tách văn bản.

`SemanticChunker` của LangChain là một công cụ mạnh mẽ đưa việc chia chunk tài liệu lên một tầm cao mới. Không giống như các phương pháp truyền thống chia văn bản theo khoảng thời gian cố định, `SemanticChunker` phân tích ý nghĩa của nội dung để tạo ra các phân chia logic hơn.

Phương pháp này dựa trên **mô hình embedding của OpenAI**, tính toán mức độ tương tự của các phần văn bản khác nhau bằng cách chuyển chúng thành biểu diễn số. Công cụ này cung cấp nhiều tùy chọn tách khác nhau để phù hợp với nhu cầu của bạn. Bạn có thể chọn từ các phương pháp dựa trên percentiles, độ lệch chuẩn hoặc khoảng tứ phân vị.

Điều làm cho `SemanticChunker` khác biệt là khả năng bảo toàn ngữ cảnh bằng cách xác định các điểm ngắt tự nhiên. Điều này cuối cùng dẫn đến hiệu suất tốt hơn khi làm việc với các mô hình ngôn ngữ lớn.

Vì `SemanticChunker` hiểu nội dung thực tế, nó tạo ra các chunk hữu ích hơn và duy trì luồng và ngữ cảnh của tài liệu gốc.

Xem notebook của Greg Kamradt: [Greg Kamradt's notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

```bash
pip install langchain_experimental langchain_openai
``` 

In [1]:
# Open the data/appendix-keywords.txt file to create a file object called f.
with open("./data/appendix-keywords.txt", encoding="utf-8") as f:

    file = f.read()  # Read the contents of the file and save it in the file variable.

# Print part of the content read from the file.
print(file[:350])

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: 


## Creating a `SemanticChunker`

The `SemanticChunker` is an experimental LangChain feature, that splits text into semantically similar chunks.

This approach allows for more effective processing and analysis of text data.

Use the `SemanticChunker` to divide the text into semantically related chunks.

In [4]:
from dotenv import load_dotenv 
load_dotenv(override=True, dotenv_path="../.env")

True

In [5]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# Initialize a semantic chunk splitter using OpenAI embeddings.
text_splitter = SemanticChunker(OpenAIEmbeddings())

## Text Splitting

Use the `text_splitter` with your loaded file (`file`) to split the text into smallar, more manageable unit documents. This process is often referred to as chunking.

In [6]:
chunks = text_splitter.split_text(file)

After splitting, you can examine the resulting chunks to see how the text has been divided.

In [7]:
# Print the first chunk among the divided chunks.
print(chunks[0])

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results. Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to better understand and process the text.


The `create_documents()` function allows you to convert the individual chunks ([`file`]) into proper document objects (`docs`).

In [8]:
# Split using text_splitter
docs = text_splitter.create_documents([file])
print(
    docs[0].page_content
)  # Print the content of the first document among the divided documents.

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results. Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to better understand and process the text.


## Breakpoints

Quá trình chia chunk này hoạt động bằng cách xác định các điểm ngắt tự nhiên giữa các câu.

Đây là cách nó quyết định vị trí tách văn bản:

1.  Nó tính toán sự khác biệt giữa các embedding này cho mỗi cặp câu.
2.  Khi sự khác biệt giữa hai câu vượt quá một ngưỡng nhất định (breakpoint), `text_splitter` xác định đây là điểm ngắt tự nhiên và tách văn bản tại điểm đó.

Xem video của Greg Kamradt: [Greg Kamradt's video](https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580) để biết thêm chi tiết.


### Percentile-Based Splitting

Phương pháp này sắp xếp tất cả các khác biệt embedding giữa các câu. Sau đó, nó tách văn bản tại một percentile cụ thể (ví dụ: percentile thứ 70).


In [9]:
text_splitter = SemanticChunker(
    # Initialize the semantic chunker using OpenAI's embedding model
    OpenAIEmbeddings(),
    # Set the split breakpoint type to percentile
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=70,
)

In [10]:
docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

[Chunk 0]

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results. Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to better understand and process the text.
[Chunk 1]

Example: The word “apple” might be represented as a vector like [0.65, -0.23, 0.17]. Related Keywords: Natural Language Processing, Vectorization, Deep Learning

Token

Definition: A token refers to a smaller unit of text obtained by splitting a larger text. It can be a word, sentence, or phrase.
[Chunk 2]

Example: The sentence “I go to school” can be sp

Use the `len(docs)` function to get the number of chunks created.

In [11]:
print(len(docs))  # Print the length of docs.

24


### Standard Deviation Splitting

Phương pháp này đặt ngưỡng dựa trên một số lượng độ lệch chuẩn được chỉ định (`breakpoint_threshold_amount`).

Để sử dụng độ lệch chuẩn cho điểm ngắt của bạn, hãy đặt tham số `breakpoint_threshold_type` thành `"standard_deviation"` khi khởi tạo `text_splitter`.


In [12]:
text_splitter = SemanticChunker(
    # Initialize the semantic chunker using OpenAI's embedding model.
    OpenAIEmbeddings(),
    # Use standard deviation as the splitting criterion.
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=1.25,
)

After splitting, check the `docs` list and print its length (`len(docs)`) to see how many chunks were created.

In [13]:
# Split using text_splitter.
docs = text_splitter.create_documents([file])

In [14]:
docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

[Chunk 0]

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results. Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to better understand and process the text.
[Chunk 1]

Example: The word “apple” might be represented as a vector like [0.65, -0.23, 0.17]. Related Keywords: Natural Language Processing, Vectorization, Deep Learning

Token

Definition: A token refers to a smaller unit of text obtained by splitting a larger text. It can be a word, sentence, or phrase. Example: The sentence “I go to school” can be split into to

In [15]:
print(len(docs))  # Print the length of docs.

14


### Interquartile Range Splitting

Phương pháp này sử dụng khoảng tứ phân vị (IQR) của các khác biệt embedding để xem xét các điểm ngắt, dẫn đến việc tách văn bản.

Đặt tham số `breakpoint_threshold_type` thành `"interquartile"` khi khởi tạo `text_splitter` để sử dụng IQR cho việc tách.


In [16]:
text_splitter = SemanticChunker(
    # Initialize the semantic chunk splitter using OpenAI's embedding model.
    OpenAIEmbeddings(),
    # Set the breakpoint threshold type to interquartile range.
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=0.5,
)

In [17]:
# Split using text_splitter.
docs = text_splitter.create_documents([file])

# Print the results.
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

[Chunk 0]

Semantic Search

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results. Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors. This allows computers to better understand and process the text.
[Chunk 1]

Example: The word “apple” might be represented as a vector like [0.65, -0.23, 0.17]. Related Keywords: Natural Language Processing, Vectorization, Deep Learning

Token

Definition: A token refers to a smaller unit of text obtained by splitting a larger text. It can be a word, sentence, or phrase. Example: The sentence “I go to school” can be split into to

In [18]:
print(len(docs))  # Print the length of docs.

19
