# Text Splitters
一旦加载了文档，您通常会想要转换它们以更好地适应您的应用程序。最简单的例子是，您可能希望将长文档分成更小的块，以便适合模型的上下文窗口。LangChain有许多内置的文档转换器，可以方便地拆分、组合、过滤和操作文档。

当您想要处理长文本时，有必要将该文本分割成块。虽然这听起来很简单，但这里有很多潜在的复杂性。理想情况下，您希望将语义相关的文本片段保持在一起。“语义相关”的含义可能取决于文本的类型。本笔记本展示了实现这一目标的几种方法。

在高层次上，文本分割器的工作方式如下
1. 将文本分成语义上有意义的小块(通常是句子)。
2. 开始将这些小块组合成一个更大的块，直到达到一定的大小(由某个函数测量)。
3. 一旦你达到这个大小，让这个块成为它自己的文本块，然后开始创建一个有一些重叠的新文本块(以保持块之间的上下文)。

这意味着有两个不同的轴，您可以沿着它自定义文本分割器
1. 文本是如何分割的
2. 如何测量块大小






In [32]:
from dotenv import load_dotenv, find_dotenv
from langchain.globals import set_debug
import os
load_dotenv(find_dotenv())
set_debug(False)

### Types of Text Splitters
LangChain提供了许多不同类型的文本分割器。这些都在langchain-text-splitter包中。下面的表格列出了所有这些，以及一些特征      
Name:文本分割器的名称
Splits On:这个文本分割器如何分割文本
Adds Metadata:该文本分割器是否添加关于每个块来自何处的元数据。
Description:拆分器的描述，包括何时使用它的建议。

表格
| Name | Splits On | Adds Metadata | Description |
| --- | --- | --- | --- |
| Recursive | 用户自定义字符的列表 |  | 递归分割文本。递归地分割文本的目的是试图保持相关的文本片段彼此相邻。这是开始分割文本的推荐方法。 |
| HTML | HTML特定字符 | 是 | 基于特定于html的字符拆分文本。值得注意的是，这增加了有关数据块来自何处的相关信息(基于HTML)。 |
| Markdown | 减记特定字符 | 是 | 根据特定标记字符拆分文本。值得注意的是，这添加了关于该块来自何处的相关信息(基于Markdown)。 |
| Code | 代码(Python, JS)特定字符 |  | 基于特定于编码语言的字符拆分文本。有15种不同的语言可供选择。 |
| Token | Tokens |  | 分割标记上的文本。有几种不同的方法来衡量代币。 |
| Character | 用户定义的字符 |  | 根据用户定义的字符拆分文本。一个更简单的方法。 |
| [实验]语义分块器 | 句子 |  | 首先拆分句子。然后，如果它们在语义上足够相似，就将它们相邻地组合在一起。取自格雷格·卡姆拉特 |
| AI21语义文本分配器 | Semantics | 是 | 识别形成连贯文本的不同主题，并沿着这些主题进行拆分。 |


### 计算文本分配器
您可以使用Greg Kamradt创建的Chunkviz实用程序来评估文本分割器。Chunkviz是一个很好的工具，可以可视化你的文本分配器是如何工作的。它将向您展示文本是如何分割的，并帮助您调整分割参数。

### Other Document Transforms
文本分割只是您可能希望在将文档传递给LLM之前对其进行转换的一个示例。转到集成，获取关于与第三方工具集成的内置文档转换器的文档。

## Split by HTML header
 `MarkdownHeaderTextSplitter`，`HTMLHeaderTextSplitter`是一个结构感知的分块器，在元素级别拆分文本，并为每个头添加与任何给定块相关的元数据。它可以一个元素一个元素地返回块，或者将具有相同元数据的元素组合起来，其目的是(a)在语义上保持相关文本的分组(或多或少)，以及(b)保留在文档结构中编码的上下文丰富的信息。它可以与其他文本分割器一起使用，作为分块管道的一部分。

### Usage examples
%pip install -qU langchain-text-splitters


In [1]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(page_content='Foo'),
 Document(page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),
 Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),
 Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),
 Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),
 Document(page_content='Baz', metadata={'Header 1': 'Foo'}),
 Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),
 Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]

[Document(page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),
 Document(page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 Th

### 局限性
从一个HTML文档到另一个文档可能会有很多结构上的变化，尽管HTMLHeaderTextSplitter会尝试将所有“相关”标头附加到任何给定的块上，但有时可能会错过某些标题。例如，该算法假定一个信息层次结构，其中标题始终处于节点“上方”相关文本，即先前的兄弟姐妹，祖先及其组合。在以下新闻文章（从本文档的撰写中）中，该文档的结构化使得顶级标题的文本何时标记为“ H1”，在我们期望的文本元素中截然不同它要在“上方” - 因此，我们可以观察到“ H1”元素及其相关文本不会显示在块元数据中（但是，在适用的情况下，我们确实看到“ H2”及其相关文本）：

In [3]:
url = "https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
print(html_header_splits[1].page_content[:500])

No two El Niño winters are the same, but many have temperature and precipitation trends in common.  
Average conditions during an El Niño winter across the continental US.  
One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA.  
Because the jet stream is essentially a river of air that storms flow through, they c


### Split by HTML section
HTMLSectionsPlitter上的概念与HTMLHeaderTextSplitter相似，是一个“结构感知”的零件，可将文本分配为元素级别，并为每个给定的块添加每个标题“相关”的元数据。它可以通过元素返回块，或将元素与相同的元数据结合在一起，以及（a）将相关文本（或多或少）的相关文本进行分组（b）保存在文档结构中编码的上下文丰富的信息的目标。它可以与其他文本拆分器一起用作块管道的一部分。在内部，当截面尺寸大于块大小时，它使用递归的Cearsivecharactertextsplitter。它还考虑文本的字体大小，以确定它是否基于确定的字体大小阈值。使用XSLT_Path提供一个绝对的路径来转换HTML，以便它可以基于提供的标签检测部分。默认值是在data_connection/docunnnection_transformers目录中使用converting_to_header.xslt文件。这是为了将HTML转换为更易于检测部分的格式/布局。例如，可以将基于其字体大小的跨度转换为标题标签以检测为部分。


In [4]:
from langchain_text_splitters import HTMLSectionSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

ImportError: cannot import name 'HTMLSectionSplitter' from 'langchain_text_splitters' (/Users/dyz/opt/anaconda3/envs/langchain/lib/python3.10/site-packages/langchain_text_splitters/__init__.py)

2)管道到另一个拆分器，与html从html字符串内容加载

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text(html_string)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits

NameError: name 'HTMLSectionSplitter' is not defined

### Split by character
这是最简单的方法。这将基于字符进行分割(默认情况下)，并根据字符数测量块长度。

1. 如何分割文本:通过单个字符。
2. 如何测量块大小:通过字符数。

In [7]:
# This is a long document we can split up.
with open("data/whatsapp_chat.txt") as f:
    state_of_the_union = f.read()
state_of_the_union

'1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\n1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.\n1/23/23, 2:59 AM - User 1: How much do you want?\n1/23/23, 3:00 AM - User 2: Online is at least $100\n1/23/23, 3:01 AM - User 2: Here is $129\n1/23/23, 3:01 AM - User 2: <Media omitted>\n1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one!\n1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!\n1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale\n1/23/23, 3:19 AM - User 1: Oh no worries! Bye\n1/23/23, 3:19 AM - User 2: Bye!\n1/23/23, 3:22_AM - User 1: And let me know if anything changes'

In [8]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

text_splitter

<langchain_text_splitters.character.CharacterTextSplitter at 0x7fa468e01660>

In [9]:
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

page_content='1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\n1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.\n1/23/23, 2:59 AM - User 1: How much do you want?\n1/23/23, 3:00 AM - User 2: Online is at least $100\n1/23/23, 3:01 AM - User 2: Here is $129\n1/23/23, 3:01 AM - User 2: <Media omitted>\n1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one!\n1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!\n1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale\n1/23/23, 3:19 AM - User 1: Oh no worries! Bye\n1/23/23, 3:19 AM - User 2: Bye!\n1/23/23, 3:22_AM - User 1: And let me know if anything changes'


下面是一个与文档一起传递元数据的示例，注意它与文档一起被拆分。

In [10]:
metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents(
    [state_of_the_union, state_of_the_union], metadatas=metadatas
)
print(documents[0])

page_content='1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\n1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.\n1/23/23, 2:59 AM - User 1: How much do you want?\n1/23/23, 3:00 AM - User 2: Online is at least $100\n1/23/23, 3:01 AM - User 2: Here is $129\n1/23/23, 3:01 AM - User 2: <Media omitted>\n1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one!\n1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!\n1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale\n1/23/23, 3:19 AM - User 1: Oh no worries! Bye\n1/23/23, 3:19 AM - User 2: Bye!\n1/23/23, 3:22_AM - User 1: And let me know if anything changes' metadata={'document': 1}


In [11]:
text_splitter.split_text(state_of_the_union)[0]

'1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\n1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.\n1/23/23, 2:59 AM - User 1: How much do you want?\n1/23/23, 3:00 AM - User 2: Online is at least $100\n1/23/23, 3:01 AM - User 2: Here is $129\n1/23/23, 3:01 AM - User 2: <Media omitted>\n1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one!\n1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!\n1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale\n1/23/23, 3:19 AM - User 1: Oh no worries! Bye\n1/23/23, 3:19 AM - User 2: Bye!\n1/23/23, 3:22_AM - User 1: And let me know if anything changes'

### Split code
CodeTextSplitter允许您拆分支持多种语言的代码。导入enum Language并指定语言。

In [13]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)
# Full list of supported languages
[e.value for e in Language]

['cpp',
 'go',
 'java',
 'kotlin',
 'js',
 'ts',
 'php',
 'proto',
 'python',
 'rst',
 'ruby',
 'rust',
 'scala',
 'swift',
 'markdown',
 'latex',
 'html',
 'sol',
 'csharp',
 'cobol',
 'c',
 'lua',
 'perl']

您还可以看到用于给定语言的分隔符

#### Python

In [14]:
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='def hello_world():\n    print("Hello, World!")'),
 Document(page_content='# Call the function\nhello_world()')]

### MarkdownHeaderTextSplitter
**Motivation**
许多聊天或问答应用程序涉及在嵌入和矢量存储之前对输入文档进行分块处理。

Pinecone的这些笔记提供了一些有用的提示




In [15]:
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
type(md_header_splits[0])

langchain_core.documents.base.Document

默认情况下，`MarkdownHeaderTextSplitter`会从输出块的内容中剥离拆分的标题。这可以通过设置`strip_headers = False`来禁用。

In [16]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

[Document(page_content='# Foo  \n## Bar  \nHi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
 Document(page_content='### Boo  \nHi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
 Document(page_content='## Baz  \nHi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

在每个markdown组中，我们可以应用任何我们想要的文本拆分器。

In [17]:
markdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)

# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(md_header_splits)
splits

[Document(page_content='# Intro  \n## History  \nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
 Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
 Document(page_content='## Rise and divergence  \nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for  \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
 Document(page_content='#### Standardization  \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane,

### Recursively split JSON
这个json拆分器首先遍历json数据深度，然后构建更小的json块。它试图保持嵌套json对象的完整，但如果需要，将它们分开，以保持最小块大小和最大块大小之间的块。如果值不是一个嵌套的json，而是一个非常大的字符串，那么字符串将不会被分割。如果您需要对块大小进行硬性限制，请考虑在这些块上使用递归文本分割器。拆分列表有一个可选的预处理步骤，首先将它们转换为json (dict)，然后拆分它们。



In [18]:
import json

import requests

In [19]:
# This is a large nested json object and will be loaded as a python dict
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

In [21]:
from langchain_text_splitters import RecursiveJsonSplitter
splitter = RecursiveJsonSplitter(max_chunk_size=300)
# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)
# The splitter can also output documents
docs = splitter.create_documents(texts=[json_data])

# or a list of strings
texts = splitter.split_text(json_data=json_data)

print(texts[0])
print(texts[1])

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}


In [22]:
# Let's look at the size of the chunks
print([len(text) for text in texts][:10])

# Reviewing one of these chunks that was bigger we see there is a list object there
print(texts[1])

[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}


In [23]:
# The json splitter by default does not split lists
# the following will preprocess the json and convert list to dict with index:item as key:val pairs
texts = splitter.split_text(json_data=json_data, convert_lists=True)
# Let's look at the size of the chunks. Now they are all under the max
print([len(text) for text in texts][:10])

[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]


In [24]:
# We can also look at the documents
docs[1]

Document(page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}')

### Recursively split by character
对于一般文本，推荐使用此文本拆分器。它由一个字符列表参数化。它试图按顺序分割它们，直到块足够小。默认列表为["\n\n"， "\n"， ""， ""]。这样做的效果是尽可能长时间地将所有段落(然后是句子，然后是单词)放在一起，因为这些段落通常看起来是文本中语义相关性最强的部分。



In [25]:
# This is a long document we can split up.
with open("data/whatsapp_chat.txt") as f:
    state_of_the_union = f.read()

In [26]:
state_of_the_union

'1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\n1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.\n1/23/23, 2:59 AM - User 1: How much do you want?\n1/23/23, 3:00 AM - User 2: Online is at least $100\n1/23/23, 3:01 AM - User 2: Here is $129\n1/23/23, 3:01 AM - User 2: <Media omitted>\n1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one!\n1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!\n1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale\n1/23/23, 3:19 AM - User 1: Oh no worries! Bye\n1/23/23, 3:19 AM - User 2: Bye!\n1/23/23, 3:22_AM - User 1: And let me know if anything changes'

In [27]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])

page_content='1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are'
page_content='me know if you are interested. Thanks!'


In [28]:
text_splitter.split_text(state_of_the_union)[:2]

['1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are',
 'me know if you are interested. Thanks!']

### Splitting text from languages without word boundaries
有些书写系统没有单词边界，例如中文、日文和泰文。使用默认分隔符列表["\n\n"， "\n"， "" "， ""]分割文本可能导致单词在块之间被分割。要将单词放在一起，可以覆盖分隔符列表以包含额外的标点符号

- 添加ASCII句号。， Unicode全宽句号。(中文)及表意文字的句号(日文及中文)
- 在泰语、缅甸语、高棉语和日语中添加零宽度空格。
- 添加ASCII逗号，，Unicode全宽逗号，和Unicode表意符号逗号




In [29]:
text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
        " ",
        ".",
        ",",
        "\u200b",  # Zero-width space
        "\uff0c",  # Fullwidth comma
        "\u3001",  # Ideographic comma
        "\uff0e",  # Fullwidth full stop
        "\u3002",  # Ideographic full stop
        "",
    ],
    # Existing args
)

### Semantic Chunking
根据语义相似度拆分文本。

摘自Greg Kamradt的精彩笔记本:5个级别的文本分割所有的功劳都是他的。

在高层次上，它分成句子，然后分成3个句子的一组，然后合并嵌入空间中相似的句子。

In [33]:
# This is a long document we can split up.
with open("data/whatsapp_chat.txt") as f:
    state_of_the_union = f.read()


from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested.


### Breakpoints
这个分块器的工作原理是决定什么时候把句子分开。这是通过寻找任意两个句子之间嵌入的差异来完成的。当差异超过某个阈值时，它们就会被分割。

默认的分割方式是基于百分位数。在这种方法中，计算句子之间的所有差异，然后拆分大于X百分位数的任何差异。



In [34]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
)

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

print(len(docs))

1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested.
2


### 标准离差
在这种方法中，任何大于X标准差的差异都被分割。

In [36]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation"
)
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks! 1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low. 1/23/23, 2:59 AM - User 1: How much do you want? 1/23/23, 3:00 AM - User 2: Online is at least $100
1/23/23, 3:01 AM - User 2: Here is $129
1/23/23, 3:01 AM - User 2: <Media omitted>
1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one! 1/23/23, 3:02 AM - User 1: I thought you were selling the blue one! 1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale
1/23/23, 3:19 AM - User 1: Oh no worries! Bye
1/23/23, 3:19 AM - User 2: Bye! 1/23/23, 3:22_AM - User 1: And let me know if anything changes
1


#### 四分点
在这种方法中，使用四分位数距离来分割块。

In [37]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="interquartile"
)
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks! 1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low. 1/23/23, 2:59 AM - User 1: How much do you want? 1/23/23, 3:00 AM - User 2: Online is at least $100
1/23/23, 3:01 AM - User 2: Here is $129
1/23/23, 3:01 AM - User 2: <Media omitted>
1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one! 1/23/23, 3:02 AM - User 1: I thought you were selling the blue one! 1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale
1/23/23, 3:19 AM - User 1: Oh no worries! Bye
1/23/23, 3:19 AM - User 2: Bye! 1/23/23, 3:22_AM - User 1: And let me know if anything changes
1


## Split by tokens
语言模型有一个令牌限制。您不应超过令牌限制。因此，当您将文本分割成块时，计算标记的数量是一个好主意。有很多标记器。在计算文本中的标记时，应该使用与语言模型中使用的相同的标记器。

### tiktoken
> tiktoken是由OpenAI创建的快速BPE标记器。

我们可以用它来估计使用的令牌。对于OpenAI模型来说，这可能会更准确。
- 如何分割文本:通过传入的字符。
- 如何测量块大小:通过tiktoken标记器。



In [2]:
# This is a long document we can split up.
with open("data/whatsapp_chat.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter

`.from_tiktoken_encoder()`方法将编码作为参数（例如CL100K_BASE）或`model_name`（例如GPT-4）。所有其他参数（例如`chunk_size`，`chunk_overlap`和`saparters`）用于实例化`parnectextsplitter`：

In [4]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!
1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.
1/23/23, 2:59 AM - User 1: How much do you want?
1/23/23, 3:00 AM - User 2: Online is at least $100
1/23/23, 3:01 AM - User 2: Here is $129
1/23/23, 3:01 AM - User 2: <Media omitted>
1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one!
1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!
1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale
1/23/23, 3:19 AM - User 1: Oh no worries! Bye
1/23/23, 3:19 AM - User 2: Bye!
1/23/23, 3:22_AM - User 1: And let me know if anything changes


请注意，如果我们使用`CharacterTextSplitter.from tiktoken`编码器，文本只被`CharacterTextSplitter`分割，而tiktoken标记器用于合并分割。这意味着分割可以大于由tiktoken标记器测量的块大小。我们可以使用`RecursiveCharacterTextSplitter.from tiktoken`编码器来确保分割不大于语言模型允许的令牌块大小，其中每个分割将被递归分割，如果它有更大的大小

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=100,
    chunk_overlap=0,
)

我们也可以直接加载一个tiktoken拆分器，这将确保每个拆分都小于块大小。

In [7]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

1/22/23, 6:30 PM


一些书面语言(如汉语和日语)有编码为2个或更多符号的字符。直接使用`TokenTextSplitter`可以将字符的令牌拆分为两个块，从而导致Unicode字符的畸形。使用`RecursiveCharacterTextSplitter.from tiktoken`编码器或`CharacterTextSplitter.from tiktoken`编码器来确保块包含有效的Unicode字符串。

### spaCy
> spaCy是一个用于高级自然语言处理的开源软件库，用Python和Cython编程语言编写。

NLTK的另一个替代方案是使用space标记器。
- 如何分割文本:通过space标记器。
- 如何测量块大小:通过字符数。

In [8]:
# This is a long document we can split up.
with open("data/whatsapp_chat.txt") as f:
    state_of_the_union = f.read()
    
from langchain_text_splitters import SpacyTextSplitter
text_splitter = SpacyTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

### SentenceTransformers
本分割器。默认行为是将文本分割成适合您想要使用的句子转换器模型的令牌窗口的块。

In [10]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "
count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]


2


In [11]:
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])

tokens in text to split: 388
lorem


### NLTK
> 自然语言工具包，或者更常见的NLTK，是一套用Python编程语言编写的用于英语的符号和统计自然语言处理(NLP)的库和程序。

我们可以使用NLTK来基于NLTK标记器进行分割，而不仅仅是分割。
- 文本是如何分割的:通过NLTK标记器
- 如何测量块大小:通过字符数。

In [12]:
# This is a long document we can split up.
with open("data/whatsapp_chat.txt") as f:
    state_of_the_union = f.read()
    
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

1/22/23, 6:30 PM - User 1: Hi!

Im interested in your bag.

Im offering $50.

Let me know if you are interested.

Thanks!

1/22/23, 8:24 PM - User 2: Goodmorning!

$50 is too low.

1/23/23, 2:59 AM - User 1: How much do you want?

1/23/23, 3:00 AM - User 2: Online is at least $100
1/23/23, 3:01 AM - User 2: Here is $129
1/23/23, 3:01 AM - User 2: <Media omitted>
1/23/23, 3:01 AM - User 1: Im not interested in this bag.

Im interested in the blue one!

1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!

1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale
1/23/23, 3:19 AM - User 1: Oh no worries!

Bye
1/23/23, 3:19 AM - User 2: Bye!

1/23/23, 3:22_AM - User 1: And let me know if anything changes


### KoNLPY
> Konlpy：Python中的韩国NLP是朝鲜语的自然语言处理（NLP）的Python包装。

令牌拆分涉及将文本分割为较小，更易于管理的单元，称为令牌。这些令牌通常是单词，短语，符号或其他有意义的元素，对于进一步的处理和分析至关重要。在英语等语言中，令牌分裂通常涉及通过空间和标点符号分开单词。令牌分裂的有效性很大程度上取决于令牌者对语言结构的理解，从而确保产生有意义的令牌。由于为英语设计的象征器没有能力理解其他语言的独特语义结构，例如韩文，因此无法有效地用于韩国语言处理。
#### 使用KoNLPy的Kkma分析器进行韩文令牌分割
对于韩语文本，KoNLPY包括一个名为Kkma(韩语知识语素分析器)的词形分析器。Kkma提供了韩语文本的详细形态学分析。它将句子分解成单词，单词分解成各自的语素，为每个符号识别词性。它可以将文本块分割成单独的句子，这在处理长文本时特别有用。

#### Usage Considerations
虽然Kkma以其详细的分析而闻名，但重要的是要注意这种精度可能会影响处理速度。因此，Kkma最适合于分析深度优先于快速文本处理的应用程序。

In [13]:
with open("data/whatsapp_chat.txt") as f:
    korean_document = f.read()
    
from langchain_text_splitters import KonlpyTextSplitter

text_splitter = KonlpyTextSplitter()
texts = text_splitter.split_text(korean_document)
# The sentences are split with "\n\n" characters.
print(texts[0])

1/22 /23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks! 1/22 /23, 8:24 PM - User 2: Goodmorning! $50 is too low. 1/23 /23, 2:59 AM - User 1: How much do you want? 1/23 /23, 3:00 AM - User 2: Online is at least $100 1/23 /23, 3:01 AM - User 2: Here is $129 1/23 /23, 3:01 AM - User 2: <Media omitted> 1/23 /23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one! 1/23 /23, 3:02 AM - User 1: I thought you were selling the blue one! 1/23 /23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale 1/23 /23, 3:19 AM - User 1: Oh no worries! Bye 1/23 /23, 3:19 AM - User 2: Bye! 1/23 /23, 3:22 _AM - User 1: And let me know if anything changes


### Hugging Face tokenizer
拥抱的脸有许多令牌。

我们使用`gpt2tokenizerfast`的拥抱脸令牌来计算令牌中的文本长度。
- 如何分割文本:通过传入的字符。
- 如何测量块大小:通过拥抱脸标记器计算的标记数。



In [14]:
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# This is a long document we can split up.
with open("data/whatsapp_chat.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!
1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.
1/23/23, 2:59 AM - User 1: How much do you want?
1/23/23, 3:00 AM - User 2: Online is at least $100
1/23/23, 3:01 AM - User 2: Here is $129
1/23/23, 3:01 AM - User 2: <Media omitted>
1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one!
1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!
1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale
1/23/23, 3:19 AM - User 1: Oh no worries! Bye
1/23/23, 3:19 AM - User 2: Bye!
1/23/23, 3:22_AM - User 1: And let me know if anything changes
