LangChian中主要有 三种 分割策略

# 1 Text structure-based

依据 段落、句子 和 单词，依赖 `RecursiveCharacterTextSplitter` 库,有以下原则原则：
- 尽量保留大的单元，比如 段落
- 如果 分割单元 超过限制，那么就转向分割为更小 的 层级，比如 从分割段落 到 分割句子
- 如有必要，将降低至 单词级别 的分割单元

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

document = """Your long document text goes here..."""

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(document)

# 2 Length-based

完全基于 文本长度，长度单位 包括 字符 与 Token。（以下代码 不可直接运行

In [None]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(document)

In [None]:
from langchain_text_splitters import TokenTextSplitter

with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

# 3 Document structure-based

针对部分具有 特殊 结构的 文档，比如 HTML 、 Markdown 和 Json。

- Markdown: Split based on headers (e.g., #, ##, ###)
- HTML: Split using tags
- JSON: Split by object or array elements
- Code: Split by functions, classes, or logical blocks

### 3.1 Mardown

使用 # ## 等 headers 划分，比如 对于：

md = '# Foo\n\n ## Bar\n\nHi this is Jim  \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly'

我们可以指定要拆分的 headers：

[("#", "Header 1"),("##", "Header 2")]

拆分后的文本：

{'content': 'Hi this is Jim  \nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}
{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}

In [None]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"), # 第一个参数，字符串后边会 + 一个空格进行匹配
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits # type(md_header_splits[0]) ---> langchain_core.documents.base.Document

[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi this is Jim  \nHi this is Joe'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='Hi this is Lance'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Hi this is Molly')]

In [20]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, 
    chunk_overlap=20,
    separators=["## ","\n\n", "\n", " ", ""]
    )
texts = text_splitter.split_text(markdown_document)

print(len(texts),len(texts[0]), texts)

2 79 ['# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance', '## Baz\n\n Hi this is Molly']
