# CharacterTextSplitter

原始数据 -> Loader -> Document (一个大的文本块) -> TextSplitter -> Chunks (多个小的文本块) -> LLM Tokenizer -> Tokens (词)

In [8]:

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=35,
 chunk_overlap=4,
 separator="" 
 )

text = "This is the text I would like to chunk up. It is the example text for this exercise"
text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is the text I would like to ch'),
 Document(metadata={}, page_content='o chunk up. It is the example text'),
 Document(metadata={}, page_content='ext for this exercise')]

In [9]:
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=35,
    chunk_overlap=4,
)

text = "This is\n the text I would\n like to chunk up.It is the example text for this exercise"
text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is\n the text I would'),
 Document(metadata={}, page_content='like to chunk up.It is the example text for this exercise')]

# Recursively split by character

In [10]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=35,
    chunk_overlap=4,
)

text = "This is\n\n the\n\n text\n I would like\n to chunk up. It is the example text\n for\n\n this exercise"
text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is\n\n the'),
 Document(metadata={}, page_content='text\n I would like'),
 Document(metadata={}, page_content='to chunk up. It is the example'),
 Document(metadata={}, page_content='text'),
 Document(metadata={}, page_content='for'),
 Document(metadata={}, page_content='this exercise')]

# Markdowm分割
每一次分割会显示当前目录的层级，但是一般只显示当前目录等级里面的内容


In [11]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

[Document(metadata={'Header 1': 'Intro', 'Header 2': 'History'}, page_content='Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]  \nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.'),
 Document(metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}, page_content='As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for  \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.  \n#### Standardization  \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.'),
 Document(metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'}, page_content=

In [12]:
%pip install --quiet langchain_experimental

Note: you may need to restart the kernel to use updated packages.


# Semantic Chunking
语义风格，先转换为向量然后如果差异大就做分割