## 文档切分
***
- 按照长度切分
- 按照文本架构进行切分（句子、段落）
- 按照文档格式切分
- 基于语义进行切分

#### 长度切分
****

In [None]:
! pip install langchain-text-splitters

In [None]:
file_path='deepseek.pdf'

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)

In [5]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=50, chunk_overlap=10
)
texts = text_splitter.split_text(pages[1].page_content)
print(texts)
docs = text_splitter.create_documents([pages[2].page_content,pages[3].page_content])
print(docs)

['2. 算⼒需求分析\n模型 参数规\n模\n计算精\n度\n最低显存需\n求 最低算⼒需求\nDeepSeek-R1 (671B)671B FP8 ≥890GB 2*XE9680（16*H20\nGPU）\nDeepSeek-R1-Distill-\n70B 70B BF16 ≥180GB 4*L20 或 2*H20 GPU\n三、国产芯⽚与硬件适配⽅案\n1. 国内⽣态合作伙伴动态\n企业 适配内容 性能对标（vs\nNVIDIA）\n华为昇\n腾\n昇腾910B原⽣⽀持R1全系列，提供端到端推理优化\n⽅案 等效A100（FP16）\n沐曦\nGPU\nMXN系列⽀持70B模型BF16推理，显存利⽤率提升\n30% 等效RTX 3090\n海光\nDCU 适配V3/R1模型，性能对标NVIDIA A100 等效A100（BF16）\n2. 国产硬件推荐配置\n模型参数 推荐⽅案 适⽤场景\n1.5B 太初T100加速卡 个⼈开发者原型验证\n14B 昆仑芯K200集群 企业级复杂任务推理\n32B 壁彻算⼒平台+昇腾910B集群 科研计算与多模态处理\n四、云端部署替代⽅案\n1. 国内云服务商推荐\n平台 核⼼优势 适⽤场景']
[Document(metadata={}, page_content='硅基流动 官⽅推荐API，低延迟，⽀持多模态模型 企业级⾼并发推理\n腾讯云 ⼀键部署+限时免费体验，⽀持VPC私有化 中⼩规模模型快速上线\nPPIO派欧云 价格仅为OpenAI 1/20，注册赠5000万tokens 低成本尝鲜与测试\n2. 国际接⼊渠道（需魔法或外企上⽹环境\n!\n）\n英伟达NIM：企业级GPU集群部署（链接）\nGroq：超低延迟推理（链接）\n五、完整671B MoE模型部署（Ollama+Unsloth）\n1. 量化⽅案与模型选择\n量化版本 ⽂件体\n积\n最低内存+显存需\n求 适⽤场景\nDeepSeek-R1-UD-\nIQ1_M 158 GB ≥200 GB 消费级硬件（如Mac\nStudio）\nDeepSeek-R1-Q4_K_M 404 GB ≥500 GB ⾼性能服务器/云GPU\n下载地址：\nHuggingFace模型库\nUnsloth AI官⽅说明\n2. 硬件配置建议\n硬件类型 推荐配置 性

### 基于文本架构
****

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=0)
texts = text_splitter.split_text(pages[1].page_content)
print(texts)

['2. 算⼒需求分析\n模型 参数规\n模\n计算精\n度\n最低显存需\n求 最低算⼒需求', 'DeepSeek-R1 (671B)671B FP8 ≥890GB 2*XE9680（16*H20', 'GPU）\nDeepSeek-R1-Distill-', '70B 70B BF16 ≥180GB 4*L20 或 2*H20 GPU', '三、国产芯⽚与硬件适配⽅案\n1. 国内⽣态合作伙伴动态\n企业 适配内容 性能对标（vs', 'NVIDIA）\n华为昇\n腾\n昇腾910B原⽣⽀持R1全系列，提供端到端推理优化', '⽅案 等效A100（FP16）\n沐曦\nGPU\nMXN系列⽀持70B模型BF16推理，显存利⽤率提升', '30% 等效RTX 3090\n海光', 'DCU 适配V3/R1模型，性能对标NVIDIA A100 等效A100（BF16）', '2. 国产硬件推荐配置\n模型参数 推荐⽅案 适⽤场景', '1.5B 太初T100加速卡 个⼈开发者原型验证\n14B 昆仑芯K200集群 企业级复杂任务推理', '32B 壁彻算⼒平台+昇腾910B集群 科研计算与多模态处理\n四、云端部署替代⽅案', '1. 国内云服务商推荐\n平台 核⼼优势 适⽤场景']


#### 基于文档架构
****
- markdown 根据标题拆分（例如，#、##、###）
- JSON：按对象或数组元素拆分

In [None]:
! pip install -qU langchain-text-splitters

基于markdown格式进行切分

In [7]:
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi this is Jim  \nHi this is Joe'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='Hi this is Lance'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Hi this is Molly')]

基于JSON格式进行切分

In [8]:
import json
import requests
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

In [9]:
from langchain_text_splitters import RecursiveJsonSplitter

splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = splitter.split_json(json_data=json_data)

for chunk in json_chunks[:3]:
    print(chunk)

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'include_stats', 'in': 'query', 'required': False, 'schema': {'type': 'boolean', 'default': False, 'title': 'Include Stats'}}, {'name': 'accept', 'in': 'header', 'required': False, 'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Accept'}}]}}}}


In [10]:
# 生成langchain Document
docs = splitter.create_documents(texts=[json_data])

for doc in docs[:3]:
    print(doc)

page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"operationId": "read_tracer_session_api_v1_sessions__session_id__get", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}'


#### 基于语义切分
*****

In [None]:
! pip install --quiet langchain_experimental langchain_openai

In [11]:
with open("meow.txt") as f:
    meow = f.read()

In [12]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
# 使用OpenAIEmbeddings进行向量化
text_splitter = SemanticChunker(OpenAIEmbeddings())

In [13]:
docs = text_splitter.create_documents([meow])
print(docs[0].page_content)

meow meow🐱 
 meow meow🐱 
 meow😻😻
