# 递归拆分 JSON

此 json 拆分器首先遍历 json 数据深度并构建较小的 json 块。它尝试保持嵌套的 json 对象完整，但如果需要，它会拆分它们以在min_chunk_size和max_chunk_size之间保留块。如果该值不是嵌套的 json，而是一个非常大的字符串，则不会拆分该字符串。如果您需要对块大小进行硬上限，请在这些块上使用递归文本拆分器。拆分列表有一个可选的预处理步骤，首先将它们转换为 json （dict），然后将它们拆分成这样。

1. 如何拆分文本：json 值。
2. 如何测量块大小：按字符数。

In [2]:
import json

import requests

In [3]:
from pprint import pprint

# This is a large nested json object and will be loaded as a python dict
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()
pprint(json_data)

{'components': {'schemas': {'APIFeedbackSource': {'description': 'API feedback '
                                                                 'source.',
                                                  'properties': {'metadata': {'anyOf': [{'type': 'object'},
                                                                                        {'type': 'null'}],
                                                                              'title': 'Metadata'},
                                                                 'type': {'default': 'api',
                                                                          'title': 'Type',
                                                                          'type': 'string'}},
                                                  'title': 'APIFeedbackSource',
                                                  'type': 'object'},
                            'APIKeyCreateRequest': {'description': 'API key '
                        

In [4]:
from langchain_text_splitters import RecursiveJsonSplitter

In [5]:
splitter = RecursiveJsonSplitter(max_chunk_size=300)

In [8]:
# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)
pprint(json_chunks)

[{'info': {'title': 'LangSmith', 'version': '0.1.0'},
  'openapi': '3.1.0',
  'servers': [{'description': 'LangSmith API endpoint.',
               'url': 'https://api.smith.langchain.com'}]},
 {'paths': {'/api/v1/sessions/{session_id}': {'get': {'description': 'Get a '
                                                                     'specific '
                                                                     'session.',
                                                      'operationId': 'read_tracer_session_api_v1_sessions__session_id__get',
                                                      'summary': 'Read Tracer '
                                                                 'Session',
                                                      'tags': ['tracer-sessions']}}}},
 {'paths': {'/api/v1/sessions/{session_id}': {'get': {'security': [{'API Key': []},
                                                                   {'Tenant ID': []},
                             

In [9]:
# The splitter can also output documents
docs = splitter.create_documents(texts=[json_data])

# or a list of strings
texts = splitter.split_text(json_data=json_data)

print(texts[0])
print(texts[1])

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}


In [10]:
# Let's look at the size of the chunks
print([len(text) for text in texts][:10])

# Reviewing one of these chunks that was bigger we see there is a list object there
print(texts[3])

[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}


In [11]:
# The json splitter by default does not split lists
# the following will preprocess the json and convert list to dict with index:item as key:val pairs
texts = splitter.split_text(json_data=json_data, convert_lists=True)

In [12]:
# Let's look at the size of the chunks. Now they are all under the max
print([len(text) for text in texts][:10])

[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]


In [13]:
# The list has been converted to a dict, but retains all the needed contextual information even if split into many chunks
print(texts[3])

{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}


In [14]:
# We can also look at the documents
docs[1]

Document(page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}')