## How to split JSON data

This JSON splitter splits JSON data while allowing control over chunk sizes. It traverses JSON data depth first and builds smaller JSON chunks. It attempts to keep nested JSON objects whule but will split them if needed to keep chunk size between `min_chunk_size` and `max_chunk_size`.

If the value is not a nested JSON, but rather a very large string the string won't be split. If you need a hard cap on the chunk size consider combining this with `RecursiveTextSplitter` on those chunks. This is an optional pre-processing step to split lists, by converting them first to JSON (dict) and then splitting them as such.

- How the text is split: JSON value.
- How the chunk size is measures: by number of characters.

In [2]:
import json
import requests

json_data_url = 'https://api.smith.langchain.com/openapi.json'
json_data = requests.get(json_data_url).json()

In [3]:
from langchain.text_splitter import RecursiveJsonSplitter

json_splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = json_splitter.split_json(json_data)

In [5]:
for ch in json_chunks[:3]:
    print(f'Chunk: {ch}')

Chunk: {'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}/dashboard': {'post': {'tags': ['tracer-sessions'], 'summary': 'Get Tracing Project Prebuilt Dashboard', 'description': 'Get a prebuilt dashboard for a tracing project.'}}}}
Chunk: {'paths': {'/api/v1/sessions/{session_id}/dashboard': {'post': {'operationId': 'get_tracing_project_prebuilt_dashboard_api_v1_sessions__session_id__dashboard_post', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}
Chunk: {'paths': {'/api/v1/sessions/{session_id}/dashboard': {'post': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'accept', 'in': 'header', 'required': False, 'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Accept'}}]}}}}


### The splitter can also output documents

In [6]:
docs = json_splitter.create_documents(texts=[json_data])
for d in docs[:3]:
    print(f'Document: {d}')

Document: page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"tags": ["tracer-sessions"], "summary": "Get Tracing Project Prebuilt Dashboard", "description": "Get a prebuilt dashboard for a tracing project."}}}}'
Document: page_content='{"paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"operationId": "get_tracing_project_prebuilt_dashboard_api_v1_sessions__session_id__dashboard_post", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'
Document: page_content='{"paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}'


In [7]:
texts = json_splitter.split_text(json_data=json_data)
for t in texts[:2]:
    print(t)

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"tags": ["tracer-sessions"], "summary": "Get Tracing Project Prebuilt Dashboard", "description": "Get a prebuilt dashboard for a tracing project."}}}}
{"paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"operationId": "get_tracing_project_prebuilt_dashboard_api_v1_sessions__session_id__dashboard_post", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}


In [None]:
from lan