# Splitting Large JSON Files: Explanation and Demonstration

Large JSON files can be difficult to process due to memory constraints, slow loading times, or the need to parallelize data processing. Splitting these files into smaller parts makes them easier to handle, share, and process in data pipelines or machine learning workflows.

This notebook explains and demonstrates the logic behind the `file_splitter.py` script, which splits a large JSON file into two smaller files.

## 1. Import Required Libraries

We use the `json` module to read and write JSON files in Python.

In [16]:
import json
import os

## 2. Load the Input JSON File

We start by reading the input JSON file and loading its contents into a Python variable. The file can contain either a list or a dictionary.

In [41]:
# Example input file path (update as needed)
input_file = '../data/processed/applicants_for_processing_part6.json'

# Load the JSON data
with open(input_file, 'r', encoding='utf-8') as f:
    data = json.load(f)

# Show the type of data loaded
print(f"Loaded data type: {type(data)}")

Loaded data type: <class 'dict'>


## 3. Determine JSON Structure and Split Data

Depending on whether the data is a list or a dictionary, we split it into two parts. For lists, we slice it in half. For dictionaries, we split the keys and create two new dictionaries.

In [42]:
# Split the data into two parts
if isinstance(data, list):
    half = len(data) // 2
    part1 = data[:half]
    part2 = data[half:]
    print(f"List split: {len(part1)} items in part1, {len(part2)} items in part2")
elif isinstance(data, dict):
    keys = list(data.keys())
    half = len(keys) // 2
    keys1 = keys[:half]
    keys2 = keys[half:]
    part1 = {k: data[k] for k in keys1}
    part2 = {k: data[k] for k in keys2}
    print(f"Dict split: {len(part1)} keys in part1, {len(part2)} keys in part2")
else:
    raise ValueError("Unsupported JSON structure")

Dict split: 2933 keys in part1, 2933 keys in part2


## 4. Write Split Data to Output Files

We save each part of the split data into separate output JSON files using `json.dump()`.

In [43]:
input_file = '../data/processed/applicants_for_processing.json'
import itertools

# Get the directory and base name of the input file
input_dir = os.path.dirname(input_file)
input_base = os.path.splitext(os.path.basename(input_file))[0]

# Find next available part numbers to avoid overwriting
def get_next_available_filename(base, part, ext, directory):
    for i in itertools.count(part):
        candidate = os.path.join(directory, f'{base}_part{i}{ext}')
        if not os.path.exists(candidate):
            return candidate, i

output_file1, part_num1 = get_next_available_filename(input_base, 1, '.json', input_dir)
output_file2, part_num2 = get_next_available_filename(input_base, part_num1+1, '.json', input_dir)

with open(output_file1, 'w', encoding='utf-8') as f1:
    json.dump(part1, f1, ensure_ascii=False, indent=2)

with open(output_file2, 'w', encoding='utf-8') as f2:
    json.dump(part2, f2, ensure_ascii=False, indent=2)

print(f"Split into {output_file1} and {output_file2}")

Split into ../data/processed\applicants_for_processing_part11.json and ../data/processed\applicants_for_processing_part12.json


## 5. Verify Output Files

Finally, we read back the output files and check the number of items in each to confirm the split was successful.

In [None]:
with open(output_file1, 'r', encoding='utf-8') as f1:
    part1_loaded = json.load(f1)

with open(output_file2, 'r', encoding='utf-8') as f2:
    part2_loaded = json.load(f2)

if isinstance(part1_loaded, list):
    print(f"Part 1: {len(part1_loaded)} items")
    print(f"Part 2: {len(part2_loaded)} items")
elif isinstance(part1_loaded, dict):
    print(f"Part 1: {len(part1_loaded.keys())} keys")
    print(f"Part 2: {len(part2_loaded.keys())} keys")

Part 1: 2933 keys
Part 2: 2933 keys


: 