yelp data analysis.

## Code Explanation

This script demonstrates how to split a large JSON file into multiple smaller JSON files. Splitting large files can be beneficial when dealing with memory constraints or distributing computational workloads across multiple environments or processes. Below is a step-by-step breakdown of the code:

1. **Import and Variable Setup**  
   - `input_file`: The name of the large JSON file to split.  
   - `output_prefix`: The prefix used for naming the output files.  
   - `num_files`: The number of smaller files to create.

2. **Counting Lines**  
   - The code opens the input file in read mode and counts how many lines (each containing a JSON object) are present.
   - This count determines how many lines each split file will contain (`lines_per_file`).

3. **Splitting into Smaller Files**  
   - For each split file, the script opens a new file in write mode.  
   - It then reads `lines_per_file` lines from the input file and writes them into the new output file.  
   - This process repeats for the specified number of files (`num_files`).

4. **Stopping Condition**  
   - If the file ends before we have written the full `lines_per_file` lines (e.g., the file length is not perfectly divisible), the script stops reading and moves on.

5. **Final Output**  
   - After the loop completes, the script prints a success message indicating that the large file has been split successfully into the designated smaller files.

**Use Case**:  
You can use this script when you need to handle extremely large datasets that may exceed system memory limits or require parallel processing. This step helps in efficiently processing, transferring, or analyzing data without having to manage a single massive file.


In [1]:
import json

input_file = "yelp_academic_dataset_review.json"  # 5GB JSON file
output_prefix = "split_file_"  # Prefix for output files
num_files = 10  # Number of files to split into

# Count total lines (objects) in the file
with open(input_file, "r" , encoding="utf8") as f:
    total_lines = sum(1 for _ in f)  

lines_per_file = total_lines // num_files  # Lines per split file

print(f"Total lines: {total_lines}, Lines per file: {lines_per_file}")

# Now split into multiple smaller files
with open(input_file, "r" , encoding="utf8") as f:
    for i in range(num_files):
        output_filename = f"{output_prefix}{i+1}.json"
        
        with open(output_filename, "w", encoding="utf8" ) as out_file:
            for j in range(lines_per_file):
                line = f.readline()
                if not line:
                    break  # Stop if file ends early
                out_file.write(line)

print("✅ JSON file successfully split into smaller parts!")

Total lines: 6990280, Lines per file: 699028
✅ JSON file successfully split into smaller parts!
