# JSON Flattening Toolkit - Comprehensive Guide for Data Engineers & Scientists

> **A world-class exploration of JSON flattening techniques, patterns, and real-world applications**

This notebook is organized into **10 self-contained milestones**, each focusing on specific aspects of JSON flattening. You can work through them sequentially or jump to specific topics of interest.

## üìö Table of Contents

### Foundations
- **[Milestone 1: Foundations & Core Concepts](#milestone-1)** - Basic flattening, list policies, separators
- **[Milestone 2: Array Handling Strategies](#milestone-2)** - Index vs join, explosion, cartesian products

### Advanced Techniques  
- **[Milestone 3: Complex Structures](#milestone-3)** - Deep nesting, mixed types, null handling

### Real-World Use Cases
- **[Milestone 4: E-commerce Data](#milestone-4)** - Orders, products, customers, transactions
- **[Milestone 5: API & Event Data](#milestone-5)** - API responses, webhooks, event logs

### Data Pipelines
- **[Milestone 6: CSV Operations & Pipelines](#milestone-6)** - Read/write, transformations, batch processing

### Database Integration
- **[Milestone 7: MongoDB Integration](#milestone-7)** - Ingestion, querying, type inference
- **[Milestone 8: Snowflake Integration](#milestone-8)** - Schema generation, ingestion, queries

### Production Patterns
- **[Milestone 9: Advanced Patterns & Best Practices](#milestone-9)** - Performance, memory, error handling
- **[Milestone 10: End-to-End Workflows](#milestone-10)** - Complete pipelines, production examples

---

## üéØ Learning Objectives

By the end of this notebook, you will be able to:
- ‚úÖ Flatten complex nested JSON structures efficiently
- ‚úÖ Choose appropriate array handling strategies for your use case
- ‚úÖ Build data pipelines from JSON to CSV to databases
- ‚úÖ Handle edge cases (nulls, empty arrays, mixed types)
- ‚úÖ Integrate with MongoDB and Snowflake
- ‚úÖ Apply best practices for production systems

## üöÄ Quick Start

Let's set up our environment and import the necessary modules.

In [None]:
# ============================================================================
# IMPORTS - All imports at the top for clarity
# ============================================================================

import json
import sys
import os
import time
from pathlib import Path
from datetime import datetime
from typing import Any, Dict, List, Optional
from collections import Counter

# Core flattening functions
from json_flatten import flatten_json, flatten_records, write_csv, read_csv
from json_flatten.scenarios import get_scenarios

# Optional: MongoDB and Snowflake (may not be available)
try:
    from json_flatten.mongodb_io import ingest_csv_to_mongodb, query_mongodb, infer_type
    MONGO_AVAILABLE = True
except ImportError:
    MONGO_AVAILABLE = False
    print("‚ö† MongoDB integration not available (pymongo not installed)")

try:
    from json_flatten.snowflake_io import create_table_schema, ingest_csv_to_snowflake, query_snowflake
    SNOWFLAKE_AVAILABLE = True
except ImportError:
    SNOWFLAKE_AVAILABLE = False
    print("‚ö† Snowflake integration not available (snowflake-connector-python not installed)")

# PySpark imports (for large-scale processing)
try:
    import findspark
    findspark.init()
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, explode, from_json, schema_of_json
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType, ArrayType
    PYSPARK_AVAILABLE = True
except ImportError:
    PYSPARK_AVAILABLE = False
    print("‚ö† PySpark not available (pyspark not installed)")

# Setup output directory
OUTPUT_DIR = Path("notebook_output")
OUTPUT_DIR.mkdir(exist_ok=True)

# Helper functions for pretty printing and analysis
def print_section(title: str, char: str = "="):
    """Print a formatted section header."""
    print(f"\n{char * 60}")
    print(f"  {title}")
    print(f"{char * 60}\n")

def compare_before_after(before: Any, after: Dict[str, Any], title: str = "Transformation"):
    """Compare original and flattened data side by side."""
    print_section(title)
    print("BEFORE (Original JSON):")
    print(json.dumps(before, indent=2))
    print("\nAFTER (Flattened):")
    print(json.dumps(after, indent=2))
    print(f"\nüìä Flattened to {len(after)} fields")

def measure_time(func):
    """Decorator to measure execution time."""
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        elapsed = time.time() - start
        print(f"‚è±Ô∏è  Execution time: {elapsed:.4f} seconds")
        return result
    return wrapper

# Initialize PySpark if available
if PYSPARK_AVAILABLE:
    spark = SparkSession.builder \
        .appName("JSONFlattening") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
        .getOrCreate()
    spark.sparkContext.setLogLevel("WARN")  # Reduce verbosity
    print("‚úÖ PySpark session initialized")

print("‚úÖ Environment setup complete!")
print(f"üìÅ Output directory: {OUTPUT_DIR.absolute()}")
print(f"üîß MongoDB available: {MONGO_AVAILABLE}")
print(f"‚ùÑÔ∏è  Snowflake available: {SNOWFLAKE_AVAILABLE}")
print(f"‚ö° PySpark available: {PYSPARK_AVAILABLE}")

---

# Milestone 1: Foundations & Core Concepts

## Learning Objectives
- Understand the fundamental concept of JSON flattening
- Learn how nested structures are converted to flat dictionaries
- Explore different list handling policies
- Master custom separator usage

## Why Flatten JSON?

Data engineers and data scientists frequently encounter challenges:
- **Tabular formats** (CSV, databases) require flat structures
- **Analytics tools** work better with normalized data
- **Schema inference** is easier with flat structures
- **Database ingestion** requires consistent column structures

Let's start with the basics!

In [None]:
### 1.1 Understanding Nested Structures

**What is nesting?**  
Nesting occurs when JSON objects contain other objects or arrays inside them. Think of it like Russian dolls - objects within objects.

**Why is this a problem?**  
- Databases expect flat tables with columns
- CSV files are inherently flat (rows and columns)
- Analytics tools work better with normalized data
- Schema inference becomes complex with nested structures

**How does flattening work?**  
The `flatten_json()` function recursively traverses nested structures and creates dot-delimited keys. For example:
- `user.profile.name` represents the `name` field inside `profile` inside `user`
- The dot (`.`) is the default separator, but you can customize it

Let's see this in action:

In [None]:
# Example 1: Simple nested structure
data1 = {
    "user": {
        "id": 42,
        "profile": {
            "name": "Alice",
            "active": True
        }
    },
    "score": 9.5
}

flattened1 = flatten_json(data1)
compare_before_after(data1, flattened1, "Example 1: Simple Nested Structure")

### 1.2 Custom Separators

**Why use custom separators?**  
Sometimes the default dot (`.`) separator can conflict with your data:
- Field names might contain dots
- You might prefer underscores (`_`) or double underscores (`__`)
- Some systems have naming conventions

**Example use cases:**
- MongoDB uses dots for nested queries, so you might want `_` instead
- Some databases prefer `__` for clarity
- Your organization might have specific naming standards

Let's explore different separators:

## Array Handling

Arrays can be handled in two ways:
- **Index policy**: Creates indexed keys (e.g., `tags.0`, `tags.1`)
- **Join policy**: Joins primitive arrays with commas

In [None]:
# Example: Array explosion - creating multiple records
data5 = {
    "order_id": 1001,
    "customer": "Alice",
    "items": [
        {"sku": "A1", "qty": 2, "price": 10.50},
        {"sku": "B2", "qty": 1, "price": 5.25},
        {"sku": "C3", "qty": 3, "price": 8.00}
    ]
}

records = flatten_records(data5, explode_paths=["items"])
print(f"Created {len(records)} records from array explosion:")
for i, record in enumerate(records, 1):
    print(f"\nRecord {i}:")
    print(json.dumps(record, indent=2))

## CSV Operations

Converting flattened JSON to CSV format for database ingestion or analysis.

In [None]:
# Create output directory
output_dir = Path("notebook_output")
output_dir.mkdir(exist_ok=True)

# Flatten and write to CSV
sample_data = {
    "order_id": 1001,
    "customer": {"name": "Alice", "email": "alice@example.com"},
    "items": [
        {"sku": "A1", "qty": 2},
        {"sku": "B2", "qty": 1}
    ]
}

records = flatten_records(sample_data, explode_paths=["items"])
csv_path = output_dir / "orders.csv"
write_csv(records, csv_path)

print(f"‚úì Written {len(records)} records to {csv_path}")
print("\nCSV content:")
print(csv_path.read_text())

## Summary

This toolkit provides comprehensive solutions for:
1. **Flattening complex JSON structures** with configurable policies
2. **Handling arrays** through indexing or explosion
3. **Creating cartesian products** from multiple array paths
4. **CSV conversion** for tabular data formats
5. **Database ingestion** into MongoDB and Snowflake

See README.md for complete documentation.