# **JSON and Pandas for Data Analysis**

### **What's covered in this notebook?**

1. Converting Flat JSON to DataFrame
    - Using pd.DataFrame()
    - Using pd.read_json()
2. Handling Deeply Nested JSON Structures
    - Normalizing Nested JSON Structures
    - Normalizing Multi-Level JSON
3. Few Examples
	- Example 1: Parse Students Data to Identify the Top Skill
	- Example 2: Parse Customer Transactions from JSON and Generate Insights
	- Example 3: Parse a Sample E-Commerce Order Data for Analysis

## **Converting Flat JSON to DataFrame**

A flat JSON structure consists of a list of dictionaries, where each dictionary represents a row.

### **Using pd.DataFrame()**

In [1]:
import json

# Reading JSON file
with open("data/flat_user_data.json", "r") as file:
    data = json.load(file)

print(json.dumps(data, indent=4))

[
    {
        "id": 1,
        "name": "Alice",
        "age": 25,
        "city": "New York"
    },
    {
        "id": 2,
        "name": "Bob",
        "age": 30,
        "city": "San Francisco"
    },
    {
        "id": 3,
        "name": "Charlie",
        "age": 28,
        "city": "Los Angeles"
    }
]


In [2]:
import pandas as pd

# Convert to Pandas DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,id,name,age,city
0,1,Alice,25,New York
1,2,Bob,30,San Francisco
2,3,Charlie,28,Los Angeles


### **Using pd.read_json()**

**pd.read_json(file_path)** Works Well Only for Simple JSON Structures.

If the data is a **flat JSON structure** (i.e., a list of dictionaries where each dictionary represents a row), pd.read_json() works perfectly.

Note that, this approach fails with deeply nested JSON.

In [3]:
df = pd.read_json("data/flat_user_data.json")

df

Unnamed: 0,id,name,age,city
0,1,Alice,25,New York
1,2,Bob,30,San Francisco
2,3,Charlie,28,Los Angeles


## **Handling Deeply Nested JSON Structures**

Complex nested JSON Structures requires preprocessing with **json.load(s) + pd.json_normalize()**.

pd.json_normalize() helps unpack deeply nested JSON into a structured format.

- **pd.json_normalize()** flattens nested JSON.
- Use sep="_" to customize column names for clarity.

### **Normalizing Nested JSON Structures**

In [4]:
nested_json = '''[
    {"id": 1, "name": "Alice", "address": {"city": "New York", "zip": "10001"}},
    {"id": 2, "name": "Bob", "address": {"city": "San Francisco", "zip": "94105"}},
    {"id": 3, "name": "Charlie", "address": {"city": "Los Angeles", "zip": "90001"}}
]'''

# Convert JSON to DataFrame
data = json.loads(nested_json)

# Flatten nested fields
df = pd.json_normalize(data, sep="_")  

df

Unnamed: 0,id,name,address_city,address_zip
0,1,Alice,New York,10001
1,2,Bob,San Francisco,94105
2,3,Charlie,Los Angeles,90001


### **Normalizing Multi-Level JSON**

**pd.json_normalize(data, record_path, meta_fields)** expands nested lists into rows.

**Note:** "record_path" must be a list or null.

In [5]:
# Sample JSON
json_data = '''[
    {"id": 1, "name": "Alice", "orders": [
        {"order_id": 101, "amount": 250},
        {"order_id": 102, "amount": 150}
    ]},
    {"id": 2, "name": "Bob", "orders": [
        {"order_id": 103, "amount": 300}
    ]}
]'''

In [6]:
import pandas as pd
import json

# Convert JSON to DataFrame
data = json.loads(json_data)

df_users = pd.DataFrame(data)

df_users

Unnamed: 0,id,name,orders
0,1,Alice,"[{'order_id': 101, 'amount': 250}, {'order_id'..."
1,2,Bob,"[{'order_id': 103, 'amount': 300}]"


In [7]:
# Normalize orders into a separate DataFrame
df_orders = pd.json_normalize(data, record_path="orders", meta=["id", "name"])

df_orders

Unnamed: 0,order_id,amount,id,name
0,101,250,1,Alice
1,102,150,1,Alice
2,103,300,2,Bob


## **Few Examples**

### **Example 1: Parse Students Data to Identify the Top Skill**

In [8]:
complex_json = '''[
    {"id": 1, "name": "Alice", "skills": ["Python", "SQL", "Machine Learning"]},
    {"id": 2, "name": "Bob", "skills": ["Java", "Python"]},
    {"id": 3, "name": "Charlie", "skills": ["Python", "JavaScript", "Machine Learning"]}
]'''

# Read JSON
data = json.loads(complex_json)

# Convert to DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,id,name,skills
0,1,Alice,"[Python, SQL, Machine Learning]"
1,2,Bob,"[Java, Python]"
2,3,Charlie,"[Python, JavaScript, Machine Learning]"


In [9]:
df_new = df.explode("skills").reset_index(drop=True)

df_new

Unnamed: 0,id,name,skills
0,1,Alice,Python
1,1,Alice,SQL
2,1,Alice,Machine Learning
3,2,Bob,Java
4,2,Bob,Python
5,3,Charlie,Python
6,3,Charlie,JavaScript
7,3,Charlie,Machine Learning


In [10]:
df_new["skills"].value_counts()

skills
Python              3
Machine Learning    2
SQL                 1
Java                1
JavaScript          1
Name: count, dtype: int64

### **Example 2: Parse Customer Transactions from JSON and Generate Insights**

In [11]:
import pandas as pd
import json

# Sample JSON
json_data = '''[
    {"customer": "Alice", "amount": 200, "date": "2024-03-01"},
    {"customer": "Bob", "amount": 150, "date": "2024-03-02"},
    {"customer": "Alice", "amount": 300, "date": "2024-03-05"},
    {"customer": "Charlie", "amount": 500, "date": "2024-03-06"},
    {"customer": "Bob", "amount": 100, "date": "2024-03-07"}
]'''

# Convert JSON to DataFrame
data = json.loads(json_data)
df = pd.DataFrame(data)

# Convert date column to datetime
df["date"] = pd.to_datetime(df["date"])

df

Unnamed: 0,customer,amount,date
0,Alice,200,2024-03-01
1,Bob,150,2024-03-02
2,Alice,300,2024-03-05
3,Charlie,500,2024-03-06
4,Bob,100,2024-03-07


In [12]:
# Calculate total spending per customer
total_spending = df.groupby("customer")["amount"].sum().reset_index()

# Calculate average transaction amount per customer
average_spending = df.groupby("customer")["amount"].mean().reset_index()

# Find the most frequent customer
frequent_customer = df["customer"].value_counts().idxmax()

print("Total Spending:\n", total_spending)
print("\nAverage Spending:\n", average_spending)
print("\nMost Frequent Customer:", frequent_customer)

Total Spending:
   customer  amount
0    Alice     500
1      Bob     250
2  Charlie     500

Average Spending:
   customer  amount
0    Alice   250.0
1      Bob   125.0
2  Charlie   500.0

Most Frequent Customer: Alice


### **Example 3: Parse a Sample E-Commerce Order Data for Analysis**

Assume that you received a **multi-level deeply nested JSON data** from an API. Your task is to **extract specific fields efficiently** and **flatten it for analysis**.

In [13]:
simple_json_data = '''{
    "order_id": "ORD12345",
    "customer": {
        "id": 98765,
        "name": "John Doe",
        "email": "johndoe@email.com",
        "address": {
            "street": "123 Main St",
            "city": "New York",
            "country": "USA"
        }
    },
    "items": [
        {
            "product_id": 111,
            "name": "Laptop",
            "price": 1500.00,
            "quantity": 1
        },
        {
            "product_id": 222,
            "name": "Mouse",
            "price": 50.00,
            "quantity": 2
        }
    ]
}
'''

In [14]:
data = json.loads(simple_json_data)

df_customer = pd.json_normalize(data, 
                                sep="_", 
                                record_path=["items"], 
                                meta=["order_id", 
                                      ["customer", "id"],
                                      ["customer", "name"],
                                      ["customer", "email"]
                                     ]
                               )

df_customer

Unnamed: 0,product_id,name,price,quantity,order_id,customer_id,customer_name,customer_email
0,111,Laptop,1500.0,1,ORD12345,98765,John Doe,johndoe@email.com
1,222,Mouse,50.0,2,ORD12345,98765,John Doe,johndoe@email.com
