# **JSON and Pandas for Data Analysis**

### **What's covered in this notebook?**

1. Converting Flat JSON to DataFrame
    - Using pd.DataFrame()
    - Using pd.read_json()
2. Handling Deeply Nested JSON Structures
    - Normalizing Nested JSON Structures
    - Normalizing Multi-Level JSON
3. Few Examples
	- Example 1: Parse Students Data to Identify the Top Skill
	- Example 2: Parse Customer Transactions from JSON and Generate Insights
	- Example 3: Parse a Sample E-Commerce Order Data for Analysis

## **Converting Flat JSON to DataFrame**

A flat JSON structure consists of a list of dictionaries, where each dictionary represents a row.

### **Using pd.DataFrame()**

In [1]:
import json

# Reading JSON file
with open("data/flat_user_data.json", "r") as file:
    data = json.load(file)

print(json.dumps(data, indent=4))

[
    {
        "id": 1,
        "name": "Alice",
        "age": 25,
        "city": "New York"
    },
    {
        "id": 2,
        "name": "Bob",
        "age": 30,
        "city": "San Francisco"
    },
    {
        "id": 3,
        "name": "Charlie",
        "age": 28,
        "city": "Los Angeles"
    }
]


In [2]:
import pandas as pd

# Convert to Pandas DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,id,name,age,city
0,1,Alice,25,New York
1,2,Bob,30,San Francisco
2,3,Charlie,28,Los Angeles


### **Using pd.read_json()**

**pd.read_json(file_path)** Works Well Only for Simple JSON Structures.

If the data is a **flat JSON structure** (i.e., a list of dictionaries where each dictionary represents a row), pd.read_json() works perfectly.

Note that, this approach fails with deeply nested JSON.

In [3]:
df = pd.read_json("data/flat_user_data.json")

df

Unnamed: 0,id,name,age,city
0,1,Alice,25,New York
1,2,Bob,30,San Francisco
2,3,Charlie,28,Los Angeles


## **Handling Deeply Nested JSON Structures**

Complex nested JSON Structures requires preprocessing with **json.load(s) + pd.json_normalize()**.

pd.json_normalize() helps unpack deeply nested JSON into a structured format.

- **pd.json_normalize()** flattens nested JSON.
- Use sep="_" to customize column names for clarity.

### **Normalizing Nested JSON Structures**

In [4]:
import json

# Reading JSON file
with open("data/nested_user_data.json", "r") as file:
    data = json.load(file)

print(json.dumps(data, indent=4))

[
    {
        "id": 1,
        "name": "Alice",
        "address": {
            "city": "New York",
            "zip": "10001"
        }
    },
    {
        "id": 2,
        "name": "Bob",
        "address": {
            "city": "San Francisco",
            "zip": "94105"
        }
    },
    {
        "id": 3,
        "name": "Charlie",
        "address": {
            "city": "Los Angeles",
            "zip": "90001"
        }
    }
]


In [5]:
# Convert to Pandas DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,id,name,address
0,1,Alice,"{'city': 'New York', 'zip': '10001'}"
1,2,Bob,"{'city': 'San Francisco', 'zip': '94105'}"
2,3,Charlie,"{'city': 'Los Angeles', 'zip': '90001'}"


In [6]:
import pandas as pd

# Flatten nested fields
df = pd.json_normalize(data, sep="_")  

df

Unnamed: 0,id,name,address_city,address_zip
0,1,Alice,New York,10001
1,2,Bob,San Francisco,94105
2,3,Charlie,Los Angeles,90001


### **Normalizing Multi-Level JSON**

**pd.json_normalize(data, record_path, meta_fields)** expands nested lists into rows.

**Note:** "record_path" must be a list or null.

In [7]:
import json

# Reading JSON file
with open("data/customer_multi_level.json", "r") as file:
    data = json.load(file)

print(json.dumps(data, indent=4))

[
    {
        "id": 1,
        "name": "Alice",
        "orders": [
            {
                "order_id": 101,
                "amount": 250
            },
            {
                "order_id": 102,
                "amount": 150
            }
        ]
    },
    {
        "id": 2,
        "name": "Bob",
        "orders": [
            {
                "order_id": 103,
                "amount": 300
            }
        ]
    }
]


In [8]:
# Convert JSON to DataFrame
df_users = pd.DataFrame(data)

df_users

Unnamed: 0,id,name,orders
0,1,Alice,"[{'order_id': 101, 'amount': 250}, {'order_id'..."
1,2,Bob,"[{'order_id': 103, 'amount': 300}]"


In [9]:
# Normalize orders into a separate DataFrame
df_orders = pd.json_normalize(data, record_path="orders", meta=["id", "name"])

df_orders

Unnamed: 0,order_id,amount,id,name
0,101,250,1,Alice
1,102,150,1,Alice
2,103,300,2,Bob


## **Few Examples**

### **Example 1: Parse Students Data to Identify the Top Skill**

In [10]:
import json

# Reading JSON file
with open("data/student_skills.json", "r") as file:
    data = json.load(file)

print(json.dumps(data, indent=4))

[
    {
        "id": 1,
        "name": "Alice",
        "skills": [
            "Python",
            "SQL",
            "Machine Learning"
        ]
    },
    {
        "id": 2,
        "name": "Bob",
        "skills": [
            "Java",
            "Python"
        ]
    },
    {
        "id": 3,
        "name": "Charlie",
        "skills": [
            "Python",
            "JavaScript",
            "Machine Learning"
        ]
    }
]


In [11]:
# Convert to DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,id,name,skills
0,1,Alice,"[Python, SQL, Machine Learning]"
1,2,Bob,"[Java, Python]"
2,3,Charlie,"[Python, JavaScript, Machine Learning]"


In [12]:
df_new = df.explode("skills").reset_index(drop=True)

df_new

Unnamed: 0,id,name,skills
0,1,Alice,Python
1,1,Alice,SQL
2,1,Alice,Machine Learning
3,2,Bob,Java
4,2,Bob,Python
5,3,Charlie,Python
6,3,Charlie,JavaScript
7,3,Charlie,Machine Learning


In [13]:
df_new["skills"].value_counts()

skills
Python              3
Machine Learning    2
SQL                 1
Java                1
JavaScript          1
Name: count, dtype: int64

### **Example 2: Parse Customer Transactions from JSON and Generate Insights**

In [14]:
import json

# Reading JSON file
with open("data/ecom_transaction_1.json", "r") as file:
    data = json.load(file)

print(json.dumps(data, indent=4))

[
    {
        "customer": "Alice",
        "amount": 200,
        "date": "2024-03-01"
    },
    {
        "customer": "Bob",
        "amount": 150,
        "date": "2024-03-02"
    },
    {
        "customer": "Alice",
        "amount": 300,
        "date": "2024-03-05"
    },
    {
        "customer": "Charlie",
        "amount": 500,
        "date": "2024-03-06"
    },
    {
        "customer": "Bob",
        "amount": 100,
        "date": "2024-03-07"
    }
]


In [15]:
df = pd.DataFrame(data)

# Convert date column to datetime
df["date"] = pd.to_datetime(df["date"])

df

Unnamed: 0,customer,amount,date
0,Alice,200,2024-03-01
1,Bob,150,2024-03-02
2,Alice,300,2024-03-05
3,Charlie,500,2024-03-06
4,Bob,100,2024-03-07


In [16]:
# Calculate total spending per customer
total_spending = df.groupby("customer")["amount"].sum().reset_index()

# Calculate average transaction amount per customer
average_spending = df.groupby("customer")["amount"].mean().reset_index()

# Find the most frequent customer
frequent_customer = df["customer"].value_counts().idxmax()

print("Total Spending:\n", total_spending)
print("\nAverage Spending:\n", average_spending)
print("\nMost Frequent Customer:", frequent_customer)

Total Spending:
   customer  amount
0    Alice     500
1      Bob     250
2  Charlie     500

Average Spending:
   customer  amount
0    Alice   250.0
1      Bob   125.0
2  Charlie   500.0

Most Frequent Customer: Alice


### **Example 3: Parse a Sample E-Commerce Order Data for Analysis**

Assume that you received a **multi-level deeply nested JSON data** from an API. Your task is to **extract specific fields efficiently** and **flatten it for analysis**.

In [17]:
import json

# Reading JSON file
with open("data/ecom_transaction_2.json", "r") as file:
    data = json.load(file)

print(json.dumps(data, indent=4))

[
    {
        "order_id": "ORD1",
        "customer": {
            "id": 1639,
            "name": "Angela Griffin",
            "email": "gibbsedward@example.org",
            "address": {
                "street": "24026 Darlene Ranch",
                "city": "Angelashire",
                "country": "Saint Helena"
            }
        },
        "items": [
            {
                "product_id": 25,
                "name": "Mountain Bike",
                "category": "Sports",
                "price": 599.99,
                "quantity": 1
            },
            {
                "product_id": 23,
                "name": "Tennis Racket",
                "category": "Sports",
                "price": 89.99,
                "quantity": 2
            }
        ],
        "payment": {
            "method": "Credit Card",
            "transaction_id": "D005B042-A",
            "discount_applied": 2.48
        }
    },
    {
        "order_id": "ORD2",
        "customer": {
  

In [18]:
# Loading data to DF
df = pd.DataFrame(data)

df

Unnamed: 0,order_id,customer,items,payment
0,ORD1,"{'id': 1639, 'name': 'Angela Griffin', 'email'...","[{'product_id': 25, 'name': 'Mountain Bike', '...","{'method': 'Credit Card', 'transaction_id': 'D..."
1,ORD2,"{'id': 163, 'name': 'Sarah Moore', 'email': 'z...","[{'product_id': 24, 'name': 'Hydration Backpac...","{'method': 'Credit Card', 'transaction_id': 'C..."
2,ORD3,"{'id': 1814, 'name': 'Dwayne Hartman', 'email'...","[{'product_id': 28, 'name': 'Noise-Canceling O...","{'method': 'Credit Card', 'transaction_id': '9..."


In [19]:
df = pd.json_normalize(data, sep="_")

df

Unnamed: 0,order_id,items,customer_id,customer_name,customer_email,customer_address_street,customer_address_city,customer_address_country,payment_method,payment_transaction_id,payment_discount_applied
0,ORD1,"[{'product_id': 25, 'name': 'Mountain Bike', '...",1639,Angela Griffin,gibbsedward@example.org,24026 Darlene Ranch,Angelashire,Saint Helena,Credit Card,D005B042-A,2.48
1,ORD2,"[{'product_id': 24, 'name': 'Hydration Backpac...",163,Sarah Moore,zshelton@example.org,33466 Kristin Meadow Suite 060,Lake Brittany,Nigeria,Credit Card,C1A2BCA7-1,6.12
2,ORD3,"[{'product_id': 28, 'name': 'Noise-Canceling O...",1814,Dwayne Hartman,daviddyer@example.org,32036 Rodney Creek,New Brandy,Sierra Leone,Credit Card,98F4114D-B,5.42


In [20]:
df_payments = df[["order_id", "payment_method", "payment_transaction_id", 
                  "payment_discount_applied"]].copy()

df_payments.head()

Unnamed: 0,order_id,payment_method,payment_transaction_id,payment_discount_applied
0,ORD1,Credit Card,D005B042-A,2.48
1,ORD2,Credit Card,C1A2BCA7-1,6.12
2,ORD3,Credit Card,98F4114D-B,5.42


In [21]:
df_customers = df[["customer_id", "customer_name", "customer_email", 
                  "customer_address_street", "customer_address_city", 
                   "customer_address_country"]].copy()

df_customers.head()

Unnamed: 0,customer_id,customer_name,customer_email,customer_address_street,customer_address_city,customer_address_country
0,1639,Angela Griffin,gibbsedward@example.org,24026 Darlene Ranch,Angelashire,Saint Helena
1,163,Sarah Moore,zshelton@example.org,33466 Kristin Meadow Suite 060,Lake Brittany,Nigeria
2,1814,Dwayne Hartman,daviddyer@example.org,32036 Rodney Creek,New Brandy,Sierra Leone


In [22]:
df_order_items = pd.json_normalize(data, 
                                  sep="_", 
                                  record_path=["items"], 
                                  meta=["order_id"]
                                  )

df_order_items.head()

Unnamed: 0,product_id,name,category,price,quantity,order_id
0,25,Mountain Bike,Sports,599.99,1,ORD1
1,23,Tennis Racket,Sports,89.99,2,ORD1
2,24,Hydration Backpack for Runners,Sports,59.99,1,ORD2
3,21,Yoga Mat with Non-Slip Surface,Sports,39.99,2,ORD2
4,28,Noise-Canceling Office Headset,Accessories,89.99,1,ORD3
