# Session 8: Exploring Semi-Structured Data (JSON)

**Unit 1: Introduction to Data Science**
**Hour: 8**
**Mode: Practical Lab**

---

### 1. Objective

In this lab, we'll learn to work with semi-structured data, specifically JSON. JSON is the standard format for data transfer on the web, and you'll encounter it frequently when working with APIs.

**What is Semi-Structured Data?** It doesn't fit neatly into a table but has organizational markers (like keys and values) that give it structure.

Our goal is to load JSON data and transform it into a clean, structured DataFrame.

### 2. Setup

We'll need the `pandas` library, as always, and Python's built-in `json` library.

In [None]:
import pandas as pd
import json

### 3. The JSON Data

Imagine we received the following data from a user database API. It's a list of two users. Notice that the `address` is a "nested" object inside the main user object.

In [None]:
json_data = '''
[
    {
        "id": 101,
        "name": "Alice",
        "email": "alice@example.com",
        "address": {
            "street": "123 Maple St",
            "city": "Wonderland"
        }
    },
    {
        "id": 102,
        "name": "Bob",
        "email": "bob@example.com",
        "address": {
            "street": "456 Oak Ave",
            "city": "Builderton"
        }
    }
]
'''

### 4. Handling JSON

#### 4.1. Loading JSON as a Python Object

First, let's use the `json` library to load the string into a familiar Python object (in this case, a list of dictionaries).

In [None]:
data = json.loads(json_data)
print(data)

We can access elements just like we would with any Python list or dictionary.

In [None]:
# Get the first user's name
print(data[0]['name'])

# Get the second user's city
print(data[1]['address']['city'])

#### 4.2. Loading JSON directly into Pandas

Pandas can often read JSON directly, but let's see what happens with our nested data.

In [None]:
df_nested = pd.read_json(json_data)
df_nested.head()

**Problem:** Look at the `address` column. It contains the dictionary object, which isn't very useful for analysis. We can't easily filter by city, for example. We need to flatten it.

#### 4.3. Flattening Nested JSON with `json_normalize`

Pandas provides a powerful function, `json_normalize`, specifically for this problem. It intelligently unpacks nested dictionaries and lists into their own columns.

In [None]:
# We first need to load the json string into a python object
data_for_normalize = json.loads(json_data)

df_flat = pd.json_normalize(data_for_normalize)
df_flat.head()

**Success!** Notice how `address.street` and `address.city` are now their own proper columns. The data is now fully **structured** and ready for analysis.

Let's check the `.info()` to confirm.

In [None]:
df_flat.info()

### 5. Conclusion

In this session, you learned how to:
1.  Load JSON data into Python.
2.  Recognize the problem of nested data when loading into Pandas.
3.  Use the powerful `pd.json_normalize()` function to flatten semi-structured JSON into a clean, structured DataFrame.

This is a critical skill for any data scientist who needs to work with data from web APIs. Next, we'll tackle our final data type: unstructured text.