# Polars Nested Data Structures - Comprehensive Guide

This notebook covers Polars' powerful nested data types: **Struct**, **List**, and **Array**.

## What You'll Learn:
- Creating and manipulating Struct columns (nested key-value pairs)
- Advanced List operations (dynamic-length sequences)
- Array operations (fixed-length sequences)
- Unnesting and exploding nested structures
- Real-world JSON normalization patterns
- When to use each nested type

## Polars Nested Types Overview:

| Type | Description | Length | Use Case |
|------|-------------|--------|----------|
| **Struct** | Nested named fields (like a mini-DataFrame) | Fixed fields | JSON objects, nested records |
| **List** | Dynamic-length sequences | Variable per row | Arrays of items, tags, transactions |
| **Array** | Fixed-length sequences | Same for all rows | Embeddings, coordinates, fixed features |

In [None]:
import polars as pl
import json
from datetime import datetime, date

# Set display options
pl.Config.set_tbl_rows(10)
pl.Config.set_fmt_str_lengths(100)

print(f"Polars version: {pl.__version__}")

---
# Part 1: Struct Data Type

Structs are like having a mini-DataFrame within each cell. They store named fields with potentially different types.

## 1.1 Creating Struct Columns

In [None]:
# Method 1: Create from dictionary (automatically becomes Struct)
df_struct = pl.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'address': [
        {'street': '123 Main St', 'city': 'NYC', 'zip': '10001'},
        {'street': '456 Oak Ave', 'city': 'LA', 'zip': '90001'},
        {'street': '789 Pine Rd', 'city': 'Chicago', 'zip': '60601'}
    ]
})

print("DataFrame with Struct column:")
print(df_struct)
print("\nSchema:")
print(df_struct.schema)

In [None]:
# Method 2: Create Struct from existing columns using pl.struct()
df = pl.DataFrame({
    'user_id': [1, 2, 3],
    'first_name': ['Alice', 'Bob', 'Charlie'],
    'last_name': ['Smith', 'Jones', 'Brown'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})

# Combine columns into a struct
df_with_struct = df.select([
    pl.col('user_id'),
    pl.struct(['first_name', 'last_name', 'age']).alias('user_info'),
    pl.col('city')
])

print("Created Struct from columns:")
print(df_with_struct)

In [None]:
# Method 3: Create Struct using expressions
result = df.select([
    pl.col('user_id'),
    pl.struct([
        pl.col('first_name'),
        pl.col('last_name'),
        (pl.col('age') + 1).alias('age_next_year')
    ]).alias('person')
])

print("Struct with computed fields:")
print(result)

## 1.2 Accessing Struct Fields

In [None]:
# Access struct fields using .struct.field()
result = df_struct.select([
    pl.col('name'),
    pl.col('address').struct.field('city').alias('city'),
    pl.col('address').struct.field('zip').alias('zip_code')
])

print("Extract specific fields from Struct:")
print(result)

In [None]:
# Extract all struct fields at once using unnest()
result = df_struct.unnest('address')

print("Unnest struct (flatten all fields):")
print(result)

In [None]:
# Extract multiple fields in one operation
result = df_struct.with_columns([
    pl.col('address').struct.field('city').alias('city'),
    pl.col('address').struct.field('street').alias('street')
])

print("Keep original struct and extract fields:")
print(result)

## 1.3 Renaming Struct Fields

In [None]:
# Rename struct fields using .struct.rename_fields()
result = df_struct.select([
    pl.col('name'),
    pl.col('address').struct.rename_fields(['street_address', 'city_name', 'postal_code'])
])

print("Renamed struct fields:")
print(result)
print("\nSchema:")
print(result.schema)

## 1.4 Complex Nested Structs (Structs within Structs)

In [None]:
# Create deeply nested structure
df_nested = pl.DataFrame({
    'user_id': [1, 2, 3],
    'profile': [
        {
            'name': {'first': 'Alice', 'last': 'Smith'},
            'contact': {'email': 'alice@example.com', 'phone': '555-0001'},
            'age': 25
        },
        {
            'name': {'first': 'Bob', 'last': 'Jones'},
            'contact': {'email': 'bob@example.com', 'phone': '555-0002'},
            'age': 30
        },
        {
            'name': {'first': 'Charlie', 'last': 'Brown'},
            'contact': {'email': 'charlie@example.com', 'phone': '555-0003'},
            'age': 35
        }
    ]
})

print("Deeply nested structs:")
print(df_nested)
print("\nSchema:")
print(df_nested.schema)

In [None]:
# Access nested struct fields (chaining)
result = df_nested.select([
    pl.col('user_id'),
    pl.col('profile').struct.field('name').struct.field('first').alias('first_name'),
    pl.col('profile').struct.field('name').struct.field('last').alias('last_name'),
    pl.col('profile').struct.field('contact').struct.field('email').alias('email'),
    pl.col('profile').struct.field('age').alias('age')
])

print("Extract deeply nested fields:")
print(result)

In [None]:
# Unnest nested structs level by level
result = (
    df_nested
    .unnest('profile')  # First level
    .unnest(['name', 'contact'])  # Second level (multiple structs)
)

print("Fully flattened nested structs:")
print(result)

## 1.5 Filtering and Operations on Structs

In [None]:
# Filter based on struct field values
result = df_struct.filter(
    pl.col('address').struct.field('city') == 'NYC'
)

print("Filter by struct field:")
print(result)

In [None]:
# Update struct field values
result = df_struct.with_columns(
    pl.struct([
        pl.col('address').struct.field('street').alias('street'),
        pl.col('address').struct.field('city').str.to_uppercase().alias('city'),
        pl.col('address').struct.field('zip').alias('zip')
    ]).alias('address')
)

print("Modified struct field (city to uppercase):")
print(result)

---
# Part 2: List Data Type

Lists store variable-length sequences. Each row can have different number of elements.

## 2.1 Creating List Columns

In [None]:
# Create DataFrame with lists
df_lists = pl.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'purchase_amounts': [[100, 200, 150], [50, 75], [300, 250, 400, 100], [500]],
    'product_ids': [[1, 2, 3], [4, 5], [6, 7, 8, 9], [10]],
    'tags': [['vip', 'frequent'], ['new'], ['vip', 'enterprise', 'gold'], ['standard']]
})

print("DataFrame with lists:")
print(df_lists)
print("\nSchema:")
print(df_lists.schema)

## 2.2 Basic List Operations

In [None]:
# List aggregations
result = df_lists.select([
    pl.col('name'),
    pl.col('purchase_amounts').list.len().alias('num_purchases'),
    pl.col('purchase_amounts').list.sum().alias('total_spent'),
    pl.col('purchase_amounts').list.mean().alias('avg_purchase'),
    pl.col('purchase_amounts').list.max().alias('max_purchase'),
    pl.col('purchase_amounts').list.min().alias('min_purchase')
])

print("List aggregations:")
print(result)

In [None]:
# Access list elements by index
result = df_lists.select([
    pl.col('name'),
    pl.col('purchase_amounts').list.first().alias('first_purchase'),
    pl.col('purchase_amounts').list.last().alias('last_purchase'),
    pl.col('purchase_amounts').list.get(1).alias('second_purchase'),  # 0-indexed
    pl.col('purchase_amounts').list.get(-1).alias('last_purchase_negative_idx')
])

print("Access list elements:")
print(result)

In [None]:
# List slicing
result = df_lists.select([
    pl.col('name'),
    pl.col('purchase_amounts'),
    pl.col('purchase_amounts').list.head(2).alias('first_2_purchases'),
    pl.col('purchase_amounts').list.tail(2).alias('last_2_purchases'),
    pl.col('purchase_amounts').list.slice(1, 2).alias('middle_slice')  # offset=1, length=2
])

print("List slicing:")
print(result)

## 2.3 Advanced List Operations

In [None]:
# List sorting and reversing
result = df_lists.select([
    pl.col('name'),
    pl.col('purchase_amounts'),
    pl.col('purchase_amounts').list.sort().alias('sorted_purchases'),
    pl.col('purchase_amounts').list.sort(descending=True).alias('sorted_desc'),
    pl.col('purchase_amounts').list.reverse().alias('reversed')
])

print("List sorting and reversing:")
print(result)

In [None]:
# Check list membership
result = df_lists.select([
    pl.col('name'),
    pl.col('tags'),
    pl.col('tags').list.contains('vip').alias('is_vip'),
    pl.col('tags').list.contains('enterprise').alias('is_enterprise')
])

print("Check list membership:")
print(result)

In [None]:
# List unique and counts
df_with_duplicates = pl.DataFrame({
    'id': [1, 2, 3],
    'values': [[1, 2, 2, 3, 3, 3], [5, 5, 6], [7, 8, 8, 9, 9, 9, 9]]
})

result = df_with_duplicates.select([
    pl.col('values'),
    pl.col('values').list.unique().alias('unique_values'),
    pl.col('values').list.n_unique().alias('n_unique'),
    pl.col('values').list.len().alias('original_length')
])

print("List unique values:")
print(result)

## 2.4 List Transformations with list.eval()

In [None]:
# list.eval() allows applying expressions to each element in a list
result = df_lists.select([
    pl.col('name'),
    pl.col('purchase_amounts'),
    # Apply discount: multiply each element by 0.9
    pl.col('purchase_amounts').list.eval(pl.element() * 0.9).alias('discounted_10pct'),
    # Filter elements: keep only purchases > 100
    pl.col('purchase_amounts').list.eval(
        pl.element().filter(pl.element() > 100)
    ).alias('large_purchases_only')
])

print("List transformations with eval():")
print(result)

In [None]:
# More complex list.eval() examples
result = df_lists.select([
    pl.col('name'),
    pl.col('purchase_amounts'),
    # Categorize each purchase
    pl.col('purchase_amounts').list.eval(
        pl.when(pl.element() >= 200)
          .then(pl.lit('high'))
          .when(pl.element() >= 100)
          .then(pl.lit('medium'))
          .otherwise(pl.lit('low'))
    ).alias('purchase_categories'),
    # Round to nearest 50
    pl.col('purchase_amounts').list.eval(
        (pl.element() / 50).round(0) * 50
    ).alias('rounded_to_50')
])

print("Complex list transformations:")
print(result)

## 2.5 List Concatenation and Manipulation

In [None]:
# Concatenate lists from multiple columns
df_multi_lists = pl.DataFrame({
    'id': [1, 2, 3],
    'list_a': [[1, 2], [3, 4], [5, 6]],
    'list_b': [[7, 8], [9], [10, 11, 12]]
})

result = df_multi_lists.select([
    pl.col('id'),
    pl.col('list_a'),
    pl.col('list_b'),
    pl.concat_list(['list_a', 'list_b']).alias('combined')
])

print("Concatenate lists:")
print(result)

In [None]:
# Create list from scalar values
df_scalars = pl.DataFrame({
    'id': [1, 2, 3],
    'value_a': [10, 20, 30],
    'value_b': [40, 50, 60],
    'value_c': [70, 80, 90]
})

result = df_scalars.select([
    pl.col('id'),
    pl.concat_list(['value_a', 'value_b', 'value_c']).alias('values_list')
])

print("Create list from scalar columns:")
print(result)

In [None]:
# Flatten nested lists (lists of lists)
df_nested_lists = pl.DataFrame({
    'id': [1, 2],
    'nested': [[[1, 2], [3, 4]], [[5, 6], [7, 8, 9]]]
})

result = df_nested_lists.select([
    pl.col('id'),
    pl.col('nested'),
    pl.col('nested').list.flatten().alias('flattened')
])

print("Flatten nested lists:")
print(result)

---
# Part 3: Array Data Type

Arrays are fixed-length sequences where all rows have the same number of elements. Useful for embeddings, coordinates, and fixed-size features.

## 3.1 Creating Arrays

In [None]:
# Create array by casting from list
df_arrays = pl.DataFrame({
    'id': [1, 2, 3, 4],
    'coordinates': [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0], [10.0, 11.0, 12.0]]
})

# Cast to Array with fixed width
df_arrays = df_arrays.with_columns(
    pl.col('coordinates').cast(pl.Array(pl.Float64, 3)).alias('coordinates_array')
)

print("DataFrame with Array column:")
print(df_arrays)
print("\nSchema:")
print(df_arrays.schema)

In [None]:
# Arrays vs Lists - Type enforcement
try:
    # This will fail - arrays must have same length for all rows
    df_invalid = pl.DataFrame({
        'values': [[1, 2, 3], [4, 5], [6, 7, 8]]  # Different lengths!
    }).with_columns(
        pl.col('values').cast(pl.Array(pl.Int64, 3))
    )
except Exception as e:
    print(f"Error (expected): {type(e).__name__}")
    print("Arrays require all rows to have the same length!")

## 3.2 When to Use Array vs List

| Use Case | Array | List |
|----------|-------|------|
| Fixed-size vectors (embeddings, RGB colors) | ✅ | ❌ |
| Variable-length sequences (tags, items) | ❌ | ✅ |
| Coordinates (x, y, z) | ✅ | ❌ |
| Time series windows (fixed size) | ✅ | ❌ |
| Transaction history (variable) | ❌ | ✅ |
| Machine learning features (fixed) | ✅ | ❌ |

In [None]:
# Practical example: RGB color values
df_colors = pl.DataFrame({
    'color_name': ['red', 'green', 'blue', 'yellow'],
    'rgb': [[255, 0, 0], [0, 255, 0], [0, 0, 255], [255, 255, 0]]
}).with_columns(
    pl.col('rgb').cast(pl.Array(pl.UInt8, 3)).alias('rgb_array')
)

print("RGB colors as Arrays:")
print(df_colors)

In [None]:
# Practical example: Geographic coordinates (lat, lon)
df_locations = pl.DataFrame({
    'city': ['New York', 'Los Angeles', 'Chicago'],
    'coords': [[40.7128, -74.0060], [34.0522, -118.2437], [41.8781, -87.6298]]
}).with_columns(
    pl.col('coords').cast(pl.Array(pl.Float64, 2)).alias('coordinates')
)

# Extract lat and lon
result = df_locations.with_columns([
    pl.col('coordinates').arr.get(0).alias('latitude'),
    pl.col('coordinates').arr.get(1).alias('longitude')
])

print("Geographic coordinates:")
print(result)

## 3.3 Array Operations

In [None]:
# Array operations are similar to List operations
result = df_arrays.select([
    pl.col('id'),
    pl.col('coordinates_array'),
    pl.col('coordinates_array').arr.sum().alias('sum'),
    pl.col('coordinates_array').arr.mean().alias('mean'),
    pl.col('coordinates_array').arr.get(0).alias('x'),
    pl.col('coordinates_array').arr.get(1).alias('y'),
    pl.col('coordinates_array').arr.get(2).alias('z')
])

print("Array operations:")
print(result)

---
# Part 4: Unnesting and Exploding

Converting nested structures to flat tables.

## 4.1 Explode vs Unnest

In [None]:
# Sample data with both structs and lists
df_mixed = pl.DataFrame({
    'customer_id': [1, 2],
    'name': ['Alice', 'Bob'],
    'purchases': [[100, 200, 150], [50, 75]],
    'address': [
        {'city': 'NYC', 'state': 'NY'},
        {'city': 'LA', 'state': 'CA'}
    ]
})

print("Original DataFrame:")
print(df_mixed)

In [None]:
# EXPLODE: Expands lists into multiple rows (one row per list element)
exploded = df_mixed.explode('purchases')

print("After explode('purchases'):")
print(exploded)
print(f"\nRows: {len(df_mixed)} -> {len(exploded)}")

In [None]:
# UNNEST: Expands structs into separate columns (flattens struct fields)
unnested = df_mixed.unnest('address')

print("After unnest('address'):")
print(unnested)
print(f"\nColumns: {df_mixed.columns} -> {unnested.columns}")

In [None]:
# Combine explode and unnest
result = (
    df_mixed
    .explode('purchases')  # Expand list to rows
    .unnest('address')     # Expand struct to columns
)

print("After both explode and unnest:")
print(result)

## 4.2 Exploding Multiple Lists

In [None]:
# Explode multiple list columns simultaneously
df_multi = pl.DataFrame({
    'id': [1, 2],
    'products': [['A', 'B', 'C'], ['D', 'E']],
    'prices': [[10, 20, 30], [40, 50]],
    'quantities': [[1, 2, 3], [4, 5]]
})

print("Original:")
print(df_multi)

# Explode multiple columns (must have same length per row)
exploded = df_multi.explode(['products', 'prices', 'quantities'])

print("\nExploded multiple lists:")
print(exploded)

## 4.3 Working with Lists of Structs

In [None]:
# Common pattern: List of structs (like JSON arrays of objects)
df_list_struct = pl.DataFrame({
    'order_id': [1, 2],
    'items': [
        [
            {'product': 'Laptop', 'price': 1000, 'qty': 1},
            {'product': 'Mouse', 'price': 25, 'qty': 2}
        ],
        [
            {'product': 'Keyboard', 'price': 75, 'qty': 1},
            {'product': 'Monitor', 'price': 300, 'qty': 2}
        ]
    ]
})

print("List of structs:")
print(df_list_struct)
print("\nSchema:")
print(df_list_struct.schema)

In [None]:
# Flatten list of structs: explode then unnest
flattened = (
    df_list_struct
    .explode('items')      # One row per item
    .unnest('items')       # Struct fields become columns
)

print("Flattened list of structs:")
print(flattened)

In [None]:
# Add computed column
result = flattened.with_columns(
    (pl.col('price') * pl.col('qty')).alias('total')
)

print("With computed total:")
print(result)

---
# Part 5: Real-World JSON Normalization

Working with real-world nested JSON data.

In [None]:
# Example: API response with nested user data
json_data = [
    {
        "user_id": 1,
        "username": "alice_smith",
        "profile": {
            "age": 25,
            "location": {"city": "NYC", "country": "USA"},
            "verified": True
        },
        "posts": [
            {"post_id": 101, "likes": 50, "comments": 5},
            {"post_id": 102, "likes": 75, "comments": 10}
        ]
    },
    {
        "user_id": 2,
        "username": "bob_jones",
        "profile": {
            "age": 30,
            "location": {"city": "LA", "country": "USA"},
            "verified": False
        },
        "posts": [
            {"post_id": 201, "likes": 100, "comments": 20}
        ]
    }
]

# Read JSON into Polars
df_json = pl.DataFrame(json_data)

print("Raw JSON data:")
print(df_json)
print("\nSchema:")
print(df_json.schema)

In [None]:
# Step 1: Unnest the profile struct
df_step1 = df_json.unnest('profile')

print("Step 1 - Unnest profile:")
print(df_step1)

In [None]:
# Step 2: Unnest the nested location struct
df_step2 = df_step1.unnest('location')

print("Step 2 - Unnest location:")
print(df_step2)

In [None]:
# Step 3: Explode the posts list
df_step3 = df_step2.explode('posts')

print("Step 3 - Explode posts:")
print(df_step3)

In [None]:
# Step 4: Unnest the posts struct
df_normalized = df_step3.unnest('posts')

print("Fully normalized (flattened) data:")
print(df_normalized)

In [None]:
# All steps in one chain
df_normalized_chain = (
    pl.DataFrame(json_data)
    .unnest('profile')
    .unnest('location')
    .explode('posts')
    .unnest('posts')
)

print("Normalized in one chain:")
print(df_normalized_chain)

## 5.2 Advanced JSON Normalization Pattern

In [None]:
# Complex e-commerce order data
orders_json = [
    {
        "order_id": "ORD-001",
        "date": "2024-01-15",
        "customer": {
            "id": 1,
            "name": "Alice Smith",
            "tier": "gold"
        },
        "items": [
            {
                "sku": "LAPTOP-001",
                "name": "Dell Laptop",
                "price": 1200,
                "quantity": 1,
                "discount": 0.1
            },
            {
                "sku": "MOUSE-001",
                "name": "Wireless Mouse",
                "price": 25,
                "quantity": 2,
                "discount": 0.0
            }
        ],
        "shipping": {
            "address": {
                "street": "123 Main St",
                "city": "NYC",
                "zip": "10001"
            },
            "method": "express",
            "cost": 15.99
        }
    },
    {
        "order_id": "ORD-002",
        "date": "2024-01-16",
        "customer": {
            "id": 2,
            "name": "Bob Jones",
            "tier": "silver"
        },
        "items": [
            {
                "sku": "KEYBOARD-001",
                "name": "Mechanical Keyboard",
                "price": 150,
                "quantity": 1,
                "discount": 0.05
            }
        ],
        "shipping": {
            "address": {
                "street": "456 Oak Ave",
                "city": "LA",
                "zip": "90001"
            },
            "method": "standard",
            "cost": 5.99
        }
    }
]

# Normalize the complex structure
df_orders = (
    pl.DataFrame(orders_json)
    .unnest('customer')                    # Flatten customer
    .rename({'id': 'customer_id', 'name': 'customer_name'})  # Avoid column conflicts
    .unnest('shipping')                    # Flatten shipping
    .unnest('address')                     # Flatten nested address
    .rename({'cost': 'shipping_cost'})     # Avoid conflicts
    .explode('items')                      # Expand items list
    .unnest('items')                       # Flatten items struct
    # Add computed columns
    .with_columns([
        (pl.col('price') * pl.col('quantity')).alias('subtotal'),
        (pl.col('price') * pl.col('quantity') * (1 - pl.col('discount'))).alias('total_after_discount')
    ])
)

print("Normalized e-commerce orders:")
print(df_orders)

In [None]:
# Aggregation on normalized data
order_summary = (
    df_orders
    .group_by('order_id')
    .agg([
        pl.col('customer_name').first(),
        pl.col('date').first(),
        pl.col('total_after_discount').sum().alias('order_total'),
        pl.col('sku').count().alias('num_items'),
        pl.col('shipping_cost').first(),
    ])
    .with_columns(
        (pl.col('order_total') + pl.col('shipping_cost')).alias('grand_total')
    )
)

print("Order summary:")
print(order_summary)

---
# Part 6: Performance Considerations

In [None]:
# Arrays are more memory-efficient than Lists (when appropriate)
import time

# Create large dataset
n_rows = 100_000
data_list = [[1.0, 2.0, 3.0] for _ in range(n_rows)]

# As List
df_list = pl.DataFrame({'values': data_list})
list_size = df_list.estimated_size('mb')

# As Array
df_array = df_list.with_columns(
    pl.col('values').cast(pl.Array(pl.Float64, 3)).alias('values_array')
)
array_size = df_array.select('values_array').estimated_size('mb')

print(f"List column size: {list_size:.4f} MB")
print(f"Array column size: {array_size:.4f} MB")
print(f"\nArray is {list_size/array_size:.2f}x more memory efficient")

In [None]:
# Performance tip: Use list.eval() instead of explode when possible
df_perf = pl.DataFrame({
    'id': range(1000),
    'values': [[i, i+1, i+2, i+3, i+4] for i in range(1000)]
})

# Method 1: explode (creates many rows)
start = time.time()
result1 = df_perf.explode('values').with_columns(
    (pl.col('values') * 2).alias('doubled')
)
time1 = time.time() - start

# Method 2: list.eval (keeps rows same)
start = time.time()
result2 = df_perf.with_columns(
    pl.col('values').list.eval(pl.element() * 2).alias('doubled')
)
time2 = time.time() - start

print(f"Explode method: {time1:.4f}s, {len(result1)} rows")
print(f"list.eval method: {time2:.4f}s, {len(result2)} rows")
print(f"\nlist.eval is {time1/time2:.2f}x faster (avoids row explosion)")

---
# Summary & Best Practices

## When to Use Each Type:

### **Struct**
- ✅ Nested objects with named fields (JSON objects)
- ✅ Grouping related columns together
- ✅ Hierarchical data (address, contact info)
- ✅ Keeping related data together for better organization

### **List**
- ✅ Variable-length sequences (tags, transaction history)
- ✅ Arrays of unknown/changing size
- ✅ Collecting values in group_by aggregations
- ✅ JSON arrays with different lengths per row

### **Array**
- ✅ Fixed-size vectors (embeddings, coordinates)
- ✅ RGB/RGBA colors
- ✅ Time series windows (always N points)
- ✅ Machine learning features with fixed dimensions
- ✅ Better memory efficiency for fixed-size data

## Key Operations:

| Operation | Struct | List | Array |
|-----------|--------|------|-------|
| Access elements | `.struct.field()` | `.list.get()` | `.arr.get()` |
| Flatten to columns | `unnest()` | N/A | N/A |
| Expand to rows | N/A | `explode()` | `explode()` |
| Transform elements | N/A | `.list.eval()` | `.arr.eval()` |
| Aggregations | N/A | `.list.sum()`, etc. | `.arr.sum()`, etc. |

## Performance Tips:
1. Use **Array** instead of **List** when all rows have same length (more efficient)
2. Use **list.eval()** instead of explode+operation when possible
3. Unnest structs early in query pipeline for better optimization
4. Avoid deeply nested structures if possible (harder to query)

## Common Patterns:
```python
# JSON normalization
df.unnest('struct_col').explode('list_col').unnest('nested_struct')

# Creating list from columns
pl.concat_list(['col1', 'col2', 'col3'])

# Creating struct from columns
pl.struct(['col1', 'col2', 'col3'])

# Transforming list elements
pl.col('list_col').list.eval(pl.element() * 2)
```

---
# Practice Exercises

Try these on your own!

In [None]:
# Exercise data
exercise_data = [
    {
        "movie_id": 1,
        "title": "The Matrix",
        "year": 1999,
        "genres": ["Action", "Sci-Fi"],
        "ratings": {"imdb": 8.7, "rotten_tomatoes": 88, "metacritic": 73},
        "cast": [
            {"actor": "Keanu Reeves", "role": "Neo"},
            {"actor": "Laurence Fishburne", "role": "Morpheus"}
        ]
    },
    {
        "movie_id": 2,
        "title": "Inception",
        "year": 2010,
        "genres": ["Action", "Sci-Fi", "Thriller"],
        "ratings": {"imdb": 8.8, "rotten_tomatoes": 87, "metacritic": 74},
        "cast": [
            {"actor": "Leonardo DiCaprio", "role": "Cobb"},
            {"actor": "Tom Hardy", "role": "Eames"}
        ]
    }
]

df_movies = pl.DataFrame(exercise_data)
print("Exercise data:")
print(df_movies)

In [None]:
# Exercise 1: Extract the IMDB rating from the ratings struct
# Your code here:


In [None]:
# Exercise 2: Find the average of all three rating scores for each movie
# Your code here:


In [None]:
# Exercise 3: Create a table with one row per actor (flatten the cast list)
# Your code here:


In [None]:
# Exercise 4: Count how many genres each movie has
# Your code here:


In [None]:
# Exercise 5: Find movies that have "Sci-Fi" in their genres list
# Your code here:
