# Advanced Schema Evolution Lab - Apache Iceberg

## 🎯 Advanced Lab Objectives

In this advanced lab, we will explore the sophisticated schema evolution capabilities of Apache Iceberg:

1. **Complex Schema Evolution**: Add/remove/modify multiple columns simultaneously
2. **Nested Schema Evolution**: Struct, Array, Map evolution strategies
3. **Field Evolution**: Rename, type changes, nullable modifications
4. **Backward/Forward Compatibility**: Ensure compatibility across versions
5. **Performance Impact**: Measure performance implications of schema changes
6. **Real-world Scenarios**: E-commerce, IoT, Financial data examples

## 🏗️ Schema Evolution Architecture

### Iceberg Schema Evolution Principles:
- **Additive Changes**: Adding new columns (always safe)
- **Non-breaking Changes**: Making nullable, widening types
- **Breaking Changes**: Removing columns, narrowing types (requires caution)
- **Nested Evolution**: Struct fields can evolve independently

### Compatibility Matrix:
| Change Type | Backward Compatible | Forward Compatible | Safe |
|-------------|-------------------|-------------------|------|
| Add column | ✅ | ✅ | ✅ |
| Make nullable | ✅ | ✅ | ✅ |
| Widen type | ✅ | ✅ | ✅ |
| Rename field | ❌ | ❌ | ⚠️ |
| Delete field | ❌ | ✅ | ⚠️ |
| Narrow type | ✅ | ❌ | ⚠️ |

## 📊 Dataset: E-commerce Platform

We will work with e-commerce platform data including:
- **Products**: Product information with nested attributes
- **Orders**: Orders with customer info and payment details  
- **Reviews**: Product reviews with sentiment analysis
- **Analytics**: User behavior and engagement metrics


## 1. Setup and Import Libraries

First, we need to import the necessary libraries and setup the environment:


In [1]:
# Import necessary libraries
import os
import time
import json
from datetime import datetime, timedelta
from typing import Dict, List, Any

# PyIceberg imports
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import (
    StructType, StringType, IntegerType, LongType, DoubleType, BooleanType,
    TimestampType, DateType, ListType, MapType, NestedField
)

# Data processing
import pyarrow as pa
import pyarrow.compute as pc
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

print("✅ Successfully imported all libraries!")
print(f"📦 PyIceberg version: {pa.__version__}")
print(f"📦 PyArrow version: {pa.__version__}")
print(f"📦 Pandas version: {pd.__version__}")


✅ Successfully imported all libraries!
📦 PyIceberg version: 21.0.0
📦 PyArrow version: 21.0.0
📦 Pandas version: 2.3.2


In [2]:
# Setup warehouse and catalog
warehouse_path = "/tmp/advanced_iceberg_warehouse"
os.makedirs(warehouse_path, exist_ok=True)

# Configure catalog
catalog = load_catalog(
    "advanced",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/advanced_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

# Create namespace
try:
    catalog.create_namespace("ecommerce")
    print("✅ Created namespace 'ecommerce'")
except Exception as e:
    print(f"ℹ️  Namespace 'ecommerce' already exists: {e}")

print(f"📁 Warehouse path: {warehouse_path}")
print("🎯 Ready for Advanced Schema Evolution Lab!")


✅ Created namespace 'ecommerce'
📁 Warehouse path: /tmp/advanced_iceberg_warehouse
🎯 Ready for Advanced Schema Evolution Lab!


## 2. Create Complex E-commerce Dataset

We will create an e-commerce dataset with nested structures to demonstrate advanced schema evolution features:


In [5]:
# Create product data with nested attributes
def generate_product_data(n_products=100):
    """Generate product data with nested structures"""
    
    import random  # Add random import for proper list selection
    
    products = []
    categories = ["Electronics", "Clothing", "Books", "Home", "Sports"]
    brands = ["Apple", "Samsung", "Nike", "Adidas", "Sony", "LG", "Canon", "Dell"]
    
    # Define tag options as separate lists
    tag_options = [
        ["featured", "new", "sale"],
        ["premium", "eco-friendly"],
        ["limited", "exclusive"],
        ["bestseller", "trending"]
    ]
    
    for i in range(n_products):
        product = {
            "product_id": f"PROD_{i+1:04d}",
            "name": f"Product {i+1}",
            "category": np.random.choice(categories),
            "brand": np.random.choice(brands),
            "price": round(np.random.uniform(10, 1000), 2),
            "stock_quantity": np.random.randint(0, 1000),
            "is_active": np.random.choice([True, False]),
            "created_at": datetime.now() - timedelta(days=np.random.randint(1, 365)),
            
            # Nested attributes
            "attributes": {
                "color": np.random.choice(["Red", "Blue", "Green", "Black", "White"]),
                "size": np.random.choice(["S", "M", "L", "XL", "XXL"]),
                "weight": round(np.random.uniform(0.1, 10), 2),
                "material": np.random.choice(["Cotton", "Polyester", "Metal", "Plastic", "Wood"])
            },
            
            # Array of tags - use Python's random.choice for lists
            "tags": random.choice(tag_options),
            
            # Array of images
            "images": [
                f"https://example.com/images/{i+1}_1.jpg",
                f"https://example.com/images/{i+1}_2.jpg"
            ],
            
            # Ratings summary
            "ratings": {
                "average": round(np.random.uniform(1, 5), 1),
                "count": np.random.randint(0, 1000),
                "distribution": {
                    "5_star": np.random.randint(0, 100),
                    "4_star": np.random.randint(0, 100),
                    "3_star": np.random.randint(0, 100),
                    "2_star": np.random.randint(0, 100),
                    "1_star": np.random.randint(0, 100)
                }
            }
        }
        products.append(product)
    
    return products

# Generate data
print("🔄 Generating product data...")
product_data = generate_product_data(100)
print(f"✅ Generated {len(product_data)} products")

# Display sample data
print("\n📋 Sample product data:")
sample_product = product_data[0]
for key, value in sample_product.items():
    if isinstance(value, dict):
        print(f"  {key}: {json.dumps(value, indent=4, default=str)}")
    else:
        print(f"  {key}: {value}")


🔄 Generating product data...
✅ Generated 100 products

📋 Sample product data:
  product_id: PROD_0001
  name: Product 1
  category: Electronics
  brand: Adidas
  price: 151.65
  stock_quantity: 902
  is_active: False
  created_at: 2025-02-01 11:13:40.196123
  attributes: {
    "color": "White",
    "size": "L",
    "weight": 3.44,
    "material": "Plastic"
}
  tags: ['featured', 'new', 'sale']
  images: ['https://example.com/images/1_1.jpg', 'https://example.com/images/1_2.jpg']
  ratings: {
    "average": 2.8,
    "count": 922,
    "distribution": {
        "5_star": 5,
        "4_star": 59,
        "3_star": 15,
        "2_star": 14,
        "1_star": 16
    }
}


## 3. Create Iceberg Table with Complex Schema

Now we will create an Iceberg table with a complex schema including nested structures:


In [6]:
# Define complex schema with nested structures
def create_complex_schema():
    """Create a complex Iceberg schema with nested types"""
    
    schema = Schema(
        # Basic fields
        NestedField(1, "product_id", StringType(), required=True),
        NestedField(2, "name", StringType(), required=True),
        NestedField(3, "category", StringType(), required=True),
        NestedField(4, "brand", StringType(), required=True),
        NestedField(5, "price", DoubleType(), required=True),
        NestedField(6, "stock_quantity", IntegerType(), required=True),
        NestedField(7, "is_active", BooleanType(), required=True),
        NestedField(8, "created_at", TimestampType(), required=True),
        
        # Nested struct for attributes
        NestedField(9, "attributes", StructType(
            NestedField(10, "color", StringType(), required=True),
            NestedField(11, "size", StringType(), required=True),
            NestedField(12, "weight", DoubleType(), required=True),
            NestedField(13, "material", StringType(), required=True)
        ), required=True),
        
        # Array of strings for tags
        NestedField(14, "tags", ListType(15, StringType()), required=True),
        
        # Array of strings for images
        NestedField(16, "images", ListType(17, StringType()), required=True),
        
        # Nested struct for ratings
        NestedField(18, "ratings", StructType(
            NestedField(19, "average", DoubleType(), required=True),
            NestedField(20, "count", IntegerType(), required=True),
            NestedField(21, "distribution", StructType(
                NestedField(22, "5_star", IntegerType(), required=True),
                NestedField(23, "4_star", IntegerType(), required=True),
                NestedField(24, "3_star", IntegerType(), required=True),
                NestedField(25, "2_star", IntegerType(), required=True),
                NestedField(26, "1_star", IntegerType(), required=True)
            ), required=True)
        ), required=True)
    )
    
    return schema

# Create the schema
print("🏗️ Creating complex schema...")
complex_schema = create_complex_schema()
print("✅ Schema created successfully!")

# Display schema structure
print("\n📋 Schema structure:")
print(complex_schema)


🏗️ Creating complex schema...
✅ Schema created successfully!

📋 Schema structure:
table {
  1: product_id: required string
  2: name: required string
  3: category: required string
  4: brand: required string
  5: price: required double
  6: stock_quantity: required int
  7: is_active: required boolean
  8: created_at: required timestamp
  9: attributes: required struct<10: color: required string, 11: size: required string, 12: weight: required double, 13: material: required string>
  14: tags: required list<string>
  16: images: required list<string>
  18: ratings: required struct<19: average: required double, 20: count: required int, 21: distribution: required struct<22: 5_star: required int, 23: 4_star: required int, 24: 3_star: required int, 25: 2_star: required int, 26: 1_star: required int>>
}
