# Time Travel & Versioning Lab - Apache Iceberg

## 🎯 Lab Objectives

In this lab, we will explore Apache Iceberg's powerful time travel and versioning capabilities:

1. **Snapshot Management**: Create and manage data snapshots
2. **Time Travel Queries**: Query data at specific points in time
3. **Rollback Operations**: Rollback to previous versions
4. **Version Comparison**: Compare data between different versions
5. **Historical Analysis**: Analyze data changes over time
6. **Production Scenarios**: Apply time travel to real-world use cases

## 🏗️ Time Travel Architecture

### Iceberg Time Travel Features:
- **Snapshot-based Versioning**: Each write creates a new snapshot
- **Time Travel Queries**: Query data as it existed at any point in time
- **Rollback Capabilities**: Restore data to previous states
- **Version Comparison**: Compare data between snapshots
- **Historical Metadata**: Track all changes and operations

### Use Cases:
- **Data Recovery**: Restore accidentally deleted or corrupted data
- **Audit Trails**: Track all changes for compliance and auditing
- **A/B Testing**: Compare different versions of datasets
- **Debugging**: Investigate data issues by examining historical states
- **Compliance**: Meet regulatory requirements for data retention

## 📊 Dataset: E-commerce Customer Data

We will work with evolving customer data including:
- **Customer Profiles**: Personal and demographic information
- **Purchase History**: Transaction records over time
- **Account Status**: Active/inactive status changes
- **Data Evolution**: Schema and data changes over time


## 1. Setup and Import Libraries


In [1]:
# Import necessary libraries
import os
import time
import json
import random
from datetime import datetime, timedelta
from typing import Dict, List, Any

# PyIceberg imports
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import (
    StructType, StringType, IntegerType, LongType, DoubleType, BooleanType,
    TimestampType, DateType, NestedField
)

# Data processing
import pyarrow as pa
import pyarrow.compute as pc
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

print("✅ Successfully imported all libraries!")
print(f"📦 PyArrow version: {pa.__version__}")
print(f"📦 Pandas version: {pd.__version__}")
print(f"📦 NumPy version: {np.__version__}")


✅ Successfully imported all libraries!
📦 PyArrow version: 21.0.0
📦 Pandas version: 2.3.2
📦 NumPy version: 2.2.6


## 2. Warehouse and Catalog Setup


In [2]:
# Setup warehouse and catalog
warehouse_path = "/tmp/timetravel_iceberg_warehouse"
os.makedirs(warehouse_path, exist_ok=True)

# Configure catalog
catalog = load_catalog(
    "timetravel",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/timetravel_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

# Create namespace
try:
    catalog.create_namespace("ecommerce")
    print("✅ Created namespace 'ecommerce'")
except Exception as e:
    print(f"ℹ️  Namespace 'ecommerce' already exists: {e}")

print(f"📁 Warehouse path: {warehouse_path}")
print("🎯 Ready for Time Travel Lab!")


ℹ️  Namespace 'ecommerce' already exists: Namespace ecommerce already exists
📁 Warehouse path: /tmp/timetravel_iceberg_warehouse
🎯 Ready for Time Travel Lab!


## 3. Generate Evolving Customer Dataset


In [3]:
def generate_customer_data(n_customers=1000):
    """Generate realistic customer data that will evolve over time"""
    
    customers = []
    countries = ["USA", "Canada", "UK", "Germany", "France", "Japan", "Australia", "Brazil"]
    cities = ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix", "Philadelphia", "San Antonio", "San Diego"]
    segments = ["Premium", "Standard", "Budget", "VIP"]
    statuses = ["Active", "Inactive", "Suspended", "Pending"]
    
    for i in range(n_customers):
        # Generate customer with realistic data
        customer = {
            "customer_id": f"CUST_{i+1:05d}",
            "first_name": f"Customer{i+1}",
            "last_name": f"LastName{i+1}",
            "email": f"customer{i+1}@example.com",
            "phone": f"+1-555-{random.randint(100, 999)}-{random.randint(1000, 9999)}",
            "country": random.choice(countries),
            "city": random.choice(cities),
            "age": random.randint(18, 80),
            "customer_segment": random.choice(segments),
            "account_status": random.choice(statuses),
            "registration_date": datetime(2023, 1, 1) + timedelta(days=random.randint(0, 180)),
            "last_login": datetime(2023, 6, 1) + timedelta(days=random.randint(0, 30)),
            "total_purchases": random.randint(0, 50),
            "total_spent": round(random.uniform(0, 10000), 2),
            "loyalty_points": random.randint(0, 5000),
            "is_premium": random.choice([True, False]),
            "newsletter_subscribed": random.choice([True, False]),
            "created_at": datetime.now() - timedelta(days=random.randint(1, 365)),
            "updated_at": datetime.now()
        }
        
        customers.append(customer)
    
    return customers

# Generate initial customer data
print("🔄 Generating initial customer data...")
customer_data = generate_customer_data(1000)
print(f"✅ Generated {len(customer_data):,} customers")

# Display sample data
print("\n📋 Sample customer data:")
sample_customer = customer_data[0]
for key, value in sample_customer.items():
    print(f"  {key}: {value}")

# Convert to DataFrame for easier manipulation
df_customers = pd.DataFrame(customer_data)
print(f"\n📊 Dataset overview:")
print(f"  Total customers: {len(df_customers):,}")
print(f"  Countries: {df_customers['country'].nunique()}")
print(f"  Segments: {df_customers['customer_segment'].nunique()}")
print(f"  Statuses: {df_customers['account_status'].nunique()}")
print(f"  Date range: {df_customers['registration_date'].min()} to {df_customers['registration_date'].max()}")


🔄 Generating initial customer data...
✅ Generated 1,000 customers

📋 Sample customer data:
  customer_id: CUST_00001
  first_name: Customer1
  last_name: LastName1
  email: customer1@example.com
  phone: +1-555-255-7057
  country: UK
  city: New York
  age: 23
  customer_segment: Budget
  account_status: Pending
  registration_date: 2023-02-01 00:00:00
  last_login: 2023-06-24 00:00:00
  total_purchases: 31
  total_spent: 7774.29
  loyalty_points: 2478
  is_premium: True
  newsletter_subscribed: True
  created_at: 2025-02-06 13:52:03.064959
  updated_at: 2025-09-17 13:52:03.064964

📊 Dataset overview:
  Total customers: 1,000
  Countries: 8
  Segments: 4
  Statuses: 4
  Date range: 2023-01-01 00:00:00 to 2023-06-30 00:00:00


## 4. Create Customer Schema and Initial Table


In [4]:
# FIXED Rollback Operations - Smart Schema Handling
print("🔧 FIXED Rollback Operations with Smart Schema Management")

# Helper function to create PyArrow table with dynamic schema
def create_pyarrow_table_with_dynamic_schema(df, include_credit_score=False):
    """Create PyArrow table with dynamic schema based on data availability"""
    
    # Base schema without credit_score
    base_schema = pa.schema([
        pa.field('customer_id', pa.string(), nullable=False),
        pa.field('first_name', pa.string(), nullable=False),
        pa.field('last_name', pa.string(), nullable=False),
        pa.field('email', pa.string(), nullable=False),
        pa.field('phone', pa.string(), nullable=False),
        pa.field('country', pa.string(), nullable=False),
        pa.field('city', pa.string(), nullable=False),
        pa.field('age', pa.int32(), nullable=False),
        pa.field('customer_segment', pa.string(), nullable=False),
        pa.field('account_status', pa.string(), nullable=False),
        pa.field('registration_date', pa.timestamp('us'), nullable=False),
        pa.field('last_login', pa.timestamp('us'), nullable=False),
        pa.field('total_purchases', pa.int32(), nullable=False),
        pa.field('total_spent', pa.float64(), nullable=False),
        pa.field('loyalty_points', pa.int32(), nullable=False),
        pa.field('is_premium', pa.bool_(), nullable=False),
        pa.field('newsletter_subscribed', pa.bool_(), nullable=False),
        pa.field('created_at', pa.timestamp('us'), nullable=False),
        pa.field('updated_at', pa.timestamp('us'), nullable=False)
    ])
    
    # Add credit_score if needed
    if include_credit_score:
        base_schema = base_schema.append(pa.field('credit_score', pa.int32(), nullable=False))
    
    # Convert timestamps
    df_clean = df.copy()
    timestamp_cols = ['registration_date', 'last_login', 'created_at', 'updated_at']
    for col in timestamp_cols:
        if col in df_clean.columns:
            df_clean[col] = pd.to_datetime(df_clean[col]).dt.floor('us')
    
    # Create table data dictionary
    table_data = {}
    for field in base_schema:
        field_name = field.name
        if field_name in df_clean.columns:
            table_data[field_name] = df_clean[field_name]
        elif field_name == 'credit_score' and include_credit_score:
            # Generate random credit scores if not present
            table_data[field_name] = np.random.randint(300, 850, len(df_clean))
    
    return pa.table(table_data, schema=base_schema)

# Rollback Operation 1: Rollback to Snapshot 3 (before deletions)
print("\n🔄 Rollback Operation 1: Rolling back to Snapshot 3 (before deletions)")

# Get current state before rollback
current_before_rollback = customers_table.scan().to_arrow().to_pandas()
print(f"  Current records before rollback: {len(current_before_rollback):,}")

# Get data from Snapshot 3 and overwrite current table
snapshot_3_id = snapshots_df.iloc[2]['snapshot_id']
snapshot_3_data = customers_table.scan(snapshot_id=snapshot_3_id).to_arrow().to_pandas()

# Convert back to PyArrow table for overwrite
snapshot_3_table = create_pyarrow_table_with_dynamic_schema(snapshot_3_data, include_credit_score='credit_score' in snapshot_3_data.columns)
customers_table.overwrite(snapshot_3_table)

print("✅ Rollback completed!")

# Verify rollback
current_after_rollback = customers_table.scan().to_arrow().to_pandas()
print(f"  Records after rollback: {len(current_after_rollback):,}")
print(f"  Records restored: {len(current_after_rollback) - len(current_before_rollback):,}")

# Check if deleted customers are back
inactive_customers_restored = len(current_after_rollback[current_after_rollback['account_status'] == 'Inactive'])
print(f"  Inactive customers restored: {inactive_customers_restored:,}")

# Rollback Operation 2: Rollback to Snapshot 1 (original state)
print(f"\n🔄 Rollback Operation 2: Rolling back to Snapshot 1 (original state)")

# Get current state
current_before_rollback2 = customers_table.scan().to_arrow().to_pandas()
print(f"  Current records before rollback: {len(current_before_rollback2):,}")

# IMPORTANT: First remove credit_score column from schema
print("  Step 1: Removing credit_score column from schema...")
try:
    customers_table.update_schema().remove_column("credit_score").commit()
    print("  ✅ Schema updated: credit_score column removed")
except Exception as e:
    print(f"  ℹ️ Schema update: {e}")

# Get data from Snapshot 1 and overwrite current table
snapshot_1_id = snapshots_df.iloc[0]['snapshot_id']
snapshot_1_data = customers_table.scan(snapshot_id=snapshot_1_id).to_arrow().to_pandas()

# Convert back to PyArrow table for overwrite (without credit_score)
snapshot_1_table = create_pyarrow_table_with_dynamic_schema(snapshot_1_data, include_credit_score=False)
customers_table.overwrite(snapshot_1_table)

print("✅ Rollback to original state completed!")

# Verify rollback
current_after_rollback2 = customers_table.scan().to_arrow().to_pandas()
print(f"  Records after rollback: {len(current_after_rollback2):,}")
print(f"  Records restored: {len(current_after_rollback2) - len(current_before_rollback2):,}")

# Check if schema is back to original
print(f"  Columns after rollback: {list(current_after_rollback2.columns)}")
print(f"  Has credit_score column: {'credit_score' in current_after_rollback2.columns}")

# Check premium customers (should be back to original count)
premium_customers_restored = len(current_after_rollback2[current_after_rollback2['customer_segment'] == 'Premium'])
print(f"  Premium customers restored: {premium_customers_restored:,}")

# Rollback Operation 3: Rollback to Snapshot 5 (latest with all features)
print(f"\n🔄 Rollback Operation 3: Rolling back to Snapshot 5 (latest with all features)")

# IMPORTANT: First restore the schema to include credit_score column
print("  Step 1: Restoring schema with credit_score column...")
customers_table.update_schema().add_column("credit_score", IntegerType()).commit()
print("  ✅ Schema updated: credit_score column added")

# Get data from Snapshot 5 and overwrite current table
snapshot_5_id = snapshots_df.iloc[4]['snapshot_id']
snapshot_5_data = customers_table.scan(snapshot_id=snapshot_5_id).to_arrow().to_pandas()

# Convert back to PyArrow table for overwrite (with credit_score column)
snapshot_5_table = create_pyarrow_table_with_dynamic_schema(snapshot_5_data, include_credit_score=True)
customers_table.overwrite(snapshot_5_table)

print("✅ Rollback to latest state completed!")

# Verify rollback
final_state = customers_table.scan().to_arrow().to_pandas()
print(f"  Final records: {len(final_state):,}")
print(f"  Final columns: {list(final_state.columns)}")
print(f"  Has credit_score column: {'credit_score' in final_state.columns}")

# Display rollback summary
print(f"\n📊 Rollback Operations Summary:")
print("-" * 50)
print(f"Rollback 1 (to Snapshot 3):")
print(f"  Restored {len(current_after_rollback) - len(current_before_rollback):,} deleted customers")
print(f"  Restored {inactive_customers_restored:,} inactive customers")

print(f"\nRollback 2 (to Snapshot 1):")
print(f"  Restored {len(current_after_rollback2) - len(current_before_rollback2):,} customers")
print(f"  Removed credit_score column from schema")
print(f"  Reset premium customers to {premium_customers_restored:,}")

print(f"\nRollback 3 (to Snapshot 5):")
print(f"  Restored schema with credit_score column")
print(f"  Restored to latest state with all features")
print(f"  Final records: {len(final_state):,}")
print(f"  Final columns: {len(final_state.columns)}")

print("-" * 50)


🔧 FIXED Rollback Operations with Smart Schema Management

🔄 Rollback Operation 1: Rolling back to Snapshot 3 (before deletions)


NameError: name 'customers_table' is not defined

In [None]:
# Create Iceberg schema for customer data
def create_customer_schema():
    """Create Iceberg schema for customer data"""
    
    schema = Schema(
        NestedField(1, "customer_id", StringType(), required=True),
        NestedField(2, "first_name", StringType(), required=True),
        NestedField(3, "last_name", StringType(), required=True),
        NestedField(4, "email", StringType(), required=True),
        NestedField(5, "phone", StringType(), required=True),
        NestedField(6, "country", StringType(), required=True),
        NestedField(7, "city", StringType(), required=True),
        NestedField(8, "age", IntegerType(), required=True),
        NestedField(9, "customer_segment", StringType(), required=True),
        NestedField(10, "account_status", StringType(), required=True),
        NestedField(11, "registration_date", TimestampType(), required=True),
        NestedField(12, "last_login", TimestampType(), required=True),
        NestedField(13, "total_purchases", IntegerType(), required=True),
        NestedField(14, "total_spent", DoubleType(), required=True),
        NestedField(15, "loyalty_points", IntegerType(), required=True),
        NestedField(16, "is_premium", BooleanType(), required=True),
        NestedField(17, "newsletter_subscribed", BooleanType(), required=True),
        NestedField(18, "created_at", TimestampType(), required=True),
        NestedField(19, "updated_at", TimestampType(), required=True)
    )
    
    return schema

# Create the schema
print("🏗️ Creating customer schema...")
customer_schema = create_customer_schema()
print("✅ Schema created successfully!")

# Helper function to create PyArrow table
def create_customer_pyarrow_table(df):
    """Create PyArrow table with correct timestamp precision"""
    
    # Create PyArrow schema
    pyarrow_schema = pa.schema([
        pa.field('customer_id', pa.string(), nullable=False),
        pa.field('first_name', pa.string(), nullable=False),
        pa.field('last_name', pa.string(), nullable=False),
        pa.field('email', pa.string(), nullable=False),
        pa.field('phone', pa.string(), nullable=False),
        pa.field('country', pa.string(), nullable=False),
        pa.field('city', pa.string(), nullable=False),
        pa.field('age', pa.int32(), nullable=False),
        pa.field('customer_segment', pa.string(), nullable=False),
        pa.field('account_status', pa.string(), nullable=False),
        pa.field('registration_date', pa.timestamp('us'), nullable=False),
        pa.field('last_login', pa.timestamp('us'), nullable=False),
        pa.field('total_purchases', pa.int32(), nullable=False),
        pa.field('total_spent', pa.float64(), nullable=False),
        pa.field('loyalty_points', pa.int32(), nullable=False),
        pa.field('is_premium', pa.bool_(), nullable=False),
        pa.field('newsletter_subscribed', pa.bool_(), nullable=False),
        pa.field('created_at', pa.timestamp('us'), nullable=False),
        pa.field('updated_at', pa.timestamp('us'), nullable=False)
    ])
    
    # Convert timestamps to microseconds
    df_clean = df.copy()
    timestamp_cols = ['registration_date', 'last_login', 'created_at', 'updated_at']
    for col in timestamp_cols:
        df_clean[col] = pd.to_datetime(df_clean[col]).dt.floor('us')
    
    # Create table with explicit schema
    return pa.table({
        'customer_id': df_clean['customer_id'],
        'first_name': df_clean['first_name'],
        'last_name': df_clean['last_name'],
        'email': df_clean['email'],
        'phone': df_clean['phone'],
        'country': df_clean['country'],
        'city': df_clean['city'],
        'age': df_clean['age'],
        'customer_segment': df_clean['customer_segment'],
        'account_status': df_clean['account_status'],
        'registration_date': df_clean['registration_date'],
        'last_login': df_clean['last_login'],
        'total_purchases': df_clean['total_purchases'],
        'total_spent': df_clean['total_spent'],
        'loyalty_points': df_clean['loyalty_points'],
        'is_premium': df_clean['is_premium'],
        'newsletter_subscribed': df_clean['newsletter_subscribed'],
        'created_at': df_clean['created_at'],
        'updated_at': df_clean['updated_at']
    }, schema=pyarrow_schema)

# Helper function to safely get snapshot info
def get_snapshot_info(snapshots_df, snapshot_row):
    """Safely extract snapshot information from DataFrame row"""
    info = {
        'snapshot_id': snapshot_row['snapshot_id'],
        'timestamp': None,
        'summary': None
    }
    
    # Try different timestamp column names
    timestamp_cols = ['timestamp_ms', 'committed_at', 'timestamp', 'created_at']
    for col in timestamp_cols:
        if col in snapshots_df.columns:
            info['timestamp'] = snapshot_row[col]
            break
    
    # Try different summary column names
    summary_cols = ['summary', 'description', 'operation']
    for col in summary_cols:
        if col in snapshots_df.columns:
            info['summary'] = snapshot_row[col]
            break
    
    return info

# Create initial table (Snapshot 1)
print("\n🔄 Creating initial customer table (Snapshot 1)...")

# Drop existing table if it exists
try:
    catalog.drop_table("ecommerce.customers")
    print("🗑️ Dropped existing table")
except Exception as e:
    print(f"ℹ️ No existing table to drop: {e}")

# Create table
customers_table = catalog.create_table(
    "ecommerce.customers",
    schema=customer_schema
)

# Convert data to PyArrow table
customer_table = create_customer_pyarrow_table(df_customers)

# Insert initial data
customers_table.append(customer_table)

print("✅ Initial table created with Snapshot 1!")

# Get snapshot information - Safe access
snapshots_df = customers_table.inspect.snapshots().to_pandas()
print(f"\n📸 Snapshot Information:")
print(f"  Total snapshots: {len(snapshots_df)}")
print(f"  Available columns: {list(snapshots_df.columns)}")

if len(snapshots_df) > 0:
    latest_snapshot_info = get_snapshot_info(snapshots_df, snapshots_df.iloc[-1])
    print(f"  Latest snapshot ID: {latest_snapshot_info['snapshot_id']}")
    
    if latest_snapshot_info['timestamp'] is not None:
        if isinstance(latest_snapshot_info['timestamp'], (int, float)):
            timestamp_str = datetime.fromtimestamp(latest_snapshot_info['timestamp'] / 1000).strftime("%Y-%m-%d %H:%M:%S")
        else:
            timestamp_str = str(latest_snapshot_info['timestamp'])
        print(f"  Latest snapshot timestamp: {timestamp_str}")
    else:
        print(f"  Latest snapshot timestamp: Not available")
    
    if latest_snapshot_info['summary'] is not None:
        print(f"  Latest snapshot summary: {latest_snapshot_info['summary']}")
    else:
        print(f"  Latest snapshot summary: Not available")

# Get current record count
current_count = len(customers_table.scan().to_arrow())
print(f"  Current records: {current_count:,}")


🏗️ Creating customer schema...
✅ Schema created successfully!

🔄 Creating initial customer table (Snapshot 1)...
🗑️ Dropped existing table
✅ Initial table created with Snapshot 1!

📸 Snapshot Information:
  Total snapshots: 1
  Available columns: ['committed_at', 'snapshot_id', 'parent_id', 'operation', 'manifest_list', 'summary']
  Latest snapshot ID: 1783715834426628930
  Latest snapshot timestamp: 2025-09-17 06:44:58.504000
  Latest snapshot summary: [('added-files-size', '43582'), ('added-data-files', '1'), ('added-records', '1000'), ('total-data-files', '1'), ('total-delete-files', '0'), ('total-records', '1000'), ('total-files-size', '43582'), ('total-position-deletes', '0'), ('total-equality-deletes', '0')]
  Current records: 1,000


## 5. Snapshot Management - Creating Multiple Versions

Now let's create multiple snapshots by making changes to our data over time. Each write operation creates a new snapshot.

### 📸 **Snapshot Operations:**
- **Add New Customers**: Insert new customer records
- **Update Existing Customers**: Modify customer information
- **Delete Customers**: Remove customer records
- **Schema Evolution**: Add new columns to the schema


In [None]:
# Snapshot 2: Add new customers
print("🔄 Creating Snapshot 2: Adding new customers...")

# Generate additional customers
new_customers = generate_customer_data(200)  # Add 200 new customers
df_new_customers = pd.DataFrame(new_customers)

# Convert to PyArrow table
new_customer_table = create_customer_pyarrow_table(df_new_customers)

# Append new customers (creates Snapshot 2)
customers_table.append(new_customer_table)

print("✅ Snapshot 2 created: Added 200 new customers")

# Snapshot 3: Update existing customers
print("\n🔄 Creating Snapshot 3: Updating existing customers...")

# Get current data
current_data = customers_table.scan().to_arrow().to_pandas()

# Update some customers (change their segment and status)
update_indices = random.sample(range(len(current_data)), 100)  # Update 100 random customers
current_data.loc[update_indices, 'customer_segment'] = 'Premium'
current_data.loc[update_indices, 'account_status'] = 'Active'
current_data.loc[update_indices, 'is_premium'] = True
current_data.loc[update_indices, 'updated_at'] = datetime.now()

# Convert back to PyArrow table
updated_customer_table = create_customer_pyarrow_table(current_data)

# Overwrite table with updated data (creates Snapshot 3)
customers_table.overwrite(updated_customer_table)

print("✅ Snapshot 3 created: Updated 100 customers to Premium status")

# Snapshot 4: Delete some customers
print("\n🔄 Creating Snapshot 4: Deleting inactive customers...")

# Get current data
current_data = customers_table.scan().to_arrow().to_pandas()

# Delete customers with 'Inactive' status
inactive_customers = current_data[current_data['account_status'] == 'Inactive']
print(f"  Found {len(inactive_customers)} inactive customers to delete")

# Keep only active customers
active_data = current_data[current_data['account_status'] != 'Inactive']

# Convert back to PyArrow table
active_customer_table = create_customer_pyarrow_table(active_data)

# Overwrite table with active customers only (creates Snapshot 4)
customers_table.overwrite(active_customer_table)

print("✅ Snapshot 4 created: Deleted inactive customers")

# Snapshot 5: Schema evolution - Add new column
print("\n🔄 Creating Snapshot 5: Adding new column 'credit_score'...")

# Update schema to add credit_score column
customers_table.update_schema().add_column("credit_score", IntegerType()).commit()

# Get current data and add credit_score
current_data = customers_table.scan().to_arrow().to_pandas()
current_data['credit_score'] = np.random.randint(300, 850, len(current_data))  # Random credit scores
current_data['updated_at'] = datetime.now()

# Convert back to PyArrow table with new schema
pyarrow_schema_with_credit = pa.schema([
    pa.field('customer_id', pa.string(), nullable=False),
    pa.field('first_name', pa.string(), nullable=False),
    pa.field('last_name', pa.string(), nullable=False),
    pa.field('email', pa.string(), nullable=False),
    pa.field('phone', pa.string(), nullable=False),
    pa.field('country', pa.string(), nullable=False),
    pa.field('city', pa.string(), nullable=False),
    pa.field('age', pa.int32(), nullable=False),
    pa.field('customer_segment', pa.string(), nullable=False),
    pa.field('account_status', pa.string(), nullable=False),
    pa.field('registration_date', pa.timestamp('us'), nullable=False),
    pa.field('last_login', pa.timestamp('us'), nullable=False),
    pa.field('total_purchases', pa.int32(), nullable=False),
    pa.field('total_spent', pa.float64(), nullable=False),
    pa.field('loyalty_points', pa.int32(), nullable=False),
    pa.field('is_premium', pa.bool_(), nullable=False),
    pa.field('newsletter_subscribed', pa.bool_(), nullable=False),
    pa.field('created_at', pa.timestamp('us'), nullable=False),
    pa.field('updated_at', pa.timestamp('us'), nullable=False),
    pa.field('credit_score', pa.int32(), nullable=False)  # New column
])

# Convert timestamps
timestamp_cols = ['registration_date', 'last_login', 'created_at', 'updated_at']
for col in timestamp_cols:
    current_data[col] = pd.to_datetime(current_data[col]).dt.floor('us')

# Create table with new schema
updated_customer_table_with_credit = pa.table({
    'customer_id': current_data['customer_id'],
    'first_name': current_data['first_name'],
    'last_name': current_data['last_name'],
    'email': current_data['email'],
    'phone': current_data['phone'],
    'country': current_data['country'],
    'city': current_data['city'],
    'age': current_data['age'],
    'customer_segment': current_data['customer_segment'],
    'account_status': current_data['account_status'],
    'registration_date': current_data['registration_date'],
    'last_login': current_data['last_login'],
    'total_purchases': current_data['total_purchases'],
    'total_spent': current_data['total_spent'],
    'loyalty_points': current_data['loyalty_points'],
    'is_premium': current_data['is_premium'],
    'newsletter_subscribed': current_data['newsletter_subscribed'],
    'created_at': current_data['created_at'],
    'updated_at': current_data['updated_at'],
    'credit_score': current_data['credit_score']
}, schema=pyarrow_schema_with_credit)

# Overwrite table with new schema (creates Snapshot 5)
customers_table.overwrite(updated_customer_table_with_credit)

print("✅ Snapshot 5 created: Added credit_score column")

# Display all snapshots - Safe access using helper function
print(f"\n📸 All Snapshots Created:")
snapshots_df = customers_table.inspect.snapshots().to_pandas()
for i, (_, snapshot) in enumerate(snapshots_df.iterrows(), 1):
    snapshot_info = get_snapshot_info(snapshots_df, snapshot)
    print(f"  Snapshot {i}:")
    print(f"    ID: {snapshot_info['snapshot_id']}")
    
    if snapshot_info['timestamp'] is not None:
        if isinstance(snapshot_info['timestamp'], (int, float)):
            timestamp_str = datetime.fromtimestamp(snapshot_info['timestamp'] / 1000).strftime("%Y-%m-%d %H:%M:%S")
        else:
            timestamp_str = str(snapshot_info['timestamp'])
        print(f"    Timestamp: {timestamp_str}")
    else:
        print(f"    Timestamp: Not available")
    
    if snapshot_info['summary'] is not None:
        print(f"    Summary: {snapshot_info['summary']}")
    else:
        print(f"    Summary: Not available")
    
    # Try to get parent snapshot ID
    parent_cols = ['parent_snapshot_id', 'parent_id', 'parent']
    parent_id = None
    for col in parent_cols:
        if col in snapshots_df.columns:
            parent_id = snapshot[col]
            break
    
    if parent_id is not None:
        print(f"    Parent ID: {parent_id}")
    else:
        print(f"    Parent ID: Not available")
    print()


🔄 Creating Snapshot 2: Adding new customers...
✅ Snapshot 2 created: Added 200 new customers

🔄 Creating Snapshot 3: Updating existing customers...
✅ Snapshot 3 created: Updated 100 customers to Premium status

🔄 Creating Snapshot 4: Deleting inactive customers...
  Found 282 inactive customers to delete
✅ Snapshot 4 created: Deleted inactive customers

🔄 Creating Snapshot 5: Adding new column 'credit_score'...
✅ Snapshot 5 created: Added credit_score column

📸 All Snapshots Created:
  Snapshot 1:
    ID: 1783715834426628930
    Timestamp: 2025-09-17 06:44:58.504000
    Summary: [('added-files-size', '43582'), ('added-data-files', '1'), ('added-records', '1000'), ('total-data-files', '1'), ('total-delete-files', '0'), ('total-records', '1000'), ('total-files-size', '43582'), ('total-position-deletes', '0'), ('total-equality-deletes', '0')]
    Parent ID: nan

  Snapshot 2:
    ID: 332105968043556599
    Timestamp: 2025-09-17 06:44:58.537000
    Summary: [('added-files-size', '15792'), 

## 6. Time Travel Queries - Querying Historical Data

Now let's explore the power of time travel queries! We can query data as it existed at any point in time using snapshot IDs or timestamps.

### ⏰ **Time Travel Methods:**
- **Snapshot ID**: Query data at a specific snapshot
- **Timestamp**: Query data at a specific point in time
- **AS OF**: Use SQL-style time travel syntax
- **Historical Analysis**: Compare data across different time periods


In [None]:
# Get snapshot information for time travel - Safe access
snapshots_df = customers_table.inspect.snapshots().to_pandas()
print("📸 Available Snapshots for Time Travel:")
for i, (_, snapshot) in enumerate(snapshots_df.iterrows(), 1):
    snapshot_info = get_snapshot_info(snapshots_df, snapshot)
    if snapshot_info['timestamp'] is not None:
        if isinstance(snapshot_info['timestamp'], (int, float)):
            timestamp_str = datetime.fromtimestamp(snapshot_info['timestamp'] / 1000).strftime("%Y-%m-%d %H:%M:%S")
        else:
            timestamp_str = str(snapshot_info['timestamp'])
    else:
        timestamp_str = "Not available"
    print(f"  Snapshot {i}: ID={snapshot_info['snapshot_id']}, Time={timestamp_str}")

# Time Travel Query 1: Query data at Snapshot 1 (original data)
print(f"\n⏰ Time Travel Query 1: Data at Snapshot 1 (Original)")
snapshot_1_id = snapshots_df.iloc[0]['snapshot_id']
snapshot_1_data = customers_table.scan(snapshot_id=snapshot_1_id).to_arrow().to_pandas()

print(f"  Records at Snapshot 1: {len(snapshot_1_data):,}")
print(f"  Columns at Snapshot 1: {list(snapshot_1_data.columns)}")
print(f"  Premium customers at Snapshot 1: {len(snapshot_1_data[snapshot_1_data['customer_segment'] == 'Premium']):,}")

# Time Travel Query 2: Query data at Snapshot 2 (after adding customers)
print(f"\n⏰ Time Travel Query 2: Data at Snapshot 2 (After Adding Customers)")
snapshot_2_id = snapshots_df.iloc[1]['snapshot_id']
snapshot_2_data = customers_table.scan(snapshot_id=snapshot_2_id).to_arrow().to_pandas()

print(f"  Records at Snapshot 2: {len(snapshot_2_data):,}")
print(f"  New customers added: {len(snapshot_2_data) - len(snapshot_1_data):,}")

# Time Travel Query 3: Query data at Snapshot 3 (after updates)
print(f"\n⏰ Time Travel Query 3: Data at Snapshot 3 (After Updates)")
snapshot_3_id = snapshots_df.iloc[2]['snapshot_id']
snapshot_3_data = customers_table.scan(snapshot_id=snapshot_3_id).to_arrow().to_pandas()

print(f"  Records at Snapshot 3: {len(snapshot_3_data):,}")
print(f"  Premium customers at Snapshot 3: {len(snapshot_3_data[snapshot_3_data['customer_segment'] == 'Premium']):,}")

# Time Travel Query 4: Query data at Snapshot 4 (after deletions)
print(f"\n⏰ Time Travel Query 4: Data at Snapshot 4 (After Deletions)")
snapshot_4_id = snapshots_df.iloc[3]['snapshot_id']
snapshot_4_data = customers_table.scan(snapshot_id=snapshot_4_id).to_arrow().to_pandas()

print(f"  Records at Snapshot 4: {len(snapshot_4_data):,}")
print(f"  Customers deleted: {len(snapshot_3_data) - len(snapshot_4_data):,}")
print(f"  Inactive customers at Snapshot 4: {len(snapshot_4_data[snapshot_4_data['account_status'] == 'Inactive']):,}")

# Time Travel Query 5: Query data at Snapshot 5 (with new schema)
print(f"\n⏰ Time Travel Query 5: Data at Snapshot 5 (With New Schema)")
snapshot_5_id = snapshots_df.iloc[4]['snapshot_id']
snapshot_5_data = customers_table.scan(snapshot_id=snapshot_5_id).to_arrow().to_pandas()

print(f"  Records at Snapshot 5: {len(snapshot_5_data):,}")
print(f"  Columns at Snapshot 5: {list(snapshot_5_data.columns)}")
print(f"  Has credit_score column: {'credit_score' in snapshot_5_data.columns}")
if 'credit_score' in snapshot_5_data.columns:
    print(f"  Average credit score: {snapshot_5_data['credit_score'].mean():.1f}")

# Time Travel Query 6: Query current data (latest snapshot)
print(f"\n⏰ Time Travel Query 6: Current Data (Latest Snapshot)")
current_data = customers_table.scan().to_arrow().to_pandas()

print(f"  Current records: {len(current_data):,}")
print(f"  Current columns: {list(current_data.columns)}")

# Compare data across snapshots
print(f"\n📊 Data Evolution Summary:")
print("-" * 60)
print(f"{'Snapshot':<12} {'Records':<10} {'Premium':<10} {'Inactive':<10} {'Columns':<10}")
print("-" * 60)

snapshot_data_list = [snapshot_1_data, snapshot_2_data, snapshot_3_data, snapshot_4_data, snapshot_5_data]
for i, data in enumerate(snapshot_data_list, 1):
    premium_count = len(data[data['customer_segment'] == 'Premium'])
    inactive_count = len(data[data['account_status'] == 'Inactive'])
    column_count = len(data.columns)
    
    print(f"Snapshot {i:<8} {len(data):<10,} {premium_count:<10,} {inactive_count:<10,} {column_count:<10}")

print("-" * 60)


📸 Available Snapshots for Time Travel:
  Snapshot 1: ID=1783715834426628930, Time=2025-09-17 06:44:58.504000
  Snapshot 2: ID=332105968043556599, Time=2025-09-17 06:44:58.537000
  Snapshot 3: ID=6503228716597911792, Time=2025-09-17 06:44:58.575000
  Snapshot 4: ID=5814724229348825479, Time=2025-09-17 06:44:58.587000
  Snapshot 5: ID=335080899782564617, Time=2025-09-17 06:44:58.609000
  Snapshot 6: ID=5134235652662559516, Time=2025-09-17 06:44:58.621000
  Snapshot 7: ID=5300982805287717163, Time=2025-09-17 06:44:58.648000
  Snapshot 8: ID=5175176634825406144, Time=2025-09-17 06:44:58.660000

⏰ Time Travel Query 1: Data at Snapshot 1 (Original)
  Records at Snapshot 1: 1,000
  Columns at Snapshot 1: ['customer_id', 'first_name', 'last_name', 'email', 'phone', 'country', 'city', 'age', 'customer_segment', 'account_status', 'registration_date', 'last_login', 'total_purchases', 'total_spent', 'loyalty_points', 'is_premium', 'newsletter_subscribed', 'created_at', 'updated_at']
  Premium cust

## 🔄 Rollback Operations - Restoring Previous Versions

### **How Rollback Works in PyIceberg:**

Unlike traditional databases, PyIceberg doesn't have a direct `rollback_to_snapshot()` method. Instead, rollback is achieved by:

1. **Query Historical Data**: Use `scan(snapshot_id=snapshot_id)` to get data from a specific snapshot
2. **Overwrite Current Table**: Use `overwrite()` to replace current data with historical data
3. **Create New Snapshot**: The overwrite operation creates a new snapshot with the restored data

### **Rollback Strategies:**

- **Data Rollback**: Restore data to a previous state
- **Schema Rollback**: Restore schema to a previous version (requires schema evolution)
- **Complete Rollback**: Restore both data and schema to a previous snapshot

### **Important Notes:**

- Rollback creates new snapshots (doesn't delete existing ones)
- Historical snapshots remain accessible for time travel
- Schema rollback requires careful handling of column compatibility
- Rollback operations are atomic and consistent

### **Use Cases:**

- **Data Recovery**: Restore accidentally deleted or corrupted data
- **Testing**: Rollback to test different scenarios
- **Compliance**: Meet regulatory requirements for data restoration
- **Debugging**: Investigate issues by examining historical states


## 7. Rollback Operations - Restoring Previous Versions

One of the most powerful features of Iceberg is the ability to rollback to previous snapshots. This is essential for data recovery and debugging.

### 🔄 **Rollback Scenarios:**
- **Data Recovery**: Restore accidentally deleted data
- **Bug Fixes**: Rollback to a known good state
- **Schema Rollback**: Undo schema changes
- **Testing**: Rollback for testing different scenarios


In [None]:
# Rollback Operation 1: Rollback to Snapshot 3 (before deletions)
print("🔄 Rollback Operation 1: Rolling back to Snapshot 3 (before deletions)")

# Get current state before rollback
current_before_rollback = customers_table.scan().to_arrow().to_pandas()
print(f"  Current records before rollback: {len(current_before_rollback):,}")

# Get data from Snapshot 3 and overwrite current table
snapshot_3_id = snapshots_df.iloc[2]['snapshot_id']
snapshot_3_data = customers_table.scan(snapshot_id=snapshot_3_id).to_arrow().to_pandas()

# Convert back to PyArrow table for overwrite
snapshot_3_table = create_customer_pyarrow_table(snapshot_3_data)
customers_table.overwrite(snapshot_3_table)

print("✅ Rollback completed!")

# Verify rollback
current_after_rollback = customers_table.scan().to_arrow().to_pandas()
print(f"  Records after rollback: {len(current_after_rollback):,}")
print(f"  Records restored: {len(current_after_rollback) - len(current_before_rollback):,}")

# Check if deleted customers are back
inactive_customers_restored = len(current_after_rollback[current_after_rollback['account_status'] == 'Inactive'])
print(f"  Inactive customers restored: {inactive_customers_restored:,}")

# Rollback Operation 2: Rollback to Snapshot 1 (original state)
print(f"\n🔄 Rollback Operation 2: Rolling back to Snapshot 1 (original state)")

# Get current state
current_before_rollback2 = customers_table.scan().to_arrow().to_pandas()
print(f"  Current records before rollback: {len(current_before_rollback2):,}")

# Get data from Snapshot 1 and overwrite current table
snapshot_1_id = snapshots_df.iloc[0]['snapshot_id']
snapshot_1_data = customers_table.scan(snapshot_id=snapshot_1_id).to_arrow().to_pandas()

# Convert back to PyArrow table for overwrite
snapshot_1_table = create_customer_pyarrow_table(snapshot_1_data)
customers_table.overwrite(snapshot_1_table)

print("✅ Rollback to original state completed!")

# Verify rollback
current_after_rollback2 = customers_table.scan().to_arrow().to_pandas()
print(f"  Records after rollback: {len(current_after_rollback2):,}")
print(f"  Records restored: {len(current_after_rollback2) - len(current_before_rollback2):,}")

# Check if schema is back to original
print(f"  Columns after rollback: {list(current_after_rollback2.columns)}")
print(f"  Has credit_score column: {'credit_score' in current_after_rollback2.columns}")

# Check premium customers (should be back to original count)
premium_customers_restored = len(current_after_rollback2[current_after_rollback2['customer_segment'] == 'Premium'])
print(f"  Premium customers restored: {premium_customers_restored:,}")

# Rollback Operation 3: Rollback to Snapshot 5 (latest with all features)
print(f"\n🔄 Rollback Operation 3: Rolling back to Snapshot 5 (latest with all features)")

# Get data from Snapshot 5 and overwrite current table
snapshot_5_id = snapshots_df.iloc[4]['snapshot_id']
snapshot_5_data = customers_table.scan(snapshot_id=snapshot_5_id).to_arrow().to_pandas()

# Convert back to PyArrow table for overwrite (with credit_score column)
pyarrow_schema_with_credit = pa.schema([
    pa.field('customer_id', pa.string(), nullable=False),
    pa.field('first_name', pa.string(), nullable=False),
    pa.field('last_name', pa.string(), nullable=False),
    pa.field('email', pa.string(), nullable=False),
    pa.field('phone', pa.string(), nullable=False),
    pa.field('country', pa.string(), nullable=False),
    pa.field('city', pa.string(), nullable=False),
    pa.field('age', pa.int32(), nullable=False),
    pa.field('customer_segment', pa.string(), nullable=False),
    pa.field('account_status', pa.string(), nullable=False),
    pa.field('registration_date', pa.timestamp('us'), nullable=False),
    pa.field('last_login', pa.timestamp('us'), nullable=False),
    pa.field('total_purchases', pa.int32(), nullable=False),
    pa.field('total_spent', pa.float64(), nullable=False),
    pa.field('loyalty_points', pa.int32(), nullable=False),
    pa.field('is_premium', pa.bool_(), nullable=False),
    pa.field('newsletter_subscribed', pa.bool_(), nullable=False),
    pa.field('created_at', pa.timestamp('us'), nullable=False),
    pa.field('updated_at', pa.timestamp('us'), nullable=False),
    pa.field('credit_score', pa.int32(), nullable=False)
])

# Convert timestamps
timestamp_cols = ['registration_date', 'last_login', 'created_at', 'updated_at']
for col in timestamp_cols:
    snapshot_5_data[col] = pd.to_datetime(snapshot_5_data[col]).dt.floor('us')

# Create table with credit_score schema
snapshot_5_table = pa.table({
    'customer_id': snapshot_5_data['customer_id'],
    'first_name': snapshot_5_data['first_name'],
    'last_name': snapshot_5_data['last_name'],
    'email': snapshot_5_data['email'],
    'phone': snapshot_5_data['phone'],
    'country': snapshot_5_data['country'],
    'city': snapshot_5_data['city'],
    'age': snapshot_5_data['age'],
    'customer_segment': snapshot_5_data['customer_segment'],
    'account_status': snapshot_5_data['account_status'],
    'registration_date': snapshot_5_data['registration_date'],
    'last_login': snapshot_5_data['last_login'],
    'total_purchases': snapshot_5_data['total_purchases'],
    'total_spent': snapshot_5_data['total_spent'],
    'loyalty_points': snapshot_5_data['loyalty_points'],
    'is_premium': snapshot_5_data['is_premium'],
    'newsletter_subscribed': snapshot_5_data['newsletter_subscribed'],
    'created_at': snapshot_5_data['created_at'],
    'updated_at': snapshot_5_data['updated_at'],
    'credit_score': snapshot_5_data['credit_score']
}, schema=pyarrow_schema_with_credit)

customers_table.overwrite(snapshot_5_table)

print("✅ Rollback to latest state completed!")

# Verify rollback
final_state = customers_table.scan().to_arrow().to_pandas()
print(f"  Final records: {len(final_state):,}")
print(f"  Final columns: {list(final_state.columns)}")
print(f"  Has credit_score column: {'credit_score' in final_state.columns}")

# Display rollback summary
print(f"\n📊 Rollback Operations Summary:")
print("-" * 50)
print(f"Rollback 1 (to Snapshot 3):")
print(f"  Restored {len(current_after_rollback) - len(current_before_rollback):,} deleted customers")
print(f"  Restored {inactive_customers_restored:,} inactive customers")

print(f"\nRollback 2 (to Snapshot 1):")
print(f"  Restored {len(current_after_rollback2) - len(current_before_rollback2):,} customers")
print(f"  Removed credit_score column")
print(f"  Reset premium customers to {premium_customers_restored:,}")

print(f"\nRollback 3 (to Snapshot 5):")
print(f"  Restored to latest state with all features")
print(f"  Final records: {len(final_state):,}")
print(f"  Final columns: {len(final_state.columns)}")

print("-" * 50)


🔄 Rollback Operation 1: Rolling back to Snapshot 3 (before deletions)
  Current records before rollback: 1,000
✅ Rollback completed!
  Records after rollback: 0
  Records restored: -1,000
  Inactive customers restored: 0

🔄 Rollback Operation 2: Rolling back to Snapshot 1 (original state)
  Current records before rollback: 0
✅ Rollback to original state completed!
  Records after rollback: 1,000
  Records restored: 1,000
  Columns after rollback: ['customer_id', 'first_name', 'last_name', 'email', 'phone', 'country', 'city', 'age', 'customer_segment', 'account_status', 'registration_date', 'last_login', 'total_purchases', 'total_spent', 'loyalty_points', 'is_premium', 'newsletter_subscribed', 'created_at', 'updated_at', 'credit_score']
  Has credit_score column: True
  Premium customers restored: 259

🔄 Rollback Operation 3: Rolling back to Snapshot 5 (latest with all features)




KeyError: 'credit_score'

## 8. Lab Summary & Best Practices

### 🎉 **Congratulations!**

You've completed the **Time Travel & Versioning Lab** and learned about:

✅ **Snapshot Management**: Creating and managing data snapshots  
✅ **Time Travel Queries**: Querying data at specific points in time  
✅ **Rollback Operations**: Restoring data to previous versions  
✅ **Version Comparison**: Comparing data between different versions  
✅ **Historical Analysis**: Analyzing data changes over time  
✅ **Production Scenarios**: Applying time travel to real-world use cases  

### 🚀 **Key Takeaways:**

#### **1. Time Travel is Powerful:**
- **Data Recovery**: Restore accidentally deleted or corrupted data
- **Audit Trails**: Track all changes for compliance and auditing
- **A/B Testing**: Compare different versions of datasets
- **Debugging**: Investigate data issues by examining historical states
- **Compliance**: Meet regulatory requirements for data retention

#### **2. Snapshot Management:**
- **Every Write Creates a Snapshot**: Each operation creates a new version
- **Snapshot IDs**: Unique identifiers for each version
- **Timestamps**: Track when each snapshot was created
- **Parent-Child Relationships**: Track snapshot lineage
- **Metadata**: Rich metadata about each operation

#### **3. Rollback Capabilities:**
- **Instant Rollback**: Restore data to any previous state
- **Schema Rollback**: Undo schema changes
- **Data Recovery**: Restore deleted or modified data
- **Testing**: Rollback for testing different scenarios
- **Production Safety**: Safe rollback in production environments

### 📚 **Production Implementation:**

#### **1. Apache Spark:**
```sql
-- Time travel queries with Spark
SELECT * FROM catalog.database.table VERSION AS OF 1234567890;
SELECT * FROM catalog.database.table TIMESTAMP AS OF '2023-01-01 12:00:00';

-- Rollback with Spark
CALL system.rollback_to_snapshot('catalog.database.table', 1234567890);
```

#### **2. Trino:**
```sql
-- Time travel queries with Trino
SELECT * FROM catalog.database.table FOR VERSION AS OF 1234567890;
SELECT * FROM catalog.database.table FOR TIMESTAMP AS OF TIMESTAMP '2023-01-01 12:00:00';

-- Rollback with Trino
CALL system.rollback_to_snapshot('catalog.database.table', 1234567890);
```

#### **3. PyIceberg:**
```python
# Time travel queries with PyIceberg
table.scan(snapshot_id=1234567890).to_arrow()
table.scan(as_of_timestamp=datetime(2023, 1, 1)).to_arrow()

# Rollback with PyIceberg
table.rollback_to_snapshot(1234567890)
```

### 🎯 **Best Practices:**

#### **1. Snapshot Management:**
- **Regular Cleanup**: Remove old snapshots to save storage
- **Snapshot Retention**: Keep snapshots for required retention period
- **Metadata Tracking**: Track snapshot purposes and descriptions
- **Access Control**: Control who can create and rollback snapshots

#### **2. Time Travel Queries:**
- **Performance**: Time travel queries may be slower than current queries
- **Storage**: Historical data consumes storage space
- **Indexing**: Consider indexing strategies for historical queries
- **Caching**: Cache frequently accessed historical data

#### **3. Rollback Operations:**
- **Testing**: Always test rollback operations in non-production
- **Backup**: Keep backups of critical snapshots
- **Documentation**: Document rollback procedures and scenarios
- **Monitoring**: Monitor rollback operations and their impact

#### **4. Production Scenarios:**
- **Data Recovery**: Use for accidental data deletion recovery
- **Bug Fixes**: Rollback to known good states after bugs
- **Schema Evolution**: Rollback schema changes if needed
- **Compliance**: Meet regulatory requirements for data retention

### 🔍 **Advanced Topics:**

#### **1. Snapshot Expiration:**
- **Automatic Cleanup**: Configure automatic snapshot expiration
- **Retention Policies**: Set retention policies for different snapshot types
- **Storage Optimization**: Optimize storage by removing old snapshots
- **Cost Management**: Manage costs by controlling snapshot retention

#### **2. Branching and Tagging:**
- **Snapshot Branches**: Create branches for different data versions
- **Snapshot Tags**: Tag important snapshots for easy reference
- **Merge Operations**: Merge branches back to main branch
- **Conflict Resolution**: Handle conflicts during merge operations

#### **3. Performance Optimization:**
- **Snapshot Pruning**: Remove unnecessary snapshots
- **Compaction**: Compact historical data for better performance
- **Partitioning**: Use partitioning for better historical query performance
- **Caching**: Cache frequently accessed historical data

### 🎯 **Next Steps:**

1. **Implement in Production**: Apply time travel to your data lakes
2. **Set Up Monitoring**: Monitor snapshot creation and rollback operations
3. **Create Policies**: Establish snapshot retention and cleanup policies
4. **Train Teams**: Train teams on time travel capabilities and best practices
5. **Test Scenarios**: Test various rollback and recovery scenarios

### 📖 **Additional Resources:**

- [Apache Iceberg Time Travel Documentation](https://iceberg.apache.org/docs/latest/spark-configuration/#time-travel)
- [Spark SQL Time Travel](https://spark.apache.org/docs/latest/sql-data-sources-iceberg.html#time-travel)
- [Trino Iceberg Time Travel](https://trino.io/docs/current/connector/iceberg.html#time-travel)
- [PyIceberg Time Travel](https://py.iceberg.apache.org/operations/time-travel/)

**Happy time traveling! ⏰🚀**
