# Data Compaction & File Management Lab - Apache Iceberg

## 🎯 Lab Objectives

In this lab, we will explore Apache Iceberg's powerful data compaction and file management capabilities:

1. **Small File Problem**: Understand and demonstrate the small file problem
2. **Compaction Strategies**: Implement different compaction approaches
3. **File Size Optimization**: Analyze and optimize file sizes
4. **Storage Efficiency**: Improve storage costs and query performance
5. **Compaction Policies**: Configure automatic compaction rules
6. **Real-world Scenarios**: Apply compaction to realistic datasets

## 🏗️ Compaction Architecture

### Iceberg Compaction Types:
- **Rewrite Compaction**: Merge small files into larger ones
- **Bin Packing**: Optimize file sizes for better performance
- **Sort Compaction**: Sort data within files for optimal access
- **Z-Order Compaction**: Optimize for multi-dimensional queries

### Performance Benefits:
- **Reduced File Count**: Fewer files to scan
- **Better Query Performance**: Larger files = better I/O
- **Storage Efficiency**: Reduced metadata overhead
- **Cost Reduction**: Lower storage and compute costs

## 📊 Dataset: E-commerce Transaction Data

We will work with comprehensive e-commerce data including:
- **Sales Transactions**: Time-series sales data
- **Product Catalog**: Hierarchical product information
- **Customer Data**: Geographic and demographic data
- **File Management**: Small file creation and compaction


## 1. Setup and Import Libraries


In [1]:
# Import necessary libraries
import os
import time
import json
import random
from datetime import datetime, timedelta
from typing import Dict, List, Any

# PyIceberg imports
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import (
    StructType, StringType, IntegerType, LongType, DoubleType, BooleanType,
    TimestampType, DateType, NestedField
)
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import IdentityTransform

# Data processing
import pyarrow as pa
import pyarrow.compute as pc
import pandas as pd
import numpy as np

# Performance monitoring
import psutil
import gc

print("✅ Successfully imported all libraries!")
print(f"📦 PyArrow version: {pa.__version__}")
print(f"📦 Pandas version: {pd.__version__}")
print(f"📦 NumPy version: {np.__version__}")


✅ Successfully imported all libraries!
📦 PyArrow version: 21.0.0
📦 Pandas version: 2.3.2
📦 NumPy version: 2.2.6


In [2]:
## 2. Warehouse and Catalog Setup


In [3]:
# Setup warehouse and catalog
warehouse_path = "/tmp/compaction_iceberg_warehouse"
os.makedirs(warehouse_path, exist_ok=True)

# Configure catalog
catalog = load_catalog(
    "compaction",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/compaction_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

# Create namespace
try:
    catalog.create_namespace("ecommerce")
    print("✅ Created namespace 'ecommerce'")
except Exception as e:
    print(f"ℹ️  Namespace 'ecommerce' already exists: {e}")

print(f"📁 Warehouse path: {warehouse_path}")
print("🎯 Ready for Data Compaction Lab!")


✅ Created namespace 'ecommerce'
📁 Warehouse path: /tmp/compaction_iceberg_warehouse
🎯 Ready for Data Compaction Lab!


## 3. Generate E-commerce Dataset for Compaction Testing


In [4]:
def generate_sales_data(n_transactions=100000):
    """Generate realistic e-commerce sales data for compaction experiments"""
    
    transactions = []
    categories = ["Electronics", "Clothing", "Books", "Home", "Sports"]
    brands = ["Apple", "Samsung", "Nike", "Adidas", "Sony", "LG", "Canon", "Dell"]
    regions = ["North America", "Europe", "Asia Pacific", "Latin America"]
    countries = ["USA", "Canada", "UK", "Germany", "Japan", "Australia", "Brazil", "Mexico"]
    payment_methods = ["Credit Card", "Debit Card", "PayPal", "Apple Pay", "Google Pay"]
    shipping_methods = ["Standard", "Express", "Overnight", "Same Day"]
    customer_segments = ["Premium", "Standard", "Budget", "VIP"]
    
    # Generate data over 6 months
    start_date = datetime(2023, 1, 1)
    end_date = datetime(2023, 6, 30)
    
    for i in range(n_transactions):
        # Generate random date within range
        days_diff = (end_date - start_date).days
        random_days = random.randint(0, days_diff)
        sale_date = start_date + timedelta(days=random_days)
        
        # Ensure microsecond precision for Iceberg compatibility
        sale_date = sale_date.replace(microsecond=sale_date.microsecond)
        
        transaction = {
            "transaction_id": f"TXN_{i+1:06d}",
            "sale_date": sale_date.strftime("%Y-%m-%d"),
            "sale_timestamp": sale_date,
            "customer_id": f"CUST_{random.randint(1, 10000):05d}",
            "product_id": f"PROD_{random.randint(1, 5000):05d}",
            "product_name": f"Product {random.randint(1, 5000)}",
            "category": random.choice(categories),
            "brand": random.choice(brands),
            "region": random.choice(regions),
            "country": random.choice(countries),
            "quantity": random.randint(1, 10),
            "unit_price": round(random.uniform(10, 1000), 2),
            "total_amount": 0,  # Will calculate below
            "discount_percent": round(random.uniform(0, 30), 2),
            "payment_method": random.choice(payment_methods),
            "shipping_method": random.choice(shipping_methods),
            "customer_segment": random.choice(customer_segments),
            "is_returned": random.choice([True, False]),
            "return_date": None,
            "sales_rep_id": f"REP_{random.randint(1, 100):03d}"
        }
        
        # Calculate total amount with discount
        subtotal = transaction["quantity"] * transaction["unit_price"]
        discount_amount = subtotal * (transaction["discount_percent"] / 100)
        transaction["total_amount"] = round(subtotal - discount_amount, 2)
        
        # Add return date if item was returned
        if transaction["is_returned"]:
            return_days = random.randint(1, 30)
            return_date = sale_date + timedelta(days=return_days)
            transaction["return_date"] = return_date.strftime("%Y-%m-%d")
        
        transactions.append(transaction)
    
    return transactions

# Generate the dataset
print("🔄 Generating sales data for compaction testing...")
sales_data = generate_sales_data(100000)  # 100K records
print(f"✅ Generated {len(sales_data):,} sales transactions")

# Display sample data
print("\n📋 Sample transaction data:")
sample_transaction = sales_data[0]
for key, value in sample_transaction.items():
    print(f"  {key}: {value}")


🔄 Generating sales data for compaction testing...
✅ Generated 100,000 sales transactions

📋 Sample transaction data:
  transaction_id: TXN_000001
  sale_date: 2023-04-12
  sale_timestamp: 2023-04-12 00:00:00
  customer_id: CUST_00151
  product_id: PROD_00042
  product_name: Product 1652
  category: Sports
  brand: Canon
  region: North America
  country: Germany
  quantity: 5
  unit_price: 986.89
  total_amount: 4161.22
  discount_percent: 15.67
  payment_method: Credit Card
  shipping_method: Express
  customer_segment: Budget
  is_returned: False
  return_date: None
  sales_rep_id: REP_006


## 4. Create Schema and Helper Functions


In [5]:
# Create Iceberg schema for sales data
def create_sales_schema():
    """Create Iceberg schema for sales transactions"""
    
    schema = Schema(
        NestedField(1, "transaction_id", StringType(), required=True),
        NestedField(2, "sale_date", DateType(), required=True),
        NestedField(3, "sale_timestamp", TimestampType(), required=True),
        NestedField(4, "customer_id", StringType(), required=True),
        NestedField(5, "product_id", StringType(), required=True),
        NestedField(6, "product_name", StringType(), required=True),
        NestedField(7, "category", StringType(), required=True),
        NestedField(8, "brand", StringType(), required=True),
        NestedField(9, "region", StringType(), required=True),
        NestedField(10, "country", StringType(), required=True),
        NestedField(11, "quantity", IntegerType(), required=True),
        NestedField(12, "unit_price", DoubleType(), required=True),
        NestedField(13, "total_amount", DoubleType(), required=True),
        NestedField(14, "discount_percent", DoubleType(), required=True),
        NestedField(15, "payment_method", StringType(), required=True),
        NestedField(16, "shipping_method", StringType(), required=True),
        NestedField(17, "customer_segment", StringType(), required=True),
        NestedField(18, "is_returned", BooleanType(), required=True),
        NestedField(19, "return_date", StringType(), required=False),  # Can be null
        NestedField(20, "sales_rep_id", StringType(), required=True)
    )
    
    return schema

# Create the schema
print("🏗️ Creating sales schema...")
sales_schema = create_sales_schema()
print("✅ Schema created successfully!")

# Helper function to create PyArrow table with correct timestamp precision
def create_pyarrow_table_with_timestamps(df):
    """Create PyArrow table with microsecond timestamp precision and correct nullable settings"""
    
    # Create PyArrow schema that matches Iceberg schema
    pyarrow_schema = pa.schema([
        pa.field('transaction_id', pa.string(), nullable=False),
        pa.field('sale_date', pa.date32(), nullable=False),
        pa.field('sale_timestamp', pa.timestamp('us'), nullable=False),
        pa.field('customer_id', pa.string(), nullable=False),
        pa.field('product_id', pa.string(), nullable=False),
        pa.field('product_name', pa.string(), nullable=False),
        pa.field('category', pa.string(), nullable=False),
        pa.field('brand', pa.string(), nullable=False),
        pa.field('region', pa.string(), nullable=False),
        pa.field('country', pa.string(), nullable=False),
        pa.field('quantity', pa.int32(), nullable=False),
        pa.field('unit_price', pa.float64(), nullable=False),
        pa.field('total_amount', pa.float64(), nullable=False),
        pa.field('discount_percent', pa.float64(), nullable=False),
        pa.field('payment_method', pa.string(), nullable=False),
        pa.field('shipping_method', pa.string(), nullable=False),
        pa.field('customer_segment', pa.string(), nullable=False),
        pa.field('is_returned', pa.bool_(), nullable=False),
        pa.field('return_date', pa.string(), nullable=True),  # This one can be null
        pa.field('sales_rep_id', pa.string(), nullable=False)
    ])
    
    # Convert timestamp to microseconds (Iceberg requirement)
    df_clean = df.copy()
    df_clean['sale_timestamp'] = pd.to_datetime(df_clean['sale_timestamp']).dt.floor('us')
    df_clean['sale_date'] = pd.to_datetime(df_clean['sale_date']).dt.date
    
    # Create table with explicit schema
    return pa.table({
        'transaction_id': df_clean['transaction_id'],
        'sale_date': df_clean['sale_date'],
        'sale_timestamp': df_clean['sale_timestamp'],
        'customer_id': df_clean['customer_id'],
        'product_id': df_clean['product_id'],
        'product_name': df_clean['product_name'],
        'category': df_clean['category'],
        'brand': df_clean['brand'],
        'region': df_clean['region'],
        'country': df_clean['country'],
        'quantity': df_clean['quantity'],
        'unit_price': df_clean['unit_price'],
        'total_amount': df_clean['total_amount'],
        'discount_percent': df_clean['discount_percent'],
        'payment_method': df_clean['payment_method'],
        'shipping_method': df_clean['shipping_method'],
        'customer_segment': df_clean['customer_segment'],
        'is_returned': df_clean['is_returned'],
        'return_date': df_clean['return_date'],
        'sales_rep_id': df_clean['sales_rep_id']
    }, schema=pyarrow_schema)

# Helper function to analyze file statistics
def analyze_file_statistics(table, table_name):
    """Analyze file statistics for a table"""
    
    print(f"\n📊 Analyzing file statistics for {table_name}:")
    print("-" * 50)
    
    files_info = table.inspect.files()
    if len(files_info) > 0:
        files_df = files_info.to_pandas()
        
        # Basic statistics
        total_files = len(files_df)
        total_size = files_df['file_size_in_bytes'].sum()
        avg_file_size = files_df['file_size_in_bytes'].mean()
        min_file_size = files_df['file_size_in_bytes'].min()
        max_file_size = files_df['file_size_in_bytes'].max()
        
        print(f"📁 Total files: {total_files:,}")
        print(f"💾 Total size: {total_size:,} bytes ({total_size/1024/1024:.2f} MB)")
        print(f"📏 Average file size: {avg_file_size:,.0f} bytes ({avg_file_size/1024:.2f} KB)")
        print(f"📏 Min file size: {min_file_size:,} bytes ({min_file_size/1024:.2f} KB)")
        print(f"📏 Max file size: {max_file_size:,} bytes ({max_file_size/1024:.2f} KB)")
        
        # File size distribution
        small_files = len(files_df[files_df['file_size_in_bytes'] < 64*1024])  # < 64KB
        medium_files = len(files_df[(files_df['file_size_in_bytes'] >= 64*1024) & 
                                   (files_df['file_size_in_bytes'] < 1024*1024)])  # 64KB - 1MB
        large_files = len(files_df[files_df['file_size_in_bytes'] >= 1024*1024])  # >= 1MB
        
        print(f"\n📊 File size distribution:")
        print(f"  Small files (< 64KB): {small_files:,} ({small_files/total_files*100:.1f}%)")
        print(f"  Medium files (64KB-1MB): {medium_files:,} ({medium_files/total_files*100:.1f}%)")
        print(f"  Large files (>= 1MB): {large_files:,} ({large_files/total_files*100:.1f}%)")
        
        return {
            'total_files': total_files,
            'total_size': total_size,
            'avg_file_size': avg_file_size,
            'min_file_size': min_file_size,
            'max_file_size': max_file_size,
            'small_files': small_files,
            'medium_files': medium_files,
            'large_files': large_files
        }
    else:
        print("❌ No files found in table")
        return None

print("✅ Helper functions created!")


🏗️ Creating sales schema...
✅ Schema created successfully!
✅ Helper functions created!


## 5. Small File Problem Demonstration

The **Small File Problem** is a common issue in data lakes where many small files are created instead of fewer, larger files. This leads to:

### 🚨 **Problems with Small Files:**
- **Poor Query Performance**: More files to scan = slower queries
- **High Metadata Overhead**: Each file has metadata overhead
- **Storage Inefficiency**: Small files waste storage space
- **High Costs**: More API calls and compute resources needed

### 📊 **Let's Create Small Files Intentionally:**
We'll create a table with many small files by inserting data in small batches.


In [6]:
# Create a table with many small files
print("🔄 Creating table with small files...")

# Drop existing table if it exists
try:
    catalog.drop_table("ecommerce.sales_small_files")
    print("🗑️ Dropped existing table")
except Exception as e:
    print(f"ℹ️ No existing table to drop: {e}")

# Create unpartitioned table
small_files_table = catalog.create_table(
    "ecommerce.sales_small_files",
    schema=sales_schema
)

# Convert data to DataFrame
df = pd.DataFrame(sales_data)

# Create small files by inserting data in small batches (1000 records each)
batch_size = 1000
total_batches = len(df) // batch_size

print(f"📦 Inserting data in {total_batches} batches of {batch_size} records each...")

for i in range(total_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(df))
    batch_df = df.iloc[start_idx:end_idx]
    
    # Convert batch to PyArrow table
    batch_table = create_pyarrow_table_with_timestamps(batch_df)
    
    # Insert batch
    small_files_table.append(batch_table)
    
    if (i + 1) % 10 == 0:
        print(f"  ✅ Inserted batch {i + 1}/{total_batches}")

print(f"✅ Created table with small files!")

# Analyze the small files problem
small_files_stats = analyze_file_statistics(small_files_table, "Small Files Table")

# Show the problem
if small_files_stats:
    print(f"\n🚨 Small File Problem Analysis:")
    print(f"  Total files: {small_files_stats['total_files']:,}")
    print(f"  Average file size: {small_files_stats['avg_file_size']:,.0f} bytes ({small_files_stats['avg_file_size']/1024:.2f} KB)")
    print(f"  Small files (< 64KB): {small_files_stats['small_files']:,} ({small_files_stats['small_files']/small_files_stats['total_files']*100:.1f}%)")
    
    if small_files_stats['small_files'] > small_files_stats['total_files'] * 0.8:
        print(f"  ⚠️  PROBLEM: {small_files_stats['small_files']/small_files_stats['total_files']*100:.1f}% of files are small!")
    else:
        print(f"  ✅ File sizes are reasonable")


🔄 Creating table with small files...
ℹ️ No existing table to drop: Table does not exist: ecommerce.sales_small_files
📦 Inserting data in 100 batches of 1000 records each...
  ✅ Inserted batch 10/100
  ✅ Inserted batch 20/100
  ✅ Inserted batch 30/100
  ✅ Inserted batch 40/100
  ✅ Inserted batch 50/100
  ✅ Inserted batch 60/100
  ✅ Inserted batch 70/100
  ✅ Inserted batch 80/100
  ✅ Inserted batch 90/100
  ✅ Inserted batch 100/100
✅ Created table with small files!

📊 Analyzing file statistics for Small Files Table:
--------------------------------------------------
📁 Total files: 100
💾 Total size: 4,229,095 bytes (4.03 MB)
📏 Average file size: 42,291 bytes (41.30 KB)
📏 Min file size: 42,145 bytes (41.16 KB)
📏 Max file size: 42,443 bytes (41.45 KB)

📊 File size distribution:
  Small files (< 64KB): 100 (100.0%)
  Medium files (64KB-1MB): 0 (0.0%)
  Large files (>= 1MB): 0 (0.0%)

🚨 Small File Problem Analysis:
  Total files: 100
  Average file size: 42,291 bytes (41.30 KB)
  Small files 

## 6. Compaction Strategies

Now let's implement different compaction strategies to solve the small file problem:

### 🔧 **Compaction Strategies:**

1. **Rewrite Compaction**: Merge small files into larger ones
2. **Bin Packing**: Optimize file sizes for better performance  
3. **Sort Compaction**: Sort data within files for optimal access
4. **Z-Order Compaction**: Optimize for multi-dimensional queries

### 📊 **Let's Implement Rewrite Compaction:**
This is the most common and effective compaction strategy.


In [7]:
# Implement Rewrite Compaction
print("🔄 Implementing Rewrite Compaction...")

# Create a new table for compacted data
try:
    catalog.drop_table("ecommerce.sales_compacted")
    print("🗑️ Dropped existing compacted table")
except Exception as e:
    print(f"ℹ️ No existing compacted table to drop: {e}")

# Create compacted table
compacted_table = catalog.create_table(
    "ecommerce.sales_compacted",
    schema=sales_schema
)

# Insert all data at once to create larger files
print("📦 Inserting all data at once to create larger files...")

# Convert entire dataset to PyArrow table
full_table = create_pyarrow_table_with_timestamps(df)

# Insert all data in one operation
compacted_table.append(full_table)

print("✅ Created compacted table!")

# Analyze the compacted table
compacted_stats = analyze_file_statistics(compacted_table, "Compacted Table")

# Compare before and after compaction
if small_files_stats and compacted_stats:
    print(f"\n📊 Compaction Results Comparison:")
    print("-" * 60)
    print(f"{'Metric':<25} {'Small Files':<15} {'Compacted':<15} {'Improvement':<15}")
    print("-" * 60)
    
    # File count comparison
    file_reduction = small_files_stats['total_files'] / compacted_stats['total_files']
    print(f"{'Total Files':<25} {small_files_stats['total_files']:<15,} {compacted_stats['total_files']:<15,} {file_reduction:.1f}x fewer")
    
    # File size comparison
    size_increase = compacted_stats['avg_file_size'] / small_files_stats['avg_file_size']
    print(f"{'Avg File Size':<25} {small_files_stats['avg_file_size']/1024:<14.1f}KB {compacted_stats['avg_file_size']/1024:<14.1f}KB {size_increase:.1f}x larger")
    
    # Small files comparison
    small_files_reduction = small_files_stats['small_files'] / compacted_stats['small_files'] if compacted_stats['small_files'] > 0 else float('inf')
    print(f"{'Small Files (<64KB)':<25} {small_files_stats['small_files']:<15,} {compacted_stats['small_files']:<15,} {small_files_reduction:.1f}x fewer")
    
    # Storage efficiency
    storage_efficiency = (small_files_stats['total_size'] - compacted_stats['total_size']) / small_files_stats['total_size'] * 100
    print(f"{'Storage Efficiency':<25} {'N/A':<15} {'N/A':<15} {storage_efficiency:.1f}% saved")
    
    print("-" * 60)
    
    # Summary
    print(f"\n🎯 Compaction Summary:")
    print(f"  ✅ Reduced files from {small_files_stats['total_files']:,} to {compacted_stats['total_files']:,}")
    print(f"  ✅ Increased average file size by {size_increase:.1f}x")
    print(f"  ✅ Reduced small files by {small_files_reduction:.1f}x")
    print(f"  ✅ Improved storage efficiency by {storage_efficiency:.1f}%")


🔄 Implementing Rewrite Compaction...
ℹ️ No existing compacted table to drop: Table does not exist: ecommerce.sales_compacted
📦 Inserting all data at once to create larger files...
✅ Created compacted table!

📊 Analyzing file statistics for Compacted Table:
--------------------------------------------------
📁 Total files: 1
💾 Total size: 2,653,530 bytes (2.53 MB)
📏 Average file size: 2,653,530 bytes (2591.34 KB)
📏 Min file size: 2,653,530 bytes (2591.34 KB)
📏 Max file size: 2,653,530 bytes (2591.34 KB)

📊 File size distribution:
  Small files (< 64KB): 0 (0.0%)
  Medium files (64KB-1MB): 0 (0.0%)
  Large files (>= 1MB): 1 (100.0%)

📊 Compaction Results Comparison:
------------------------------------------------------------
Metric                    Small Files     Compacted       Improvement    
------------------------------------------------------------
Total Files               100             1               100.0x fewer
Avg File Size             41.3          KB 2591.3        KB 6

## 7. Performance Testing - Before vs After Compaction

Let's test query performance to see the impact of compaction on query speed:

### ⏱️ **Performance Metrics:**
- **Query Execution Time**: How fast queries run
- **Files Scanned**: Number of files accessed during queries
- **Memory Usage**: Memory consumption during queries
- **I/O Operations**: Disk read operations


In [8]:
# Performance testing function
def measure_query_performance(table, query_name, row_filter=None):
    """Measure query performance for a table"""
    
    start_time = time.time()
    start_memory = psutil.Process().memory_info().rss / 1024 / 1024  # MB
    
    try:
        # Execute query
        if row_filter:
            # Apply filter if provided
            result = table.scan(row_filter=row_filter).to_arrow()
        else:
            # Full table scan
            result = table.scan().to_arrow()
        
        execution_time = time.time() - start_time
        end_memory = psutil.Process().memory_info().rss / 1024 / 1024  # MB
        memory_used = end_memory - start_memory
        
        # Get file information
        files_info = table.inspect.files()
        files_scanned = len(files_info) if files_info else 0
        
        return {
            'query_name': query_name,
            'execution_time': execution_time,
            'records_returned': len(result),
            'files_scanned': files_scanned,
            'memory_used': memory_used
        }
        
    except Exception as e:
        print(f"❌ Error executing query {query_name}: {e}")
        return None

# Test queries
test_queries = [
    ("Full Table Scan", None),
    ("Date Filter", "sale_date >= '2023-03-01' AND sale_date < '2023-04-01'"),
    ("Category Filter", "category = 'Electronics'"),
    ("Amount Filter", "total_amount > 500"),
    ("Complex Filter", "category = 'Electronics' AND total_amount > 500 AND region = 'North America'")
]

print("⏱️ Testing query performance...")

# Test small files table
print("\n🔍 Testing Small Files Table:")
small_files_results = []

for query_name, row_filter in test_queries:
    result = measure_query_performance(small_files_table, f"{query_name} (Small Files)", row_filter)
    if result:
        small_files_results.append(result)
        print(f"  {query_name}: {result['execution_time']:.3f}s ({result['files_scanned']} files)")

# Test compacted table
print("\n🔍 Testing Compacted Table:")
compacted_results = []

for query_name, row_filter in test_queries:
    result = measure_query_performance(compacted_table, f"{query_name} (Compacted)", row_filter)
    if result:
        compacted_results.append(result)
        print(f"  {query_name}: {result['execution_time']:.3f}s ({result['files_scanned']} files)")

# Compare results
if small_files_results and compacted_results:
    print(f"\n📊 Performance Comparison:")
    print("-" * 80)
    print(f"{'Query':<20} {'Small Files (s)':<15} {'Compacted (s)':<15} {'Speedup':<15} {'Files Reduction':<15}")
    print("-" * 80)
    
    for i, (small_result, compacted_result) in enumerate(zip(small_files_results, compacted_results)):
        speedup = small_result['execution_time'] / compacted_result['execution_time']
        file_reduction = small_result['files_scanned'] / compacted_result['files_scanned'] if compacted_result['files_scanned'] > 0 else float('inf')
        
        print(f"{test_queries[i][0]:<20} {small_result['execution_time']:<14.3f} {compacted_result['execution_time']:<14.3f} {speedup:<14.1f}x {file_reduction:<14.1f}x")
    
    print("-" * 80)
    
    # Calculate average improvements
    avg_speedup = sum(small_result['execution_time'] / compacted_result['execution_time'] 
                     for small_result, compacted_result in zip(small_files_results, compacted_results)) / len(small_files_results)
    avg_file_reduction = sum(small_result['files_scanned'] / compacted_result['files_scanned'] 
                            for small_result, compacted_result in zip(small_files_results, compacted_results) 
                            if compacted_result['files_scanned'] > 0) / len(small_files_results)
    
    print(f"\n🎯 Average Improvements:")
    print(f"  ⚡ Query Speed: {avg_speedup:.1f}x faster")
    print(f"  📁 File Reduction: {avg_file_reduction:.1f}x fewer files")
    print(f"  💾 Storage Efficiency: Better file organization")
    print(f"  🚀 Overall Performance: Significantly improved")


⏱️ Testing query performance...

🔍 Testing Small Files Table:
  Full Table Scan: 0.182s (100 files)
  Date Filter: 0.216s (100 files)
  Category Filter: 0.222s (100 files)
  Amount Filter: 0.213s (100 files)
  Complex Filter: 0.220s (100 files)

🔍 Testing Compacted Table:
  Full Table Scan: 0.008s (1 files)
  Date Filter: 0.016s (1 files)
  Category Filter: 0.020s (1 files)
  Amount Filter: 0.039s (1 files)
  Complex Filter: 0.025s (1 files)

📊 Performance Comparison:
--------------------------------------------------------------------------------
Query                Small Files (s) Compacted (s)   Speedup         Files Reduction
--------------------------------------------------------------------------------
Full Table Scan      0.182          0.008          22.3          x 100.0         x
Date Filter          0.216          0.016          13.4          x 100.0         x
Category Filter      0.222          0.020          11.2          x 100.0         x
Amount Filter        0.213     

## 8. Lab Summary & Best Practices

### 🎉 **Congratulations!**

You've completed the **Data Compaction & File Management Lab** and learned about:

✅ **Small File Problem**: Understanding the impact of small files  
✅ **Compaction Strategies**: Implementing rewrite compaction  
✅ **File Size Optimization**: Analyzing and improving file sizes  
✅ **Performance Testing**: Measuring query performance improvements  
✅ **Storage Efficiency**: Reducing costs and improving performance  

### 🚀 **Key Takeaways:**

#### **1. Small Files Are Expensive:**
- **Poor Query Performance**: More files = slower queries
- **High Metadata Overhead**: Each file has overhead
- **Storage Inefficiency**: Wasted space and resources
- **High Costs**: More API calls and compute needed

#### **2. Compaction Solves These Problems:**
- **Fewer Files**: Reduced file count for faster scanning
- **Larger Files**: Better I/O performance
- **Storage Efficiency**: Reduced metadata overhead
- **Cost Savings**: Lower storage and compute costs

#### **3. Best Practices for Production:**

##### **File Size Guidelines:**
- **Target Size**: 128MB - 1GB per file
- **Minimum Size**: Avoid files < 64KB
- **Maximum Size**: Keep files < 2GB for optimal performance
- **Compression**: Use appropriate compression (Snappy, Gzip, Zstd)

##### **Compaction Strategies:**
- **Rewrite Compaction**: Merge small files into larger ones
- **Bin Packing**: Optimize file sizes for better performance
- **Sort Compaction**: Sort data within files for optimal access
- **Z-Order Compaction**: Optimize for multi-dimensional queries

##### **Automation:**
- **Scheduled Compaction**: Run compaction jobs regularly
- **Threshold-based**: Trigger compaction when file count exceeds threshold
- **Size-based**: Trigger when average file size drops below threshold
- **Time-based**: Run compaction during low-traffic periods

### 📚 **Production Implementation:**

#### **1. Apache Spark:**
```python
# Rewrite compaction with Spark
spark.sql("""
    CALL system.rewrite_data_files(
        table => 'catalog.database.table',
        options => map('target-file-size-bytes', '134217728')
    )
""")
```

#### **2. Trino:**
```sql
-- Rewrite compaction with Trino
CALL system.rewrite_data_files(
    'catalog.database.table',
    target_file_size => '128MB'
);
```

#### **3. PyIceberg:**
```python
# Manual compaction with PyIceberg
table.rewrite_data_files(
    target_file_size_bytes=134217728  # 128MB
)
```

### 🎯 **Next Steps:**

1. **Implement in Production**: Apply compaction to your data lakes
2. **Monitor Performance**: Track query performance improvements
3. **Automate Compaction**: Set up scheduled compaction jobs
4. **Optimize Further**: Experiment with different compaction strategies
5. **Scale Up**: Apply to larger datasets and more complex schemas

### 📖 **Additional Resources:**

- [Apache Iceberg Compaction Documentation](https://iceberg.apache.org/docs/latest/maintenance/)
- [Spark SQL Compaction](https://spark.apache.org/docs/latest/sql-data-sources-iceberg.html#compaction)
- [Trino Iceberg Compaction](https://trino.io/docs/current/connector/iceberg.html#compaction)
- [PyIceberg Compaction](https://py.iceberg.apache.org/operations/compaction/)

**Happy compacting! 🚀**
