# Notebook 19: Performance Optimization and Debugging

## Overview

This notebook covers performance optimization and debugging techniques in Polars. You'll learn how to profile queries, identify bottlenecks, optimize memory usage, and leverage parallel processing for maximum performance.

## Topics Covered

1. **Query Profiling and Analysis**
   - Using explain() for query plans
   - Understanding execution strategies
   - Profiling query execution

2. **Memory Management**
   - Memory profiling techniques
   - Streaming for large datasets
   - Memory-efficient operations

3. **Lazy Evaluation Optimization**
   - Query optimization strategies
   - Predicate pushdown
   - Projection pushdown

4. **Parallel Processing**
   - Thread pool configuration
   - Parallel operations
   - Scaling strategies

5. **Performance Pitfalls**
   - Common anti-patterns
   - Expensive operations to avoid
   - Best practices

6. **Benchmarking**
   - Timing comparisons
   - Performance testing
   - Regression detection

7. **Real-World Optimization**
   - Complete optimization workflow
   - Before/after comparisons
   - Production tips

In [None]:
import polars as pl
import time
import numpy as np
from datetime import datetime, timedelta
import psutil  # For memory profiling
import os

# Configure Polars for optimal performance
pl.Config.set_fmt_str_lengths(50)
pl.Config.set_tbl_rows(10)

print(f"Polars version: {pl.__version__}")
print(f"Available CPU cores: {os.cpu_count()}")

## Part 1: Query Profiling and Analysis

### 1.1 Understanding Query Plans with explain()

The `explain()` method shows how Polars will execute your query, including optimizations applied.

In [None]:
# Create sample dataset
df = pl.DataFrame({
    'customer_id': range(1, 10001),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 10000),
    'product': np.random.choice(['A', 'B', 'C', 'D'], 10000),
    'sales': np.random.randint(100, 10000, 10000),
    'quantity': np.random.randint(1, 100, 10000),
    'date': [datetime(2024, 1, 1) + timedelta(days=i) for i in range(10000)]
})

print("Sample data:")
print(df.head())
print(f"\nDataFrame shape: {df.shape}")

In [None]:
# Build a lazy query
lazy_query = (
    df.lazy()
    .filter(pl.col('region') == 'North')
    .filter(pl.col('sales') > 5000)
    .select(['customer_id', 'product', 'sales'])
    .group_by('product')
    .agg([
        pl.col('sales').sum().alias('total_sales'),
        pl.col('customer_id').n_unique().alias('unique_customers')
    ])
)

# Show the query plan
print("Query Plan:")
print(lazy_query.explain())

In [None]:
# Show optimized query plan
print("Optimized Query Plan:")
print(lazy_query.explain(optimized=True))

# Notice optimizations like:
# - Filter pushdown (filters applied early)
# - Projection pushdown (only needed columns selected)
# - Predicate combination (multiple filters combined)

### 1.2 Profiling Query Execution

Use `profile()` to get detailed timing information about query execution.

In [None]:
# Profile the query execution
result, profile = lazy_query.profile()

print("Query Result:")
print(result)

print("\nProfile Information:")
print(profile)

In [None]:
# Profile a more complex query
complex_query = (
    df.lazy()
    .with_columns([
        (pl.col('sales') / pl.col('quantity')).alias('price_per_unit'),
        pl.col('date').dt.month().alias('month')
    ])
    .filter(pl.col('price_per_unit') > 50)
    .group_by(['region', 'month'])
    .agg([
        pl.col('sales').sum().alias('total_sales'),
        pl.col('sales').mean().alias('avg_sales'),
        pl.col('customer_id').n_unique().alias('customers'),
        pl.col('quantity').sum().alias('total_quantity')
    ])
    .sort(['region', 'month'])
)

result, profile = complex_query.profile()

print("Profile for complex query:")
print(profile)
print(f"\nResult shape: {result.shape}")

## Part 2: Memory Management

### 2.1 Memory Profiling

Understanding memory usage is crucial for handling large datasets.

In [None]:
def get_memory_usage():
    """Get current process memory usage in MB"""
    process = psutil.Process()
    return process.memory_info().rss / 1024 / 1024

def profile_memory(func, *args, **kwargs):
    """Profile memory usage of a function"""
    mem_before = get_memory_usage()
    result = func(*args, **kwargs)
    mem_after = get_memory_usage()
    mem_used = mem_after - mem_before
    
    print(f"Memory before: {mem_before:.2f} MB")
    print(f"Memory after: {mem_after:.2f} MB")
    print(f"Memory used: {mem_used:.2f} MB")
    
    return result, mem_used

# Check DataFrame memory usage
print(f"DataFrame estimated size: {df.estimated_size('mb'):.2f} MB")
print(f"Current process memory: {get_memory_usage():.2f} MB")

In [None]:
# Compare memory usage: eager vs lazy

# Eager execution
print("Eager execution:")
def eager_approach():
    return (
        df
        .filter(pl.col('region') == 'North')
        .filter(pl.col('sales') > 5000)
        .select(['product', 'sales'])
        .group_by('product')
        .agg(pl.col('sales').sum())
    )

result_eager, mem_eager = profile_memory(eager_approach)

print("\nLazy execution:")
def lazy_approach():
    return (
        df.lazy()
        .filter(pl.col('region') == 'North')
        .filter(pl.col('sales') > 5000)
        .select(['product', 'sales'])
        .group_by('product')
        .agg(pl.col('sales').sum())
        .collect()
    )

result_lazy, mem_lazy = profile_memory(lazy_approach)

print(f"\nMemory difference: {abs(mem_eager - mem_lazy):.2f} MB")

### 2.2 Streaming for Large Datasets

Use streaming to process datasets larger than available memory.

In [None]:
# Create a larger dataset for streaming demo
large_df = pl.DataFrame({
    'id': range(1, 100001),
    'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], 100000),
    'value': np.random.randn(100000) * 100,
    'timestamp': [datetime(2024, 1, 1) + timedelta(minutes=i) for i in range(100000)]
})

# Save to disk for streaming
large_df.write_parquet('temp_large_data.parquet')

print(f"Large dataset size: {large_df.estimated_size('mb'):.2f} MB")

In [None]:
# Streaming query - processes data in chunks
print("Streaming execution:")
mem_before = get_memory_usage()

result = (
    pl.scan_parquet('temp_large_data.parquet')
    .group_by('category')
    .agg([
        pl.col('value').mean().alias('avg_value'),
        pl.col('value').std().alias('std_value'),
        pl.col('id').count().alias('count')
    ])
    .collect(streaming=True)  # Enable streaming
)

mem_after = get_memory_usage()
print(f"Memory used with streaming: {mem_after - mem_before:.2f} MB")
print("\nResult:")
print(result)

In [None]:
# Compare with non-streaming
print("Non-streaming execution:")
mem_before = get_memory_usage()

result = (
    pl.scan_parquet('temp_large_data.parquet')
    .group_by('category')
    .agg([
        pl.col('value').mean().alias('avg_value'),
        pl.col('value').std().alias('std_value'),
        pl.col('id').count().alias('count')
    ])
    .collect(streaming=False)
)

mem_after = get_memory_usage()
print(f"Memory used without streaming: {mem_after - mem_before:.2f} MB")

# Clean up
import os
os.remove('temp_large_data.parquet')

### 2.3 Memory-Efficient Operations

In [None]:
# Tip 1: Use appropriate data types
print("Impact of data types on memory:")

# Inefficient: using default types
df_inefficient = pl.DataFrame({
    'small_int': range(10000),  # Will use Int64 by default
    'category': ['A'] * 10000,  # Will use String
})
print(f"Inefficient size: {df_inefficient.estimated_size('mb'):.2f} MB")

# Efficient: using optimal types
df_efficient = pl.DataFrame({
    'small_int': pl.Series(range(10000), dtype=pl.UInt16),  # Smaller integer type
    'category': pl.Series(['A'] * 10000, dtype=pl.Categorical),  # Categorical for repeating strings
})
print(f"Efficient size: {df_efficient.estimated_size('mb'):.2f} MB")
print(f"Memory saved: {(df_inefficient.estimated_size('mb') - df_efficient.estimated_size('mb')):.2f} MB")

In [None]:
# Tip 2: Drop columns you don't need early
print("\nMemory optimization with column selection:")

# Bad: keeping all columns
def approach_bad():
    result = df.filter(pl.col('region') == 'North')
    result = result.group_by('product').agg(pl.col('sales').sum())
    return result

# Good: select only needed columns early
def approach_good():
    result = df.select(['region', 'product', 'sales'])
    result = result.filter(pl.col('region') == 'North')
    result = result.group_by('product').agg(pl.col('sales').sum())
    return result

print("Bad approach:")
_, mem_bad = profile_memory(approach_bad)

print("\nGood approach:")
_, mem_good = profile_memory(approach_good)

print(f"\nMemory saved: {mem_bad - mem_good:.2f} MB")

## Part 3: Lazy Evaluation Optimization

### 3.1 Understanding Query Optimization

Polars automatically optimizes lazy queries through various techniques.

In [None]:
# Predicate Pushdown - filters are applied as early as possible
print("Predicate Pushdown Example:")

query = (
    df.lazy()
    .select(['customer_id', 'region', 'product', 'sales'])
    .filter(pl.col('region') == 'North')  # This filter will be pushed down
    .filter(pl.col('sales') > 5000)       # This too
)

print("Unoptimized plan:")
print(query.explain(optimized=False))

print("\nOptimized plan:")
print(query.explain(optimized=True))
print("\nNotice how filters are combined and pushed early in execution")

In [None]:
# Projection Pushdown - only necessary columns are read
print("Projection Pushdown Example:")

# Create a temporary parquet file
df.write_parquet('temp_data.parquet')

query = (
    pl.scan_parquet('temp_data.parquet')
    .select(['product', 'sales'])  # Only these columns will be read from disk
    .filter(pl.col('sales') > 5000)
    .group_by('product')
    .agg(pl.col('sales').sum())
)

print("Optimized plan shows projection pushdown:")
print(query.explain(optimized=True))

result = query.collect()
print("\nResult:")
print(result)

# Clean up
os.remove('temp_data.parquet')

In [None]:
# Optimization tips for writing queries

# Tip 1: Put filters early
# Good
good_query = (
    df.lazy()
    .filter(pl.col('region') == 'North')  # Filter early
    .with_columns([
        (pl.col('sales') * 1.1).alias('sales_with_tax')
    ])
    .select(['product', 'sales_with_tax'])
)

# Tip 2: Combine filters when possible
# Instead of multiple filter calls
# Better to use one with combined conditions
combined_filter = (
    df.lazy()
    .filter(
        (pl.col('region') == 'North') & 
        (pl.col('sales') > 5000) &
        (pl.col('quantity') > 10)
    )
)

print("Optimization tip: Combine filters")
print(combined_filter.explain(optimized=True))

## Part 4: Parallel Processing

### 4.1 Thread Pool Configuration

Polars uses a thread pool for parallel operations.

In [None]:
# Check and configure thread pool
print(f"Available CPU cores: {os.cpu_count()}")
print(f"Polars thread pool size: {pl.thread_pool_size()}")

# You can set the number of threads
# pl.Config.set_thread_pool_size(4)  # Set to 4 threads

# For CPU-bound operations, use all cores
# For I/O-bound operations, you might want fewer threads

### 4.2 Parallel Operations

In [None]:
# Polars automatically parallelizes many operations

# Create a larger dataset
large_df = pl.DataFrame({
    'group': np.random.choice(['A', 'B', 'C', 'D'], 100000),
    'value1': np.random.randn(100000),
    'value2': np.random.randn(100000),
    'value3': np.random.randn(100000),
})

# Time a parallel group_by operation
start = time.time()
result = (
    large_df.lazy()
    .group_by('group')
    .agg([
        pl.col('value1').mean(),
        pl.col('value1').std(),
        pl.col('value2').sum(),
        pl.col('value3').max(),
        pl.col('value3').min(),
    ])
    .collect()
)
elapsed = time.time() - start

print(f"Parallel aggregation took: {elapsed:.4f} seconds")
print("\nResult:")
print(result)

In [None]:
# Multiple independent operations are automatically parallelized
result = large_df.select([
    pl.col('value1').mean().alias('mean1'),
    pl.col('value2').mean().alias('mean2'),
    pl.col('value3').mean().alias('mean3'),
    pl.col('value1').std().alias('std1'),
    pl.col('value2').std().alias('std2'),
    pl.col('value3').std().alias('std3'),
])

print("Multiple aggregations (parallelized):")
print(result)

## Part 5: Performance Pitfalls

### 5.1 Common Anti-Patterns

In [None]:
# Pitfall 1: Using apply() when native expressions exist

print("Pitfall 1: Avoid apply() when possible\n")

# Bad: Using apply with Python function (slow)
def custom_calc(x):
    return x * 2 + 10

start = time.time()
result_bad = df.with_columns([
    pl.col('sales').map_elements(custom_calc, return_dtype=pl.Float64).alias('calculated')
])
time_bad = time.time() - start

# Good: Using native expressions (fast)
start = time.time()
result_good = df.with_columns([
    (pl.col('sales') * 2 + 10).alias('calculated')
])
time_good = time.time() - start

print(f"Using apply: {time_bad:.4f} seconds")
print(f"Using native expressions: {time_good:.4f} seconds")
print(f"Speedup: {time_bad / time_good:.1f}x")

In [None]:
# Pitfall 2: Collecting too early in lazy queries

print("Pitfall 2: Don't collect() too early\n")

# Bad: Multiple collect() calls
start = time.time()
temp1 = df.lazy().filter(pl.col('region') == 'North').collect()
temp2 = temp1.lazy().filter(pl.col('sales') > 5000).collect()
result_bad = temp2.lazy().group_by('product').agg(pl.col('sales').sum()).collect()
time_bad = time.time() - start

# Good: Single collect() at the end
start = time.time()
result_good = (
    df.lazy()
    .filter(pl.col('region') == 'North')
    .filter(pl.col('sales') > 5000)
    .group_by('product')
    .agg(pl.col('sales').sum())
    .collect()
)
time_good = time.time() - start

print(f"Multiple collect(): {time_bad:.4f} seconds")
print(f"Single collect(): {time_good:.4f} seconds")
print(f"Speedup: {time_bad / time_good:.1f}x")

In [None]:
# Pitfall 3: Row-wise operations instead of columnar

print("Pitfall 3: Use columnar operations\n")

# Bad: Row-by-row iteration (very slow)
start = time.time()
result_list = []
for row in df.iter_rows(named=True):
    if row['region'] == 'North' and row['sales'] > 5000:
        result_list.append(row)
result_bad = pl.DataFrame(result_list)
time_bad = time.time() - start

# Good: Columnar operations (fast)
start = time.time()
result_good = df.filter(
    (pl.col('region') == 'North') & 
    (pl.col('sales') > 5000)
)
time_good = time.time() - start

print(f"Row-by-row: {time_bad:.4f} seconds")
print(f"Columnar: {time_good:.4f} seconds")
print(f"Speedup: {time_bad / time_good:.1f}x")

In [None]:
# Pitfall 4: Creating too many intermediate DataFrames

print("Pitfall 4: Chain operations efficiently\n")

# Bad: Many intermediate variables
start = time.time()
df1 = df.filter(pl.col('region') == 'North')
df2 = df1.with_columns([(pl.col('sales') * 1.1).alias('sales_tax')])
df3 = df2.select(['product', 'sales_tax'])
result_bad = df3.group_by('product').agg(pl.col('sales_tax').sum())
time_bad = time.time() - start

# Good: Chain operations
start = time.time()
result_good = (
    df.filter(pl.col('region') == 'North')
    .with_columns([(pl.col('sales') * 1.1).alias('sales_tax')])
    .select(['product', 'sales_tax'])
    .group_by('product')
    .agg(pl.col('sales_tax').sum())
)
time_good = time.time() - start

print(f"Intermediate variables: {time_bad:.4f} seconds")
print(f"Chained operations: {time_good:.4f} seconds")

### 5.2 Expensive Operations to Watch

In [None]:
# Operations that can be expensive:

print("Expensive operations and alternatives:\n")

# 1. join() can be expensive with large DataFrames
# Tip: Filter data before joining
print("1. Joins - filter before joining to reduce data")

# 2. sort() on large DataFrames
# Tip: Only sort when necessary, use top_k() for top N values
print("2. Sorts - use top_k() instead of sort().head()")

# Example: Getting top 5 sales
# Instead of:
# df.sort('sales', descending=True).head(5)
# Use:
top_sales = df.select(pl.col('sales')).top_k(5, by='sales')
print("\nTop 5 sales (using top_k):")
print(top_sales)

# 3. unique() on large text columns
# Tip: Use categorical type for columns with repeated values
print("\n3. Unique - use Categorical type for repeated string columns")

# 4. Multiple aggregations without group_by
# Tip: Combine aggregations in a single agg() call
print("\n4. Aggregations - combine in single agg() call")

## Part 6: Benchmarking

### 6.1 Timing Comparisons

In [None]:
import time
from typing import Callable, List, Dict

def benchmark_function(func: Callable, name: str, iterations: int = 5) -> Dict:
    """Benchmark a function over multiple iterations"""
    times = []
    
    for i in range(iterations):
        start = time.time()
        result = func()
        elapsed = time.time() - start
        times.append(elapsed)
    
    return {
        'name': name,
        'mean': np.mean(times),
        'std': np.std(times),
        'min': np.min(times),
        'max': np.max(times),
        'iterations': iterations
    }

def compare_approaches(approaches: List[tuple]) -> pl.DataFrame:
    """Compare multiple approaches and return results"""
    results = []
    
    for func, name in approaches:
        result = benchmark_function(func, name)
        results.append(result)
    
    return pl.DataFrame(results).sort('mean')

print("Benchmark utilities defined")

In [None]:
# Benchmark different approaches to the same task

# Task: Calculate total sales by region for North and South only

def approach1_eager():
    return (
        df
        .filter(pl.col('region').is_in(['North', 'South']))
        .group_by('region')
        .agg(pl.col('sales').sum())
    )

def approach2_lazy():
    return (
        df.lazy()
        .filter(pl.col('region').is_in(['North', 'South']))
        .group_by('region')
        .agg(pl.col('sales').sum())
        .collect()
    )

def approach3_lazy_streaming():
    return (
        df.lazy()
        .filter(pl.col('region').is_in(['North', 'South']))
        .group_by('region')
        .agg(pl.col('sales').sum())
        .collect(streaming=True)
    )

approaches = [
    (approach1_eager, "Eager"),
    (approach2_lazy, "Lazy"),
    (approach3_lazy_streaming, "Lazy + Streaming")
]

results = compare_approaches(approaches)
print("Benchmark results (sorted by mean time):")
print(results)

### 6.2 Best Practices for Benchmarking

In [None]:
# Best practices for accurate benchmarking:

print("Benchmarking Best Practices:\n")

print("1. Run multiple iterations to account for variance")
print("2. Use realistic data sizes for your use case")
print("3. Warm up the cache before benchmarking")
print("4. Measure both execution time and memory usage")
print("5. Test with both small and large datasets")
print("6. Consider I/O time separately from computation time")
print("7. Profile in an environment similar to production")

# Example: Warming up cache
def warm_up_cache():
    # Run the operation once to warm up
    _ = df.select(['region', 'sales']).group_by('region').agg(pl.col('sales').sum())

warm_up_cache()
print("\nCache warmed up for accurate benchmarking")

## Part 7: Real-World Optimization Workflow

### 7.1 Complete Optimization Example

In [None]:
# Scenario: Analyzing sales data with performance optimization

# Create a realistic dataset
np.random.seed(42)
sales_data = pl.DataFrame({
    'order_id': range(1, 50001),
    'customer_id': np.random.randint(1, 5000, 50000),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books', 'Toys'], 50000),
    'product_name': np.random.choice([f'Product_{i}' for i in range(100)], 50000),
    'quantity': np.random.randint(1, 20, 50000),
    'unit_price': np.random.uniform(10, 1000, 50000),
    'discount_pct': np.random.choice([0, 5, 10, 15, 20], 50000),
    'order_date': [datetime(2024, 1, 1) + timedelta(days=np.random.randint(0, 365)) for _ in range(50000)],
    'region': np.random.choice(['North', 'South', 'East', 'West'], 50000),
    'payment_method': np.random.choice(['Credit', 'Debit', 'Cash', 'PayPal'], 50000)
})

print(f"Dataset size: {sales_data.estimated_size('mb'):.2f} MB")
print(f"Rows: {len(sales_data):,}")
print("\nSample data:")
print(sales_data.head())

In [None]:
# BEFORE OPTIMIZATION: Inefficient approach

print("BEFORE OPTIMIZATION:\n")

def unoptimized_analysis():
    # Calculate total revenue by category and region
    # Using inefficient patterns
    
    # Step 1: Add calculated columns
    df1 = sales_data.with_columns([
        (pl.col('quantity') * pl.col('unit_price')).alias('subtotal')
    ])
    
    # Step 2: Apply discount (collecting unnecessarily)
    df2 = df1.with_columns([
        (pl.col('subtotal') * (1 - pl.col('discount_pct') / 100)).alias('total')
    ])
    
    # Step 3: Filter for specific region
    df3 = df2.filter(pl.col('region').is_in(['North', 'South']))
    
    # Step 4: Filter for date range
    df4 = df3.filter(
        pl.col('order_date') >= datetime(2024, 6, 1)
    )
    
    # Step 5: Group and aggregate
    result = (
        df4
        .group_by(['product_category', 'region'])
        .agg([
            pl.col('total').sum().alias('total_revenue'),
            pl.col('order_id').count().alias('order_count'),
            pl.col('customer_id').n_unique().alias('unique_customers')
        ])
        .sort('total_revenue', descending=True)
    )
    
    return result

# Time and profile
start = time.time()
mem_before = get_memory_usage()

result_before = unoptimized_analysis()

time_before = time.time() - start
mem_used_before = get_memory_usage() - mem_before

print(f"Execution time: {time_before:.4f} seconds")
print(f"Memory used: {mem_used_before:.2f} MB")
print("\nResult:")
print(result_before)

In [None]:
# AFTER OPTIMIZATION: Efficient approach

print("AFTER OPTIMIZATION:\n")

def optimized_analysis():
    # Same calculation but with optimizations:
    # 1. Use lazy evaluation
    # 2. Apply filters early
    # 3. Select only needed columns
    # 4. Combine operations
    # 5. Single collect at the end
    
    result = (
        sales_data.lazy()
        # Apply filters FIRST (predicate pushdown)
        .filter(
            pl.col('region').is_in(['North', 'South']) &
            (pl.col('order_date') >= datetime(2024, 6, 1))
        )
        # Select only needed columns (projection pushdown)
        .select([
            'product_category',
            'region',
            'order_id',
            'customer_id',
            'quantity',
            'unit_price',
            'discount_pct'
        ])
        # Calculate in a single step
        .with_columns([
            (
                pl.col('quantity') * 
                pl.col('unit_price') * 
                (1 - pl.col('discount_pct') / 100)
            ).alias('total')
        ])
        # Aggregate
        .group_by(['product_category', 'region'])
        .agg([
            pl.col('total').sum().alias('total_revenue'),
            pl.col('order_id').count().alias('order_count'),
            pl.col('customer_id').n_unique().alias('unique_customers')
        ])
        .sort('total_revenue', descending=True)
        # Single collect at the end
        .collect()
    )
    
    return result

# Time and profile
start = time.time()
mem_before = get_memory_usage()

result_after = optimized_analysis()

time_after = time.time() - start
mem_used_after = get_memory_usage() - mem_before

print(f"Execution time: {time_after:.4f} seconds")
print(f"Memory used: {mem_used_after:.2f} MB")
print("\nResult:")
print(result_after)

print("\n" + "="*50)
print("OPTIMIZATION IMPACT:")
print("="*50)
print(f"Time improvement: {time_before / time_after:.2f}x faster")
print(f"Memory reduction: {mem_used_before - mem_used_after:.2f} MB saved")

### 7.2 Production Optimization Checklist

In [None]:
print("PRODUCTION OPTIMIZATION CHECKLIST:\n")
print("☐ 1. Use lazy evaluation with .lazy() and single .collect()")
print("☐ 2. Apply filters as early as possible in the query")
print("☐ 3. Select only necessary columns")
print("☐ 4. Use appropriate data types (e.g., Categorical for repeated strings)")
print("☐ 5. Avoid map_elements() - use native expressions instead")
print("☐ 6. Use streaming for very large datasets")
print("☐ 7. Profile queries with .explain() and .profile()")
print("☐ 8. Combine multiple filters into a single condition")
print("☐ 9. Use top_k() instead of sort().head()")
print("☐ 10. Monitor memory usage with .estimated_size()")
print("☐ 11. Configure thread pool size based on workload")
print("☐ 12. Use scan_parquet() for file-based operations")
print("☐ 13. Avoid row-wise iterations - think columnar")
print("☐ 14. Chain operations instead of creating intermediate DataFrames")
print("☐ 15. Benchmark different approaches before deploying")

## Summary

In this notebook, you learned:

### Query Profiling
- Using `explain()` to understand query execution plans
- Using `profile()` to identify performance bottlenecks
- Understanding optimization strategies (predicate/projection pushdown)

### Memory Management
- Profiling memory usage with `estimated_size()`
- Using streaming for datasets larger than memory
- Memory-efficient operations and data types

### Lazy Evaluation
- How Polars optimizes lazy queries automatically
- Predicate and projection pushdown
- Writing queries for optimal performance

### Parallel Processing
- How Polars leverages multiple CPU cores
- Thread pool configuration
- Parallel operations are automatic

### Performance Pitfalls
- Avoid `map_elements()` when native expressions exist
- Don't collect() too early in lazy queries
- Use columnar operations instead of row-wise iteration
- Chain operations efficiently

### Benchmarking
- Timing different approaches
- Best practices for accurate measurements
- Comparing eager, lazy, and streaming execution

### Real-World Optimization
- Complete before/after optimization workflow
- Production checklist for high performance
- Measuring time and memory improvements

### Key Takeaways
1. **Use lazy evaluation** - Build your query with `.lazy()` and collect once at the end
2. **Filter early** - Apply filters before other operations to reduce data volume
3. **Think columnar** - Avoid row-wise operations, use native expressions
4. **Profile first** - Use `explain()` and `profile()` to understand what's slow
5. **Choose the right data types** - Categorical, smaller integer types save memory
6. **Use streaming** - For datasets larger than available memory
7. **Benchmark alternatives** - Test different approaches to find the fastest

Remember: "Premature optimization is the root of all evil" - Profile first, then optimize where it matters!

## Practice Exercises

### Exercise 1: Query Optimization
You have this inefficient query. Optimize it and measure the improvement:

```python
result = (
    sales_data
    .with_columns([(pl.col('quantity') * pl.col('unit_price')).alias('total')])
    .group_by('product_category')
    .agg(pl.col('total').sum())
    .filter(pl.col('total') > 100000)
)
```

Hints:
- Use lazy evaluation
- Apply filters early
- Use `.explain()` to see the difference

### Exercise 2: Memory Optimization
Create a DataFrame with 1 million rows and optimize its memory usage:
- Use appropriate integer types (UInt8, UInt16, etc.)
- Use Categorical for repeated string values
- Measure memory before and after optimization

### Exercise 3: Performance Pitfall Identification
Identify and fix the performance issues in this code:

```python
def slow_analysis():
    result = []
    for row in sales_data.iter_rows(named=True):
        if row['region'] == 'North':
            total = row['quantity'] * row['unit_price']
            result.append({'product': row['product_name'], 'total': total})
    return pl.DataFrame(result)
```

Rewrite it using columnar operations.

### Exercise 4: Streaming Implementation
Create a large dataset (10+ million rows), save it as parquet, and:
- Query it without streaming
- Query it with streaming
- Compare memory usage and execution time

### Exercise 5: Comprehensive Optimization
Given this complex analysis task:
- Calculate monthly revenue by category
- Filter for orders > $100
- Calculate average discount per category
- Find top 10 customers by total spend

Implement it in two ways:
1. Inefficient (eager, multiple intermediates)
2. Optimized (lazy, single collect, efficient filters)

Benchmark both approaches and compare the results.

### Exercise 6: Production Pipeline
Create a production-ready data processing pipeline that:
- Reads from multiple parquet files
- Applies transformations efficiently
- Includes error handling
- Profiles execution time
- Logs memory usage
- Uses streaming for large files

Include all optimizations from this notebook.

### Bonus Challenge: Query Plan Analysis
Write a function that:
1. Takes a lazy query as input
2. Prints the unoptimized and optimized query plans
3. Profiles the query execution
4. Reports execution time and memory usage
5. Provides optimization suggestions

Test it on various queries and see how different patterns affect performance.