# Polars Lazy Evaluation & Query Optimization

This notebook covers LazyFrames, query optimization, and when to use lazy vs eager evaluation.

## Key Concepts:
- **Eager**: Operations execute immediately (like pandas)
- **Lazy**: Operations are planned first, optimized, then executed
- **Query Plan**: Visual representation of planned operations
- **Optimizations**: Automatic improvements Polars makes to your queries

In [None]:
import polars as pl
import numpy as np
import time

## Part 1: Eager vs Lazy Execution

In [None]:
# Create sample data
df = pl.DataFrame({
    'customer_id': range(1, 11),
    'name': [f'Customer_{i}' for i in range(1, 11)],
    'age': [25, 34, 28, 42, 31, 55, 23, 38, 45, 29],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'Boston', 'LA', 'Chicago', 'NYC', 'Boston', 'LA'],
    'purchases': [5, 12, 3, 18, 7, 22, 4, 15, 9, 11],
    'total_spent': [250.5, 680.2, 120.0, 950.8, 340.5, 1100.0, 180.3, 720.5, 450.0, 520.8]
})

print("Sample DataFrame:")
print(df)

### Eager Execution (default DataFrame)

In [None]:
# Each operation executes immediately
print("EAGER EXECUTION:")
print("Step 1: Filter...")
filtered = df.filter(pl.col('age') > 30)  # Executes NOW
print(f"Rows after filter: {len(filtered)}")

print("\nStep 2: Select columns...")
selected = filtered.select(['name', 'city', 'total_spent'])  # Executes NOW
print(f"Columns: {selected.columns}")

print("\nStep 3: Sort...")
result = selected.sort('total_spent', descending=True)  # Executes NOW
print(result)

### Lazy Execution (LazyFrame)

In [None]:
# Convert to LazyFrame
lazy_df = df.lazy()  # or pl.scan_csv(), pl.scan_parquet(), etc.

print("LAZY EXECUTION:")
print("Building query plan (nothing executes yet)...\n")

# Chain operations - nothing executes yet!
lazy_query = (
    lazy_df
    .filter(pl.col('age') > 30)
    .select(['name', 'city', 'total_spent'])
    .sort('total_spent', descending=True)
)

print(f"Type: {type(lazy_query)}")
print("Query has been planned but NOT executed yet!")
print("\nTo execute, call .collect():")

# Execute the optimized query
result = lazy_query.collect()
print(result)

## Part 2: Query Plans - Understanding What Polars Does

In [None]:
# View the naive (unoptimized) query plan
print("NAIVE QUERY PLAN (before optimization):")
print(lazy_query.explain(optimized=False))

In [None]:
# View the optimized query plan
print("OPTIMIZED QUERY PLAN:")
print(lazy_query.explain(optimized=True))

## Part 3: Automatic Optimizations

Polars automatically applies several optimizations:
1. **Projection Pushdown**: Select only needed columns as early as possible
2. **Predicate Pushdown**: Apply filters as early as possible
3. **Slice Pushdown**: Apply limits/slices early
4. **Common Subexpression Elimination**: Avoid redundant calculations
5. **Simplification**: Simplify complex expressions

### Optimization 1: Projection Pushdown

In [None]:
# Create larger dataset to see optimization impact
large_df = pl.DataFrame({
    'id': range(100000),
    'col1': np.random.randn(100000),
    'col2': np.random.randn(100000),
    'col3': np.random.randn(100000),
    'col4': np.random.randn(100000),
    'col5': np.random.randn(100000),
    'col6': np.random.randn(100000),
    'col7': np.random.randn(100000),
    'col8': np.random.randn(100000),
    'col9': np.random.randn(100000),
    'col10': np.random.randn(100000),
})

# Lazy query that only needs 2 columns
query = (
    large_df.lazy()
    .filter(pl.col('col1') > 0)
    .select(['id', 'col1'])  # Only need 2 columns
)

print("Query Plan - Notice how 'select' happens BEFORE filter:")
print(query.explain())
print("\nProjection pushdown: Polars reads only 'id' and 'col1', ignoring other 8 columns!")

### Optimization 2: Predicate Pushdown

In [None]:
# Filter is applied as early as possible
query = (
    large_df.lazy()
    .select(['id', 'col1', 'col2'])
    .with_columns((pl.col('col1') * 2).alias('col1_doubled'))
    .filter(pl.col('id') < 1000)  # Filter at the end
)

print("Query Plan - Notice filter moves to the top:")
print(query.explain())
print("\nPredicate pushdown: Filter applied early, reducing data processed in later steps!")

### Optimization 3: Slice Pushdown

In [None]:
# When using .head() or .limit(), Polars optimizes to read only needed rows
query = (
    large_df.lazy()
    .select(['id', 'col1'])
    .sort('col1')
    .head(10)  # Only need 10 rows
)

print("Query Plan with head(10):")
print(query.explain())
print("\nSlice pushdown: Polars knows to stop processing after finding top 10!")

### Optimization 4: Common Subexpression Elimination

In [None]:
# If you use the same expression multiple times, Polars calculates it once
query = (
    df.lazy()
    .with_columns([
        (pl.col('total_spent') / pl.col('purchases')).alias('avg_per_purchase'),
        ((pl.col('total_spent') / pl.col('purchases')) * 1.1).alias('avg_with_tax'),
    ])
)

print("Query with repeated expression (total_spent / purchases):")
print(query.explain())
print("\nPolars calculates 'total_spent / purchases' once and reuses it!")

## Part 4: Performance Comparison

In [None]:
# Create large dataset
n_rows = 1_000_000
large_data = pl.DataFrame({
    'id': range(n_rows),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n_rows),
    'value1': np.random.randn(n_rows),
    'value2': np.random.randn(n_rows),
    'value3': np.random.randn(n_rows),
    'value4': np.random.randn(n_rows),
    'value5': np.random.randn(n_rows),
})

print(f"Dataset: {n_rows:,} rows, {len(large_data.columns)} columns")

In [None]:
# Complex query - Eager execution
start = time.time()
eager_result = (
    large_data
    .filter(pl.col('category').is_in(['A', 'B']))
    .filter(pl.col('value1') > 0)
    .select(['id', 'category', 'value1', 'value2'])
    .with_columns((pl.col('value1') * pl.col('value2')).alias('product'))
    .group_by('category')
    .agg([
        pl.col('product').mean().alias('avg_product'),
        pl.col('product').std().alias('std_product'),
    ])
)
eager_time = time.time() - start

print(f"Eager execution: {eager_time:.4f} seconds")
print(eager_result)

In [None]:
# Same query - Lazy execution
start = time.time()
lazy_result = (
    large_data.lazy()
    .filter(pl.col('category').is_in(['A', 'B']))
    .filter(pl.col('value1') > 0)
    .select(['id', 'category', 'value1', 'value2'])
    .with_columns((pl.col('value1') * pl.col('value2')).alias('product'))
    .group_by('category')
    .agg([
        pl.col('product').mean().alias('avg_product'),
        pl.col('product').std().alias('std_product'),
    ])
    .collect()
)
lazy_time = time.time() - start

print(f"Lazy execution: {lazy_time:.4f} seconds")
print(lazy_result)
print(f"\nSpeedup: {eager_time/lazy_time:.2f}x faster")

## Part 5: Streaming Mode (for datasets larger than RAM)

In [None]:
# Streaming processes data in chunks - can handle datasets larger than RAM
query = (
    large_data.lazy()
    .filter(pl.col('category') == 'A')
    .select(['value1', 'value2'])
    .group_by(pl.col('value1').cast(pl.Int32))
    .agg(pl.col('value2').mean())
)

print("Regular collect():")
start = time.time()
result1 = query.collect()
time1 = time.time() - start
print(f"Time: {time1:.4f}s")

print("\nStreaming collect():")
start = time.time()
result2 = query.collect(streaming=True)
time2 = time.time() - start
print(f"Time: {time2:.4f}s")

print(f"\nStreaming uses less memory (processes in chunks)")

## Part 6: LazyFrame Methods

### collect() - Execute and return DataFrame

In [None]:
lazy_query = df.lazy().filter(pl.col('age') > 30)

# Execute and get DataFrame
result = lazy_query.collect()
print(f"Type after collect(): {type(result)}")
print(result)

### fetch() - Execute on first N rows (for testing)

In [None]:
# Test query on first 5 rows before running on full dataset
sample_result = lazy_query.fetch(5)
print("fetch(5) - Quick test on first 5 rows:")
print(sample_result)
print("\nUse fetch() to debug/test queries on large datasets!")

### describe_plan() and describe_optimized_plan()

In [None]:
query = (
    df.lazy()
    .filter(pl.col('age') > 30)
    .select(['name', 'total_spent'])
    .sort('total_spent', descending=True)
)

print("Naive plan:")
print(query.describe_plan())

print("\n" + "="*50 + "\n")

print("Optimized plan:")
print(query.describe_optimized_plan())

### sink_parquet() and sink_csv() - Write without loading into memory

In [None]:
# Process and write directly to file without loading full result in memory
query = (
    large_data.lazy()
    .filter(pl.col('category') == 'A')
    .select(['id', 'value1', 'value2'])
)

# Write directly to parquet (efficient for large results)
query.sink_parquet('/tmp/output.parquet')
print("Written to /tmp/output.parquet without loading in memory!")

# Verify
result = pl.read_parquet('/tmp/output.parquet')
print(f"\nRows written: {len(result)}")
print(result.head())

## Part 7: Scan Functions (Lazy from Start)

In [None]:
# Save some data first
df.write_csv('/tmp/customers.csv')
df.write_parquet('/tmp/customers.parquet')

print("Files created for scanning demo")

### scan_csv() vs read_csv()

In [None]:
# read_csv - Eager (loads entire file)
eager_df = pl.read_csv('/tmp/customers.csv')
print(f"read_csv type: {type(eager_df)}")

# scan_csv - Lazy (doesn't load until .collect())
lazy_df = pl.scan_csv('/tmp/customers.csv')
print(f"scan_csv type: {type(lazy_df)}")

# Advantage: Can filter/select before loading
result = (
    pl.scan_csv('/tmp/customers.csv')
    .select(['name', 'city'])  # Only read these columns!
    .filter(pl.col('city') == 'NYC')  # Only load filtered rows!
    .collect()
)
print("\nFiltered result (only read needed columns/rows):")
print(result)

### scan_parquet() - Even Better Performance

In [None]:
# Parquet is columnar - can skip columns at file level
result = (
    pl.scan_parquet('/tmp/customers.parquet')
    .select(['name', 'total_spent'])
    .filter(pl.col('total_spent') > 500)
    .collect()
)

print("Parquet scan with projection pushdown:")
print(result)
print("\nParquet format allows reading only needed columns from disk!")

## Part 8: Advanced Lazy Patterns

### Pattern 1: Lazy Join Optimization

In [None]:
customers_large = pl.DataFrame({
    'customer_id': range(10000),
    'name': [f'Customer_{i}' for i in range(10000)],
    'segment': np.random.choice(['A', 'B', 'C'], 10000)
})

orders_large = pl.DataFrame({
    'order_id': range(50000),
    'customer_id': np.random.randint(0, 10000, 50000),
    'amount': np.random.uniform(10, 1000, 50000)
})

# Lazy join with filters - optimized automatically
query = (
    customers_large.lazy()
    .join(orders_large.lazy(), on='customer_id')
    .filter(pl.col('segment') == 'A')  # Filter pushed before join!
    .select(['customer_id', 'name', 'amount'])
    .group_by('customer_id')
    .agg(pl.col('amount').sum().alias('total_spent'))
)

print("Optimized join query plan:")
print(query.explain())
print("\nNotice: Filter on 'segment' happens BEFORE join (reduces join size)")

### Pattern 2: Lazy Union/Concat

In [None]:
# Combine multiple datasets lazily
df1 = pl.DataFrame({'id': [1, 2], 'value': [10, 20]})
df2 = pl.DataFrame({'id': [3, 4], 'value': [30, 40]})
df3 = pl.DataFrame({'id': [5, 6], 'value': [50, 60]})

# Lazy union
result = (
    pl.concat([df1.lazy(), df2.lazy(), df3.lazy()])
    .filter(pl.col('value') > 25)
    .collect()
)

print("Lazy union with filter:")
print(result)
print("\nFilter applied to each DataFrame before combining!")

### Pattern 3: Multiple Queries from Same LazyFrame

In [None]:
# Create base lazy query
base_query = (
    large_data.lazy()
    .filter(pl.col('category').is_in(['A', 'B']))
    .select(['category', 'value1', 'value2'])
)

# Branch 1: Summary stats
summary = (
    base_query
    .group_by('category')
    .agg([
        pl.col('value1').mean().alias('avg_value1'),
        pl.col('value2').mean().alias('avg_value2'),
    ])
    .collect()
)

# Branch 2: Top records
top_records = (
    base_query
    .sort('value1', descending=True)
    .head(10)
    .collect()
)

print("Summary:")
print(summary)
print("\nTop 10:")
print(top_records)
print("\nBoth queries optimize the shared base_query independently!")

## Part 9: When to Use Eager vs Lazy

### Use EAGER when:
- Exploring data interactively
- Working with small datasets (< 100MB)
- Debugging and need immediate feedback
- Simple, one-off operations

### Use LAZY when:
- Complex queries with multiple steps
- Large datasets (> 100MB)
- Production pipelines
- Reading from files (use scan_*)
- Need maximum performance
- Working with data larger than RAM (use streaming)

## Part 10: Common Gotchas and Tips

### Gotcha 1: Lazy operations don't execute

In [None]:
# This does NOTHING (forgot .collect())
result = df.lazy().filter(pl.col('age') > 30)  # No execution!
print(f"Type: {type(result)}")
print("This is still a LazyFrame, not executed!")

# Remember to collect()
result = df.lazy().filter(pl.col('age') > 30).collect()
print(f"\nAfter collect() - Type: {type(result)}")
print(result)

### Tip 1: Use fetch() for query development

In [None]:
# Develop query on small sample first
query = (
    large_data.lazy()
    .filter(pl.col('category') == 'A')
    .select(['value1', 'value2'])
    .with_columns((pl.col('value1') * pl.col('value2')).alias('product'))
)

# Test on 100 rows
print("Testing query with fetch(100):")
test_result = query.fetch(100)
print(test_result.head())

# Once confirmed working, run on full dataset
# full_result = query.collect()

### Tip 2: Profile your queries

In [None]:
# Use explain() to understand optimizations
query = (
    df.lazy()
    .filter(pl.col('purchases') > 10)
    .select(['name', 'city', 'total_spent'])
)

# Check if query is optimized as expected
print(query.explain())
print("\nLook for: projection pushdown, predicate pushdown, etc.")

### Tip 3: Chain operations for better optimization

In [None]:
# GOOD: Chain everything
good_query = (
    df.lazy()
    .filter(pl.col('age') > 30)
    .select(['name', 'total_spent'])
    .sort('total_spent')
    .collect()
)

# LESS GOOD: Multiple collects
temp1 = df.lazy().filter(pl.col('age') > 30).collect()  # Collect #1
temp2 = temp1.select(['name', 'total_spent'])  # Now eager
bad_query = temp2.sort('total_spent')  # Each step executes separately

print("Chain operations for maximum optimization!")

## Summary

### Key Takeaways:
1. **Lazy evaluation** delays execution until `.collect()` is called
2. **Optimizations** happen automatically (projection/predicate/slice pushdown)
3. **Query plans** show how Polars will execute your query
4. **Streaming mode** processes data in chunks for huge datasets
5. **scan_* functions** enable lazy reading from files
6. Use **fetch()** to test queries on large datasets
7. Use **explain()** to understand optimizations

### Best Practices:
- Start with `.lazy()` for complex pipelines
- Use `scan_csv/parquet` instead of `read_csv/parquet` when possible
- Chain operations instead of multiple `.collect()` calls
- Use `.explain()` to verify optimizations
- Use `.fetch()` for testing on large datasets
- Use `streaming=True` for data larger than RAM