# Homework 3: Lazy Evaluation and Query Optimization

In this homework, you'll practice working with LazyFrames, query optimization, and performance benchmarking.

## Dataset: Financial Transactions

You'll work with a large dataset of 100,000 financial transactions.

## Instructions

- Use lazy evaluation (`scan_csv()`) throughout unless specified otherwise
- Use `explain()` to verify optimizations
- Points are indicated for each exercise

In [None]:
import polars as pl
import time

## Exercise 1: Lazy Loading (10 points)

1. Create a LazyFrame by scanning `financial_transactions.csv`
2. Display the schema without collecting the data
3. Show the optimized query plan for selecting all columns

In [None]:
# 1.1 Create LazyFrame
lf = ...

In [None]:
# 1.2 Display schema


In [None]:
# 1.3 Show query plan


## Exercise 2: Predicate Pushdown (15 points)

Build a lazy query that:
1. Filters for transactions in the "Shopping" category
2. Filters for amounts > $500
3. Selects only `transaction_id`, `amount`, and `merchant_city`

Then:
- Show the optimized plan and verify the filters are pushed down
- Collect the results and show the count

In [None]:
# Build the query
shopping_query = ...

In [None]:
# Show optimized plan


In [None]:
# Collect and show results
shopping_result = ...
print(f"Matching transactions: {shopping_result.height}")

## Exercise 3: Projection Pushdown (15 points)

Write a query that only reads the columns needed to answer: "What is the total amount spent per payment method?"

1. Show the query plan to verify only required columns are scanned
2. Execute and display results sorted by total descending

In [None]:
# Build query
payment_query = ...

In [None]:
# Show plan


In [None]:
# Execute and display
payment_result = ...

## Exercise 4: Performance Benchmarking (20 points)

Compare the performance of eager vs lazy execution for the following analysis:
- Filter for non-recurring transactions
- Group by category and calculate: count, sum, and average amount
- Sort by sum descending

1. Implement the eager version
2. Implement the lazy version
3. Time both (3 runs each) and compare

In [None]:
def eager_analysis():
    """Eager implementation."""
    # Your code here
    pass

def lazy_analysis():
    """Lazy implementation."""
    # Your code here
    pass

In [None]:
# Run benchmarks
# Your timing code here

## Exercise 5: Query Optimization Challenge (20 points)

The following query is intentionally inefficient. Rewrite it using lazy evaluation and best practices to make it faster:

```python
# Slow query - DO NOT RUN (just rewrite)
df = pl.read_csv("financial_transactions.csv")
df = df.with_columns(pl.col("amount").abs().alias("abs_amount"))
df = df.filter(pl.col("category") == "Entertainment")
df = df.with_columns(pl.col("amount").mean().alias("overall_avg"))
df = df.filter(pl.col("abs_amount") > 100)
result = df.select("transaction_id", "amount", "merchant_city")
result = result.group_by("merchant_city").agg(pl.len())
```

1. Rewrite the query using lazy evaluation
2. Show the optimized query plan
3. Explain what optimizations Polars applies

In [None]:
# Optimized query
optimized_query = ...

In [None]:
# Show plan


**Explain the optimizations:**

(Write your explanation here)

## Exercise 6: Window Functions (10 points)

Using lazy evaluation, create a query that:
1. Calculates the average amount per category (using `over()`)
2. Calculates each transaction's deviation from its category average
3. Ranks transactions within each category by amount
4. Collects and shows the top 10 transactions by absolute deviation

In [None]:
# Window functions query
window_query = ...

## Exercise 7: Streaming (10 points)

1. Write a query that calculates monthly statistics (assuming timestamp column)
2. Execute it with streaming enabled
3. Verify the results match non-streaming execution

In [None]:
# Build monthly statistics query
monthly_query = ...

In [None]:
# Execute with streaming
streaming_result = ...

In [None]:
# Compare with non-streaming
regular_result = ...
print(f"Results match: {streaming_result.equals(regular_result)}")

## Bonus: Complex Analysis Pipeline (15 bonus points)

Build a comprehensive analysis pipeline that answers:

"For each city, what is the most popular spending category, and how does the average transaction size compare to the overall average?"

Requirements:
- Use lazy evaluation throughout
- Use window functions for the overall average comparison
- Show the optimized query plan
- Results should include: city, top category, category count, avg transaction, diff from overall avg

In [None]:
# Bonus: Complex analysis pipeline
city_analysis = ...