# Polars Window Functions - Comprehensive Guide

Window functions perform calculations across rows related to the current row.

## Topics Covered:
- What are window functions?
- over() clause syntax
- Ranking functions (rank, dense_rank, row_number)
- Lag and Lead
- Cumulative operations
- Rolling windows
- Partition by multiple columns
- Practical examples

In [None]:
import polars as pl
import numpy as np

## Part 1: What are Window Functions?

Window functions compute values **over** a "window" of rows, without collapsing rows like group_by does.

**Key difference from group_by:**
- `group_by`: Returns 1 row per group
- `over`: Returns same number of rows as input

In [None]:
# Sample sales data
df = pl.DataFrame({
    'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06'],
    'product': ['A', 'A', 'A', 'B', 'B', 'B'],
    'sales': [100, 150, 120, 200, 180, 220],
    'region': ['North', 'North', 'South', 'North', 'South', 'North']
})

print("Sample Data:")
print(df)

In [None]:
# group_by: Returns 2 rows (one per product)
grouped = df.group_by('product').agg(pl.col('sales').mean().alias('avg_sales'))
print("group_by (2 rows):")
print(grouped)

# over: Returns 6 rows (same as input)
windowed = df.with_columns(
    pl.col('sales').mean().over('product').alias('avg_sales')
)
print("\nover (6 rows):")
print(windowed)

## Part 2: Basic Window Functions with over()

### Statistical aggregations over partitions

In [None]:
# Calculate stats per product
result = df.with_columns([
    pl.col('sales').mean().over('product').alias('product_avg'),
    pl.col('sales').sum().over('product').alias('product_total'),
    pl.col('sales').min().over('product').alias('product_min'),
    pl.col('sales').max().over('product').alias('product_max'),
    pl.len().over('product').alias('product_count')
])

print("Stats per product (repeated for each row):")
print(result)

### Difference from group mean

In [None]:
# Calculate how much each sale deviates from product average
result = df.with_columns([
    pl.col('sales').mean().over('product').alias('product_avg'),
    (pl.col('sales') - pl.col('sales').mean().over('product')).alias('deviation_from_avg'),
    ((pl.col('sales') - pl.col('sales').mean().over('product')) / pl.col('sales').mean().over('product') * 100)
      .alias('pct_deviation')
])

print("Deviation from product average:")
print(result)

### Percentage of group total

In [None]:
# What % of product's total sales does each transaction represent?
result = df.with_columns([
    pl.col('sales').sum().over('product').alias('product_total'),
    (pl.col('sales') / pl.col('sales').sum().over('product') * 100).alias('pct_of_product_total')
])

print("Percentage of product total:")
print(result)

## Part 3: Ranking Functions

In [None]:
# Sample data with ties
rank_df = pl.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'class': ['A', 'A', 'A', 'B', 'B', 'B'],
    'score': [95, 87, 87, 92, 88, 92]  # Note: ties at 87 and 92
})

print("Student scores with ties:")
print(rank_df)

### rank() - Standard ranking (1, 2, 2, 4)

In [None]:
# rank(): Ties get same rank, next rank is skipped
result = rank_df.with_columns([
    pl.col('score').rank(descending=True).over('class').alias('rank')
]).sort(['class', 'rank'])

print("rank() - Standard ranking:")
print(result)
print("\nNote: In class A, both 87s get rank 2, next is rank 4 (not 3)")

### dense_rank() - Dense ranking (1, 2, 2, 3)

In [None]:
# dense_rank(): Ties get same rank, next rank is NOT skipped
result = rank_df.with_columns([
    pl.col('score').rank('dense', descending=True).over('class').alias('dense_rank')
]).sort(['class', 'dense_rank'])

print("dense_rank() - Dense ranking:")
print(result)
print("\nNote: In class A, both 87s get rank 2, next is rank 3 (not skipped)")

### row_number() - Sequential numbers (1, 2, 3, 4)

In [None]:
# row_number(): No ties, arbitrary order for equal values
result = rank_df.with_columns([
    pl.col('score').rank('ordinal', descending=True).over('class').alias('row_number')
]).sort(['class', 'row_number'])

print("row_number() - Sequential:")
print(result)
print("\nNote: Tied values get different numbers (order is arbitrary)")

### Comparison of all three

In [None]:
# All three together
result = rank_df.with_columns([
    pl.col('score').rank(descending=True).over('class').alias('rank'),
    pl.col('score').rank('dense', descending=True).over('class').alias('dense_rank'),
    pl.col('score').rank('ordinal', descending=True).over('class').alias('row_number')
]).sort(['class', 'score'], descending=[False, True])

print("Comparison of ranking methods:")
print(result)

## Part 4: Lag and Lead (Compare with Previous/Next Row)

In [None]:
# Time series data
ts_df = pl.DataFrame({
    'date': pl.date_range(pl.date(2023, 1, 1), pl.date(2023, 1, 10), '1d', eager=True),
    'product': ['A'] * 5 + ['B'] * 5,
    'sales': [100, 110, 105, 115, 120, 200, 210, 205, 220, 215]
})

print("Time series data:")
print(ts_df)

### shift() - Lag (previous value)

In [None]:
# Get previous day's sales
result = ts_df.with_columns([
    pl.col('sales').shift(1).over('product').alias('prev_day_sales'),
    (pl.col('sales') - pl.col('sales').shift(1).over('product')).alias('daily_change')
])

print("Lag - Compare with previous day:")
print(result)

### shift(-n) - Lead (next value)

In [None]:
# Get next day's sales
result = ts_df.with_columns([
    pl.col('sales').shift(-1).over('product').alias('next_day_sales'),
    pl.when(pl.col('sales').shift(-1).over('product') > pl.col('sales'))
      .then(pl.lit('Increasing'))
      .when(pl.col('sales').shift(-1).over('product') < pl.col('sales'))
      .then(pl.lit('Decreasing'))
      .otherwise(pl.lit('Stable'))
      .alias('trend')
])

print("Lead - Compare with next day:")
print(result)

### Multiple lags

In [None]:
# Compare with multiple previous days
result = ts_df.with_columns([
    pl.col('sales').shift(1).over('product').alias('lag_1'),
    pl.col('sales').shift(2).over('product').alias('lag_2'),
    pl.col('sales').shift(3).over('product').alias('lag_3')
])

print("Multiple lags:")
print(result)

## Part 5: Cumulative Operations

### Cumulative sum

In [None]:
# Running total per product
result = ts_df.with_columns([
    pl.col('sales').cum_sum().over('product').alias('cumulative_sales'),
    (pl.col('sales').cum_sum().over('product') / pl.col('sales').sum().over('product') * 100)
      .alias('pct_of_total')
])

print("Cumulative sum:")
print(result)

### Other cumulative operations

In [None]:
# Multiple cumulative operations
result = ts_df.with_columns([
    pl.col('sales').cum_sum().over('product').alias('cum_sum'),
    pl.col('sales').cum_min().over('product').alias('cum_min'),
    pl.col('sales').cum_max().over('product').alias('cum_max'),
    pl.col('sales').cum_count().over('product').alias('cum_count')
])

print("Multiple cumulative operations:")
print(result)

## Part 6: Rolling Windows

### Rolling mean (moving average)

In [None]:
# 3-day moving average
result = ts_df.with_columns([
    pl.col('sales').rolling_mean(window_size=3).over('product').alias('ma_3day'),
    pl.col('sales').rolling_mean(window_size=5).over('product').alias('ma_5day')
])

print("Rolling mean (moving average):")
print(result)

### Other rolling operations

In [None]:
# Multiple rolling operations with 3-day window
result = ts_df.with_columns([
    pl.col('sales').rolling_mean(window_size=3).over('product').alias('rolling_mean'),
    pl.col('sales').rolling_sum(window_size=3).over('product').alias('rolling_sum'),
    pl.col('sales').rolling_min(window_size=3).over('product').alias('rolling_min'),
    pl.col('sales').rolling_max(window_size=3).over('product').alias('rolling_max'),
    pl.col('sales').rolling_std(window_size=3).over('product').alias('rolling_std')
])

print("Multiple rolling operations (3-day window):")
print(result)

## Part 7: Partitioning by Multiple Columns

In [None]:
# Create data with multiple grouping columns
multi_df = pl.DataFrame({
    'region': ['North', 'North', 'North', 'South', 'South', 'South'],
    'product': ['A', 'A', 'B', 'A', 'A', 'B'],
    'month': [1, 2, 1, 1, 2, 1],
    'sales': [100, 110, 150, 120, 130, 140]
})

print("Multi-level data:")
print(multi_df)

In [None]:
# Partition by multiple columns
result = multi_df.with_columns([
    pl.col('sales').mean().over(['region', 'product']).alias('region_product_avg'),
    pl.col('sales').mean().over('region').alias('region_avg'),
    pl.col('sales').mean().over('product').alias('product_avg'),
    pl.col('sales').mean().alias('overall_avg')
])

print("Multiple partition levels:")
print(result)

## Part 8: Practical Real-World Examples

In [None]:
# Create realistic e-commerce data
np.random.seed(42)
dates = pl.date_range(pl.date(2023, 1, 1), pl.date(2023, 3, 31), '1d', eager=True)

sales_data = pl.DataFrame({
    'date': dates,
    'product': np.random.choice(['Laptop', 'Mouse', 'Keyboard'], len(dates)),
    'region': np.random.choice(['North', 'South', 'East', 'West'], len(dates)),
    'revenue': np.random.uniform(1000, 5000, len(dates))
}).sort('date')

print(f"E-commerce data: {len(sales_data)} days")
print(sales_data.head(10))

### Example 1: Sales trend analysis

In [None]:
# Analyze trends with moving averages and growth rates
trend_analysis = sales_data.with_columns([
    # 7-day moving average
    pl.col('revenue').rolling_mean(window_size=7).over('product').alias('ma_7day'),
    
    # Day-over-day change
    (pl.col('revenue') - pl.col('revenue').shift(1).over('product')).alias('daily_change'),
    
    # Day-over-day % change
    ((pl.col('revenue') - pl.col('revenue').shift(1).over('product')) / 
     pl.col('revenue').shift(1).over('product') * 100).alias('daily_pct_change'),
    
    # Running total
    pl.col('revenue').cum_sum().over('product').alias('ytd_revenue')
])

print("Sales trend analysis:")
print(trend_analysis.filter(pl.col('product') == 'Laptop').head(15))

### Example 2: Product performance ranking

In [None]:
# Rank products by revenue within each region
product_ranking = (
    sales_data
    .group_by(['region', 'product'])
    .agg(pl.col('revenue').sum().alias('total_revenue'))
    .with_columns([
        pl.col('total_revenue').rank(descending=True).over('region').alias('rank_in_region'),
        (pl.col('total_revenue') / pl.col('total_revenue').sum().over('region') * 100)
          .alias('pct_of_region')
    ])
    .sort(['region', 'rank_in_region'])
)

print("Product ranking by region:")
print(product_ranking)

### Example 3: Quartile analysis

In [None]:
# Classify each day's performance into quartiles
quartile_analysis = sales_data.with_columns([
    pl.col('revenue').quantile(0.25).over('product').alias('q1'),
    pl.col('revenue').quantile(0.50).over('product').alias('q2_median'),
    pl.col('revenue').quantile(0.75).over('product').alias('q3'),
]).with_columns([
    pl.when(pl.col('revenue') <= pl.col('q1'))
      .then(pl.lit('Q1 (Bottom 25%)'))
      .when(pl.col('revenue') <= pl.col('q2_median'))
      .then(pl.lit('Q2'))
      .when(pl.col('revenue') <= pl.col('q3'))
      .then(pl.lit('Q3'))
      .otherwise(pl.lit('Q4 (Top 25%)'))
      .alias('quartile')
])

print("Quartile analysis:")
print(quartile_analysis.head(20))

### Example 4: Top N within each group

In [None]:
# Find top 3 revenue days for each product
top_days = (
    sales_data
    .with_columns([
        pl.col('revenue').rank(descending=True).over('product').alias('rank')
    ])
    .filter(pl.col('rank') <= 3)
    .sort(['product', 'rank'])
)

print("Top 3 revenue days per product:")
print(top_days)

## Part 9: Complex Window Patterns

### Pattern 1: Z-score (standardization within group)

In [None]:
# Calculate z-score per product
zscore = sales_data.with_columns([
    ((pl.col('revenue') - pl.col('revenue').mean().over('product')) / 
     pl.col('revenue').std().over('product')).alias('z_score')
]).with_columns([
    pl.when(pl.col('z_score').abs() > 2)
      .then(pl.lit('Outlier'))
      .otherwise(pl.lit('Normal'))
      .alias('outlier_status')
])

print("Z-score analysis (outlier detection):")
print(zscore.filter(pl.col('outlier_status') == 'Outlier').head(10))

### Pattern 2: First and last comparison

In [None]:
# Compare current value with first and last in group
comparison = sales_data.with_columns([
    pl.col('revenue').first().over('product').alias('first_revenue'),
    pl.col('revenue').last().over('product').alias('last_revenue'),
]).with_columns([
    (pl.col('revenue') - pl.col('first_revenue')).alias('change_from_first'),
    ((pl.col('revenue') - pl.col('first_revenue')) / pl.col('first_revenue') * 100)
      .alias('pct_change_from_first')
])

print("Comparison with first value:")
print(comparison.filter(pl.col('product') == 'Laptop').head(15))

## Part 10: Performance Tips

In [None]:
import time

# Create large dataset
large_df = pl.DataFrame({
    'group': np.random.choice(['A', 'B', 'C', 'D', 'E'], 100000),
    'value': np.random.randn(100000)
})

print(f"Large dataset: {len(large_df):,} rows")

In [None]:
# Combine window functions when possible
start = time.time()
result = large_df.with_columns([
    pl.col('value').mean().over('group').alias('mean'),
    pl.col('value').std().over('group').alias('std'),
    pl.col('value').min().over('group').alias('min')
])
time1 = time.time() - start

print(f"Combined window functions: {time1:.4f}s")
print("Tip: Combine multiple window operations in single with_columns for efficiency!")

## Summary

### Key Concepts:
1. **over()** applies operations across a window without collapsing rows
2. **Ranking**: rank, dense_rank, row_number for ordering within groups
3. **shift()**: Access previous (lag) or next (lead) values
4. **Cumulative**: cum_sum, cum_min, cum_max for running totals
5. **Rolling**: rolling_mean, rolling_sum for moving windows
6. **Partition**: Use multiple columns to define windows

### When to Use:
- **Ranking**: Leaderboards, top N per group
- **Lag/Lead**: Time series comparisons, trends
- **Cumulative**: Running totals, YTD calculations
- **Rolling**: Moving averages, smoothing
- **Stats over window**: Deviations, z-scores, percentages

### vs group_by:
- Use **group_by** when you want aggregated results (fewer rows)
- Use **over** when you want to keep all rows and add aggregated values