# Polars GroupBy & Aggregations - Comprehensive Guide

Master grouping and aggregation operations in Polars.

## Topics Covered:
- Basic group_by and aggregations
- Multiple aggregations per group
- Multiple grouping columns
- Advanced aggregation functions
- Conditional aggregations
- Rolling and dynamic group_by
- Performance optimization

In [None]:
import polars as pl
import numpy as np
from datetime import datetime, timedelta

## Part 1: Basic GroupBy

In [None]:
# Sample sales data
df = pl.DataFrame({
    'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03'],
    'product': ['A', 'B', 'A', 'B', 'A', 'C'],
    'region': ['North', 'North', 'South', 'South', 'North', 'West'],
    'sales': [100, 150, 200, 175, 120, 90],
    'quantity': [5, 8, 10, 9, 6, 4],
    'cost': [60, 90, 120, 105, 72, 54]
})

print("Sample Data:")
print(df)

### Simple group_by with single aggregation

In [None]:
# Group by product, sum sales
result = df.group_by('product').agg(
    pl.col('sales').sum()
)

print("Total sales by product:")
print(result)

### Multiple aggregations

In [None]:
# Multiple aggregations per group
result = df.group_by('product').agg([
    pl.col('sales').sum().alias('total_sales'),
    pl.col('sales').mean().alias('avg_sales'),
    pl.col('quantity').sum().alias('total_quantity'),
    pl.len().alias('num_transactions')
])

print("Multiple aggregations:")
print(result)

### Group by multiple columns

In [None]:
# Group by product AND region
result = df.group_by(['product', 'region']).agg([
    pl.col('sales').sum().alias('total_sales'),
    pl.col('quantity').sum().alias('total_quantity')
]).sort(['product', 'region'])

print("Group by product and region:")
print(result)

## Part 2: Common Aggregation Functions

In [None]:
# All common aggregations
result = df.group_by('product').agg([
    pl.col('sales').sum().alias('sum'),
    pl.col('sales').mean().alias('mean'),
    pl.col('sales').median().alias('median'),
    pl.col('sales').min().alias('min'),
    pl.col('sales').max().alias('max'),
    pl.col('sales').std().alias('std'),
    pl.col('sales').var().alias('variance'),
    pl.len().alias('count')
])

print("Common aggregations:")
print(result)

### Unique values and counts

In [None]:
# Unique values and counts
result = df.group_by('region').agg([
    pl.col('product').n_unique().alias('unique_products'),
    pl.col('product').unique().alias('product_list'),
    pl.len().alias('num_transactions')
])

print("Unique values:")
print(result)

### First, last, and nth values

In [None]:
# Get first, last values per group
result = df.group_by('product').agg([
    pl.col('sales').first().alias('first_sale'),
    pl.col('sales').last().alias('last_sale'),
    pl.col('date').first().alias('first_date'),
    pl.col('date').last().alias('last_date')
]).sort('product')

print("First and last values:")
print(result)

## Part 3: Advanced Aggregations

### Conditional aggregations

In [None]:
# Aggregate with conditions
result = df.group_by('region').agg([
    pl.col('sales').sum().alias('total_sales'),
    # Count high-value sales (> 150)
    pl.col('sales').filter(pl.col('sales') > 150).len().alias('high_value_count'),
    # Sum only high-value sales
    pl.col('sales').filter(pl.col('sales') > 150).sum().alias('high_value_sum'),
    # Average of low-value sales (<= 150)
    pl.col('sales').filter(pl.col('sales') <= 150).mean().alias('low_value_avg')
])

print("Conditional aggregations:")
print(result)

### Aggregating expressions (computed columns)

In [None]:
# Aggregate computed values
result = df.group_by('product').agg([
    # Profit = sales - cost
    (pl.col('sales') - pl.col('cost')).sum().alias('total_profit'),
    # Average price per unit = sales / quantity
    (pl.col('sales') / pl.col('quantity')).mean().alias('avg_price_per_unit'),
    # Profit margin = (sales - cost) / sales
    ((pl.col('sales') - pl.col('cost')) / pl.col('sales') * 100).mean().alias('avg_profit_margin_pct')
])

print("Aggregating expressions:")
print(result)

### Quantiles and percentiles

In [None]:
# Calculate percentiles
result = df.group_by('region').agg([
    pl.col('sales').quantile(0.25).alias('p25'),
    pl.col('sales').quantile(0.50).alias('p50_median'),
    pl.col('sales').quantile(0.75).alias('p75'),
    pl.col('sales').quantile(0.90).alias('p90')
])

print("Percentiles:")
print(result)

### List aggregation (collect values into lists)

In [None]:
# Collect values into lists
result = df.group_by('region').agg([
    pl.col('product').alias('all_products'),  # Creates list of all products
    pl.col('sales').alias('all_sales'),       # Creates list of all sales
    pl.col('sales').sum().alias('total_sales')
])

print("List aggregation:")
print(result)

## Part 4: Maintaining Row Order with maintain_order

In [None]:
# Without maintain_order (order may change)
result1 = df.group_by('product').agg(pl.col('sales').sum())
print("Without maintain_order:")
print(result1)

# With maintain_order (preserves first occurrence order)
result2 = df.group_by('product', maintain_order=True).agg(pl.col('sales').sum())
print("\nWith maintain_order:")
print(result2)

## Part 5: Multiple Aggregations on Same Column

In [None]:
# Get full statistics for sales column
result = df.group_by('product').agg([
    pl.col('sales').min().alias('min_sales'),
    pl.col('sales').quantile(0.25).alias('q1_sales'),
    pl.col('sales').median().alias('median_sales'),
    pl.col('sales').quantile(0.75).alias('q3_sales'),
    pl.col('sales').max().alias('max_sales'),
    pl.col('sales').mean().alias('mean_sales'),
    pl.col('sales').std().alias('std_sales'),
    pl.len().alias('count')
])

print("Full statistics:")
print(result)

## Part 6: Complex Real-World Examples

In [None]:
# Create more realistic sales data
np.random.seed(42)
dates = pl.date_range(pl.date(2023, 1, 1), pl.date(2023, 3, 31), '1d', eager=True)

sales_data = pl.DataFrame({
    'date': np.repeat(dates, 3),
    'product': np.tile(['Laptop', 'Mouse', 'Keyboard'], len(dates)),
    'region': np.random.choice(['North', 'South', 'East', 'West'], len(dates) * 3),
    'sales_amount': np.random.uniform(100, 2000, len(dates) * 3),
    'quantity': np.random.randint(1, 20, len(dates) * 3),
}).with_columns([
    pl.col('date').dt.month().alias('month'),
    pl.col('date').dt.weekday().alias('weekday')
])

print(f"Sales data: {len(sales_data)} rows")
print(sales_data.head(10))

### Example 1: Monthly product performance

In [None]:
monthly_performance = sales_data.group_by(['month', 'product']).agg([
    pl.col('sales_amount').sum().alias('total_revenue'),
    pl.col('quantity').sum().alias('units_sold'),
    (pl.col('sales_amount').sum() / pl.col('quantity').sum()).alias('avg_price_per_unit'),
    pl.col('sales_amount').mean().alias('avg_transaction'),
    pl.len().alias('num_transactions'),
    pl.col('region').n_unique().alias('regions_covered')
]).sort(['month', 'product'])

print("Monthly product performance:")
print(monthly_performance.head(10))

### Example 2: Regional analysis with rankings

In [None]:
regional_analysis = (
    sales_data
    .group_by('region')
    .agg([
        pl.col('sales_amount').sum().alias('total_revenue'),
        pl.col('sales_amount').mean().alias('avg_transaction'),
        pl.len().alias('num_transactions'),
        pl.col('product').n_unique().alias('unique_products')
    ])
    .with_columns([
        pl.col('total_revenue').rank(descending=True).alias('revenue_rank'),
        (pl.col('total_revenue') / pl.col('total_revenue').sum() * 100).alias('revenue_share_pct')
    ])
    .sort('revenue_rank')
)

print("Regional analysis with rankings:")
print(regional_analysis)

### Example 3: Weekday vs Weekend analysis

In [None]:
# Add weekend flag
weekday_analysis = (
    sales_data
    .with_columns([
        pl.when(pl.col('weekday').is_in([5, 6]))
          .then(pl.lit('Weekend'))
          .otherwise(pl.lit('Weekday'))
          .alias('day_type')
    ])
    .group_by(['product', 'day_type'])
    .agg([
        pl.col('sales_amount').sum().alias('total_sales'),
        pl.col('sales_amount').mean().alias('avg_sale'),
        pl.len().alias('num_transactions')
    ])
    .sort(['product', 'day_type'])
)

print("Weekday vs Weekend:")
print(weekday_analysis)

## Part 7: Rolling Group By (Time-based Windows)

In [None]:
# Aggregate daily sales for one product
laptop_sales = (
    sales_data
    .filter(pl.col('product') == 'Laptop')
    .group_by('date')
    .agg(pl.col('sales_amount').sum().alias('daily_sales'))
    .sort('date')
)

print("Daily laptop sales (first 10 days):")
print(laptop_sales.head(10))

In [None]:
# Rolling 7-day average
rolling_analysis = laptop_sales.with_columns([
    pl.col('daily_sales').rolling_mean(window_size=7).alias('7day_avg'),
    pl.col('daily_sales').rolling_sum(window_size=7).alias('7day_sum'),
    pl.col('daily_sales').rolling_max(window_size=7).alias('7day_max')
])

print("\nRolling 7-day analysis:")
print(rolling_analysis.tail(10))

## Part 8: Dynamic Group By

In [None]:
# Group by time windows (e.g., weekly aggregation)
weekly_sales = (
    sales_data
    .sort('date')
    .group_by_dynamic('date', every='1w', by='product')
    .agg([
        pl.col('sales_amount').sum().alias('weekly_sales'),
        pl.col('quantity').sum().alias('weekly_quantity')
    ])
)

print("Weekly sales by product:")
print(weekly_sales.head(15))

## Part 9: Advanced Patterns

### Pattern 1: Top N per group

In [None]:
# Get top 3 sales per region
top_sales_per_region = (
    sales_data
    .sort('sales_amount', descending=True)
    .group_by('region', maintain_order=True)
    .agg([
        pl.col('date').head(3).alias('top_dates'),
        pl.col('product').head(3).alias('top_products'),
        pl.col('sales_amount').head(3).alias('top_amounts')
    ])
)

print("Top 3 sales per region:")
print(top_sales_per_region)

### Pattern 2: Ratio to group total

In [None]:
# Calculate each transaction's % of regional total
with_pct = (
    sales_data
    .with_columns([
        (pl.col('sales_amount') / pl.col('sales_amount').sum().over('region') * 100)
        .alias('pct_of_region_total')
    ])
    .select(['date', 'region', 'product', 'sales_amount', 'pct_of_region_total'])
    .sort('pct_of_region_total', descending=True)
)

print("Top transactions by % of regional total:")
print(with_pct.head(10))

### Pattern 3: Multiple groupings with different aggregations

In [None]:
# Create multiple summary views
by_product = sales_data.group_by('product').agg([
    pl.col('sales_amount').sum().alias('total')
])

by_region = sales_data.group_by('region').agg([
    pl.col('sales_amount').sum().alias('total')
])

by_month = sales_data.group_by('month').agg([
    pl.col('sales_amount').sum().alias('total')
])

print("By Product:")
print(by_product)
print("\nBy Region:")
print(by_region)
print("\nBy Month:")
print(by_month)

### Pattern 4: Aggregating multiple columns with same function

In [None]:
# Sum multiple numeric columns at once
result = sales_data.group_by('product').agg([
    pl.col('sales_amount', 'quantity').sum().name.suffix('_sum')
])

print("Sum multiple columns:")
print(result)

## Part 10: Performance Tips

In [None]:
import time

# Create large dataset
large_df = pl.DataFrame({
    'group': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1_000_000),
    'value1': np.random.randn(1_000_000),
    'value2': np.random.randn(1_000_000),
    'value3': np.random.randn(1_000_000)
})

print(f"Large dataset: {len(large_df):,} rows")

In [None]:
# Combine aggregations in single pass
start = time.time()
result = large_df.group_by('group').agg([
    pl.col('value1').mean(),
    pl.col('value2').sum(),
    pl.col('value3').std()
])
time1 = time.time() - start
print(f"Single group_by: {time1:.4f}s")

# Multiple separate group_bys (SLOWER)
start = time.time()
r1 = large_df.group_by('group').agg(pl.col('value1').mean())
r2 = large_df.group_by('group').agg(pl.col('value2').sum())
r3 = large_df.group_by('group').agg(pl.col('value3').std())
time2 = time.time() - start
print(f"Multiple group_bys: {time2:.4f}s")
print(f"\nSingle pass is {time2/time1:.2f}x faster!")

### Lazy evaluation with group_by

In [None]:
# Use lazy for complex pipelines
start = time.time()
lazy_result = (
    large_df.lazy()
    .filter(pl.col('value1') > 0)
    .group_by('group')
    .agg([
        pl.col('value1').mean(),
        pl.col('value2').sum()
    ])
    .sort('group')
    .collect()
)
lazy_time = time.time() - start

print(f"Lazy execution: {lazy_time:.4f}s")
print("Lazy allows filter pushdown before grouping!")

## Summary

### Key Concepts:
1. **group_by()** splits data into groups for aggregation
2. **agg()** applies aggregation expressions to each group
3. Multiple aggregations can be performed in a single pass
4. **Conditional aggregations** using filter() within agg()
5. **Rolling** and **dynamic** group_by for time-series data
6. Use **maintain_order** to preserve group order

### Common Aggregations:
- **Statistical**: sum, mean, median, std, var, min, max
- **Counting**: len, n_unique, count
- **Positional**: first, last, head, tail
- **Quantiles**: quantile()
- **Lists**: collect values into lists (default behavior)

### Best Practices:
- Combine multiple aggregations in single group_by
- Use lazy evaluation for complex pipelines
- Use conditional aggregations instead of multiple group_bys
- Consider rolling/dynamic group_by for time-series
- Use over() for window functions (see Window Functions notebook)