# Week 9: Advanced Aggregations for Business Metrics

## Learning Objectives
By the end of this notebook, you will be able to:
1. Create custom aggregation functions for business metrics
2. Use named aggregations with `.agg()` for clean output
3. Combine multiple operations on the same column
4. Calculate complex business metrics (revenue growth, customer retention)
5. Apply statistical aggregations (percentiles, standard deviation, variance)
6. Use conditional aggregations with `.filter()`

## Business Context
We'll build comprehensive customer analytics including:
- Customer lifetime value calculations
- Purchase frequency and recency analysis
- Revenue contribution metrics
- Customer segment performance

**Duration:** 45 minutes

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Load and prepare data
customers = pd.read_csv('../datasets/customers.csv')
orders = pd.read_csv('../datasets/orders.csv')
order_items = pd.read_csv('../datasets/order_items.csv')
products = pd.read_csv('../datasets/products.csv')

# Merge datasets
data = (order_items
        .merge(orders, on='order_id')
        .merge(customers, on='customer_id')
        .merge(products, on='product_id'))

data['total_price'] = data['price'] + data['freight_value']
data['order_purchase_timestamp'] = pd.to_datetime(data['order_purchase_timestamp'])

print("Data ready!")

## Part 1: Custom Aggregation Functions

In [None]:
# Define custom functions
def revenue_range(x):
    """Calculate range between max and min"""
    return x.max() - x.min()

def coefficient_of_variation(x):
    """Calculate CV (std/mean) - measures relative variability"""
    if x.mean() == 0:
        return 0
    return (x.std() / x.mean()) * 100

def percentile_90(x):
    """Calculate 90th percentile"""
    return x.quantile(0.90)

# Apply custom aggregations
customer_metrics = data.groupby('customer_id').agg(
    num_orders=('order_id', 'nunique'),
    total_revenue=('total_price', 'sum'),
    avg_revenue=('total_price', 'mean'),
    revenue_range=('total_price', revenue_range),
    revenue_cv=('total_price', coefficient_of_variation),
    revenue_90th=('total_price', percentile_90)
).round(2)

print("=== Customer Metrics with Custom Aggregations ===")
print(customer_metrics.head(10))

## Part 2: Named Aggregations for Clean Code

In [None]:
# Comprehensive customer analysis with named aggregations
customer_analysis = data.groupby('customer_id').agg(
    # Order metrics
    total_orders=('order_id', 'nunique'),
    total_items=('order_item_id', 'count'),
    
    # Revenue metrics
    lifetime_value=('total_price', 'sum'),
    avg_order_value=('total_price', 'mean'),
    min_order=('total_price', 'min'),
    max_order=('total_price', 'max'),
    
    # Statistical metrics
    revenue_std=('total_price', 'std'),
    revenue_median=('total_price', 'median'),
    
    # Category diversity
    unique_categories=('product_category_name', 'nunique'),
    
    # Time metrics
    first_purchase=('order_purchase_timestamp', 'min'),
    last_purchase=('order_purchase_timestamp', 'max')
).round(2)

# Calculate derived metrics
customer_analysis['items_per_order'] = (customer_analysis['total_items'] / 
                                         customer_analysis['total_orders']).round(2)

print("=== Comprehensive Customer Analysis ===")
print(customer_analysis.head(10))

## Part 3: Recency, Frequency, Monetary (RFM) Analysis

In [None]:
# Define analysis date (use max date in dataset + 1 day)
analysis_date = data['order_purchase_timestamp'].max() + pd.Timedelta(days=1)

# Calculate RFM metrics
rfm = data.groupby('customer_id').agg(
    recency=('order_purchase_timestamp', lambda x: (analysis_date - x.max()).days),
    frequency=('order_id', 'nunique'),
    monetary=('total_price', 'sum')
).round(2)

print("=== RFM Analysis ===")
print(rfm.head(10))
print("\n=== RFM Statistics ===")
print(rfm.describe())

In [None]:
# Create RFM scores using quartiles
rfm['r_score'] = pd.qcut(rfm['recency'], 4, labels=[4, 3, 2, 1])  # Lower recency is better
rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 4, labels=[1, 2, 3, 4])
rfm['m_score'] = pd.qcut(rfm['monetary'], 4, labels=[1, 2, 3, 4])

# Create RFM segment
rfm['rfm_score'] = rfm['r_score'].astype(str) + rfm['f_score'].astype(str) + rfm['m_score'].astype(str)

print("=== RFM Scores ===")
print(rfm.head(10))

In [None]:
# Segment customers
def segment_customer(row):
    r, f, m = int(row['r_score']), int(row['f_score']), int(row['m_score'])
    
    if r >= 4 and f >= 4 and m >= 4:
        return 'Champions'
    elif r >= 3 and f >= 3:
        return 'Loyal Customers'
    elif r >= 4:
        return 'Recent Customers'
    elif f >= 4:
        return 'Frequent Buyers'
    elif m >= 4:
        return 'Big Spenders'
    elif r <= 2:
        return 'At Risk'
    else:
        return 'Need Attention'

rfm['segment'] = rfm.apply(segment_customer, axis=1)

print("=== Customer Segmentation ===")
print(rfm['segment'].value_counts())

# Visualize
rfm['segment'].value_counts().plot(kind='barh', color='steelblue')
plt.title('Customer Segments Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Number of Customers')
plt.tight_layout()
plt.show()

## Part 4: Statistical Aggregations

In [None]:
# Statistical analysis by product category
category_stats = data.groupby('product_category_name').agg(
    count=('total_price', 'count'),
    mean=('total_price', 'mean'),
    median=('total_price', 'median'),
    std=('total_price', 'std'),
    min=('total_price', 'min'),
    q25=('total_price', lambda x: x.quantile(0.25)),
    q75=('total_price', lambda x: x.quantile(0.75)),
    max=('total_price', 'max')
).round(2)

# Calculate IQR
category_stats['iqr'] = category_stats['q75'] - category_stats['q25']

print("=== Category Price Statistics ===")
print(category_stats.sort_values('mean', ascending=False).head(10))

In [None]:
# Analyze purchase patterns by state
state_analysis = data.groupby('customer_state').agg(
    total_orders=('order_id', 'nunique'),
    total_customers=('customer_id', 'nunique'),
    total_revenue=('total_price', 'sum'),
    avg_order_value=('total_price', 'mean'),
    revenue_std=('total_price', 'std'),
    median_order=('total_price', 'median')
).round(2)

# Calculate orders per customer
state_analysis['orders_per_customer'] = (state_analysis['total_orders'] / 
                                          state_analysis['total_customers']).round(2)

print("=== State Performance Analysis ===")
print(state_analysis.sort_values('total_revenue', ascending=False).head(10))

## Part 5: Conditional Aggregations with Filter

In [None]:
# Filter customers with more than 1 order
repeat_customers = data.groupby('customer_id').filter(lambda x: x['order_id'].nunique() > 1)

print(f"Total rows in data: {len(data)}")
print(f"Rows for repeat customers: {len(repeat_customers)}")
print(f"Unique repeat customers: {repeat_customers['customer_id'].nunique()}")

In [None]:
# Analyze repeat customer behavior
repeat_analysis = repeat_customers.groupby('customer_id').agg(
    num_orders=('order_id', 'nunique'),
    total_spent=('total_price', 'sum'),
    avg_order=('total_price', 'mean'),
    days_active=('order_purchase_timestamp', lambda x: (x.max() - x.min()).days)
).round(2)

print("=== Repeat Customer Analysis ===")
print(repeat_analysis.sort_values('total_spent', ascending=False).head(10))

## Part 6: Complex Business Metrics

In [None]:
# Calculate customer purchase velocity (orders per month)
customer_velocity = data.groupby('customer_id').agg(
    num_orders=('order_id', 'nunique'),
    first_order=('order_purchase_timestamp', 'min'),
    last_order=('order_purchase_timestamp', 'max'),
    total_revenue=('total_price', 'sum')
)

# Calculate days active
customer_velocity['days_active'] = (customer_velocity['last_order'] - 
                                     customer_velocity['first_order']).dt.days

# Calculate monthly purchase rate (avoid division by zero)
customer_velocity['months_active'] = (customer_velocity['days_active'] / 30).replace(0, 1)
customer_velocity['orders_per_month'] = (customer_velocity['num_orders'] / 
                                          customer_velocity['months_active']).round(2)

print("=== Customer Purchase Velocity ===")
print(customer_velocity.sort_values('orders_per_month', ascending=False).head(10))

In [None]:
# Revenue concentration analysis
total_revenue = data['total_price'].sum()

customer_revenue = data.groupby('customer_id').agg(
    revenue=('total_price', 'sum')
).sort_values('revenue', ascending=False)

# Calculate cumulative percentage
customer_revenue['cumulative_revenue'] = customer_revenue['revenue'].cumsum()
customer_revenue['cumulative_pct'] = (customer_revenue['cumulative_revenue'] / total_revenue * 100).round(2)

# Find top 20% customers
top_20_pct = int(len(customer_revenue) * 0.2)
top_20_revenue = customer_revenue.iloc[:top_20_pct]['revenue'].sum()
top_20_contribution = (top_20_revenue / total_revenue * 100).round(2)

print(f"Top 20% of customers contribute {top_20_contribution}% of total revenue")
print("\n=== Top 10 Revenue Contributors ===")
print(customer_revenue.head(10))

## Key Takeaways

### Custom Aggregations
1. Define functions for business-specific metrics
2. Use lambda functions for simple calculations
3. Combine multiple aggregations with named syntax

### RFM Analysis
1. **Recency:** Days since last purchase
2. **Frequency:** Number of purchases
3. **Monetary:** Total spending
4. Score and segment customers for targeted marketing

### Statistical Methods
1. Use percentiles to understand distribution
2. Calculate variation metrics (std, CV, IQR)
3. Identify outliers and patterns

### Business Applications
1. Customer segmentation for marketing
2. Revenue concentration analysis
3. Purchase velocity tracking
4. Churn risk identification

## Next Steps
1. Complete `exercises/exercise-03-clv-calculations.ipynb`
2. Review `resources/aggregation-functions.md`
3. Continue to Notebook 04: Complete CLV Analysis

---
**PORA Academy Cohort 5 - Week 9 Wednesday Python**  
*Customer Lifetime Value Analysis*