# Week 9: GroupBy Mastery for Customer Analysis

## Learning Objectives
By the end of this notebook, you will be able to:
1. Perform single and multiple column grouping operations
2. Apply built-in aggregation functions (sum, mean, count, nunique)
3. Use custom aggregation functions with `.agg()`
4. Understand and apply `.transform()` for broadcasting calculations
5. Analyze customer purchase behavior using groupby operations

## Business Context
We're analyzing customer data from a Lagos-based e-commerce platform to understand purchasing patterns, calculate customer lifetime value, and identify high-value customer segments.

**Duration:** 30 minutes

## Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Load datasets
customers = pd.read_csv('../datasets/customers.csv')
orders = pd.read_csv('../datasets/orders.csv')
order_items = pd.read_csv('../datasets/order_items.csv')
products = pd.read_csv('../datasets/products.csv')

print("Data loaded successfully!")
print(f"Customers: {len(customers)} rows")
print(f"Orders: {len(orders)} rows")
print(f"Order Items: {len(order_items)} rows")
print(f"Products: {len(products)} rows")

In [None]:
# Quick data exploration
print("=== Customers Sample ===")
print(customers.head())
print("\n=== Orders Sample ===")
print(orders.head())
print("\n=== Order Items Sample ===")
print(order_items.head())

## Part 1: Single Column Grouping

### Basic GroupBy Syntax
```python
df.groupby('column_name').aggregation_function()
```

Let's start by analyzing customer distribution across cities.

In [None]:
# Count customers by city
customers_by_city = customers.groupby('customer_city').size()
print("=== Customer Count by City ===")
print(customers_by_city.sort_values(ascending=False))

# Visualize
customers_by_city.sort_values(ascending=False).head(10).plot(kind='barh', color='steelblue')
plt.title('Top 10 Cities by Customer Count', fontsize=14, fontweight='bold')
plt.xlabel('Number of Customers')
plt.ylabel('City')
plt.tight_layout()
plt.show()

### Common Aggregation Functions
- `.size()` - Count of rows (includes NaN)
- `.count()` - Count of non-NaN values
- `.sum()` - Sum of values
- `.mean()` - Average
- `.median()` - Median value
- `.nunique()` - Count of unique values

In [None]:
# Calculate total revenue per order
# First, we need to merge order_items to get prices
order_revenue = order_items.groupby('order_id').agg({
    'price': 'sum',
    'freight_value': 'sum'
})

# Calculate total (price + freight)
order_revenue['total_amount'] = order_revenue['price'] + order_revenue['freight_value']

print("=== Order Revenue Summary ===")
print(order_revenue.head(10))
print("\n=== Revenue Statistics ===")
print(order_revenue['total_amount'].describe())

In [None]:
# Calculate number of items per order
items_per_order = order_items.groupby('order_id').size()

print("=== Items per Order Distribution ===")
print(items_per_order.value_counts().sort_index())

# Visualize
items_per_order.value_counts().sort_index().plot(kind='bar', color='coral')
plt.title('Distribution of Items per Order', fontsize=14, fontweight='bold')
plt.xlabel('Number of Items')
plt.ylabel('Number of Orders')
plt.tight_layout()
plt.show()

## Part 2: Multiple Column Grouping

When grouping by multiple columns, pandas creates hierarchical groups.

**Syntax:**
```python
df.groupby(['col1', 'col2']).aggregation_function()
```

In [None]:
# Merge datasets to get city information with orders
orders_with_customers = orders.merge(customers, on='customer_id', how='left')

# Convert order_purchase_timestamp to datetime
orders_with_customers['order_purchase_timestamp'] = pd.to_datetime(orders_with_customers['order_purchase_timestamp'])
orders_with_customers['order_month'] = orders_with_customers['order_purchase_timestamp'].dt.to_period('M')

# Group by state and month
orders_by_state_month = orders_with_customers.groupby(['customer_state', 'order_month']).size()

print("=== Orders by State and Month ===")
print(orders_by_state_month.head(15))

In [None]:
# Focus on Nigerian states
nigerian_states = ['LA', 'AB', 'PH', 'KA']
nigerian_orders = orders_with_customers[orders_with_customers['customer_state'].isin(nigerian_states)]

# Group by state and city
orders_by_location = nigerian_orders.groupby(['customer_state', 'customer_city']).agg({
    'order_id': 'count',
    'customer_id': 'nunique'
}).rename(columns={
    'order_id': 'total_orders',
    'customer_id': 'unique_customers'
})

# Calculate orders per customer
orders_by_location['orders_per_customer'] = (orders_by_location['total_orders'] / 
                                               orders_by_location['unique_customers']).round(2)

print("=== Nigerian Location Analysis ===")
print(orders_by_location.sort_values('total_orders', ascending=False))

## Part 3: Custom Aggregation with .agg()

The `.agg()` method allows you to:
- Apply different functions to different columns
- Apply multiple functions to a single column
- Use custom functions

In [None]:
# Multiple aggregations on order items
order_summary = order_items.groupby('order_id').agg({
    'order_item_id': 'count',      # Number of items
    'price': ['sum', 'mean', 'min', 'max'],
    'freight_value': 'sum'
})

print("=== Order Summary Statistics ===")
print(order_summary.head(10))

In [None]:
# Named aggregations (cleaner output)
order_summary_clean = order_items.groupby('order_id').agg(
    total_items=('order_item_id', 'count'),
    total_revenue=('price', 'sum'),
    avg_item_price=('price', 'mean'),
    min_price=('price', 'min'),
    max_price=('price', 'max'),
    total_freight=('freight_value', 'sum')
).round(2)

print("=== Clean Order Summary ===")
print(order_summary_clean.head(10))

print("\n=== Summary Statistics ===")
print(order_summary_clean.describe())

In [None]:
# Custom aggregation function
def price_range(x):
    """Calculate the range between max and min prices"""
    return x.max() - x.min()

# Apply custom function
order_price_analysis = order_items.groupby('order_id').agg(
    total_items=('order_item_id', 'count'),
    total_revenue=('price', 'sum'),
    price_range=('price', price_range),
    price_std=('price', 'std')
).round(2)

print("=== Order Price Analysis ===")
print(order_price_analysis.head(10))

# Filter orders with high price variation
high_variation_orders = order_price_analysis[order_price_analysis['price_range'] > 100]
print(f"\nOrders with price range > ₦100: {len(high_variation_orders)}")

## Part 4: Transform Operations

`.transform()` broadcasts the result back to the original DataFrame shape.
This is useful for:
- Calculating percentage of total
- Standardizing values within groups
- Comparing individual values to group statistics

In [None]:
# Calculate each customer's total spending
# First, create a comprehensive order dataset
full_order_data = orders.merge(order_items, on='order_id')
full_order_data['total_price'] = full_order_data['price'] + full_order_data['freight_value']

# Calculate customer total using transform
full_order_data['customer_total'] = full_order_data.groupby('customer_id')['total_price'].transform('sum')

# Calculate percentage of customer total for each order
full_order_data['pct_of_customer_total'] = (
    full_order_data['total_price'] / full_order_data['customer_total'] * 100
).round(2)

print("=== Order Contribution to Customer Total ===")
print(full_order_data[['customer_id', 'order_id', 'total_price', 'customer_total', 'pct_of_customer_total']].head(15))

In [None]:
# Compare order value to customer average
full_order_data['customer_avg'] = full_order_data.groupby('customer_id')['total_price'].transform('mean')
full_order_data['above_avg'] = full_order_data['total_price'] > full_order_data['customer_avg']

print("=== Orders Compared to Customer Average ===")
print(full_order_data[['customer_id', 'order_id', 'total_price', 'customer_avg', 'above_avg']].head(15))

# Summary
print(f"\nOrders above customer average: {full_order_data['above_avg'].sum()}")
print(f"Orders below customer average: {(~full_order_data['above_avg']).sum()}")

## Part 5: Customer Purchase Behavior Analysis

Let's apply everything we've learned to analyze customer behavior.

In [None]:
# Comprehensive customer analysis
customer_analysis = full_order_data.groupby('customer_id').agg(
    total_orders=('order_id', 'nunique'),
    total_items=('order_item_id', 'count'),
    total_revenue=('total_price', 'sum'),
    avg_order_value=('total_price', 'mean'),
    min_order_value=('total_price', 'min'),
    max_order_value=('total_price', 'max'),
    std_order_value=('total_price', 'std')
).round(2)

# Calculate items per order
customer_analysis['avg_items_per_order'] = (customer_analysis['total_items'] / 
                                             customer_analysis['total_orders']).round(2)

print("=== Customer Purchase Behavior ===")
print(customer_analysis.sort_values('total_revenue', ascending=False).head(10))

In [None]:
# Segment customers by spending level
customer_analysis['spending_tier'] = pd.cut(
    customer_analysis['total_revenue'],
    bins=[0, 100, 300, 600, float('inf')],
    labels=['Low', 'Medium', 'High', 'VIP']
)

# Analyze by tier
tier_analysis = customer_analysis.groupby('spending_tier').agg({
    'total_revenue': ['count', 'sum', 'mean'],
    'total_orders': 'mean',
    'avg_order_value': 'mean'
}).round(2)

print("=== Customer Spending Tiers ===")
print(tier_analysis)

# Visualize customer distribution
customer_analysis['spending_tier'].value_counts().plot(kind='bar', color='skyblue')
plt.title('Customer Distribution by Spending Tier', fontsize=14, fontweight='bold')
plt.xlabel('Spending Tier')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Identify repeat customers
customer_analysis['is_repeat_customer'] = customer_analysis['total_orders'] > 1

print("=== Repeat vs One-Time Customers ===")
repeat_summary = customer_analysis.groupby('is_repeat_customer').agg({
    'total_revenue': ['count', 'sum', 'mean'],
    'total_orders': 'mean'
}).round(2)
repeat_summary.index = ['One-Time', 'Repeat']
print(repeat_summary)

# Calculate percentage
repeat_pct = (customer_analysis['is_repeat_customer'].sum() / len(customer_analysis) * 100).round(2)
print(f"\nRepeat customer rate: {repeat_pct}%")

## Key Takeaways

### GroupBy Operations
1. **Single Column:** `df.groupby('col')` - Group by one dimension
2. **Multiple Columns:** `df.groupby(['col1', 'col2'])` - Create hierarchical groups
3. **Basic Aggregations:** `.sum()`, `.mean()`, `.count()`, `.nunique()`

### Advanced Techniques
1. **Custom Aggregations:** `.agg()` with dictionary or named aggregations
2. **Transform:** Broadcast group statistics back to original shape
3. **Multiple Functions:** Apply different functions to different columns

### Business Applications
1. **Customer Segmentation:** Group customers by spending behavior
2. **Performance Analysis:** Compare individual values to group averages
3. **Trend Analysis:** Track metrics over time and location

### SQL to Pandas
- SQL `GROUP BY` = pandas `.groupby()`
- SQL `COUNT(*)` = pandas `.size()` or `.count()`
- SQL `OVER (PARTITION BY)` = pandas `.transform()`

## Next Steps
1. Complete the exercises in `exercises/exercise-01-groupby-operations.ipynb`
2. Review the `resources/pandas-groupby-reference.md` for quick reference
3. Continue to Notebook 02: Multi-Index Operations

---
**PORA Academy Cohort 5 - Week 9 Wednesday Python**  
*Customer Lifetime Value Analysis*