# Exercise 01: GroupBy Operations

## Instructions
Complete the exercises below using the e-commerce dataset. Each exercise has a TODO comment indicating what you need to accomplish.

**Tips:**
- Review lecture notebook 01-groupby-mastery.ipynb if you need help
- Test your code after each exercise
- Check the solutions only after attempting on your own

**Expected Time:** 30-45 minutes

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load datasets
customers = pd.read_csv('../lecture-materials/datasets/customers.csv')
orders = pd.read_csv('../lecture-materials/datasets/orders.csv')
order_items = pd.read_csv('../lecture-materials/datasets/order_items.csv')
products = pd.read_csv('../lecture-materials/datasets/products.csv')

# Create merged dataset
data = (order_items
        .merge(orders, on='order_id')
        .merge(customers, on='customer_id')
        .merge(products, on='product_id'))

data['total_price'] = data['price'] + data['freight_value']
data['order_purchase_timestamp'] = pd.to_datetime(data['order_purchase_timestamp'])

print("Data loaded successfully!")

## Exercise 1: Basic GroupBy (10 points)

Calculate the total number of customers in each Nigerian state (LA, AB, PH, KA).

In [None]:
# TODO: Filter for Nigerian states only
nigerian_states = ['LA', 'AB', 'PH', 'KA']
nigerian_customers = # YOUR CODE HERE

# TODO: Group by customer_state and count unique customers
customers_by_state = # YOUR CODE HERE

print("Customers by State:")
print(customers_by_state)

# HINT: Use .groupby('customer_state')['customer_id'].nunique()

## Exercise 2: Multiple Aggregations (15 points)

For each product category, calculate:
- Total revenue
- Average price
- Number of orders
- Number of unique customers

In [None]:
# TODO: Create aggregations using .agg()
category_summary = data.groupby('product_category_name').agg({
    # YOUR CODE HERE - fill in the aggregations
}).round(2)

# TODO: Sort by total revenue in descending order
category_summary = # YOUR CODE HERE

print("Top 10 Categories:")
print(category_summary.head(10))

# HINT: Use 'sum', 'mean', 'nunique' as aggregation functions

## Exercise 3: Named Aggregations (15 points)

Calculate customer-level metrics using named aggregations:
- total_spent: Sum of all purchases
- num_orders: Count of unique orders
- avg_order_value: Mean purchase amount
- max_purchase: Maximum single purchase

In [None]:
# TODO: Use named aggregations
customer_metrics = data.groupby('customer_id').agg(
    # YOUR CODE HERE
).round(2)

print("Customer Metrics:")
print(customer_metrics.head(10))

# HINT: Syntax is metric_name=('column_name', 'aggregation_function')

## Exercise 4: Transform Operation (20 points)

For each order, calculate what percentage it represents of the customer's total spending.

In [None]:
# TODO: Calculate customer total using transform
data['customer_total'] = # YOUR CODE HERE

# TODO: Calculate percentage contribution
data['pct_of_total'] = # YOUR CODE HERE

# Display results
print("Order Contribution Analysis:")
print(data[['customer_id', 'order_id', 'total_price', 'customer_total', 'pct_of_total']].head(15))

# HINT: Use .transform('sum') and then divide

## Exercise 5: Custom Aggregation Function (20 points)

Create a custom function to calculate the coefficient of variation (CV) for customer order values.
CV = (standard deviation / mean) * 100

In [None]:
# TODO: Define custom function
def coeff_variation(x):
    """Calculate coefficient of variation"""
    # YOUR CODE HERE
    pass

# TODO: Apply custom function
customer_variability = data.groupby('customer_id').agg(
    num_orders=('order_id', 'nunique'),
    avg_order=('total_price', 'mean'),
    std_order=('total_price', 'std'),
    cv=('total_price', coeff_variation)
).round(2)

# Filter customers with more than 1 order
customer_variability = customer_variability[customer_variability['num_orders'] > 1]

print("Customer Order Variability:")
print(customer_variability.sort_values('cv', ascending=False).head(10))

# HINT: Check for zero mean to avoid division by zero

## Exercise 6: Multi-Column GroupBy (20 points)

Analyze orders by state and month. Calculate total revenue and number of orders for each state-month combination.

In [None]:
# TODO: Extract month from timestamp
data['order_month'] = # YOUR CODE HERE

# TODO: Group by state and month
state_month_analysis = # YOUR CODE HERE

print("State-Month Analysis:")
print(state_month_analysis.head(20))

# TODO: Find the month with highest revenue for Lagos (LA)
lagos_data = # YOUR CODE HERE
print("\nLagos Monthly Performance:")
print(lagos_data)

# HINT: Use .dt.to_period('M') for monthly periods

## Bonus Challenge: Identify VIP Customers (Extra Credit)

Identify VIP customers who meet ALL these criteria:
- Total spending > ₦500
- More than 2 orders
- Purchased from at least 2 different categories

In [None]:
# TODO: Calculate customer criteria
customer_profile = data.groupby('customer_id').agg(
    # YOUR CODE HERE
)

# TODO: Filter for VIP customers
vip_customers = # YOUR CODE HERE

print(f"Total VIP Customers: {len(vip_customers)}")
print("\nVIP Customer Details:")
print(vip_customers.head(10))

# HINT: Use boolean indexing with multiple conditions

---
## Submission

Once complete:
1. Ensure all cells execute without errors
2. Review your outputs for correctness
3. Compare with solution notebook (after attempting all exercises)
4. Share insights or challenges in class discussion

**Points Distribution:**
- Exercise 1: 10 points
- Exercise 2: 15 points
- Exercise 3: 15 points
- Exercise 4: 20 points
- Exercise 5: 20 points
- Exercise 6: 20 points
- Bonus: 10 points extra credit

**Total: 100 points (110 with bonus)**