# Week 9: Multi-Index Operations and Data Pivoting

## Learning Objectives
By the end of this notebook, you will be able to:
1. Create and work with hierarchical (multi-level) indexes
2. Navigate and slice multi-indexed DataFrames
3. Use `.unstack()` and `.stack()` to reshape data
4. Create pivot tables with `.pivot()` and `.pivot_table()`
5. Generate cross-tabulations with `pd.crosstab()`
6. Build product-customer purchase matrices for analysis

## Business Context
We'll analyze product-customer purchase patterns to understand:
- Which products are frequently bought together
- Customer preferences across product categories
- Purchase patterns by location and time
- Cross-selling opportunities

**Duration:** 45 minutes

## Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
pd.set_option('display.max_rows', 20)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 7)

In [None]:
# Load datasets
customers = pd.read_csv('../datasets/customers.csv')
orders = pd.read_csv('../datasets/orders.csv')
order_items = pd.read_csv('../datasets/order_items.csv')
products = pd.read_csv('../datasets/products.csv')

print("Data loaded successfully!")

In [None]:
# Create comprehensive dataset
# Merge all tables
data = (order_items
        .merge(orders, on='order_id')
        .merge(customers, on='customer_id')
        .merge(products, on='product_id'))

# Calculate total price
data['total_price'] = data['price'] + data['freight_value']

# Convert timestamp to datetime
data['order_purchase_timestamp'] = pd.to_datetime(data['order_purchase_timestamp'])
data['order_month'] = data['order_purchase_timestamp'].dt.to_period('M')

print(f"Combined dataset: {len(data)} rows")
print("\nSample data:")
print(data.head())

## Part 1: Creating Multi-Index DataFrames

A multi-index (hierarchical index) allows you to have multiple levels of row or column labels.

**Benefits:**
- Organize data in multiple dimensions
- Efficient storage of sparse data
- Natural representation of grouped data

In [None]:
# Create multi-index from groupby
sales_by_category_state = data.groupby(['product_category_name', 'customer_state']).agg({
    'order_id': 'count',
    'total_price': 'sum'
}).rename(columns={'order_id': 'num_orders', 'total_price': 'revenue'})

print("=== Multi-Index Example ===")
print(sales_by_category_state.head(15))
print(f"\nIndex levels: {sales_by_category_state.index.names}")

In [None]:
# Accessing data in multi-index
# Method 1: Using .loc with tuple
print("=== Electronics in Lagos (LA) ===")
print(sales_by_category_state.loc[('eletronicos', 'LA')])

# Method 2: Using .loc with slice
print("\n=== All categories in Lagos ===")
print(sales_by_category_state.loc[(slice(None), 'LA'), :])

In [None]:
# Create multi-index manually
category_city_month = data.groupby(['product_category_name', 'customer_city', 'order_month']).agg({
    'total_price': 'sum',
    'order_id': 'nunique'
}).rename(columns={'order_id': 'orders'})

print("=== Three-Level Multi-Index ===")
print(category_city_month.head(20))
print(f"\nIndex levels: {category_city_month.index.names}")

## Part 2: Unstacking and Stacking

- **`.unstack()`**: Converts row index to columns (pivot operation)
- **`.stack()`**: Converts columns to row index (unpivot operation)

In [None]:
# Unstack the last level (state)
category_state_pivot = sales_by_category_state['revenue'].unstack()

print("=== Revenue by Category and State (Pivoted) ===")
print(category_state_pivot)

# Fill NaN values with 0
category_state_pivot_filled = category_state_pivot.fillna(0)
print("\n=== After Filling NaN ===")
print(category_state_pivot_filled.head(10))

In [None]:
# Visualize category performance across states
# Focus on Nigerian states
nigerian_cols = ['LA', 'AB', 'PH', 'KA']
nigerian_data = category_state_pivot_filled[nigerian_cols]

# Top 5 categories by total revenue
top_categories = nigerian_data.sum(axis=1).nlargest(5).index
nigerian_data.loc[top_categories].plot(kind='bar', stacked=False)
plt.title('Top 5 Product Categories by State Revenue', fontsize=14, fontweight='bold')
plt.xlabel('Product Category')
plt.ylabel('Revenue (₦)')
plt.legend(title='State', bbox_to_anchor=(1.05, 1))
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Stack back to original format
stacked_back = category_state_pivot_filled.stack()

print("=== After Stacking Back ===")
print(stacked_back.head(15))
print(f"\nType: {type(stacked_back)}")
print(f"Index levels: {stacked_back.index.names}")

## Part 3: Pivot Tables

Pivot tables aggregate data and reshape it simultaneously.

**Syntax:**
```python
df.pivot_table(
    values='column_to_aggregate',
    index='row_labels',
    columns='column_labels',
    aggfunc='aggregation_function'
)
```

In [None]:
# Simple pivot table: revenue by category and state
revenue_pivot = data.pivot_table(
    values='total_price',
    index='product_category_name',
    columns='customer_state',
    aggfunc='sum',
    fill_value=0
)

print("=== Revenue Pivot Table ===")
print(revenue_pivot.head(10))

In [None]:
# Multiple aggregations
multi_agg_pivot = data.pivot_table(
    values='total_price',
    index='product_category_name',
    columns='customer_state',
    aggfunc=['sum', 'mean', 'count'],
    fill_value=0
)

print("=== Multiple Aggregations Pivot ===")
print(multi_agg_pivot.head())

In [None]:
# Pivot with margins (totals)
category_month_pivot = data.pivot_table(
    values='total_price',
    index='product_category_name',
    columns='order_month',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)

print("=== Category Revenue by Month (with Totals) ===")
print(category_month_pivot.head(10))

## Part 4: Cross-Tabulation with pd.crosstab()

`pd.crosstab()` is specialized for counting frequencies and calculating proportions.

In [None]:
# Basic crosstab: orders by category and state
category_state_crosstab = pd.crosstab(
    data['product_category_name'],
    data['customer_state']
)

print("=== Order Frequency by Category and State ===")
print(category_state_crosstab.head(10))

In [None]:
# Crosstab with percentages
category_state_pct = pd.crosstab(
    data['product_category_name'],
    data['customer_state'],
    normalize='columns'  # Percentage within each state
) * 100

print("=== Category Percentage by State ===")
print(category_state_pct.round(2).head(10))

In [None]:
# Crosstab with values (like pivot_table)
category_state_revenue = pd.crosstab(
    data['product_category_name'],
    data['customer_state'],
    values=data['total_price'],
    aggfunc='sum'
).fillna(0)

print("=== Revenue Crosstab ===")
print(category_state_revenue.head(10))

# Heatmap visualization
plt.figure(figsize=(12, 8))
sns.heatmap(category_state_revenue, annot=False, fmt='.0f', cmap='YlOrRd', cbar_kws={'label': 'Revenue (₦)'})
plt.title('Revenue Heatmap: Category vs State', fontsize=14, fontweight='bold')
plt.xlabel('State')
plt.ylabel('Product Category')
plt.tight_layout()
plt.show()

## Part 5: Product-Customer Purchase Matrix

Create a matrix showing which customers bought which product categories.

In [None]:
# Customer-category purchase matrix
customer_category_matrix = pd.crosstab(
    data['customer_id'],
    data['product_category_name']
)

print("=== Customer-Category Purchase Matrix ===")
print(customer_category_matrix.head(10))
print(f"\nMatrix shape: {customer_category_matrix.shape}")

In [None]:
# Convert to binary (1 if purchased, 0 if not)
customer_category_binary = (customer_category_matrix > 0).astype(int)

print("=== Binary Purchase Matrix ===")
print(customer_category_binary.head(10))

In [None]:
# Analyze category co-purchases
# Create category-category correlation matrix
category_correlation = customer_category_binary.T.corr()

print("=== Top 10 Categories for Analysis ===")
top_10_categories = customer_category_matrix.sum().nlargest(10).index
category_corr_subset = customer_category_binary[top_10_categories].T.corr()

# Visualize correlation
plt.figure(figsize=(10, 8))
sns.heatmap(category_corr_subset, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={'label': 'Correlation'})
plt.title('Product Category Co-Purchase Correlation', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Calculate category diversity per customer
customer_diversity = customer_category_binary.sum(axis=1)

print("=== Customer Category Diversity ===")
print(customer_diversity.value_counts().sort_index())

# Visualize
customer_diversity.value_counts().sort_index().plot(kind='bar', color='teal')
plt.title('Distribution of Categories Purchased per Customer', fontsize=14, fontweight='bold')
plt.xlabel('Number of Different Categories')
plt.ylabel('Number of Customers')
plt.tight_layout()
plt.show()

print(f"\nAverage categories per customer: {customer_diversity.mean():.2f}")
print(f"Max categories purchased by one customer: {customer_diversity.max()}")

## Part 6: Advanced Multi-Index Operations

Working with complex hierarchical data structures.

In [None]:
# Create complex multi-index pivot
complex_pivot = data.pivot_table(
    values='total_price',
    index=['customer_state', 'customer_city'],
    columns=['order_month'],
    aggfunc='sum',
    fill_value=0
)

print("=== Complex Multi-Index Pivot ===")
print(complex_pivot.head(15))

In [None]:
# Reset index to flatten
complex_pivot_flat = complex_pivot.reset_index()

print("=== Flattened Pivot ===")
print(complex_pivot_flat.head(10))

In [None]:
# Swap index levels
sales_city_category = data.groupby(['customer_city', 'product_category_name'])['total_price'].sum()

print("=== Original Multi-Index ===")
print(sales_city_category.head(10))

# Swap levels
sales_category_city = sales_city_category.swaplevel()

print("\n=== After Swapping Levels ===")
print(sales_category_city.head(10))

# Sort by new index
sales_category_city_sorted = sales_category_city.sort_index()
print("\n=== Sorted by New Index ===")
print(sales_category_city_sorted.head(10))

## Part 7: Business Insights from Multi-Index Data

In [None]:
# Identify top-performing state-city combinations
location_performance = data.groupby(['customer_state', 'customer_city']).agg({
    'total_price': 'sum',
    'order_id': 'nunique',
    'customer_id': 'nunique'
}).rename(columns={
    'total_price': 'revenue',
    'order_id': 'orders',
    'customer_id': 'customers'
})

# Calculate average order value
location_performance['avg_order_value'] = (location_performance['revenue'] / 
                                            location_performance['orders']).round(2)

print("=== Top 10 Locations by Revenue ===")
print(location_performance.sort_values('revenue', ascending=False).head(10))

In [None]:
# Focus on Nigerian locations
nigerian_states = ['LA', 'AB', 'PH', 'KA']
nigerian_performance = location_performance.loc[nigerian_states]

print("=== Nigerian Location Performance ===")
print(nigerian_performance.sort_values('revenue', ascending=False))

# Unstack for visualization
nigerian_revenue = nigerian_performance['revenue'].unstack(level=0)
nigerian_revenue.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Revenue Distribution Across Nigerian Cities', fontsize=14, fontweight='bold')
plt.xlabel('City')
plt.ylabel('Revenue (₦)')
plt.legend(title='State', bbox_to_anchor=(1.05, 1))
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## Key Takeaways

### Multi-Index Concepts
1. **Creation:** Automatically from `groupby()` with multiple columns
2. **Navigation:** Use `.loc[]` with tuples or slices
3. **Manipulation:** `swaplevel()`, `reset_index()`, `set_index()`

### Reshaping Operations
1. **Unstack:** Convert row index to columns (wide format)
2. **Stack:** Convert columns to row index (long format)
3. **Pivot Table:** Aggregate and reshape simultaneously
4. **Crosstab:** Count frequencies and calculate proportions

### Business Applications
1. **Purchase Matrices:** Identify customer-product relationships
2. **Correlation Analysis:** Find products frequently bought together
3. **Location Analytics:** Compare performance across dimensions
4. **Time Series Analysis:** Track metrics over multiple dimensions

### Best Practices
1. Use `.fillna(0)` after unstacking to handle missing combinations
2. Reset index to flatten for easier manipulation
3. Use `margins=True` in pivot tables for row/column totals
4. Visualize with heatmaps for pattern recognition

## Next Steps
1. Complete `exercises/exercise-02-pivoting-data.ipynb`
2. Review `resources/multi-index-guide.md` for quick reference
3. Continue to Notebook 03: Advanced Aggregations

---
**PORA Academy Cohort 5 - Week 9 Wednesday Python**  
*Customer Lifetime Value Analysis*