# Data Reshaping - Part 2: Melt and Pivot Operations

## Week 4, Day 1 (Wednesday) - April 30th, 2025

### Overview
This is the second part of our Data Reshaping session, focusing on transforming data between wide and long formats. Understanding how to reshape data is crucial for data analysis, visualization, and reporting - especially when working with time series data and creating pivot tables.

### Learning Objectives
- Understand the difference between wide and long format data
- Master `pd.melt()` for converting wide to long format
- Learn `pd.pivot()` and `pd.pivot_table()` for converting long to wide format
- Apply reshaping techniques to e-commerce scenarios
- Create summary tables and reports using pivot operations
- Handle common reshaping challenges

### Prerequisites
- Completed Part 1: Merge, Join, and Concatenate
- Understanding of Pandas DataFrames
- Basic knowledge of SQL GROUP BY operations

## 1. Introduction to Data Reshaping

### Wide vs Long Format Data

**Wide Format**: Each variable has its own column
- Easy to read by humans
- Common in spreadsheets
- Good for summary tables

**Long Format**: Each observation is a row
- Better for analysis and visualization
- Required by many statistical functions
- More normalized database structure

Think of it like the difference between a crosstab table and a detailed transaction log.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully!")

## 2. Creating Sample E-commerce Data

Let's create realistic e-commerce datasets to demonstrate reshaping operations:

In [None]:
# Monthly sales data - WIDE FORMAT
# Each month is a separate column
monthly_sales_wide = pd.DataFrame({
    'product_id': ['PROD001', 'PROD002', 'PROD003', 'PROD004', 'PROD005'],
    'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Monitor'],
    'Jan_2025': [15, 25, 12, 8, 18],
    'Feb_2025': [18, 30, 15, 12, 22],
    'Mar_2025': [22, 28, 18, 15, 25],
    'Apr_2025': [20, 32, 20, 18, 28]
})

print("Monthly Sales Data (WIDE FORMAT):")
print(monthly_sales_wide)
print(f"\nShape: {monthly_sales_wide.shape}")
print("Notice: Each month is a separate column")

In [None]:
# Daily sales data - LONG FORMAT
# Each observation (product-date combination) is a separate row
np.random.seed(42)  # For reproducible results
dates = pd.date_range('2025-01-01', '2025-01-10', freq='D')
products = ['PROD001', 'PROD002', 'PROD003']

daily_sales_long = pd.DataFrame([
    {'date': date, 'product_id': product, 'sales_qty': np.random.randint(1, 10)}
    for date in dates
    for product in products
])

print("Daily Sales Data (LONG FORMAT):")
print(daily_sales_long.head(15))
print(f"\nShape: {daily_sales_long.shape}")
print("Notice: Each product-date combination is a separate row")

## 3. Melting Data (Wide to Long Format)

### What is Melting?
`pd.melt()` transforms wide format data to long format by:
- Taking column names and making them values in a new column
- Moving the corresponding values to another new column
- Creating a row for each original column value

### Basic Melting Example

In [None]:
# Melt the monthly sales data from wide to long format
monthly_sales_melted = pd.melt(monthly_sales_wide, 
                              id_vars=['product_id', 'product_name'],  # Columns to keep as identifiers
                              value_vars=['Jan_2025', 'Feb_2025', 'Mar_2025', 'Apr_2025'],  # Columns to melt
                              var_name='month',  # Name for the new column containing old column names
                              value_name='sales_qty')  # Name for the new column containing values

print("Melted Monthly Sales Data (LONG FORMAT):")
print(monthly_sales_melted.head(12))
print(f"\nOriginal shape: {monthly_sales_wide.shape}")
print(f"Melted shape: {monthly_sales_melted.shape}")
print("\nNotice: 5 products × 4 months = 20 rows!")

### Simplified Melting
You can also melt without specifying all parameters:

In [None]:
# Simplified melt - pandas will automatically determine what to melt
simple_melt = pd.melt(monthly_sales_wide, 
                     id_vars=['product_id', 'product_name'])

print("Simplified Melt (default column names):")
print(simple_melt.head(8))
print("\nDefault column names: 'variable' and 'value'")

### Advanced Melting with Data Cleaning

In [None]:
# Let's clean up the melted data for better analysis
monthly_sales_clean = monthly_sales_melted.copy()

# Convert month column to datetime
monthly_sales_clean['month'] = pd.to_datetime(monthly_sales_clean['month'], format='%b_%Y')

# Sort by product and month
monthly_sales_clean = monthly_sales_clean.sort_values(['product_id', 'month']).reset_index(drop=True)

print("Cleaned Melted Data:")
print(monthly_sales_clean.head(12))
print(f"\nMonth column data type: {monthly_sales_clean['month'].dtype}")

## 4. Pivoting Data (Long to Wide Format)

### What is Pivoting?
`pd.pivot()` transforms long format data to wide format by:
- Using unique values from one column as new column headers
- Using values from another column to populate the new columns
- Creating a summary view of the data

### Basic Pivot Example

In [None]:
# Pivot the daily sales data from long to wide format
daily_sales_pivot = daily_sales_long.pivot(index='date', 
                                          columns='product_id', 
                                          values='sales_qty')

print("Daily Sales Data Pivoted (WIDE FORMAT):")
print(daily_sales_pivot)
print(f"\nOriginal shape: {daily_sales_long.shape}")
print(f"Pivoted shape: {daily_sales_pivot.shape}")
print("\nNotice: Dates are now rows, products are columns!")

### Handling Missing Values in Pivots

In [None]:
# Create data with missing combinations
sparse_sales = pd.DataFrame({
    'date': ['2025-01-01', '2025-01-01', '2025-01-02', '2025-01-03'],
    'product_id': ['PROD001', 'PROD002', 'PROD001', 'PROD003'],
    'sales_qty': [5, 3, 7, 2]
})

print("Sparse Sales Data (missing combinations):")
print(sparse_sales)

# Pivot with missing values
sparse_pivot = sparse_sales.pivot(index='date', columns='product_id', values='sales_qty')
print("\nPivoted with NaN for missing combinations:")
print(sparse_pivot)

# Fill missing values
sparse_pivot_filled = sparse_pivot.fillna(0)
print("\nPivoted with 0 for missing combinations:")
print(sparse_pivot_filled)

## 5. Pivot Tables with Aggregation

### When to Use pivot_table() vs pivot()
- Use `pivot()` when each index-column combination has exactly one value
- Use `pivot_table()` when you need aggregation (like SQL GROUP BY)

### Creating Comprehensive Sales Data

In [None]:
# Create sales data with multiple entries per product-date combination
# (Like multiple orders for the same product on the same day)
detailed_sales = pd.DataFrame({
    'date': ['2025-01-01', '2025-01-01', '2025-01-01', '2025-01-01', '2025-01-02', '2025-01-02'],
    'product_id': ['PROD001', 'PROD001', 'PROD002', 'PROD003', 'PROD001', 'PROD002'],
    'customer_id': ['CUST001', 'CUST002', 'CUST001', 'CUST003', 'CUST001', 'CUST002'],
    'sales_qty': [2, 3, 1, 4, 2, 1],
    'sales_amount': [100, 150, 50, 200, 100, 50]
})

print("Detailed Sales Data (multiple entries per product-date):")
print(detailed_sales)

# This would fail with regular pivot() due to duplicate index-column combinations
# detailed_pivot = detailed_sales.pivot(index='date', columns='product_id', values='sales_qty')
# UnionError: Index contains duplicate entries, cannot reshape

### Using pivot_table() for Aggregation

In [None]:
# Use pivot_table with aggregation function
sales_pivot_table = pd.pivot_table(detailed_sales,
                                  index='date',
                                  columns='product_id',
                                  values='sales_qty',
                                  aggfunc='sum',  # Aggregate function
                                  fill_value=0)   # Fill missing values

print("Sales Pivot Table (aggregated):")
print(sales_pivot_table)
print("\nThis is like: SELECT date, product_id, SUM(sales_qty) FROM sales GROUP BY date, product_id")

### Multiple Aggregation Functions

In [None]:
# Multiple aggregation functions
multi_agg_pivot = pd.pivot_table(detailed_sales,
                                index='date',
                                columns='product_id',
                                values='sales_qty',
                                aggfunc=['sum', 'mean', 'count'],
                                fill_value=0)

print("Multi-Aggregation Pivot Table:")
print(multi_agg_pivot)
print("\nNotice the hierarchical column structure!")

### Multiple Values in Pivot Table

In [None]:
# Pivot table with multiple value columns
multi_value_pivot = pd.pivot_table(detailed_sales,
                                  index='date',
                                  columns='product_id',
                                  values=['sales_qty', 'sales_amount'],
                                  aggfunc='sum',
                                  fill_value=0)

print("Multi-Value Pivot Table:")
print(multi_value_pivot)
print("\nBoth sales_qty and sales_amount are pivoted!")

## 6. Real-World E-commerce Examples

### Example 1: Customer Purchase Patterns

In [None]:
# Create customer purchase data
customer_purchases = pd.DataFrame({
    'customer_id': ['CUST001', 'CUST001', 'CUST002', 'CUST002', 'CUST003', 'CUST003'],
    'category': ['Electronics', 'Books', 'Electronics', 'Clothing', 'Books', 'Electronics'],
    'purchase_amount': [299.99, 19.99, 499.99, 79.99, 29.99, 199.99],
    'purchase_count': [1, 2, 1, 3, 1, 1]
})

print("Customer Purchase Data:")
print(customer_purchases)

# Create customer-category spending matrix
customer_category_spending = pd.pivot_table(customer_purchases,
                                           index='customer_id',
                                           columns='category',
                                           values='purchase_amount',
                                           aggfunc='sum',
                                           fill_value=0)

print("\nCustomer-Category Spending Matrix:")
print(customer_category_spending)
print("\nThis shows how much each customer spent in each category")

### Example 2: Product Performance Dashboard

In [None]:
# Create product performance data
product_metrics = pd.DataFrame({
    'product_id': ['PROD001', 'PROD002', 'PROD003'] * 4,
    'metric': ['revenue', 'orders', 'returns', 'reviews'] * 3,
    'value': [15000, 50, 2, 45, 12000, 40, 1, 38, 8000, 25, 3, 22]
})

print("Product Metrics Data (LONG FORMAT):")
print(product_metrics)

# Pivot to create a product dashboard
product_dashboard = pd.pivot_table(product_metrics,
                                  index='product_id',
                                  columns='metric',
                                  values='value',
                                  aggfunc='first')  # Use 'first' since each combination has one value

print("\nProduct Performance Dashboard:")
print(product_dashboard)

# Add calculated metrics
product_dashboard['avg_order_value'] = product_dashboard['revenue'] / product_dashboard['orders']
product_dashboard['return_rate'] = product_dashboard['returns'] / product_dashboard['orders'] * 100

print("\nDashboard with Calculated Metrics:")
print(product_dashboard.round(2))

## 7. Advanced Reshaping Techniques

### Melting Multiple Value Columns

In [None]:
# Create data with multiple value columns
quarterly_data = pd.DataFrame({
    'product_id': ['PROD001', 'PROD002', 'PROD003'],
    'Q1_revenue': [10000, 8000, 6000],
    'Q1_orders': [50, 40, 30],
    'Q2_revenue': [12000, 9000, 7000],
    'Q2_orders': [60, 45, 35]
})

print("Quarterly Data (WIDE FORMAT):")
print(quarterly_data)

# Melt to long format
quarterly_melted = pd.melt(quarterly_data,
                          id_vars=['product_id'],
                          var_name='quarter_metric',
                          value_name='value')

print("\nQuarterly Data Melted:")
print(quarterly_melted)

# Split the quarter_metric column to separate quarter and metric
quarterly_melted[['quarter', 'metric']] = quarterly_melted['quarter_metric'].str.split('_', expand=True)
quarterly_melted = quarterly_melted.drop('quarter_metric', axis=1)

print("\nCleaned Quarterly Data:")
print(quarterly_melted)

### Stack and Unstack Operations
Alternative methods for reshaping data:

In [None]:
# Using stack() and unstack() as alternatives to melt() and pivot()
# Start with our monthly sales data
monthly_indexed = monthly_sales_wide.set_index(['product_id', 'product_name'])

print("Monthly Sales with MultiIndex:")
print(monthly_indexed)

# Stack to convert columns to rows (like melt)
stacked = monthly_indexed.stack()
print("\nStacked (columns to rows):")
print(stacked.head(8))
print(f"Type: {type(stacked)}")

# Unstack to convert rows to columns (like pivot)
unstacked = stacked.unstack()
print("\nUnstacked (back to original):")
print(unstacked)

## 8. Handling Complex Reshaping Scenarios

### Reshaping with Hierarchical Columns

In [None]:
# Create data with hierarchical column structure
hierarchical_data = pd.DataFrame({
    ('Sales', 'Q1'): [100, 120, 90],
    ('Sales', 'Q2'): [110, 130, 95],
    ('Revenue', 'Q1'): [10000, 12000, 9000],
    ('Revenue', 'Q2'): [11000, 13000, 9500]
}, index=['PROD001', 'PROD002', 'PROD003'])

# Set proper column names
hierarchical_data.columns = pd.MultiIndex.from_tuples(hierarchical_data.columns, names=['Metric', 'Quarter'])
hierarchical_data.index.name = 'Product'

print("Hierarchical Column Data:")
print(hierarchical_data)

# Stack to reshape
hierarchical_stacked = hierarchical_data.stack()
print("\nStacked Hierarchical Data:")
print(hierarchical_stacked)

# Reset index to get a flat structure
flat_hierarchical = hierarchical_stacked.reset_index()
print("\nFlat Structure:")
print(flat_hierarchical)

## 9. SQL Equivalents and Comparisons

### Pivot Table vs SQL GROUP BY

In [None]:
# Create sales data for SQL comparison
sales_data = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'product': ['Laptop', 'Phone', 'Laptop', 'Phone', 'Laptop', 'Phone'],
    'sales': [100, 150, 120, 180, 90, 160]
})

print("Sales Data:")
print(sales_data)

# SQL equivalent: SELECT region, product, SUM(sales) FROM sales GROUP BY region, product
sql_equivalent = sales_data.groupby(['region', 'product'])['sales'].sum().reset_index()
print("\nSQL GROUP BY equivalent:")
print(sql_equivalent)

# Pivot table for cross-tabulation
# SQL equivalent: Complex CASE WHEN statements
pivot_crosstab = pd.pivot_table(sales_data,
                               index='region',
                               columns='product',
                               values='sales',
                               aggfunc='sum',
                               fill_value=0)

print("\nPivot Table (Cross-tabulation):")
print(pivot_crosstab)
print("\nThis would require complex CASE WHEN statements in SQL!")

## 10. Performance Tips and Best Practices

### When to Use Each Method

In [None]:
# Performance comparison for different reshaping methods
import time

# Create larger dataset for timing
large_data = pd.DataFrame({
    'id': list(range(1000)) * 4,
    'category': ['A', 'B', 'C', 'D'] * 1000,
    'value': np.random.randn(4000)
})

print(f"Large dataset shape: {large_data.shape}")

# Time pivot_table
start_time = time.time()
pivot_result = pd.pivot_table(large_data, index='id', columns='category', values='value', aggfunc='mean')
pivot_time = time.time() - start_time

# Time groupby + unstack
start_time = time.time()
groupby_result = large_data.groupby(['id', 'category'])['value'].mean().unstack()
groupby_time = time.time() - start_time

print(f"\nPivot table time: {pivot_time:.4f} seconds")
print(f"GroupBy + unstack time: {groupby_time:.4f} seconds")

# Verify results are the same
print(f"\nResults are equal: {pivot_result.equals(groupby_result)}")

print("\nPerformance Tips:")
print("1. Use groupby + unstack for large datasets")
print("2. Use pivot_table for readability and multiple aggregations")
print("3. Consider memory usage when reshaping large datasets")

## 11. Common Issues and Solutions

### Issue 1: Duplicate Values in Pivot

In [None]:
# Create data with duplicates
duplicate_data = pd.DataFrame({
    'date': ['2025-01-01', '2025-01-01', '2025-01-02'],
    'product': ['A', 'A', 'B'],  # Duplicate A on same date
    'sales': [10, 15, 20]
})

print("Data with duplicates:")
print(duplicate_data)

# This will fail with regular pivot()
try:
    bad_pivot = duplicate_data.pivot(index='date', columns='product', values='sales')
except ValueError as e:
    print(f"\nError with pivot(): {e}")

# Solution: Use pivot_table with aggregation
good_pivot = pd.pivot_table(duplicate_data,
                           index='date',
                           columns='product',
                           values='sales',
                           aggfunc='sum')  # Aggregate duplicates

print("\nSolution with pivot_table:")
print(good_pivot)
print("Note: The duplicate values for product A on 2025-01-01 were summed (10 + 15 = 25)")

### Issue 2: Memory Usage with Large Datasets

In [None]:
# Monitor memory usage during reshaping
def check_memory_usage(df, name):
    memory_mb = df.memory_usage(deep=True).sum() / 1024**2
    print(f"{name}: {memory_mb:.2f} MB")
    return memory_mb

# Original data
original_memory = check_memory_usage(large_data, "Original data")

# After pivot
pivot_memory = check_memory_usage(pivot_result, "After pivot")

print(f"\nMemory increase: {((pivot_memory - original_memory) / original_memory * 100):.1f}%")

print("\nMemory optimization tips:")
print("1. Use categorical data types for repeated strings")
print("2. Consider sparse matrices for data with many zeros")
print("3. Process data in chunks for very large datasets")

## 12. Practice Exercises

### Exercise 1: Sales Report Reshaping
You have quarterly sales data in wide format. Convert it to long format for analysis.

In [None]:
# Exercise data
quarterly_sales = pd.DataFrame({
    'salesperson': ['Alice', 'Bob', 'Carol'],
    'region': ['North', 'South', 'East'],
    'Q1_2025': [50000, 45000, 48000],
    'Q2_2025': [55000, 48000, 52000],
    'Q3_2025': [52000, 50000, 49000],
    'Q4_2025': [58000, 53000, 55000]
})

print("Quarterly Sales Data:")
print(quarterly_sales)

# Your task: Convert to long format with columns: salesperson, region, quarter, sales
# Your code here:


### Exercise 2: Customer Behavior Analysis
Create a pivot table showing customer purchase patterns by category.

In [None]:
# Exercise data
customer_transactions = pd.DataFrame({
    'customer_id': ['CUST001', 'CUST001', 'CUST002', 'CUST002', 'CUST003', 'CUST003', 'CUST001'],
    'category': ['Electronics', 'Books', 'Electronics', 'Clothing', 'Books', 'Electronics', 'Clothing'],
    'transaction_amount': [299.99, 19.99, 499.99, 79.99, 29.99, 199.99, 59.99],
    'transaction_count': [1, 1, 1, 1, 1, 1, 1]
})

print("Customer Transactions:")
print(customer_transactions)

# Your task: Create a pivot table showing:
# - Rows: customer_id
# - Columns: category
# - Values: total transaction_amount per customer per category
# Your code here:


### Exercise 3: Product Performance Dashboard
Transform the following metrics data into a dashboard format.

In [None]:
# Exercise data
product_metrics_long = pd.DataFrame({
    'product_id': ['PROD001', 'PROD001', 'PROD001', 'PROD002', 'PROD002', 'PROD002'],
    'metric_type': ['sales', 'returns', 'rating', 'sales', 'returns', 'rating'],
    'metric_value': [1000, 50, 4.5, 800, 30, 4.2]
})

print("Product Metrics (Long Format):")
print(product_metrics_long)

# Your task: Create a dashboard where:
# - Rows: product_id
# - Columns: metric_type
# - Values: metric_value
# Then calculate return_rate = returns / sales * 100
# Your code here:


### Exercise 4: Advanced Reshaping Challenge
You have monthly data for multiple metrics. Reshape it for time series analysis.

In [None]:
# Exercise data
monthly_metrics = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet'],
    'Jan_sales': [100, 150, 80],
    'Jan_revenue': [50000, 45000, 24000],
    'Feb_sales': [120, 180, 90],
    'Feb_revenue': [60000, 54000, 27000],
    'Mar_sales': [110, 160, 85],
    'Mar_revenue': [55000, 48000, 25500]
})

print("Monthly Metrics (Wide Format):")
print(monthly_metrics)

# Your task: 
# 1. Melt to long format
# 2. Split the month_metric column into separate month and metric columns
# 3. Pivot to have months as rows and metrics as columns (grouped by product)
# Your code here:


## Next Steps

Excellent work! You've mastered the fundamentals of data reshaping with melt and pivot operations. These skills are essential for:

- **Data Analysis**: Converting data to the right format for analysis
- **Visualization**: Many plotting libraries require specific data formats
- **Reporting**: Creating summary tables and dashboards
- **Time Series Analysis**: Preparing data for temporal analysis

In the next part of today's session, we'll continue with:
- **Part 3: Time series manipulation basics**

### Key Takeaways
1. **Melt** (`pd.melt()`) converts wide to long format - use for analysis and visualization
2. **Pivot** (`pd.pivot()`) converts long to wide format - use for summary tables
3. **Pivot Table** (`pd.pivot_table()`) adds aggregation - use when you have duplicate combinations
4. **Choose the right tool**:
   - `melt()` for analysis-ready long format
   - `pivot()` for unique combinations
   - `pivot_table()` for aggregation
   - `stack()`/`unstack()` for hierarchical data
5. **Always handle missing values** and consider memory usage with large datasets