# Assignment 8: Data Aggregation and Group Operations

## Overview
This assignment covers data aggregation and group operations using a health data lens (think EHR-like tables for departments, staff, regions, and activity). We’ll use the existing `department`, `employee`, and `sales` columns as proxies for clinical departments, clinicians, and encounters to keep the provided dataset and tests unchanged.

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set random seed for reproducibility
np.random.seed(42)

# Create output directory
os.makedirs('output', exist_ok=True)

## Question 1: Basic GroupBy Operations

### Part 1.1: Load and Explore Data

In [None]:
# Load the datasets
employee_df = pd.read_csv('data/employee_data.csv')
department_df = pd.read_csv('data/department_data.csv')
sales_df = pd.read_csv('data/sales_data.csv')

print("Employee data shape:", employee_df.shape)
print("Department data shape:", department_df.shape)
print("Sales data shape:", sales_df.shape)

# Merge data for analysis
merged_df = sales_df.merge(employee_df, on='employee_id').merge(department_df, on='department_id')

print("\nMerged data shape:", merged_df.shape)
print("\nColumns:", merged_df.columns.tolist())
print("\nFirst few rows:")
print(merged_df.head())

### Part 1.2: Basic Aggregation (health context)

**TODO: Perform basic groupby operations**

In [None]:
# TODO: Group by department (treat as clinical department) and calculate basic stats
# TODO: Calculate mean, sum, count for salary and experience by department
# TODO: Calculate total sales by department (treat sales as encounter totals)
# TODO: Find the top-performing department by total sales (encounter volume)

# TODO: Save results as 'output/q1_groupby_analysis.csv'

### Part 1.3: Transform Operations (within-department norms)

**TODO: Use transform operations to add group statistics**

In [None]:
# TODO: Add department (clinical unit) mean salary as new column
# TODO: Add department standard deviation of salary
# TODO: Create normalized salary (z-score within department)
# TODO: Add department total sales (encounters) as new column

# TODO: Display the enhanced dataframe
# TODO: Save results as 'output/q1_aggregation_report.txt'

## Question 2: Advanced GroupBy Operations

### Part 2.1: Filter Operations (quality/scale gates)

**TODO: Use filter operations to remove groups**

In [None]:
# TODO: Filter departments with more than 5 employees (sufficient staffing)
# TODO: Filter departments with average salary > 60000 (seniority proxy)
# TODO: Filter departments with total sales > 100000 (high encounter volume)

# TODO: Create a summary of filtered results
# TODO: Save results as 'output/q2_hierarchical_analysis.csv'

### Part 2.2: Apply Operations

**TODO: Use apply operations with custom functions**

In [None]:
# TODO: Create custom function to calculate salary statistics
def salary_stats(group):
    # TODO: Return mean, std, min, max, range for salary
    pass

# TODO: Apply custom function to each department
# TODO: Create function to find top earners in each department
# TODO: Apply function to get top 2 earners per department

# TODO: Save results as 'output/q2_performance_report.txt'

### Part 2.3: Hierarchical Grouping (dept × region)

**TODO: Perform multi-level grouping**

In [None]:
# TODO: Group by department (clinical unit) and region
# TODO: Calculate statistics for each department-region combination
# TODO: Use unstack to convert to wide format
# TODO: Use stack to convert back to long format

# TODO: Analyze the hierarchical structure
# TODO: Save results as 'output/q2_hierarchical_analysis.csv'

## Question 3: Pivot Tables and Cross-Tabulations

### Part 3.1: Basic Pivot Tables (health service mix)

**TODO: Create pivot tables for multi-dimensional analysis**

In [None]:
# TODO: Create pivot table: sales (encounters) by product (service) and region
# TODO: Create pivot table with multiple aggregations (sum, mean, count)
# TODO: Add totals (margins) to pivot table
# TODO: Handle missing values with fill_value

# TODO: Save results as 'output/q3_pivot_analysis.csv'

### Part 3.2: Cross-Tabulations (caseload distribution)

**TODO: Create cross-tabulations for categorical analysis**

In [None]:
# TODO: Create crosstab of department vs region
# TODO: Create crosstab with margins
# TODO: Create multi-dimensional crosstab

# TODO: Analyze the cross-tabulation results
# TODO: Save results as 'output/q3_crosstab_analysis.csv'

### Part 3.3: Pivot Table Visualization

**TODO: Create visualizations from pivot tables**

In [None]:
# TODO: Create heatmap from pivot table
# TODO: Create bar chart from pivot table
# TODO: Customize colors and styling
# TODO: Add appropriate titles and labels

# TODO: Save the plot as 'output/q3_pivot_visualization.png'

## Submission Checklist

Before submitting, verify you've created:

- [ ] `output/q1_groupby_analysis.csv` - Basic groupby analysis
- [ ] `output/q1_aggregation_report.txt` - Aggregation report
- [ ] `output/q2_hierarchical_analysis.csv` - Hierarchical analysis
- [ ] `output/q2_performance_report.txt` - Performance report
- [ ] `output/q3_pivot_analysis.csv` - Pivot table analysis
- [ ] `output/q3_crosstab_analysis.csv` - Cross-tabulation analysis
- [ ] `output/q3_pivot_visualization.png` - Pivot visualization

## Key Learning Objectives

- Master the split-apply-combine paradigm
- Apply aggregation functions and transformations
- Create pivot tables for multi-dimensional analysis
- Apply advanced groupby techniques