# Level 8: Data Aggregation & GroupBy

Data aggregation is the process of combining and summarizing data. The `groupby` operation in Pandas is one of its most powerful features, allowing you to efficiently perform the 'Split-Apply-Combine' pattern on your data.

In [1]:
import pandas as pd
import numpy as np

data = {
    'Department': ['HR', 'Engineering', 'Sales', 'Engineering', 'HR', 'Sales', 'Sales'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace'],
    'Salary': [70000, 80000, 120000, 95000, 75000, 110000, 130000],
    'YearsExperience': [5, 7, 10, 8, 6, 9, 12]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Department,Employee,Salary,YearsExperience
0,HR,Alice,70000,5
1,Engineering,Bob,80000,7
2,Sales,Charlie,120000,10
3,Engineering,David,95000,8
4,HR,Eva,75000,6
5,Sales,Frank,110000,9
6,Sales,Grace,130000,12


## 8.1 GroupBy Basics

### The Split-Apply-Combine Paradigm
1.  **Split:** The data is split into groups based on some criteria (e.g., by 'Department').
2.  **Apply:** A function is applied to each group independently (e.g., calculate the mean salary).
3.  **Combine:** The results of the function applications are combined into a new data structure.

In [2]:
# Group the DataFrame by the 'Department' column
grouped = df.groupby('Department')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002BCA0C29510>

A `groupby` operation creates a `DataFrameGroupBy` object. No computation is done until you apply an aggregation function.

## 8.2 Aggregation Functions

### Built-in Functions

In [3]:
# Calculate the mean salary for each department
grouped['Salary'].mean()

Department
Engineering     87500.0
HR              72500.0
Sales          120000.0
Name: Salary, dtype: float64

In [4]:
# Get the size of each group
grouped.size()

Department
Engineering    2
HR             2
Sales          3
dtype: int64

### The `.agg()` Method
The `agg()` method is the most flexible and powerful tool for aggregation.

In [5]:
# Multiple aggregations on the same column
grouped['Salary'].agg(['mean', 'std', 'count'])

Unnamed: 0_level_0,mean,std,count
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Engineering,87500.0,10606.601718,2
HR,72500.0,3535.533906,2
Sales,120000.0,10000.0,3


In [6]:
# Different aggregations for different columns
agg_dict = {
    'Salary': 'mean',
    'YearsExperience': 'max'
}
grouped.agg(agg_dict)

Unnamed: 0_level_0,Salary,YearsExperience
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Engineering,87500.0,8
HR,72500.0,6
Sales,120000.0,12


In [7]:
# Custom aggregation functions
def salary_range(series):
    return series.max() - series.min()

grouped['Salary'].agg(salary_range)

Department
Engineering    15000
HR              5000
Sales          20000
Name: Salary, dtype: int64

## 8.3 Transform & Filter

### `.transform()`
Applies a function to each group and returns a result that is the same shape as the original DataFrame. This is useful for creating new columns based on group-level calculations.

In [8]:
# Standardize salaries within each department (z-score)
df['Salary_ZScore'] = grouped['Salary'].transform(lambda x: (x - x.mean()) / x.std())
df

Unnamed: 0,Department,Employee,Salary,YearsExperience,Salary_ZScore
0,HR,Alice,70000,5,-0.707107
1,Engineering,Bob,80000,7,-0.707107
2,Sales,Charlie,120000,10,0.0
3,Engineering,David,95000,8,0.707107
4,HR,Eva,75000,6,0.707107
5,Sales,Frank,110000,9,-1.0
6,Sales,Grace,130000,12,1.0


### `.filter()`
Returns a subset of the original DataFrame by keeping only the groups that satisfy a certain condition.

In [9]:
# Keep only departments where the average salary is greater than 80,000
df.groupby('Department').filter(lambda x: x['Salary'].mean() > 80000)

Unnamed: 0,Department,Employee,Salary,YearsExperience,Salary_ZScore
1,Engineering,Bob,80000,7,-0.707107
2,Sales,Charlie,120000,10,0.0
3,Engineering,David,95000,8,0.707107
5,Sales,Frank,110000,9,-1.0
6,Sales,Grace,130000,12,1.0


## 8.4 Pivot Tables (`pd.pivot_table()`)

A pivot table is a way to summarize and reorganize data in a DataFrame by creating a new table where the rows and columns are based on existing column values.

In [10]:
# Let's create a more suitable dataset
pivot_data = {
    'Date': pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02']),
    'City': ['NY', 'LA', 'NY', 'LA'],
    'Product': ['A', 'A', 'B', 'B'],
    'Sales': [100, 150, 200, 250]
}
df_pivot = pd.DataFrame(pivot_data)
df_pivot

Unnamed: 0,Date,City,Product,Sales
0,2023-01-01,NY,A,100
1,2023-01-01,LA,A,150
2,2023-01-02,NY,B,200
3,2023-01-02,LA,B,250


In [11]:
# Create a pivot table
# Index = rows, Columns = columns, Values = what to aggregate
pd.pivot_table(df_pivot, values='Sales', index='Date', columns='City', aggfunc='sum')

City,LA,NY
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-01,150,100
2023-01-02,250,200


## 8.5 Cross-tabulation (`pd.crosstab()`)

A cross-tabulation (or crosstab) is a table that shows the frequency distribution of two or more variables.

In [12]:
# Crosstab of Department vs. a binned salary
df['SalaryBin'] = pd.cut(df['Salary'], bins=[60000, 90000, 150000], labels=['Low', 'High'])
pd.crosstab(df['Department'], df['SalaryBin'])

SalaryBin,Low,High
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Engineering,1,1
HR,2,0
Sales,0,3
