# Pandas GroupBy
**`06-groupby.ipynb`**

The **`groupby`** function in Pandas is used to **split data into groups based on some criteria**, perform **aggregations**, and combine results.  
It is very useful for **summarizing and analyzing data**.

---

## Step 1: Import Libraries

In [1]:
import pandas as pd
import numpy as np


---

## Step 2: Create Sample DataFrame

In [2]:
data = {
    "Employee": ["Alice", "Bob", "Charlie", "David", "Eva", "Frank"],
    "Department": ["HR", "IT", "Finance", "IT", "HR", "Finance"],
    "Salary": [50000, 60000, 55000, 65000, 52000, 58000],
    "Age": [25, 30, 28, 35, 27, 32]
}

df = pd.DataFrame(data)
print(df)

  Employee Department  Salary  Age
0    Alice         HR   50000   25
1      Bob         IT   60000   30
2  Charlie    Finance   55000   28
3    David         IT   65000   35
4      Eva         HR   52000   27
5    Frank    Finance   58000   32



---



## Step 3: Basic GroupBy

### Group by a Single Column

In [3]:
grouped = df.groupby('Department')
print(grouped)
# This creates a GroupBy object. To see results, we need to aggregate.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F2BFBC7520>


### Aggregation

In [4]:
# Average salary by department
print(grouped['Salary'].mean())

# Maximum age in each department
print(grouped['Age'].max())

Department
Finance    56500.0
HR         51000.0
IT         62500.0
Name: Salary, dtype: float64
Department
Finance    32
HR         27
IT         35
Name: Age, dtype: int64



---

## Step 4: Multiple Aggregations

In [5]:
# Multiple aggregation functions
agg_result = grouped['Salary'].agg(['mean', 'sum', 'max', 'min'])
print(agg_result)

               mean     sum    max    min
Department                               
Finance     56500.0  113000  58000  55000
HR          51000.0  102000  52000  50000
IT          62500.0  125000  65000  60000


---

## Step 5: Grouping by Multiple Columns

In [6]:
# Create another column
df['Team'] = ['A','B','A','B','A','B']

grouped_multi = df.groupby(['Department', 'Team'])['Salary'].mean()
print(grouped_multi)


Department  Team
Finance     A       55000.0
            B       58000.0
HR          A       51000.0
IT          B       62500.0
Name: Salary, dtype: float64


---

## Step 6: Iterating Over Group

In [7]:
for name, group in grouped:
    print(f"Department: {name}")
    print(group)
    print("-"*30)

Department: Finance
  Employee Department  Salary  Age Team
2  Charlie    Finance   55000   28    A
5    Frank    Finance   58000   32    B
------------------------------
Department: HR
  Employee Department  Salary  Age Team
0    Alice         HR   50000   25    A
4      Eva         HR   52000   27    A
------------------------------
Department: IT
  Employee Department  Salary  Age Team
1      Bob         IT   60000   30    B
3    David         IT   65000   35    B
------------------------------



---


In [8]:
## Step 7: Accessing a Single Group

it_group = grouped.get_group('IT')
print(it_group)


---

## Step 8: Transforming Data

In [9]:
# Normalize salaries by subtracting department mean
df['Salary_Normalized'] = grouped['Salary'].transform(lambda x: x - x.mean())
print(df)


  Employee Department  Salary  Age Team  Salary_Normalized
0    Alice         HR   50000   25    A            -1000.0
1      Bob         IT   60000   30    B            -2500.0
2  Charlie    Finance   55000   28    A            -1500.0
3    David         IT   65000   35    B             2500.0
4      Eva         HR   52000   27    A             1000.0
5    Frank    Finance   58000   32    B             1500.0


---



## Step 9: Applying Custom Functions

In [10]:
# Function to calculate salary range
def salary_range(x):
    return x.max() - x.min()

range_by_department = grouped['Salary'].apply(salary_range)
print(range_by_department)


Department
Finance    3000
HR         2000
IT         5000
Name: Salary, dtype: int64


---


## Step 10: Combining Aggregation with Other Columns


In [11]:
# Average age and max salary per department
agg_df = grouped.agg({'Age':'mean', 'Salary':'max'})
print(agg_df)


             Age  Salary
Department              
Finance     30.0   58000
HR          26.0   52000
IT          32.5   65000


---



## Step 11: Real-World Example

In [12]:
# Sales Data
sales = pd.DataFrame({
    "Salesperson": ["Alice", "Bob", "Charlie", "Alice", "Bob", "Charlie"],
    "Region": ["North", "North", "South", "South", "North", "South"],
    "Revenue": [1000, 1500, 2000, 1200, 1600, 2100]
})

# Total revenue per salesperson
total_revenue = sales.groupby('Salesperson')['Revenue'].sum()
print(total_revenue)

# Average revenue per region
avg_region = sales.groupby('Region')['Revenue'].mean()
print(avg_region)

Salesperson
Alice      2200
Bob        3100
Charlie    4100
Name: Revenue, dtype: int64
Region
North    1366.666667
South    1766.666667
Name: Revenue, dtype: float64



---


## ✅ Summary

* **`groupby`** splits data into groups based on values.
* Aggregation functions: `.mean()`, `.sum()`, `.max()`, `.min()`, `.agg()`.
* Can **group by single or multiple columns**.
* `.transform()` allows returning a **full-sized DataFrame aligned with original**.
* `.apply()` allows applying **custom functions** to groups.
* Essential for **summarizing, analyzing, and comparing categorical data**.

---