# üêº Pandas - Class 6: GroupBy & Aggregations
Welcome to **Class 6** of our Pandas series. Today we‚Äôll learn how to group data and perform aggregations.

## 1. Concept of Split‚ÄìApply‚ÄìCombine
- **Split** the data into groups based on some criteria.
- **Apply** a function (e.g., sum, mean) to each group.
- **Combine** the results into a new DataFrame or Series.

This is the core idea behind `groupby`.

In [None]:
import pandas as pd

# New dataset: Employees, their Department, and Salary
data = {
    "Employee": ["John", "Sarah", "Mike", "Anna",  "Tom", "Laura", "Steve", "Kate"],
    "Department": ["HR", "IT", "IT", "Finance", "Finance", "HR", "IT", "Finance"],
    "Salary": [50000, 70000, 72000, 65000, 60000, 52000, 75000, 58000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Split‚ÄìApply‚ÄìCombine: average salary by department
avg_salary = df.groupby("Department")["Salary"].mean()
print("\nAverage salary by Department:")
print(avg_salary)

# Another example: count employees per department
emp_count = df.groupby("Department")["Employee"].count()
print("\nNumber of employees by Department:")
print(emp_count)

# Group and aggregate multiple stats at once
multi_stats = df.groupby("Department")["Salary"].agg(["mean", "max", "min"])
print("\nSalary stats (mean, max, min) by Department:")
print(multi_stats)


Original DataFrame:
  Employee Department  Salary
0     John         HR   50000
1    Sarah         IT   70000
2     Mike         IT   72000
3     Anna    Finance   65000
4      Tom    Finance   60000
5    Laura         HR   52000
6    Steve         IT   75000
7     Kate    Finance   58000

Average salary by Department:
Department
Finance    61000.000000
HR         51000.000000
IT         72333.333333
Name: Salary, dtype: float64

Number of employees by Department:
Department
Finance    3
HR         2
IT         3
Name: Employee, dtype: int64

Salary stats (mean, max, min) by Department:
                    mean    max    min
Department                            
Finance     61000.000000  65000  58000
HR          51000.000000  52000  50000
IT          72333.333333  75000  70000


## 2. Using `groupby()` with Aggregation Functions
- Use `groupby('col').mean()` or other functions like sum, count, min, max.
- You can also use `agg()` to pass multiple functions at once.

In [None]:

# 1. Mean salary per department
print("\nMean salary by Department:")
print(df.groupby("Department")["Salary"].mean())

# 2. Sum of salaries per department
print("\nTotal salary by Department:")
print(df.groupby("Department")["Salary"].sum())

# 3. Count of employees per department
print("\nNumber of employees by Department:")
print(df.groupby("Department")["Employee"].count())

# 4. Multiple aggregations on Salary using agg()
print("\nMultiple aggregations on Salary:")
print(df.groupby("Department")["Salary"].agg(["mean", "max", "min", "count"]))



Mean salary by Department:
Department
Finance    61000.000000
HR         51000.000000
IT         72333.333333
Name: Salary, dtype: float64

Total salary by Department:
Department
Finance    183000
HR         102000
IT         217000
Name: Salary, dtype: int64

Number of employees by Department:
Department
Finance    3
HR         2
IT         3
Name: Employee, dtype: int64

Multiple aggregations on Salary:
                    mean    max    min  count
Department                                   
Finance     61000.000000  65000  58000      3
HR          51000.000000  52000  50000      2
IT          72333.333333  75000  70000      3


## 3. Multiple Aggregations on Different Columns
- With `agg()`, you can specify different functions for each column.
- Example: `df.groupby('Dept').agg({'Salary':'mean', 'Age':'max'})`.

In [None]:
import pandas as pd

# Same dataset
data = {
    "Employee": ["John", "Sarah", "Mike", "Anna", "Tom", "Laura", "Steve", "Kate"],
    "Department": ["HR", "IT", "IT", "Finance", "Finance", "HR", "IT", "Finance"],
    "Salary": [50000, 70000, 72000, 65000, 60000, 52000, 75000, 58000],
    "Age": [28, 32, 30, 27, 35, 29, 31, 26]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Multiple aggregations on different columns
result = df.groupby("Department").agg({
    "Salary": ["mean", "max", "min"],
    "Age": ["mean", "max"]
})

print("\nMultiple aggregations on Salary and Age by Department:")
print(result)


Original DataFrame:
  Employee Department  Salary  Age
0     John         HR   50000   28
1    Sarah         IT   70000   32
2     Mike         IT   72000   30
3     Anna    Finance   65000   27
4      Tom    Finance   60000   35
5    Laura         HR   52000   29
6    Steve         IT   75000   31
7     Kate    Finance   58000   26

Multiple aggregations on Salary and Age by Department:
                  Salary                      Age    
                    mean    max    min       mean max
Department                                           
Finance     61000.000000  65000  58000  29.333333  35
HR          51000.000000  52000  50000  28.500000  29
IT          72333.333333  75000  70000  31.000000  32


## 4. Pivot Tables & Crosstab
- `pivot_table()` summarizes data like Excel pivot tables.
- `crosstab()` shows frequency counts of combinations of factors.
- Both are powerful for summarizing and comparing groups.

In [None]:

# 1. Pivot table: average Salary by Department
pivot_salary = df.pivot_table(values="Salary", index="Department", aggfunc="mean")
print("\nPivot table: Average Salary by Department")
print(pivot_salary)

# 2. Pivot table with multiple values (Salary & Age)
pivot_multi = df.pivot_table(values=["Salary", "Age"], index="Department", aggfunc="mean")
print("\nPivot table: Average Salary & Age by Department")
print(pivot_multi)

# 3. Crosstab: number of employees by Department and Age group
# Let's create an AgeGroup column
df["AgeGroup"] = pd.cut(df["Age"], bins=[25, 30, 35, 40], labels=["25-30", "31-35", "36-40"])
crosstab_result = pd.crosstab(df["Department"], df["AgeGroup"])
print("\nCrosstab: Employees by Department and Age Group")
print(crosstab_result)


Pivot table: Average Salary by Department
                  Salary
Department              
Finance     61000.000000
HR          51000.000000
IT          72333.333333

Pivot table: Average Salary & Age by Department
                  Age        Salary
Department                         
Finance     29.333333  61000.000000
HR          28.500000  51000.000000
IT          31.000000  72333.333333

Crosstab: Employees by Department and Age Group
AgeGroup    25-30  31-35
Department              
Finance         2      1
HR              2      0
IT              1      2


## Mini Practice
1. Create a DataFrame with columns: Department, Employee, Salary, Age, City.
2. Group by Department to get average Salary and max Age.
3. Apply multiple aggregations using agg().
4. Create a pivot_table to see mean Salary by Department and City.
5. Build a crosstab for Department vs City.

In [None]:
import pandas as pd

# 1. Create the DataFrame
data = {
    "Department": ["HR", "IT", "Finance", "IT", "HR", "Finance", "IT", "HR"],
    "Employee": ["Alice", "Bob", "Charlie", "David", "Emma", "Frank", "George", "Hannah"],
    "Salary": [50000, 65000, 60000, 70000, 52000, 58000, 72000, 51000],
    "Age": [28, 35, 30, 40, 26, 32, 38, 27],
    "City": ["Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai", "Delhi", "Mumbai", "Delhi"]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# 2. Group by Department to get average Salary and max Age
grouped = df.groupby("Department").agg({"Salary": "mean", "Age": "max"})
print("\nAverage Salary and Max Age by Department:")
print(grouped)

# 3. Apply multiple aggregations using agg()
multi_agg = df.groupby("Department").agg(
    Salary_mean=("Salary", "mean"),
    Salary_min=("Salary", "min"),
    Age_max=("Age", "max"),
    Age_min=("Age", "min"),
    Count=("Employee", "count")
)
print("\nMultiple aggregations by Department:")
print(multi_agg)

# 4. Create a pivot_table to see mean Salary by Department and City
pivot = df.pivot_table(values="Salary", index="Department", columns="City", aggfunc="mean", fill_value=0)
print("\nPivot table: Mean Salary by Department and City")
print(pivot)

# 5. Build a crosstab for Department vs City
cross = pd.crosstab(df["Department"], df["City"])
print("\nCrosstab: Department vs City")
print(cross)


Original DataFrame:
  Department Employee  Salary  Age     City
0         HR    Alice   50000   28    Delhi
1         IT      Bob   65000   35   Mumbai
2    Finance  Charlie   60000   30    Delhi
3         IT    David   70000   40  Chennai
4         HR     Emma   52000   26   Mumbai
5    Finance    Frank   58000   32    Delhi
6         IT   George   72000   38   Mumbai
7         HR   Hannah   51000   27    Delhi

Average Salary and Max Age by Department:
             Salary  Age
Department              
Finance     59000.0   32
HR          51000.0   28
IT          69000.0   40

Multiple aggregations by Department:
            Salary_mean  Salary_min  Age_max  Age_min  Count
Department                                                  
Finance         59000.0       58000       32       30      2
HR              51000.0       50000       28       26      3
IT              69000.0       65000       40       35      3

Pivot table: Mean Salary by Department and City
City        Chennai    D

---
## Summary
- Learned the split‚Äìapply‚Äìcombine concept.
- Used `groupby()` with aggregation functions.
- Applied multiple aggregations to different columns.
- Explored `pivot_table` and `crosstab` for summarizing data.