##  groupby and aggregation
Following are all the aggregation functions:

| Function Name | Description                   |
| ------------- | ----------------------------- |
| `mean`        | Average value                 |
| `max`         | Maximum value                 |
| `min`         | Minimum value                 |
| `sum`         | Sum of values                 |
| `count`       | Number of non-NA/null entries |
| `median`      | Median (middle value)         |
| `std`         | Standard deviation            |
| `var`         | Variance                      |
| `first`       | First value                   |
| `last`        | Last value                    |
| `nunique`     | Number of unique values       |
| `prod`        | Product of values             |
| `size`        | Size of group (including NA)  |
| `sem`         | Standard error of the mean    |
| `mad`         | Mean absolute deviation       |


- we explore few of the above stats

In [3]:
import pandas as pd

import warnings

# Suppress all FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [4]:
# groupby and aggregation

df = pd.read_csv('data_groupby_aggregation.csv')
print("Original Data:\n", df)

Original Data:
   Department     Employee   Salary  Experience Region
0         HR        Rahim  60000.0           2   East
1         IT      Vanessa  75000.0           5   West
2    Finance         Cory  82000.0           7   East
3         HR        David  58000.0           3    NaN
4         IT         Neha  79000.0           4   East
5    Finance        Frank      NaN           9   West
6         HR        Grace  62000.0           2   East
7         IT        Helen  73000.0           6   East
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West


In [5]:
# Group by Department: Tell me how many employees are for each department
print("Original Data:\n", df)
print("##################")
grouped = df.groupby(['Department',]).size() # size() total rows in each group: Includes NaN values
print(grouped)

Original Data:
   Department     Employee   Salary  Experience Region
0         HR        Rahim  60000.0           2   East
1         IT      Vanessa  75000.0           5   West
2    Finance         Cory  82000.0           7   East
3         HR        David  58000.0           3    NaN
4         IT         Neha  79000.0           4   East
5    Finance        Frank      NaN           9   West
6         HR        Grace  62000.0           2   East
7         IT        Helen  73000.0           6   East
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West
##################
Department
Finance    4
HR         3
IT         3
dtype: int64


In [6]:
# Now give me total count of non-null in each group
print("Original Data:\n", df)
print("##################")
grouped = df.groupby(['Department',]).count() 
print(grouped)

Original Data:
   Department     Employee   Salary  Experience Region
0         HR        Rahim  60000.0           2   East
1         IT      Vanessa  75000.0           5   West
2    Finance         Cory  82000.0           7   East
3         HR        David  58000.0           3    NaN
4         IT         Neha  79000.0           4   East
5    Finance        Frank      NaN           9   West
6         HR        Grace  62000.0           2   East
7         IT        Helen  73000.0           6   East
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West
##################
            Employee  Salary  Experience  Region
Department                                      
Finance            4       3           4       4
HR                 3       3           3       2
IT                 3       3           3       3


In [8]:
# Group by Region and give me size
print("Original Data:\n", df)
print("##################")
grouped = df.groupby(['Region',]).size()
print(grouped)

Original Data:
   Department     Employee   Salary  Experience Region
0         HR        Rahim  60000.0           2   East
1         IT      Vanessa  75000.0           5   West
2    Finance         Cory  82000.0           7   East
3         HR        David  58000.0           3    NaN
4         IT         Neha  79000.0           4   East
5    Finance        Frank      NaN           9   West
6         HR        Grace  62000.0           2   East
7         IT        Helen  73000.0           6   East
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West
##################
Region
East    5
West    4
dtype: int64


In [9]:
# Group by Department and Region and give me size of each
print("Original Data:\n", df)
print("##################")
grouped = df.groupby(['Department', 'Region']).size()
print(grouped)

Original Data:
   Department     Employee   Salary  Experience Region
0         HR        Rahim  60000.0           2   East
1         IT      Vanessa  75000.0           5   West
2    Finance         Cory  82000.0           7   East
3         HR        David  58000.0           3    NaN
4         IT         Neha  79000.0           4   East
5    Finance        Frank      NaN           9   West
6         HR        Grace  62000.0           2   East
7         IT        Helen  73000.0           6   East
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West
##################
Department  Region
Finance     East      1
            West      3
HR          East      2
IT          East      2
            West      1
dtype: int64


In [10]:
# group by dept and calculate sum of all salaries awarded to employees in each dept
print("Original Data:\n", df)
print("##################")
grouped = df.groupby(['Department'])['Salary'].sum()
print(grouped)

Original Data:
   Department     Employee   Salary  Experience Region
0         HR        Rahim  60000.0           2   East
1         IT      Vanessa  75000.0           5   West
2    Finance         Cory  82000.0           7   East
3         HR        David  58000.0           3    NaN
4         IT         Neha  79000.0           4   East
5    Finance        Frank      NaN           9   West
6         HR        Grace  62000.0           2   East
7         IT        Helen  73000.0           6   East
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West
##################
Department
Finance    226000.0
HR         180000.0
IT         227000.0
Name: Salary, dtype: float64


In [11]:
# Group by department and calculate mean salary for each dept
print("Original Data:\n", df)
print("##################")
grouped = df.groupby('Department')['Salary'].mean()
print(grouped)

Original Data:
   Department     Employee   Salary  Experience Region
0         HR        Rahim  60000.0           2   East
1         IT      Vanessa  75000.0           5   West
2    Finance         Cory  82000.0           7   East
3         HR        David  58000.0           3    NaN
4         IT         Neha  79000.0           4   East
5    Finance        Frank      NaN           9   West
6         HR        Grace  62000.0           2   East
7         IT        Helen  73000.0           6   East
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West
##################
Department
Finance    75333.333333
HR         60000.000000
IT         75666.666667
Name: Salary, dtype: float64


In [12]:
# Group by multiple columns and give me the mean of salary
print("Original Data:\n", df)
print("##################")
# groupby by Department and Region and calculate mean salary for each department and region
grouped = df.groupby(['Department', 'Region'])['Salary'].mean()
print(grouped)

Original Data:
   Department     Employee   Salary  Experience Region
0         HR        Rahim  60000.0           2   East
1         IT      Vanessa  75000.0           5   West
2    Finance         Cory  82000.0           7   East
3         HR        David  58000.0           3    NaN
4         IT         Neha  79000.0           4   East
5    Finance        Frank      NaN           9   West
6         HR        Grace  62000.0           2   East
7         IT        Helen  73000.0           6   East
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West
##################
Department  Region
Finance     East      82000.0
            West      72000.0
HR          East      61000.0
IT          East      76000.0
            West      75000.0
Name: Salary, dtype: float64


In [14]:
#### Aggregation: Group by department and get multiple stats like mean, max, min, etc
print("Original Data:\n", df)
print("##################")
multi_stats = df.groupby('Department').agg({
    'Salary': ['mean', 'max', 'min', 'sum'],
    'Experience': ['mean', 'count']
})
print(multi_stats)

Original Data:
   Department     Employee   Salary  Experience Region
0         HR        Rahim  60000.0           2   East
1         IT      Vanessa  75000.0           5   West
2    Finance         Cory  82000.0           7   East
3         HR        David  58000.0           3    NaN
4         IT         Neha  79000.0           4   East
5    Finance        Frank      NaN           9   West
6         HR        Grace  62000.0           2   East
7         IT        Helen  73000.0           6   East
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West
##################
                  Salary                             Experience      
                    mean      max      min       sum       mean count
Department                                                           
Finance     75333.333333  88000.0  56000.0  226000.0   6.500000     4
HR          60000.000000  62000.0  58000.0  180000.0   2.333333     3
IT          75666.666

In [15]:
# ADVANCED: (OPTIONAL)You can also iterate over groups
grouped = df.groupby('Department')
for name, group in grouped:
    print(f"\nDepartment: {name}")
    print(group)


Department: Finance
  Department     Employee   Salary  Experience Region
2    Finance         Cory  82000.0           7   East
5    Finance        Frank      NaN           9   West
8    Finance          Ian  88000.0           8   West
9    Finance  Shamlodhiya  56000.0           2   West

Department: HR
  Department Employee   Salary  Experience Region
0         HR    Rahim  60000.0           2   East
3         HR    David  58000.0           3    NaN
6         HR    Grace  62000.0           2   East

Department: IT
  Department Employee   Salary  Experience Region
1         IT  Vanessa  75000.0           5   West
4         IT     Neha  79000.0           4   East
7         IT    Helen  73000.0           6   East
