# Grouping and Aggregating Data

## Outine
* Splitting data into groups
* Operations on groupby objects


In [2]:
import pandas as pd
import numpy as np
from pathlib import Path

df = pd.read_csv(Path('data/employee_attrition.csv'))

# Select only people in this age range
df = df[df.Age >= 21]
df = df[df.Age <= 28]


## Splitting data into groups
This is where every grouping operation begins, by calling the `groupby()` function of a Dataframe object.

Split the data into groups by a single column's value, or multiple.

In [3]:

grouped = df.groupby('Age')
grouped = df.groupby(['Age', 'Attrition'])


Iterate over the groups, returning `(groupName, group)`. "groupName" is either the value of the single column of the group, or a tuple containing the values of multiple columns. "group" is a dataframe containing the rows of each group.

In [4]:
for groupName, group in grouped:
    print(f'{groupName}: {len(group)} rows')


(21, 'No'): 7 rows
(21, 'Yes'): 6 rows
(22, 'No'): 11 rows
(22, 'Yes'): 5 rows
(23, 'No'): 10 rows
(23, 'Yes'): 4 rows
(24, 'No'): 19 rows
(24, 'Yes'): 7 rows
(25, 'No'): 20 rows
(25, 'Yes'): 6 rows
(26, 'No'): 27 rows
(26, 'Yes'): 12 rows
(27, 'No'): 45 rows
(27, 'Yes'): 3 rows
(28, 'No'): 34 rows
(28, 'Yes'): 14 rows


Select a single group with `get_group()`, passing the value or a tuple of values.

In [5]:
grouped = df.groupby(['Age', 'Attrition', 'Gender'])

grouped.get_group((24, 'No', 'Female'))

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
20,24,No,Non-Travel,673,11,1,Female,96,4,2,...,3,4,1,5,5,2,4,2,1,3
96,24,No,Travel_Rarely,1353,3,1,Female,33,3,2,...,4,1,1,4,2,2,3,2,0,2
380,24,No,Travel_Rarely,1371,10,4,Female,77,3,2,...,3,4,1,5,2,4,5,2,0,3
724,24,No,Travel_Rarely,1206,17,4,Female,41,2,2,...,3,2,2,5,6,3,4,2,3,2
1025,24,No,Travel_Rarely,1476,4,4,Female,42,3,2,...,3,3,2,5,3,3,5,4,0,3
1061,24,No,Non-Travel,830,13,4,Female,78,3,1,...,3,3,1,1,2,3,1,0,0,0
1168,24,No,Travel_Frequently,567,2,1,Female,32,3,1,...,3,3,0,6,2,3,6,3,1,3


## Operations on groupby objects

After splitting the groups, we usually want to perform some operation on the groups of data returned.

### Aggregation

Aggregation can be performed by either using the available aggregating functions on the `groupby` object, or using the aggregate function to apply arbtrary function logic.

In [6]:
grouped = df.groupby('Age')

dfh = grouped['DistanceFromHome'] # pandas SeriesGroupBy object

dfh.mean()
dfh.std()



Age
21    6.435080
22    7.201562
23    6.780710
24    8.496244
25    8.353719
26    9.683033
27    6.045924
28    8.539333
Name: DistanceFromHome, dtype: float64

In [12]:
dfh.aggregate([np.mean, np.std])

dfh.aggregate(lambda x: sum(x) / len(x))
#dfh.mean()

Age
21     9.076923
22     8.437500
23     9.142857
24    11.884615
25     8.769231
26    10.230769
27     7.500000
28     8.875000
Name: DistanceFromHome, dtype: float64

You can also transform the data of each group.

In [138]:
grouped = df.groupby('Age')

# Assign every value of DailyRate within each age group to the mean of the group
standardized_rates = grouped['DailyRate'].transform(np.mean)

df.assign(DailyRate=standardized_rates)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
4,27,No,Travel_Rarely,907.333333,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2
14,28,Yes,Travel_Rarely,927.500000,24,3,Male,50,2,1,...,3,2,0,6,4,3,4,2,0,3
17,22,No,Non-Travel,806.937500,16,4,Male,96,4,1,...,3,2,2,1,2,2,1,0,0,0
20,24,No,Non-Travel,863.115385,11,1,Female,96,4,2,...,3,4,1,5,5,2,4,2,1,3
23,21,No,Travel_Rarely,762.846154,15,3,Male,96,3,1,...,3,4,0,0,6,3,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1433,25,No,Travel_Rarely,768.076923,8,1,Female,85,3,2,...,4,2,1,6,3,2,5,3,0,4
1436,21,No,Travel_Rarely,762.846154,5,3,Male,58,3,1,...,3,4,0,2,6,3,2,2,1,2
1438,23,Yes,Travel_Frequently,707.928571,9,4,Male,33,3,1,...,3,1,1,1,3,2,1,0,1,0
1464,26,No,Travel_Rarely,844.769231,5,4,Female,30,2,1,...,3,4,0,5,2,3,4,2,0,0
