# 1. Alternate GroupBy Syntax

This notebook covers alternative syntax with the **`groupby`** method. The purpose of this notebook is to show other syntaxes that you might see in the wild that accomplish the same exact task. This notebook has great potential to confuse beginning pandas users since these methods do not give you any extra power to do data analysis, just aggregate in a different manner.

In [2]:
import pandas as pd
import numpy as np

# Use City of Houston Employee Data
Read in employee data and add a column for years of experience.

In [3]:
emp = pd.read_csv('../../data/employee.csv', parse_dates=['hire_date'])
emp['experience'] = 2016 - emp['hire_date'].dt.year
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date,experience
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03,1
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08,34
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,1984-11-26,32
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,2012-03-26,4
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,2013-11-04,3


# Grouping a single column, aggregating a single column, applying a single function
Originally taught:

In [4]:
emp.groupby('race').agg({'salary': 'mean'})

Unnamed: 0_level_0,salary
race,Unnamed: 1_level_1
Asian,60143.218391
Black,50366.588803
Hispanic,52533.456693
Native American,64562.142857
White,63834.575646


# Alternative
You can select the aggregating column with the brackets and the aggregating function as a string to the **`agg`** method.

In [5]:
emp.groupby('race')['salary'].agg('sum')

race
Asian               5232460.0
Black              26089893.0
Hispanic           20015247.0
Native American      451935.0
White              34598340.0
Name: salary, dtype: float64

You can even bypass the **`agg`** method and call the **`sum`** 

In [6]:
emp.groupby('race')['salary'].sum()

race
Asian               5232460.0
Black              26089893.0
Hispanic           20015247.0
Native American      451935.0
White              34598340.0
Name: salary, dtype: float64

# Multiple aggregation functions
Original:

In [7]:
emp.groupby('race').agg({'salary': ['mean', 'sum']})

Unnamed: 0_level_0,salary,salary
Unnamed: 0_level_1,mean,sum
race,Unnamed: 1_level_2,Unnamed: 2_level_2
Asian,60143.218391,5232460.0
Black,50366.588803,26089893.0
Hispanic,52533.456693,20015247.0
Native American,64562.142857,451935.0
White,63834.575646,34598340.0


# Alternative
Notice, that there is no multi-level column index here.

In [8]:
emp.groupby('race')['salary'].agg(['mean', 'sum'])

Unnamed: 0_level_0,mean,sum
race,Unnamed: 1_level_1,Unnamed: 2_level_1
Asian,60143.218391,5232460.0
Black,50366.588803,26089893.0
Hispanic,52533.456693,20015247.0
Native American,64562.142857,451935.0
White,63834.575646,34598340.0


# Multiple Grouping, Aggregating, and Applying same Functions
This only works if you are applying the same functions to each aggregating column.

In [9]:
emp.groupby(['race', 'gender']).agg({'salary': ['min', 'max', 'mean', 'median'],
                                     'experience': ['min', 'max', 'mean', 'median']})

Unnamed: 0_level_0,Unnamed: 1_level_0,salary,salary,salary,salary,experience,experience,experience,experience
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,median,min,max,mean,median
race,gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Asian,Female,26125.0,95950.0,58304.222222,51514.5,1,35,16.277778,13.0
Asian,Male,27914.0,163228.0,60622.956522,55461.0,0,39,13.5,12.5
Black,Female,24960.0,150416.0,48133.381643,40581.0,0,37,13.555556,11.0
Black,Male,26125.0,186192.0,51853.0,49150.0,0,48,13.276074,11.0
Hispanic,Female,26125.0,96157.0,44216.96,42837.5,0,37,10.861386,8.0
Hispanic,Male,26104.0,165216.0,55493.064057,55437.0,0,41,12.843537,11.0
Native American,Female,49379.0,68299.0,58844.333333,58855.0,12,21,15.75,15.0
Native American,Male,55461.0,81239.0,68850.5,69351.0,8,25,15.75,15.0
White,Female,30888.0,178331.0,66415.527778,62783.0,0,41,14.179487,11.5
White,Male,26125.0,210588.0,63439.195745,62540.0,0,58,17.867816,16.0


In [10]:
emp.groupby(['race', 'gender'])['salary', 'experience'].agg(['min', 'max', 'mean', 'median'])

Unnamed: 0_level_0,Unnamed: 1_level_0,salary,salary,salary,salary,experience,experience,experience,experience
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,median,min,max,mean,median
race,gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Asian,Female,26125.0,95950.0,58304.222222,51514.5,1,35,16.277778,13.0
Asian,Male,27914.0,163228.0,60622.956522,55461.0,0,39,13.5,12.5
Black,Female,24960.0,150416.0,48133.381643,40581.0,0,37,13.555556,11.0
Black,Male,26125.0,186192.0,51853.0,49150.0,0,48,13.276074,11.0
Hispanic,Female,26125.0,96157.0,44216.96,42837.5,0,37,10.861386,8.0
Hispanic,Male,26104.0,165216.0,55493.064057,55437.0,0,41,12.843537,11.0
Native American,Female,49379.0,68299.0,58844.333333,58855.0,12,21,15.75,15.0
Native American,Male,55461.0,81239.0,68850.5,69351.0,8,25,15.75,15.0
White,Female,30888.0,178331.0,66415.527778,62783.0,0,41,14.179487,11.5
White,Male,26125.0,210588.0,63439.195745,62540.0,0,58,17.867816,16.0


# Alternative - No Aggregating Columns
You actually do not need to specify the aggregating columns when grouping. Pandas will silently drop the columns that don't work for the particular aggregation method. For instance, only numeric columns have a mean. All other columns will be dropped. The only numeric columns are salary and experience.

In [11]:
emp.groupby(['race', 'gender']).agg(['min', 'max', 'mean', 'median'])

Unnamed: 0_level_0,Unnamed: 1_level_0,salary,salary,salary,salary,experience,experience,experience,experience
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,median,min,max,mean,median
race,gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Asian,Female,26125.0,95950.0,58304.222222,51514.5,1,35,16.277778,13.0
Asian,Male,27914.0,163228.0,60622.956522,55461.0,0,39,13.5,12.5
Black,Female,24960.0,150416.0,48133.381643,40581.0,0,37,13.555556,11.0
Black,Male,26125.0,186192.0,51853.0,49150.0,0,48,13.276074,11.0
Hispanic,Female,26125.0,96157.0,44216.96,42837.5,0,37,10.861386,8.0
Hispanic,Male,26104.0,165216.0,55493.064057,55437.0,0,41,12.843537,11.0
Native American,Female,49379.0,68299.0,58844.333333,58855.0,12,21,15.75,15.0
Native American,Male,55461.0,81239.0,68850.5,69351.0,8,25,15.75,15.0
White,Female,30888.0,178331.0,66415.527778,62783.0,0,41,14.179487,11.5
White,Male,26125.0,210588.0,63439.195745,62540.0,0,58,17.867816,16.0


You can even call a method directly after grouping to apply it to all columns.

In [12]:
emp.groupby(['race', 'gender']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,salary,experience
race,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
Asian,Female,58304.222222,16.277778
Asian,Male,60622.956522,13.5
Black,Female,48133.381643,13.555556
Black,Male,51853.0,13.276074
Hispanic,Female,44216.96,10.861386
Hispanic,Male,55493.064057,12.843537
Native American,Female,58844.333333,15.75
Native American,Male,68850.5,15.75
White,Female,66415.527778,14.179487
White,Male,63439.195745,17.867816


The **`count`** method works for all columns not just numeric.

In [13]:
emp.groupby(['race', 'gender']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,title,dept,salary,hire_date,experience
race,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Asian,Female,18,18,18,18,18
Asian,Male,70,70,69,70,70
Black,Female,216,216,207,216,216
Black,Male,326,326,311,326,326
Hispanic,Female,101,101,100,101,101
Hispanic,Male,294,294,281,294,294
Native American,Female,4,4,3,4,4
Native American,Male,4,4,4,4,4
White,Female,78,78,72,78,78
White,Male,522,522,470,522,522


# A trick to discovering all the groupby methods
Tab completion will not work after a groupby method. It's unfortunately not that intelligent so won't know what to display. However, you can assign the result of a groupby to a variable and then use your normal tab completion to reveal all the groupby methods.

Placing a dot after the groupby below and pressing tab will not work:

In [14]:
emp.groupby(['race', 'gender'])

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x110f57b70>

Instead, assign it to a variable and the get access to the groupby methods:

In [15]:
g = emp.groupby(['race', 'gender'])
# execute this cell then uncomment the next line and press tab
# g.

## `size` vs `count`
Since **`size`** is the same for every variable, Pandas returns just a single column. The number of missing values may be different for each column, so the **`count`** method will be applied to every single column of the DataFrame.

In [16]:
g.size()

race             gender
Asian            Female     18
                 Male       70
Black            Female    216
                 Male      326
Hispanic         Female    101
                 Male      294
Native American  Female      4
                 Male        4
White            Female     78
                 Male      522
dtype: int64

In [17]:
g.count()

Unnamed: 0_level_0,Unnamed: 1_level_0,title,dept,salary,hire_date,experience
race,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Asian,Female,18,18,18,18,18
Asian,Male,70,70,69,70,70
Black,Female,216,216,207,216,216
Black,Male,326,326,311,326,326
Hispanic,Female,101,101,100,101,101
Hispanic,Male,294,294,281,294,294
Native American,Female,4,4,3,4,4
Native American,Male,4,4,4,4,4
White,Female,78,78,72,78,78
White,Male,522,522,470,522,522


### Practice calling methods directly from `g`

# Exercises
Go through the previous groupby notebooks and use the alternate syntax to answer the problems.