# 11. Split-Apply-Combine Aggregation Basics

### Introduction
In previous notebooks, when we called a method, such as **`sum`**, on our DataFrames, the action was performed to every single value in it. In this notebook, we will perform actions to distinct groupings within our data and not to the whole. Split-Apply-Combine is simply a recently popular term to describe this. You can also simply refer to it as **grouping** data.

#### Examples of questions we can answer
The split-apply-combine strategy can be used to answer questions such as:
* What is the maximum salary for every department at a company
* What is the average temperature and precipitation for every month for different cities
* Find the top 5 best selling shirts at each store

#### Definitions
* **Split** - data is split into distinct and independent groups based on each member meeting a certain criteria
* **Apply** - apply a function to each group independently
* **Combine** - combine the results into single dataset

![](images/split-apply-combine.png)

In [3]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 100

## NYC Leading Causes of Death Data
To get started with split-apply-combine, we will use a small dataset containing the leading causes of death in NYC from 2007-2014. [This dataset][1] may be found at the [NYC Open Data][2] site.

[1]: https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam
[2]: https://opendata.cityofnewyork.us/

In [13]:
table_of_nyc = pd.read_csv('data/nyc_deaths.csv')
table_of_nyc.head()

Unnamed: 0,year,cause,sex,race,deaths
0,2007,Accidents,F,Asian,32
1,2007,Accidents,F,Black,87
2,2007,Accidents,F,Hispanic,71
3,2007,Accidents,F,White,162
4,2007,Accidents,M,Asian,53


## Grouping with the **`groupby`** method
All of the tasks involving split-apply-combine involve the **`groupby`** method.

### Aggregation
By far, the most common type of function to apply to each group is an aggregation function. As we have previously learned, to aggregate means to take all the values in the group and summarize them with a single value. 


## Strange Syntax for using the `groupby` method

### Must use method chaining with `groupby`

### Performing an Aggregation with `agg`
To perform an aggregation, you must chain the **`agg`** method after calling **`groupby`** like this:


### `df.groupby('<grouping column>').agg({'<aggregating column>':'<aggregating function>'})`

Let's see an example of this by finding the total number of deaths per year.

In [14]:
table_of_nyc.groupby('year').agg({'deaths':'sum'})

Unnamed: 0_level_0,deaths
year,Unnamed: 1_level_1
2007,53996
2008,54138
2009,52820
2010,52505
2011,52726
2012,52420
2013,53387
2014,53006


## Explanation
Every **`groupby`** aggregation has three separate pieces:
* **grouping column** - Every distinct value in this column forms its own group
* **aggregating column** - This is column we apply a function to such that it aggregates (returns a single value). This column is usually numeric.
* **aggregating function** - This is the function that is applied to the aggregating column.

## Always identify each piece
When facing a problem where you will be grouping and aggregating, it is important to identify each of the pieces. This will help you insert them in the right place of the syntax above. In the above example, we have:

* **grouping column** - **`year`**
* **aggregating column** - **`deaths`**
* **aggregating function** - **`sum`**

### Use string names for aggregation functions
Pandas understands many string aggregation functions. Below are most of the available string names you can use. 
+ **`sum`**
+ **`min`**
+ **`max`**
+ **`mean`**
+ **`median`**
+ **`std`**
+ **`var`**
+ **`count`** - count of non-missing values
+ **`size`** - count of all elements
+ **`first`** - first value in group
+ **`last`** - last value in group
+ **`idxmax`** - index of maximum value in group
+ **`idxmin`** - index of minimum value in group
+ **`any`** - checks for at least one True value - returns boolean
+ **`all`** - checks for at least one False value - returns boolean
+ **`nunique`** - number of unique values in group
+ **`sem`** - standard error of the mean

## Find the maximum deaths for each leading cause
Identify each component of the aggregation:

* **grouping column** - **`cause`**
* **aggregating column** - **`deaths`**
* **aggregating function** - **`max`**

Then place each component in the proper place for the syntax:

In [15]:
table_of_nyc.groupby('cause').agg({'deaths':'max'})

Unnamed: 0_level_0,deaths
cause,Unnamed: 1_level_1
Accidents,297
Alzheimer's,276
Cancer,3518
Congenital Malformations,14
Diabetes,410
Flu and Pneumonia,707
HIV,377
Heart Disease,7050
Hepatitis,15
Homicide,299


## Understanding the index
If you were paying close attention, you would notice that the grouping column gets placed in the index. In our above example, the **`year`** is the now the index. It is not a column.

In [16]:
func_groupby_year_deaths = table_of_nyc.groupby('year').agg({'deaths':'sum'})
func_groupby_year_deaths

Unnamed: 0_level_0,deaths
year,Unnamed: 1_level_1
2007,53996
2008,54138
2009,52820
2010,52505
2011,52726
2012,52420
2013,53387
2014,53006


### Use `reset_index`
All DataFrames come equipped with a **`reset_index`** method which makes the index into a column of data. You can chain it after the call to **`agg`**.

In [17]:
table_of_nyc.groupby('year').agg({'deaths':'sum'}).reset_index()

Unnamed: 0,year,deaths
0,2007,53996
1,2008,54138
2,2009,52820
3,2010,52505
4,2011,52726
5,2012,52420
6,2013,53387
7,2014,53006


# Exercises

### Problem 1
<span  style="color:green; font-size:16px">What year had the most deaths?</span>

In [25]:
import pandas as pd
table_of_nyc = pd.read_csv('data/nyc_deaths.csv')
table_of_nyc.head()

Unnamed: 0,year,cause,sex,race,deaths
0,2007,Accidents,F,Asian,32
1,2007,Accidents,F,Black,87
2,2007,Accidents,F,Hispanic,71
3,2007,Accidents,F,White,162
4,2007,Accidents,M,Asian,53


In [26]:
func_groupby_year_deaths = table_of_nyc.groupby('year').agg({'deaths':'sum'})
func_groupby_year_deaths.idxmax()

deaths    2008
dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Find the total number of deaths by race and sort by most to least.</span>

### Use the employee dataset for the remaining problems

In [28]:
emp = pd.read_csv('./data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,experience
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,1
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,34
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,32
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,4
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,3


### Problem 3
<span  style="color:green; font-size:16px">Find the maximum salary for each gender.</span>

In [30]:
emp.groupby('gender').agg({'salary':'max'})

Unnamed: 0_level_0,salary
gender,Unnamed: 1_level_1
Female,178331.0
Male,210588.0


### Problem 4
<span  style="color:green; font-size:16px">Find the median salary for each department.</span>

In [32]:
emp.groupby('dept').agg({'salary':'median'}).head()

Unnamed: 0_level_0,salary
dept,Unnamed: 1_level_1
Health & Human Services,46384.0
Houston Airport System (HAS),41808.0
Houston Fire Department (HFD),61921.0
Houston Police Department-HPD,61643.0
Parks & Recreation,35027.0


### Problem 5
<span  style="color:green; font-size:16px">Find the average salary for each race. Return a DataFrame with the race as a column.</span>

In [33]:
emp.groupby('race', as_index=False).agg({'salary':'mean'})

Unnamed: 0,race,salary
0,Asian,60143.218391
1,Black,50366.588803
2,Hispanic,52533.456693
3,Native American,64562.142857
4,White,63834.575646
