# 1. Groupby Aggregation Basics

### Objectives

+ Define split, apply, combine and why it is useful for data analysis
+ Know the definition of an aggregation
+ Group by a single column
+ Aggregate a single column
+ Apply a single function
+ Use this syntax: **`df.groupby('<grouping column>').agg({'<aggregating column>':'<aggregating function>'})`**
+ For every group by aggregation, identify **grouping column**, **aggregating column**, and the **aggregating function**
+ Remove the grouping column from the index with **`reset_index`** method
+ Know that the `groupby` method returns a **GroupBy** object

### Resources
+ Read the pandas [split apply combine documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html) stopping at 'transformation'.

### Introduction
In previous notebooks, when we called a method, such as **`sum`**, on our DataFrames, the action was performed to every single value in it as a whole. In this notebook, we will perform actions to distinct groupings within our data and not to the whole. Split-Apply-Combine is a recently popular term to describe this idea. You can also refer to it as **grouping** data.

#### Examples of questions we can answer
The split-apply-combine strategy can be used to answer questions such as:
* What is the maximum salary for every department at a company
* What is the average temperature and precipitation for every month for different cities
* Find the top 5 best selling shirts at each store

#### Definitions
* **Split** - The data is split into distinct and independent groups based on each member meeting a certain criteria
* **Apply** - Apply a function to each group independently
* **Combine** - Combine the results of the function applied to each group back together to form a single dataset again

![](images/split-apply-combine.png)

In [None]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 100

## NYC Leading Causes of Death Data
To get started with split-apply-combine, we will use a small dataset containing the leading causes of death in NYC from 2007-2014. [This dataset][1] may be found at the [NYC Open Data][2] site.

[1]: https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam
[2]: https://opendata.cityofnewyork.us/

In [None]:
nyc = pd.read_csv('../data/nyc_deaths.csv')
nyc.head()

## Grouping with the **`groupby`** method
All of the tasks involving grouping use the **`groupby`** method. This one method is responsible for splitting the data into independent groups, applying the desired function or functions to each group, and combining the results back together, and usually does so in a single line of code.

### Aggregation
By far, the most common type of function to apply to each group is an aggregation function. As we have previously learned, to aggregate means to take all the values in the group and summarize them with a single value. Aggregations always return a single number for each group. Taking the sum, average, mean, min, max, standard deviation, count, etc.. are all examples of an aggregation. [See here for more.](https://en.wikipedia.org/wiki/Aggregate_function)


## Syntax for using the `groupby` method
The **`groupby`** method is not as straightforward to use as most other methods. It will take more effort to learn how it works. Unfortunately, there are several different valid types of syntax that do the same the thing.

### Must use method chaining with `groupby`
Nearly all of the calls to the **`groupby`** must have another method chained to it to return a result.

### Performing an Aggregation with `agg`
To perform an aggregation, you must chain the **`agg`** method to your call to **`groupby`**. The basic syntax takes the following form:

**```
df.groupby('<grouping column>').agg({'<aggregating column>':'<aggregating function>'})
```**

Let's see an example of this by finding the total number of deaths per year.

In [None]:
nyc.groupby('year').agg({'deaths':'sum'})

## Explanation
Every **`groupby`** aggregation has three separate pieces:
* **Grouping column** - Every distinct value in this column forms its own group
* **Aggregating column** - This is column we apply a function to such that it aggregates (returns a single value). This column is usually numeric.
* **Aggregating function** - This is the function that is applied to the aggregating column.

## Always identify each piece
When facing a problem where you will be grouping and aggregating, it is important to identify each of the pieces. This will help you insert them in the right place of the syntax above. In the above example, we have:

* **Grouping column** - **`year`**
* **Aggregating column** - **`deaths`**
* **Aggregating function** - **`sum`**

### Use string names for aggregation functions
Pandas understands many string aggregation functions. Below are most of the available string names you can use. 
+ **`sum`**
+ **`min`**
+ **`max`**
+ **`mean`**
+ **`median`**
+ **`std`**
+ **`var`**
+ **`count`** - count of non-missing values
+ **`size`** - count of all elements
+ **`first`** - first value in group
+ **`last`** - last value in group
+ **`idxmax`** - index of maximum value in group
+ **`idxmin`** - index of minimum value in group
+ **`any`** - checks for at least one True value - returns boolean
+ **`all`** - checks for at least one False value - returns boolean
+ **`nunique`** - number of unique values in group
+ **`sem`** - standard error of the mean

Later on we will see where these names came from.

## Deeper explanation on method chaining with `groupby`
The `groupby` syntax is a bit strange in that it requires method chaining to deliver results. Let's examine the results of making a call just to the **`groupby`** method.

In [None]:
nyc.groupby('year')

### What is that?
Calling **`groupby`** by itself does not do much. You are simply alerting pandas that you would like to create distinct groups with a particular column. It has formally returned a **`DataFrameGroupBy`** object. Just like all Pandas objects, you can see a list of all its [attributes and methods in the API][1]. This type of object is not crucial to dive into at this point.

### Assign the `groupby` object to a  variable
Let's assign the result of the call to **`groupby`** as a variable and verify its type.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#groupby

In [None]:
g = nyc.groupby('year')

In [None]:
type(g)

## `GroupBy` objects
The documentation refers to the object returned from a call to the **`groupby`** method as a **GroupBy** object. Technically there are two specific objects - **`DataFrameGroupBy`** (as we saw above) and **`SeriesGroupBy`**. It's not necessary to think much about these objects. Just be aware that a call to **`groupby`** returns some other object that is not a DataFrame or a Series. It is a **GroupBy** object with its own attributes and methods.

### The `groups` attribute
The **`groups`** attribute of the GroupBy object. This is an interesting attribute - it is a dictionary that contains each individual group value as the key with its corresponding index labels of that group.

In [None]:
g.groups

There is also an `ngroups` attribute that returns an integer of the number of distinct groups.

In [None]:
g.ngroups

### Calling the `agg` method from the GroupBy object
We can call the **`agg`** method from this assigned variable (the GroupBy object) to get the same result as above.

In [None]:
g.agg({'deaths':'sum'})

## Understanding the index
If you were paying close attention, you would notice that the grouping column gets placed in the index. In our above example, the **`year`** is the now the index. It is not a column.

In [None]:
year_deaths = nyc.groupby('year').agg({'deaths':'sum'})
year_deaths

### Use `reset_index` method to turn the index as a column
All DataFrames come equipped with a **`reset_index`** method which turns the index into a column of data. You can chain it after the call to **`agg`**.

In [None]:
nyc.groupby('year').agg({'deaths':'sum'}).reset_index()

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">What year had the most deaths?</span>

### Problem 2
<span  style="color:green; font-size:16px">Find the total number of deaths by race and sort by most to least.</span>

### Use the employee dataset for the remaining problems

In [None]:
emp = pd.read_csv('../data/employee.csv')
emp.head()

### Problem 3
<span  style="color:green; font-size:16px">Find the maximum salary for each gender.</span>

### Problem 4
<span  style="color:green; font-size:16px">Find the median salary for each department.</span>

### Problem 5
<span  style="color:green; font-size:16px">Find the average salary for each race. Return a DataFrame with the race as a column.</span>