# Grouping and Aggregation

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
from pandas import DataFrame, Series
import numpy as np

**Learning Objectives:** Learn to apply the split-apply-combine approach to group and aggregate data.

This notebook is based on Chapter 9 of Wes McKinney's Python for Data Analysis.

## Split-apply-combine

The idea of *split-apply-combine* is this:

1. Split the data frame into groups of rows.
2. Apply some transformation, method or function to each column of each group of rows.
3. Combine the output of those transformation into a final `Series` or `DataFrame`.

This simple sequence of steps can be used to accomplish a wide range of data transformations.

## Split using `groupby`

Splitting a data frames by its rows is done using the `groupby` method of a `Series` or `DataFrame`.

To illustrate this, here is a `DataFrame` with two numerical and two categorical columns:

In [2]:
df = DataFrame({'key1': ['a','a','b','b','a'],
                'key2': ['one','two','one','two','one'],
                'data1': np.random.randn(5),
                'data2': np.random.randn(5)})

In [3]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.788691,-0.303939,a,one
1,-0.013036,-0.92826,a,two
2,-0.130572,-0.745983,b,one
3,-1.16014,0.448574,b,two
4,2.345117,0.625342,a,one


There are two things you have to specify when calling `groupby` to perform a split:

1. What columns you want to **look at* or analyze.
2. What columns you want to **group by.**

As you think about these choices, here is are two *guidelines* for picking these columns:

* Look at numerical columns.
* Group by categorical columns.

While these guidelines can be broken, they are a good idea to keep in mind.

Look at the `data1` column and group by the `key1` column's values.

In [4]:
g1 = df['data1'].groupby(df['key1'])

In [5]:
g1

<pandas.core.groupby.SeriesGroupBy object at 0x7f252c08c9e8>

You can iterate through the groups as follows:

In [6]:
for name, group in g1:
    print(name)
    print(group)
    print('')

a
0   -0.788691
1   -0.013036
4    2.345117
Name: data1, dtype: float64

b
2   -0.130572
3   -1.160140
Name: data1, dtype: float64



The `groups` attribute returns a dictionary this gives which rows belong to which groups:

In [7]:
g1.groups

{'a': [0, 1, 4], 'b': [2, 3]}

It can also be useful to see the size of the groups:

In [8]:
g1.size()

key1
a    3
b    2
dtype: int64

### Interacting with `groupby`

Let's use IPython's interact function to better understand how `groupby` works.

In [9]:
def show_groups(column, by):
    groups = df[column].groupby(df[by])
    for name, group in groups:
        print(name)
        print(group)
        print('')

In [10]:
from ipywidgets import interact, fixed

In [11]:
interact(show_groups, column=['data1','data2'], by=['key1','key2']);

one
0   -0.788691
2   -0.130572
4    2.345117
Name: data1, dtype: float64

two
1   -0.013036
3   -1.160140
Name: data1, dtype: float64



You can pick columns to look at either before you call `groupby` or after:

In [12]:
df['data1'].groupby(df['key1']).mean()

key1
a    0.514463
b   -0.645356
Name: data1, dtype: float64

In [13]:
df.groupby(df['key1'])['data1'].mean()

key1
a    0.514463
b   -0.645356
Name: data1, dtype: float64

If you are usng `groupby` on the entire `DataFrame`, you can pass `groupby` the name of the column, rather than the actual column of values:

In [14]:
df.groupby('key1')['data1'].mean()

key1
a    0.514463
b   -0.645356
Name: data1, dtype: float64

You can group by multiple columns. The resulting `Series` of `DataFrame` will have a heirarchical index:

In [15]:
df.groupby(['key1','key2'])['data1'].mean()

key1  key2
a     one     0.778213
      two    -0.013036
b     one    -0.130572
      two    -1.160140
Name: data1, dtype: float64

If you are looking at a single column of numerical data, the final result will be a `Series`. You can also look at multiple columns, which will result in a `DataFrame`:

In [16]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.514463,-0.202286
b,-0.645356,-0.148704


Here is a more complicated example where we are looking at and grouping by multiple columns:

In [17]:
df.groupby(['key1','key2'])[['data1','data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.778213,0.160702
a,two,-0.013036,-0.92826
b,one,-0.130572,-0.745983
b,two,-1.16014,0.448574


It is possibly to use any sequence for the `groupby` values. If you pass a sequence that isn't in the `DataFrame`, that sequence will be treated like another column in the `DataFrame` for the splitting step:

In [18]:
states = ['OH','CA','CA','OH','OH']
years = [2005,2005,2006,2005,2006]
df.groupby([states, years]).mean()

Unnamed: 0,Unnamed: 1,data1,data2
CA,2005,-0.013036,-0.92826
CA,2006,-0.130572,-0.745983
OH,2005,-0.974415,0.072318
OH,2006,2.345117,0.625342


That is like doing a groupby on the following `DataFrame`:

In [19]:
df2 = df.copy()
df2['states'] = states
df2['years'] = years
df2

Unnamed: 0,data1,data2,key1,key2,states,years
0,-0.788691,-0.303939,a,one,OH,2005
1,-0.013036,-0.92826,a,two,CA,2005
2,-0.130572,-0.745983,b,one,CA,2006
3,-1.16014,0.448574,b,two,OH,2005
4,2.345117,0.625342,a,one,OH,2006


In [20]:
df2.groupby(['states','years']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
states,years,Unnamed: 2_level_1,Unnamed: 3_level_1
CA,2005,-0.013036,-0.92826
CA,2006,-0.130572,-0.745983
OH,2005,-0.974415,0.072318
OH,2006,2.345117,0.625342


There are other, more sophisticated ways of doing grouping:

* `Series`
* `dict`
* Functions

See P4DA Chapter 9 for more details or the Pandas [Group By Documentation](http://pandas.pydata.org/pandas-docs/dev/groupby.html).

## Aggregation

### Single function on all columns

To apply a single aggregation function to all columns in the grouped data, simple call the method on the `groupby` result or pass an aggregation function to the `agg` method.

In [21]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.788691,-0.303939,a,one
1,-0.013036,-0.92826,a,two
2,-0.130572,-0.745983,b,one
3,-1.16014,0.448574,b,two
4,2.345117,0.625342,a,one


In [22]:
g2 = df.groupby('key1')

In [23]:
g2.mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.514463,-0.202286
b,-0.645356,-0.148704


In [24]:
g2.count()

Unnamed: 0_level_0,data1,data2,key2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,3,3,3
b,2,2,2


Here are some of the aggregation methods that are built in:

* `count`
* `sum`
* `mean`
* `median`
* `std/var`
* `min/max`
* `prod`
* `first/last`
* `describe`
* `size`

When you call these methods, **the same operation is applied to all columns of all groups.**

It is possible to write your own aggregation function and have it called on the columns of each group using `agg`.

In [25]:
def peak_to_peak(arr):
    return arr.mean()-arr.max()

In [26]:
g2.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-1.830654,-0.827627
b,-0.514784,-0.597278


When you pass a single function to `agg`, that same function is applied all columns of each group.

You can also pass the names of builtin function to `agg` as strings:

In [27]:
g2.agg('mean')

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.514463,-0.202286
b,-0.645356,-0.148704


### Multiple functions on each column

Sometimes, you want to call multiple aggregation functions on each column of data. In this case, the same set of functions is still being called on all columns. To call different functions on each column see below.

We will use the tips data set to illustrate these features:

In [28]:
import seaborn

In [29]:
tips = seaborn.load_dataset('tips')
tips['tip_pct'] = tips.tip/tips.total_bill

In [30]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


In [31]:
grouped = tips.groupby(['sex','smoker'])

The first way of calling multiple aggregation function is to simple pass a list of functions or function names to `agg`:

In [32]:
grouped['tip_pct'].agg(['mean','std',peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,Yes,0.152771,0.090588,-0.557574
Male,No,0.160669,0.041849,-0.131321
Female,Yes,0.18215,0.071595,-0.234516
Female,No,0.156921,0.036421,-0.095752


Note how the new column names will match the names of the functions. If you want to customize the column names, you can pass a list of tuples:

In [33]:
grouped['tip_pct'].agg([('the_mean','mean'),('the_std','std'),('p2p',peak_to_peak)])

Unnamed: 0_level_0,Unnamed: 1_level_0,the_mean,the_std,p2p
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,Yes,0.152771,0.090588,-0.557574
Male,No,0.160669,0.041849,-0.131321
Female,Yes,0.18215,0.071595,-0.234516
Female,No,0.156921,0.036421,-0.095752


### Different functions on different columns

The last case if if you want to call different sets of function on different columns. In this case, you can pass a dict, where the keys are the column names and the values are a functions you want to apply.

Here is a simple example of applying different functions to the `tip` and `tip_pct` columns:

In [None]:
grouped.agg({'tip':'max','tip_pct':['mean','std']})