# Chapter 10 Groupby: Split-Apply-Combine

Pandas for Everyone. See the author's [github page](https://github.com/chendaniely/pandas_for_everyone)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/gapminder.txt', sep='\t')
df

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


Let's say we want to calculate the lifeExp per year across all countries, we can:

In [3]:
df.groupby('year')['lifeExp'].mean()

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

In [4]:
df.groupby('year')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0A778D10>

In [5]:
df.groupby('year')['lifeExp']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x03D42850>

In [6]:
df.groupby('year')['lifeExp'].mean()[1952] # get the life expectancy of year 1952

49.05761971830987

In [7]:
df.groupby('continent')['lifeExp'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,624.0,48.86533,9.15021,23.599,42.3725,47.792,54.4115,76.442
Americas,300.0,64.658737,9.345088,37.579,58.41,67.048,71.6995,80.653
Asia,396.0,60.064903,11.864532,28.801,51.42625,61.7915,69.50525,82.603
Europe,360.0,71.903686,5.433178,43.585,69.57,72.241,75.4505,81.757
Oceania,24.0,74.326208,3.795611,69.12,71.205,73.665,77.5525,81.235


We can think of the process of group by as:

1. Group the rows of the dataframe by a key or a combination of keys;
2. Take out a column (optional);
3. Call an aggregation function.

## Aggregate Function

The function to be called on a groupby object is an aggregate function. They include:

1. Pandas built-in functions (mean, max, etc.);
2. Numpy built-in functions (np.mean, etc.);
3. User defined functions.

### User Defined Aggregate Function
We invoke a user defined aggregate function through agg() call. Let's say, we want to define our own function to calculate the mean.

In [18]:
def my_mean(s):
    """
    user defined function to calculate the mean(). It must implement the interface: [Series] s => something
    """
    print(type(s))
    return len(s)

In [19]:
df.groupby('continent')['lifeExp'].agg(my_mean)

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


continent
Africa      624.0
Americas    300.0
Asia        396.0
Europe      360.0
Oceania      24.0
Name: lifeExp, dtype: float64

We can see that the argument passed to the aggregate function is a Series object. Therefore we can use all methods of Series class.

### Aggregate Functions With Multiple Parameters

Just like apply(), we can have a aggregate function that has more than one parameter. To call such a function, we can:

1. Use partial function;
2. Supply the extra parameter.

Let's illustrate (2) here.

In [21]:
from functools import reduce
from operator import add

def my_mean_diff(s, diff_value):
    """
    Calculate the difference between mean of s and diff_value.
    """
    return reduce(add, s, 0)/len(s) - diff_value

In [25]:
global_mean = df['lifeExp'].mean() # average of life expectancy over the years
df.groupby('year')['lifeExp'].agg(my_mean_diff, diff_value=global_mean)

year
1952   -10.416820
1957    -7.967038
1962    -5.865190
1967    -3.796150
1972    -1.827053
1977     0.095718
1982     2.058758
1987     3.738173
1992     4.685899
1997     5.540237
2002     6.220483
2007     7.532983
Name: lifeExp, dtype: float64

## Call Multiple Aggregate Functions

Sometimes we need to:

1. Call multiple aggregate functions on the same column;
2. Call multiple aggregate functions, each for a different column.

Here is how.

In [27]:
import numpy as np
df.groupby('year')['lifeExp'].agg([np.count_nonzero, np.mean, np.std])

Unnamed: 0_level_0,count_nonzero,mean,std
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1952,142.0,49.05762,12.225956
1957,142.0,51.507401,12.231286
1962,142.0,53.609249,12.097245
1967,142.0,55.67829,11.718858
1972,142.0,57.647386,11.381953
1977,142.0,59.570157,11.227229
1982,142.0,61.533197,10.770618
1987,142.0,63.212613,10.556285
1992,142.0,64.160338,11.22738
1997,142.0,65.014676,11.559439


### Multiple Functions for Different Columns

But these functions needs to be the built in pandas groupby aggregate function.

In [31]:
df.groupby('year').agg({ 'lifeExp': 'mean', 'pop': 'median', 'gdpPercap': 'median'})

Unnamed: 0_level_0,lifeExp,pop,gdpPercap
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1952,49.05762,3943953.0,1968.528344
1957,51.507401,4282942.0,2173.220291
1962,53.609249,4686039.5,2335.439533
1967,55.67829,5170175.5,2678.334741
1972,57.647386,5877996.5,3339.129407
1977,59.570157,6404036.5,3798.609244
1982,61.533197,7007320.0,4216.228428
1987,63.212613,7774861.5,4280.300366
1992,64.160338,8688686.5,4386.085502
1997,65.014676,9735063.5,4781.825478


## Transform

Aggregate is a Series to value mapping, transform is a element wise mapping per group. Therefore, the transform operation is a Series -> Series function.

Here let's define a function that operates on a Series.

In [36]:
zscore = lambda v: (v - v.mean())/v.std() # for a vector (Series), compute the relative distance to the mean

In [37]:
zscore(pd.Series(range(5)))

0   -1.264911
1   -0.632456
2    0.000000
3    0.632456
4    1.264911
dtype: float64

In [40]:
df.groupby('year')['lifeExp'].transform(zscore).head()

0   -1.656854
1   -1.731249
2   -1.786543
3   -1.848157
4   -1.894173
Name: lifeExp, dtype: float64

In [42]:
df.groupby('year')['lifeExp'].transform(zscore).tail() # All points are transformed

1699   -0.081621
1700   -0.336974
1701   -1.574962
1702   -2.093346
1703   -1.948180
Name: lifeExp, dtype: float64

## Filter

We already have the capability to filter rows by using subsetting or boolean expressions. Here we introduce how to use filter() to take the groups of interest.

In [54]:
df.groupby('continent')['lifeExp'].count()

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
Name: lifeExp, dtype: int64

Now let's take out the group whose size is less than 30

In [73]:
df_filtered = df.groupby('continent').filter(lambda group: group['lifeExp'].count() > 30)
type(df_filtered)

pandas.core.frame.DataFrame

Now we can see that the 'Oceania' group is no longer there

In [76]:
df_filtered.groupby('continent')['lifeExp'].mean()

continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Name: lifeExp, dtype: float64

### What's Under The Cover?

We can call filter(filter_func) with either way:

Usage | filter_func type | output type of expression
------|------------------|------------------------------
df.groupby('*xxx*').filter() | DataFrame -> Boolean Series (for selection) | DataFrame
df.groupby('*xxx*')\['*yyy*'\].filter() | Series -> Boolean Series (for selection) | Series

Which way to choose, depends on the task.

In [93]:
def dummy_filter(group):
    print(type(group))
    return group['lifeExp'].count() > 30

df2 = df.groupby('continent').filter(dummy_filter)
df2

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [95]:
df2.groupby('continent')['lifeExp'].mean() # Now the Oceania group is gone

continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Name: lifeExp, dtype: float64

In [89]:
def dummy_filter2(group):
    print(type(group))
    return group.count() > 30

df.groupby('continent')['lifeExp'].filter(dummy_filter2)

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


0       28.801
1       30.332
2       31.997
3       34.020
4       36.088
         ...  
1699    62.351
1700    60.377
1701    46.809
1702    39.989
1703    43.487
Name: lifeExp, Length: 1680, dtype: float64

We can no longer use groupby('continent') on the above because the output is a series and the continent information is gone.

In [98]:
df.filter

<bound method NDFrame.filter of           country continent  year  lifeExp       pop   gdpPercap
0     Afghanistan      Asia  1952   28.801   8425333  779.445314
1     Afghanistan      Asia  1957   30.332   9240934  820.853030
2     Afghanistan      Asia  1962   31.997  10267083  853.100710
3     Afghanistan      Asia  1967   34.020  11537966  836.197138
4     Afghanistan      Asia  1972   36.088  13079460  739.981106
...           ...       ...   ...      ...       ...         ...
1699     Zimbabwe    Africa  1987   62.351   9216418  706.157306
1700     Zimbabwe    Africa  1992   60.377  10704340  693.420786
1701     Zimbabwe    Africa  1997   46.809  11404948  792.449960
1702     Zimbabwe    Africa  2002   39.989  11926563  672.038623
1703     Zimbabwe    Africa  2007   43.487  12311143  469.709298

[1704 rows x 6 columns]>

## Looking Deeper

We can save the groupby object for later re-use. Let's look into the groupby object to see what's under it.

In [101]:
grouped = df.groupby('year')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x09AA7AD0>

In [103]:
for group in grouped:
    print(type(group))
    print(group)
    break

<class 'tuple'>
(1952,                  country continent  year  lifeExp       pop    gdpPercap
0            Afghanistan      Asia  1952   28.801   8425333   779.445314
12               Albania    Europe  1952   55.230   1282697  1601.056136
24               Algeria    Africa  1952   43.077   9279525  2449.008185
36                Angola    Africa  1952   30.015   4232095  3520.610273
48             Argentina  Americas  1952   62.485  17876956  5911.315053
...                  ...       ...   ...      ...       ...          ...
1644             Vietnam      Asia  1952   40.412  26246839   605.066492
1656  West Bank and Gaza      Asia  1952   43.160   1030585  1515.592329
1668         Yemen, Rep.      Asia  1952   32.548   4963829   781.717576
1680              Zambia    Africa  1952   42.038   2672000  1147.388831
1692            Zimbabwe    Africa  1952   48.451   3080907   406.884115

[142 rows x 6 columns])


### Group Is a Tuple

From the above, we can see that a group is a tuple, (*group name*, *dataframe*). For example, if we just want to know the average life expectancy of year 1952, we can do:

In [105]:
def get1952(grouped):
    for g in grouped:
        if g[0] == 1952:
            return g[1]


get1952(grouped)['lifeExp'].mean()

49.057619718309866

## To be Continued

Multiple groups, MultiIndex