# Split-apply-combine

[Resource here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

Let's start with a dataset:

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("data/country_sex_age.csv")
df.sample(5)

Unnamed: 0,country,seasonality,sex,age_group,month,unemployment,unemployment_rate
1004,at,sa,f,y25-74,2004.09,71000,4.7
75760,si,trend,m,y25-74,2006.05,20000,4.1
76689,sk,nsa,m,y_lt25,2003.1,65000,32.5
28820,fr,sa,f,y25-74,1995.09,1113000,10.8
4843,be,sa,m,y25-74,2002.08,124000,5.5


Say we want the average unemployment rate by `sex`. If we were doing this with SQL we would do:

```
SELECT sex,
   AVG(unemployment_rate)
FROM df
GROUP BY sex
```

In pandas, the syntax for it is similar:

In [2]:
df.groupby('sex')['unemployment_rate'].mean()

sex
f    12.982629
m    11.671026
Name: unemployment_rate, dtype: float64

Note the index of the resulting series is the "groupby key".

If we wanted to group over two different categories, we'd pass a list to the groupby object:

In [3]:
grp = df.groupby(['sex', 'age_group'])

grp.unemployment_rate.mean()

sex  age_group
f    y25-74        7.566771
     y_lt25       18.457435
m    y25-74        6.244016
     y_lt25       17.098036
Name: unemployment_rate, dtype: float64

The `grp` object is a "in waiting" object that holds the groupby, waiting for you to tell final results

In [4]:
grp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11caef250>

This is where the **split-apply-combine** framework comes in. 

We **split** into the groups we want (the groupby list)

We **apply** the transformation to the statistical column

We **combine** (or **aggregate**) the results for each group.

We can do this for multiple columns with a dictionary:

In [5]:
agg_dict = {
    # Strings for inbuilt aggregation methods
    "unemployment_rate": 'mean',
    # You can use numpy reduce functions
    "unemployment": np.sum
}

# Aggregate each column in the dict 
# Using the apply function from 
grp.agg(agg_dict)

Unnamed: 0_level_0,Unnamed: 1_level_0,unemployment_rate,unemployment
sex,age_group,Unnamed: 2_level_1,Unnamed: 3_level_1
f,y25-74,7.566771,5476574000
f,y_lt25,18.457435,2346186000
m,y25-74,6.244016,6016451000
m,y_lt25,17.098036,2578235000


Note that this dataframe has two layers of indices -- a `pd.MultiIndex` which is really bothersome to work with.

This is why we often reset the index after doing groupbys, letting us access columns easily:

In [6]:
grp.agg(agg_dict).reset_index()

Unnamed: 0,sex,age_group,unemployment_rate,unemployment
0,f,y25-74,7.566771,5476574000
1,f,y_lt25,18.457435,2346186000
2,m,y25-74,6.244016,6016451000
3,m,y_lt25,17.098036,2578235000
