# Split-apply-combine

[Resource here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

Let's start with a dataset:

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("data/country_sex_age.csv")
df.sample(5) #picks out random n number of rows from dataframe 

Unnamed: 0,country,seasonality,sex,age_group,month,unemployment,unemployment_rate
67234,pt,trend,f,y_lt25,2004.11,45000,17.7
39796,ie,trend,f,y_lt25,1984.05,31000,19.4
23439,es,trend,m,y25-74,1991.04,614000,7.1
27116,fi,trend,m,y_lt25,1993.09,55000,37.4
75438,si,trend,f,y_lt25,1995.07,11000,


Say we want the average unemployment rate by `sex`. If we were doing this with SQL we would do:

```
SELECT sex,
   AVG(unemployment_rate)
FROM df
GROUP BY sex
```

In pandas, the syntax for it is similar:

In [2]:
df.groupby('sex')['unemployment_rate'].mean()

sex
f    12.982629
m    11.671026
Name: unemployment_rate, dtype: float64

Note the index of the resulting series is the "groupby key".

If we wanted to group over two different categories, we'd pass a list to the groupby object:

In [4]:
grp = df.groupby(['sex', 'age_group'])

grp.unemployment_rate.mean()

sex  age_group
f    y25-74        7.566771
     y_lt25       18.457435
m    y25-74        6.244016
     y_lt25       17.098036
Name: unemployment_rate, dtype: float64

The `grp` object is a "in waiting" object that holds the groupby, waiting for you to tell final results

In [5]:
grp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f96926004f0>

This is where the **split-apply-combine** framework comes in. 

We **split** into the groups we want (the groupby list)

We **apply** the transformation to the statistical column

We **combine** (or **aggregate**) the results for each group.

We can do this for multiple columns with a dictionary:

In [7]:
agg_dict = {
    # Strings for inbuilt aggregation methods
    "unemployment_rate": 'mean',
    # You can use numpy reduce functions
    "unemployment": np.sum
}

# Aggregate each column in the dict 
# Using the apply function from 
grp.agg(agg_dict).reset_index() #reset index 


Unnamed: 0,sex,age_group,unemployment_rate,unemployment
0,f,y25-74,7.566771,5476574000
1,f,y_lt25,18.457435,2346186000
2,m,y25-74,6.244016,6016451000
3,m,y_lt25,17.098036,2578235000


Note that this dataframe has two layers of indices -- a `pd.MultiIndex` which is really bothersome to work with.

This is why we often reset the index after doing groupbys, letting us access columns easily:

In [6]:
grp.agg(agg_dict).reset_index()

Unnamed: 0,sex,age_group,unemployment_rate,unemployment
0,f,y25-74,7.566771,5476574000
1,f,y_lt25,18.457435,2346186000
2,m,y25-74,6.244016,6016451000
3,m,y_lt25,17.098036,2578235000
