<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

In [8]:
# import the helper functions from the parent directory,
# these help with things like graph plotting and notebook layout
import sys
sys.path.append('../..')
from helper_functions import *

# set things like fonts etc - comes from helper_functions
set_notebook_preferences()

# add a show/hide code button - also from helper_functions#dependencies
toggle_code(title = "import functions")

In [2]:
#dependencies
import pandas as pd
titanic = pd.read_csv('../../Data/titanic.csv')

#columns from NewVar script
titanic['child'] = (titanic['age'] < 18).astype(int)
titanic['embarked_city'] = titanic['embarked'].map({'S':'Southampton','C':'Cherbourg','Q':'Queenstown'})
titanic['surname'] = titanic['name'].str.split(',',expand=True)[0]

#columns from descstat script
titanic['z_score'] = (titanic['fare'] - titanic['fare'].mean())/titanic['fare'].std()


toggle_code(title='dependencies')

# 6. Aggregation

Aggregation means grouping data together by a particular grouping variable and producing a summary of one or more columns for that grouping variable.

## 6.1 Aggregation using the .groupby() method

We'll use the `groupby()` function. 

This function can be really useful, especially when your data are disaggregate - e.g. data about individual units of people or things. 

`groupby()` allows you to aggregate by a categorical variable and summarise numerical data into a new dataframe.

`.groupby()` works on a principle known as 'split-apply-combine':
* Split - a dataframe is divided into a set of smaller dataframes based on the grouping variable.
* Apply - an aggregation is applied to each of the groups to create a single row for each group in the original dataframe.
* Combine - bring together the aggregated dataframe rows into a final new dataframe.

Let's walk through what that might look like for the `titanic` dataframe:
* Firstly, we decide to **split** the data by the 'pclass'. This divides the `titanic` dataframe into effectively three separate dataframes, one for first, one for second and one for third class.
* Secondly, we **apply** an aggregation to the dataframe. You can either produce an aggregate statistic for all rows, or you can selected specific columns on which to do the aggregation. If we **apply** a `.mean()` aggregation to 'fare', then for each 'pclass' group we get the average fare cost.
* Finally, pandas returns a **combined** dataframe that contains the new aggregate statistics.

Let's look at that in code:

In [3]:
titanic.groupby('pclass')['fare'].mean()

pclass
1    87.508992
2    21.179196
3    13.302889
Name: fare, dtype: float64

In [4]:
# Similarly a count of children by class might look like this:
titanic.groupby('pclass')['child'].sum()

pclass
1     15
2     33
3    106
Name: child, dtype: int64

Hopefully this all sounds fairly straightforward! `.groupby()` is a powerful tool, particularly when you are working with any kind of hierarchical data where you might want to know something about aggregated groups within the data, for instance:
* individuals nested in households.
* employees nested in firms.
* patients nested in primary or secondary care trusts.
* small area geographies (e.g. wards, output areas, postcodes etc.) nested in larger geographies (e.g. districts, counties etc.)
* countries nested in supra-national entities.

or, demographic, cultural and socio-economic classes:
* individuals by age, sex, ethnicity, religion etc.
* employees by grade or occupational social class.
* households by neighbourhood deprivation rank or decile.
* experimental subjects in intervention and control arms of a trial.

We can also aggregate according to more complicated groupings:

In [5]:
# Groupby passenger class, then city of embarkation.
titanic.groupby(['pclass','embarked_city'])['fare'].mean()

pclass  embarked_city
1       Cherbourg        106.845330
        Queenstown        90.000000
        Southampton       72.148094
2       Cherbourg         23.300593
        Queenstown        11.735114
        Southampton       21.206921
3       Cherbourg         11.021624
        Queenstown        10.390820
        Southampton       14.435422
Name: fare, dtype: float64

In [6]:
# NB order is important to the output.
titanic.groupby(['embarked_city','pclass'])['fare'].mean()

embarked_city  pclass
Cherbourg      1         106.845330
               2          23.300593
               3          11.021624
Queenstown     1          90.000000
               2          11.735114
               3          10.390820
Southampton    1          72.148094
               2          21.206921
               3          14.435422
Name: fare, dtype: float64

The ordering of groups may be important as it affects the resultant dataframe.

If you assign the `groupby()` output to a variable, you can also pull out dataframes for particular groups, just as if you had written a filter condition!

In [7]:
classes = titanic.groupby('pclass')
classes.get_group(3).head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarked_city,surname,z_score
600,3,0,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S,0,Southampton,Abbing,-0.497414
601,3,0,"Abbott, Master. Eugene Joseph",male,13.0,0,2,C.A. 2673,20.25,,S,1,Southampton,Abbott,-0.252044
602,3,0,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S,1,Southampton,Abbott,-0.252044
603,3,1,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.25,,S,0,Southampton,Abbott,-0.252044
604,3,1,"Abelseth, Miss. Karen Marie",female,16.0,0,0,348125,7.65,,S,1,Southampton,Abelseth,-0.495482
