Cover:
- reshaping data
    - set - reset index
    - melt
    - pivot?
    - stack - unstack
- groupby
    - simple operations
    - looping
    - aggregate 
    - transform 
    - apply
- merge and join   

In [1]:
import pandas as pd
import numpy as np

# Groupby

As I pointed out in the first part of this lesson, tidy data is only useful if we have tools that work with it in a consistent and reproducable manner. One such tools is a `groupby` method of `DataFrame`, which provides a powerful interface to apply any operation based on groupping variables, and we will talk about it in detail in the current section.

It turns out that very frequently we need to do some operation based on a groupping variable. A common example is calculating mean of each group (e.g. performance of each subject, or performance on each type of stimuli, etc). This can be thought of as making 3 separate actions:
- Splitting the data based on a groupping variable(s)
- Applying a function to each group separately
- Combining the resulting values back together

Based on these 3 actions, this approach is called *Split-Apply-Combine* (SAC) [1].

[1] Wickham, Hadley. "The split-apply-combine strategy for data analysis." Journal of Statistical Software 40.1 (2011): 1-29.

<img src="http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/figures/03.08-split-apply-combine.png"></img>
From ["Aggregation and groupping" chapter](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb) of ["Python Data Science Handbook"](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb) by Jake VanderPlas

A lot of operations on data can be thought of as SAC operations. These include calculating sums, means, standard deviations and other parameters of the groups' distributions; transfromations of data, such as normalization or detrending; plotting based on group, e.g. boxplots; and many other. (Some operations cannot be thought of as purely SAC, most prominently those in which data from the same group is used several times, e.g. rolling window means.)

A traditional way of doing these operations in include loops, where on each iteration a subset of data is selected and processed. Loops, however, are slow and usually require a lot of code, which makes them difficult to read, and are not easily extendible from 1 to several groupping variables.

`Groupby` is a method of `DataFrames` which makes any SAC operation easy to perform and read.

>**Note**: Tidy data is an the most convenient form for making SAC operations, because you always have access to any combination of your groupping variables due to them being always separated in columns.

Let's see a toy example of using a `groupby` operation instead of a loop.

In [64]:
df = pd.DataFrame({'group': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)})
df

Unnamed: 0,data,group
0,0,A
1,1,B
2,2,C
3,3,A
4,4,B
5,5,C


Let's say I want to calculate a sum of `data` column, based on `group` variable and save it in a `Series`. I can do it with a loop:

In [65]:
result = pd.Series()

groups = df['group'].unique()
for g in groups:
    data = df.loc[df['group']==g, 'data']
    result[g] = np.sum(data)

result

A    3
B    5
C    7
dtype: int64

This code does the job, but it is quite long. If I try to shorten it, it will become very difficult to read:

In [66]:
result = pd.Series()
for g in df['group'].unique():
    result[g] = np.sum(df.loc[df['group']==g, 'data'])

result

A    3
B    5
C    7
dtype: int64

Now let's try to do the same thing with `groupby`:

In [67]:
df.groupby('group')['data'].sum()

group
A    3
B    5
C    7
Name: data, dtype: int32

See that it is really short and concise and readable. Moreover, let's say I have a more complicated example with several groupping variables:

In [71]:
df = pd.DataFrame({'group1': ['A', 'B', 'C']*3,
                   'group2': ['A']*4 + ['B']*1 + ['C']*4,
                   'data': range(9)})
df

Unnamed: 0,data,group1,group2
0,0,A,A
1,1,B,A
2,2,C,A
3,3,A,A
4,4,B,B
5,5,C,C
6,6,A,C
7,7,B,C
8,8,C,C


Trying to calculate a sum based on these several groups requires significantly more code with loops. With `groupby` it is as easy as adding another groupping variable in the `groupby` attributes:

In [73]:
result = df.groupby(['group1','group2'])['data'].sum()
result

group1  group2
A       A          3
        C          6
B       A          1
        B          4
        C          7
C       A          2
        C         13
Name: data, dtype: int32

>**Pro-tip**: You may notice that in the resulting `Series` index has 2 levels: `group1` and `group2`. This is referred to as *Hierarchical index* or `MultiIndex`, and is a way to stack several dimensions of data. We won't go much into the details of `MultiIndex` (if you wish to learn more, you may refer to [this section](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.05-Hierarchical-Indexing.ipynb) of [Python Data Science Handbook](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb) and to [MultiIndex](http://pandas.pydata.org/pandas-docs/stable/advanced.html) section of `pandas` documentation. For our purposes we just need to know 2 things: how to index a `MultiIndex` and how to *unstack* dimensions to turn it into a table:

In [76]:
# get an element with group1 = A and group2 = C
result[('A','C')]

6

In [79]:
# unstack levels of multiindex (turn one of them into a column)
result.unstack()

group2,A,B,C
group1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,3.0,,6.0
B,1.0,4.0,7.0
C,2.0,,13.0


Overall, `groupby` is an extremely useful tool for making group-based operations quickly and more readible. Let's see some concrete examples of how you can use it. We will work on the data in the food preferences task provided by Paolo Garlasco. Let's load it first and do some cleanup:

In [81]:
df = pd.read_csv('data/Paolo.csv')
# drop old index column
df.drop('Unnamed: 0', axis='columns', inplace=True)
print(df.shape)
df.head()

(12460, 10)


Unnamed: 0,item,subj_num,session,pref_b,freq,cal,cond,congr,response,rt
0,ciliegie,12,1,8,7,38,2,1,1,559.000015
1,anguria-02,12,1,4,3,16,4,0,1,496.999979
2,caramelle,12,1,9,2,394,1,1,0,496.999979
3,melone-01,12,1,7,5,33,2,1,1,575.000048
4,ananas,12,1,4,3,40,2,1,0,512.000084


The data contains 4 subjects:

In [83]:
df['subj_num'].unique()

array([12,  3,  6,  8], dtype=int64)

Let's calculate mean reaction time for each subject:

In [86]:
df.groupby('subj_num')['rt'].mean()

subj_num
3     759.796249
6     782.063034
8     908.831453
12    562.563121
Name: rt, dtype: float64

Subjects also seem to have more that 1 session, so we might want to compute mean for each session separately:

In [88]:
rt_subject_session = df.groupby(['subj_num','session'])['rt'].mean()
rt_subject_session

subj_num  session
3         0           709.919997
          1           809.672502
6         0           750.098065
          1           811.027536
8         0          1022.428570
          1           808.440000
12        0           622.272497
          1           502.853746
Name: rt, dtype: float64

# <font color='DarkSeaGreen '>Exercise</font>
In the cell below calculate mean response for each food item.



(12460, 10)


Unnamed: 0,item,subj_num,session,pref_b,freq,cal,cond,congr,response,rt
0,ciliegie,12,1,8,7,38,2,1,1,559.000015
1,anguria-02,12,1,4,3,16,4,0,1,496.999979
2,caramelle,12,1,9,2,394,1,1,0,496.999979
3,melone-01,12,1,7,5,33,2,1,1,575.000048
4,ananas,12,1,4,3,40,2,1,0,512.000084


In [58]:
df.groupby(['item','cond'])['response'].count().unstack()

cond,1,2,3,4
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
albicocca-01,,145.0,8.0,151.0
ananas,,144.0,16.0,148.0
anguria-02,,147.0,23.0,137.0
aragosta,,147.0,32.0,132.0
bavarese ai mirtilli,,146.0,120.0,48.0
brie,149.0,,125.0,36.0
broccolo,,151.0,39.0,122.0
budino al cioccolato,,146.0,125.0,40.0
cannoli,151.0,,119.0,44.0
capocollo,150.0,,108.0,56.0
