# Split apply and combine

In [1]:
import addutils.toc ; addutils.toc.js(ipy_notebook=True)

In [2]:
from utilities.generators import generate_all
generate_all()

ImportError: No module named utilities.generators

In this tutorial we are going to see advanced data management with `pandas` data structures.

In [3]:
import numpy as np
import pandas as pd
from pandas.io.data import DataFrame, read_csv
from IPython.display import (display, HTML)
from addutils import side_by_side2
from addutils import css_notebook
css_notebook()

Categorizing a data set and applying a function to each group, is often a critical component of a data analysis workflow. After
loading, merging, and preparing a data set, a familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes. *pandas* provides a flexible and high-performance groupby facility.

By *'group by'* we refer to a process involving one or more of the following steps:

* **Splitting** the data into groups based on some criteria
* **Applying** a function to each group independently
* **Combining** the results into a data structure

Suppose we are managing a website and we have a log-file with number of *wiews* and *likes* coming from different cities:

In [4]:
d1 = read_csv('temp/p07_d1.txt', index_col=0)
d1 = d1.reindex(columns=['State','City','Views','Likes'])
display(d1)

IOError: File temp/p07_d1.txt does not exist

## 1 Groupby

`groupby` groups DataFrame or Series by a parameter on a given axis:

In [5]:
g1 = d1.groupby('State')
print g1.groups

NameError: name 'd1' is not defined

The variable `groups` of a `GroupBy` object is a dictionary containing indexes of each group member.

In [6]:
for name,group in g1:
    print name
    print group
    print 'Total Views: %d - Total Likes: %d\n\n' %(group['Views'].sum(),
                                                    group['Likes'].sum())

NameError: name 'g1' is not defined

It is also possibile to apply a `groupby` over a hierarchical index `DataFrame`

In [7]:
d2 = d1.set_index(['State','City'])
display(d2)

NameError: name 'd1' is not defined

## 2 Aggregate

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. Here we use `aggregate`. The result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the `as_index option`:

In [8]:
g2 = d2.groupby(level=[0])
print g2.groups
g2.aggregate(np.sum)

NameError: name 'd2' is not defined

In [9]:
g3 = d2.groupby(level=[0,1])
g4 = d2.groupby(level=[0,1], as_index=False)
HTML(side_by_side2(g3.aggregate(np.sum), g4.aggregate(np.sum)))

NameError: name 'd2' is not defined

`aggregate` allows to pass any function that returns a scalar value from a vector and can handle list of functions:

In [10]:
d1[['State', 'Views']].groupby('State').aggregate([np.sum, np.mean, np.std])

NameError: name 'd1' is not defined

## 3 Apply

`apply` will extend the previous concepts to any Python function:

In [11]:
pd.set_option('display.float_format', lambda x: '{:.1f}'.format(x))

def add_field(group):
    group['Tot.Views'] = group['Views'].sum()
    group['Likes[%]'] = 100.0*group['Likes']/group['Likes'].sum()
    return group

HTML(side_by_side2(d1, d1.groupby('State').apply(add_field)))

NameError: name 'd1' is not defined

## 4 A pratical example: Normalize by year

In [12]:
idx = pd.date_range('1999/5/28', periods=1500, freq='1B')
s1 = pd.Series(np.random.normal(5.5, 2, 1500), idx)
s1 = pd.rolling_mean(s1, 10, 10).dropna()

Here we define a grouping key for months and one for years:

In [13]:
def my_groupby_key_year(timestamp):
    return timestamp.year

def my_groupby_key_month(timestamp):
    return timestamp.month

def my_normalization(group):
    return (group-group.mean())/group.std()

Here we normalize the data on a monthly base and check mean and std on an yearly base:

In [14]:
t1 = s1.groupby(my_groupby_key_month).apply(my_normalization)

HTML(side_by_side2(s1.head(8),
                   t1.head(8),
                   t1.groupby(my_groupby_key_year).aggregate([np.mean, np.std])))

Unnamed: 0,0
1999-06-10,4.6
1999-06-11,4.4
1999-06-14,4.5
1999-06-15,4.3
1999-06-16,4.2
1999-06-17,4.2
1999-06-18,4.3
1999-06-21,4.6

Unnamed: 0,0
1999-06-10,-1.3
1999-06-11,-1.6
1999-06-14,-1.4
1999-06-15,-1.8
1999-06-16,-1.9
1999-06-17,-1.9
1999-06-18,-1.8
1999-06-21,-1.3

Unnamed: 0,mean,std
1999,0.1,1.1
2000,0.1,0.9
2001,0.2,1.1
2002,-0.1,0.9
2003,0.0,0.9
2004,-0.1,0.9
2005,-0.2,1.3


## 5 A practical example: Group and standardize by dimension

In [15]:
d3 = pd.read_csv('example_data/company.csv', index_col=0)
display(d3.head())

IOError: File example_data/company.csv does not exist

Since the column "Value" is made by strings with a space separator we need a simpel intermediate step to convert values from string to floats:

In [16]:
d3['Value'] = d3['Value'].apply(lambda x: float(x.replace(' ', '')))
d3.head()

NameError: name 'd3' is not defined

In [17]:
d3.groupby('Dimension').mean()

NameError: name 'd3' is not defined

---

Visit [www.add-for.com](<http://www.add-for.com/IT>) for more tutorials and updates
This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.