# Motivation 

Split-Apply-Combine is a common and powerful pattern for manipuating structured data into a form that makes it easy for us to derive meaning from the data [rephrase]. 

This is especially common in `R`. 

# Getting Started 

If you are new to split-apply-combine, I recommend you read [this](http://www.jstatsoft.org/v40/i01/paper) excellent paper. 

# Loading the Data

In [9]:
from blaze import Data, by
import pandas as pd
from odo import into

In [12]:
d = Data('../blaze-tutorial/data/iris.csv')

In [13]:
df = into(pd.DataFrame, d)

In [8]:
!ls ../blaze-tutorial

00-Motivation.ipynb         06-Blaze-with-into.ipynb
01-into-Introduction.ipynb  README.md
02-into-Datatypes.ipynb     Will_Examples.ipynb
03-into-Design.ipynb        accounts.csv
04-Blaze-Introduction.ipynb [34mdata[m[m
05-Blaze-with-SQL.ipynb     [34mimages[m[m


# Apply-Split-Combine in Pandas

A comphrensive walkthrough of the functionality of split-apply-combine in `Pandas` can be found [here](http://pandas.pydata.org/pandas-docs/version/0.15.2/groupby.html)

In [15]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [19]:
from numpy.random import randn
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                      'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                'C' : randn(8), 'D' : randn(8)})


In [26]:
df

Unnamed: 0,A,B,C,D
0,foo,one,0.636712,0.223871
1,bar,one,-0.59095,-1.145235
2,foo,two,0.042861,-0.458442
3,bar,three,0.281579,1.118033
4,foo,two,-1.263378,-1.194287
5,bar,two,-0.609225,0.052197
6,foo,one,-0.731935,-1.689991
7,foo,three,1.120949,0.363922


In [33]:
df.groupby('A').first()
df.groupby('A').nth(2)
df.groupby('A').last()

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.59095,-1.145235
foo,one,0.636712,0.223871


In [24]:
df.groupby(['A','B']).first()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.59095,-1.145235
bar,three,0.281579,1.118033
bar,two,-0.609225,0.052197
foo,one,0.636712,0.223871
foo,three,1.120949,0.363922
foo,two,0.042861,-0.458442


In [31]:
df.groupby('B').nth(2)

Unnamed: 0_level_0,A,C,D
B,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,foo,-0.731935,-1.689991
two,bar,-0.609225,0.052197


In [35]:
df

Unnamed: 0,A,B,C,D
0,foo,one,0.636712,0.223871
1,bar,one,-0.59095,-1.145235
2,foo,two,0.042861,-0.458442
3,bar,three,0.281579,1.118033
4,foo,two,-1.263378,-1.194287
5,bar,two,-0.609225,0.052197
6,foo,one,-0.731935,-1.689991
7,foo,three,1.120949,0.363922


In [37]:
df.groupby('B')

<pandas.core.groupby.DataFrameGroupBy object at 0x115c2ee10>

In [43]:
import numpy as np
index = pd.date_range('10/1/1999', periods=1100)
ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
ts.head()

1999-10-01    1.449875
1999-10-02    2.753562
1999-10-03   -2.007073
1999-10-04    2.309854
1999-10-05    3.255048
Freq: D, dtype: float64

In [44]:
ts = pd.rolling_mean(ts, 100, 100).dropna()
ts.head()

2000-01-08    0.580615
2000-01-09    0.569145
2000-01-10    0.499904
2000-01-11    0.536755
2000-01-12    0.506705
Freq: D, dtype: float64

In [50]:
dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
dff.groupby('B').filter(lambda x: len(x) >2)

Unnamed: 0,A,B
2,2,b
3,3,b
4,4,b
5,5,b


In [49]:
dff

Unnamed: 0,A,B
0,0,a
1,1,a
2,2,b
3,3,b
4,4,b
5,5,b
6,6,c
7,7,c


In [3]:
type({1:3})

dict