# Data Aggregation and Group Operations
There are a number of grouped operations by utilizing any function that accepts a pandas object or Numpy array.
* Split a pandas object into pieces using one or more keys
* Computing group summary statistics
* Apply a varying set of functions to each column of a DataFrame
* Apply within-group transformations or other manipulations
* Compute pivot tables and cross-tabulations
* Perform quantile analysis and other data-derived group analysis. 

## GroupBy Mechanics
The term _split-apply-combine_ can be decomposed as following:
* At first, data contained in a pandas object is __split__ into groups based on one ore more _keys_
* Then, a function is __applied__ to each group, producing a new value.
* Finally, the result of all those function applications are _combined_ into a result object.

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1' : ['a'] * 2 + ['b'] * 2 + ['a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5), 
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,0.760266,0.751812,a,one
1,0.494145,-1.313654,a,two
2,0.504178,1.561412,b,one
3,-1.661199,-0.079162,b,two
4,0.436205,0.512862,a,one


In [2]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x7fce847e2240>

In [3]:
grouped.mean()

key1
a    0.563539
b   -0.578510
Name: data1, dtype: float64

In [4]:
df['data1'].groupby([df['key1'], df['key2']]).mean()

key1  key2
a     one     0.598236
      two     0.494145
b     one     0.504178
      two    -1.661199
Name: data1, dtype: float64

In [5]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.563539,-0.016327
b,-0.57851,0.741125


In [6]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

In [7]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print (k1, k2)
    print (group)

a one
      data1     data2 key1 key2
0  0.760266  0.751812    a  one
4  0.436205  0.512862    a  one
a two
      data1     data2 key1 key2
1  0.494145 -1.313654    a  two
b one
      data1     data2 key1 key2
2  0.504178  1.561412    b  one
b two
      data1     data2 key1 key2
3 -1.661199 -0.079162    b  two


In [8]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [9]:
grouped = df.groupby(df.dtypes, axis= 1)
dict(list(grouped))

{dtype('float64'):       data1     data2
 0  0.760266  0.751812
 1  0.494145 -1.313654
 2  0.504178  1.561412
 3 -1.661199 -0.079162
 4  0.436205  0.512862, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of _selecting those columns_ for aggregation

In [10]:
df.groupby(['key1', 'key2'])['data2'].mean()

key1  key2
a     one     0.632337
      two    -1.313654
b     one     1.561412
      two    -0.079162
Name: data2, dtype: float64

In [11]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.632337
a,two,-1.313654
b,one,1.561412
b,two,-0.079162


Grouping information can be Dictionary or Series.

In [12]:
ppl = pd.DataFrame(np.random.randn(5, 5),
                   columns=['a1', 'a2', 'a3', 'a4', 'a5'],
                   index = ['Peter', 'Eric', 'Bob', 'Paul', 'Andy'])
ppl.ix[2:4, ['a1', 'a5']] = np.nan
ppl

Unnamed: 0,a1,a2,a3,a4,a5
Peter,-0.484088,0.264213,1.495071,1.35523,1.012155
Eric,0.508173,0.441205,-1.25479,1.066204,1.283697
Bob,,0.586082,0.109566,-1.281299,
Paul,,0.539115,-0.66781,1.088757,
Andy,-0.623018,-1.334974,-1.640993,0.472824,0.146745


In [13]:
mapping = {'a1': 'salary', 'a2': 'deduction', 'a3': 'salary', 'a4': 'salary', 'a5': 'deduction'}

In [14]:
by_column = ppl.groupby(mapping, axis = 1)
by_column.sum()

Unnamed: 0,deduction,salary
Peter,1.276368,2.366214
Eric,1.724903,0.319587
Bob,0.586082,-1.171733
Paul,0.539115,0.420947
Andy,-1.188229,-1.791186


In [15]:
map_series = pd.Series(mapping)
map_series

a1       salary
a2    deduction
a3       salary
a4       salary
a5    deduction
dtype: object

In [16]:
ppl.groupby(map_series, axis=1).count()

Unnamed: 0,deduction,salary
Peter,2,3
Eric,2,3
Bob,1,2
Paul,1,2
Andy,2,3


More interesting, any functions passed as a group key will be called once per index value, with the return values being used as group names.

In [17]:
ppl.groupby(len).sum()

Unnamed: 0,a1,a2,a3,a4,a5
3,,0.586082,0.109566,-1.281299,
4,-0.114844,-0.354653,-3.563593,2.627785,1.430442
5,-0.484088,0.264213,1.495071,1.35523,1.012155


In [18]:
sort_list = ['t1', 't2', 't3', 't2', 't3']
ppl.groupby([len, sort_list]).min()

Unnamed: 0,Unnamed: 1,a1,a2,a3,a4,a5
3,t3,,0.586082,0.109566,-1.281299,
4,t2,0.508173,0.441205,-1.25479,1.066204,1.283697
4,t3,-0.623018,-1.334974,-1.640993,0.472824,0.146745
5,t1,-0.484088,0.264213,1.495071,1.35523,1.012155


It is also possible to aggregate using one of the levels of an axis index in hiearchically-indexed data sets.

In [19]:
columns = pd.MultiIndex.from_arrays([['CN', 'CN', 'CN', 'US', 'US'],
                                    [6.7, 6.7, 7.5, 1.0, 2.9]], 
                                    names = ['nation', 'gdp'])

hier_df =pd.DataFrame(np.random.randn(4, 5), columns = columns)
hier_df

nation,CN,CN,CN,US,US
gdp,6.7,6.7.1,7.5,1.0,2.9
0,-0.459703,0.4557,-0.737612,-1.143507,0.016092
1,0.911317,1.277103,-0.633896,-1.00152,-1.92247
2,-0.29969,-0.375309,0.517058,-0.184771,0.51753
3,0.528736,1.119535,-0.28118,-0.281854,-1.045826


In [20]:
hier_df.groupby(level = 'nation', axis = 1).count()

nation,CN,US
0,3,2
1,3,2
2,3,2
3,3,2


## Data Aggregation
Any data transformation that produces scalar value from arrays can be referred as __data aggregation__. One can even define her own aggregation function.

In [21]:
grouped = df.groupby('key1')

In [22]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.32406,2.065466
b,2.165377,1.640574


In [23]:
grouped.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,0.563539,-0.016327
a,std,0.172816,1.129853
a,min,0.436205,-1.313654
a,25%,0.465175,-0.400396
a,50%,0.494145,0.512862
a,75%,0.627205,0.632337
a,max,0.760266,0.751812
b,count,2.0,2.0
b,mean,-0.57851,0.741125


In [24]:
grouped.agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,mean,std,peak_to_peak,mean,std,peak_to_peak
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
a,0.563539,0.172816,0.32406,-0.016327,1.129853,2.065466
b,-0.57851,1.531153,2.165377,0.741125,1.160061,1.640574


In [25]:
grouped.agg([('func1', 'mean'),('foo', peak_to_peak)])

Unnamed: 0_level_0,data1,data1,data2,data2
Unnamed: 0_level_1,func1,foo,func1,foo
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,0.563539,0.32406,-0.016327,2.065466
b,-0.57851,2.165377,0.741125,1.640574


In [26]:
grouped.agg({'data1': np.max, 'data2': peak_to_peak})

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.760266,2.065466
b,0.504178,1.640574
