# Data Aggregation and Group Operations
There are a number of grouped operations by utilizing any function that accepts a pandas object or Numpy array.
* Split a pandas object into pieces using one or more keys
* Computing group summary statistics
* Apply a varying set of functions to each column of a DataFrame
* Apply within-group transformations or other manipulations
* Compute pivot tables and cross-tabulations
* Perform quantile analysis and other data-derived group analysis. 

## GroupBy Mechanics
The term _split-apply-combine_ can be decomposed as following:
* At first, data contained in a pandas object is __split__ into groups based on one ore more _keys_
* Then, a function is __applied__ to each group, producing a new value.
* Finally, the result of all those function applications are _combined_ into a result object.

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1' : ['a'] * 2 + ['b'] * 2 + ['a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5), 
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,-1.522204,0.101514,a,one
1,-1.136664,0.678247,a,two
2,0.226972,-0.250847,b,one
3,1.905098,-0.829269,b,two
4,2.461735,1.784615,a,one


In [2]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x7f99b07bc8d0>

In [3]:
grouped.mean()

key1
a   -0.065711
b    1.066035
Name: data1, dtype: float64

In [4]:
df['data1'].groupby([df['key1'], df['key2']]).mean()

key1  key2
a     one     0.469766
      two    -1.136664
b     one     0.226972
      two     1.905098
Name: data1, dtype: float64

In [5]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.065711,0.854792
b,1.066035,-0.540058


In [6]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

In [7]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print (k1, k2)
    print (group)

a one
      data1     data2 key1 key2
0 -1.522204  0.101514    a  one
4  2.461735  1.784615    a  one
a two
      data1     data2 key1 key2
1 -1.136664  0.678247    a  two
b one
      data1     data2 key1 key2
2  0.226972 -0.250847    b  one
b two
      data1     data2 key1 key2
3  1.905098 -0.829269    b  two


In [8]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [9]:
grouped = df.groupby(df.dtypes, axis= 1)
dict(list(grouped))

{dtype('float64'):       data1     data2
 0 -1.522204  0.101514
 1 -1.136664  0.678247
 2  0.226972 -0.250847
 3  1.905098 -0.829269
 4  2.461735  1.784615, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of _selecting those columns_ for aggregation

In [10]:
df.groupby(['key1', 'key2'])['data2'].mean()

key1  key2
a     one     0.943065
      two     0.678247
b     one    -0.250847
      two    -0.829269
Name: data2, dtype: float64

In [11]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.943065
a,two,0.678247
b,one,-0.250847
b,two,-0.829269


Grouping information can be Dictionary or Series.

In [12]:
ppl = pd.DataFrame(np.random.randn(5, 5),
                   columns=['a1', 'a2', 'a3', 'a4', 'a5'],
                   index = ['Peter', 'Eric', 'Bob', 'Paul', 'Andy'])
ppl.ix[2:4, ['a1', 'a5']] = np.nan
ppl

Unnamed: 0,a1,a2,a3,a4,a5
Peter,-0.98154,1.000529,-0.241441,-0.418692,-0.165746
Eric,-1.183826,0.147128,-1.428774,1.153826,0.686745
Bob,,0.639883,-0.604917,-2.146066,
Paul,,-0.48598,-1.503564,0.158231,
Andy,1.376205,0.908416,-0.691231,0.077869,0.794499


In [13]:
mapping = {'a1': 'salary', 'a2': 'deduction', 'a3': 'salary', 'a4': 'salary', 'a5': 'deduction'}

In [14]:
by_column = ppl.groupby(mapping, axis = 1)
by_column.sum()

Unnamed: 0,deduction,salary
Peter,0.834783,-1.641673
Eric,0.833873,-1.458774
Bob,0.639883,-2.750983
Paul,-0.48598,-1.345334
Andy,1.702915,0.762843


In [15]:
map_series = pd.Series(mapping)
map_series

a1       salary
a2    deduction
a3       salary
a4       salary
a5    deduction
dtype: object

In [16]:
ppl.groupby(map_series, axis=1).count()

Unnamed: 0,deduction,salary
Peter,2,3
Eric,2,3
Bob,1,2
Paul,1,2
Andy,2,3


More interesting, any functions passed as a group key will be called once per index value, with the return values being used as group names.

In [17]:
ppl.groupby(len).sum()

Unnamed: 0,a1,a2,a3,a4,a5
3,,0.639883,-0.604917,-2.146066,
4,0.192379,0.569564,-3.62357,1.389926,1.481244
5,-0.98154,1.000529,-0.241441,-0.418692,-0.165746


In [18]:
sort_list = ['t1', 't2', 't3', 't2', 't3']
ppl.groupby([len, sort_list]).min()

Unnamed: 0,Unnamed: 1,a1,a2,a3,a4,a5
3,t3,,0.639883,-0.604917,-2.146066,
4,t2,-1.183826,-0.48598,-1.503564,0.158231,0.686745
4,t3,1.376205,0.908416,-0.691231,0.077869,0.794499
5,t1,-0.98154,1.000529,-0.241441,-0.418692,-0.165746


It is also possible to aggregate using one of the levels of an axis index in hiearchically-indexed data sets.

In [19]:
columns = pd.MultiIndex.from_arrays([['CN', 'CN', 'CN', 'US', 'US'],
                                    [6.7, 6.7, 7.5, 1.0, 2.9]], 
                                    names = ['nation', 'gdp'])

hier_df =pd.DataFrame(np.random.randn(4, 5), columns = columns)
hier_df

nation,CN,CN,CN,US,US
gdp,6.7,6.7.1,7.5,1.0,2.9
0,0.77864,-1.312251,1.061284,-1.156332,0.886928
1,-0.296621,1.121803,1.453851,-0.687433,0.034735
2,0.550564,-0.518859,0.574172,-0.351274,-0.363114
3,0.21449,0.522048,0.178576,0.570299,0.013277


In [20]:
hier_df.groupby(level = 'nation', axis = 1).count()

nation,CN,US
0,3,2
1,3,2
2,3,2
3,3,2


## Data Aggregation
Any data transformation that produces scalar value from arrays can be referred as __data aggregation__. One can even define her own aggregation function.

In [21]:
grouped = df.groupby('key1')

In [22]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3.983939,1.6831
b,1.678126,0.578422


In [23]:
grouped.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,-0.065711,0.854792
a,std,2.197305,0.855326
a,min,-1.522204,0.101514
a,25%,-1.329434,0.38988
a,50%,-1.136664,0.678247
a,75%,0.662536,1.231431
a,max,2.461735,1.784615
b,count,2.0,2.0
b,mean,1.066035,-0.540058


In [24]:
grouped.agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,mean,std,peak_to_peak,mean,std,peak_to_peak
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
a,-0.065711,2.197305,3.983939,0.854792,0.855326,1.6831
b,1.066035,1.186615,1.678126,-0.540058,0.409006,0.578422


In [25]:
grouped.agg([('func1', 'mean'),('foo', peak_to_peak)])

Unnamed: 0_level_0,data1,data1,data2,data2
Unnamed: 0_level_1,func1,foo,func1,foo
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,-0.065711,3.983939,0.854792,1.6831
b,1.066035,1.678126,-0.540058,0.578422


In [26]:
grouped.agg({'data1': np.max, 'data2': peak_to_peak})

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.461735,1.6831
b,1.905098,0.578422


## Group-wise operations and transformations
There are other group-wise operations other than aggregations. 

In [27]:
key = ['one', 'two', 'one', 'two', 'two']
ppl.groupby(key).mean()

Unnamed: 0,a1,a2,a3,a4,a5
one,-0.98154,0.820206,-0.423179,-1.282379,-0.165746
two,0.096189,0.189855,-1.207857,0.463309,0.740622


In [28]:
ppl.groupby(key).transform(np.mean)

Unnamed: 0,a1,a2,a3,a4,a5
Peter,-0.98154,0.820206,-0.423179,-1.282379,-0.165746
Eric,0.096189,0.189855,-1.207857,0.463309,0.740622
Bob,-0.98154,0.820206,-0.423179,-1.282379,-0.165746
Paul,0.096189,0.189855,-1.207857,0.463309,0.740622
Andy,0.096189,0.189855,-1.207857,0.463309,0.740622


__transform__ applies a function to each group, then places the results in the appropriate locations. If each group produces a scalar value, it will be propagated. The passing function must either produce a scalar value to be broadcasted or a transformed array of the same size.

In [29]:
def demean(arr):
    return arr - arr.mean()

In [30]:
demeaned =  ppl.groupby(key).transform(demean)
demeaned

Unnamed: 0,a1,a2,a3,a4,a5
Peter,0.0,0.180323,0.181738,0.863687,0.0
Eric,-1.280016,-0.042727,-0.220917,0.690518,-0.053877
Bob,,-0.180323,-0.181738,-0.863687,
Paul,,-0.675834,-0.295708,-0.305078,
Andy,1.280016,0.718561,0.516625,-0.38544,0.053877


In [31]:
demeaned.groupby(key).mean()

Unnamed: 0,a1,a2,a3,a4,a5
one,0.0,5.5511150000000004e-17,2.775558e-17,-5.5511150000000004e-17,0.0
two,0.0,-3.700743e-17,-1.480297e-16,0.0,-5.5511150000000004e-17


__apply__, as a general purpose GroupBy method, splits the object being manipulated into pieces, invokes the passed function on each piece, then attempts to concatenate the pieces together.

In [32]:
f = lambda x: x.describe()
grouped.apply(f)

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,-0.065711,0.854792
a,std,2.197305,0.855326
a,min,-1.522204,0.101514
a,25%,-1.329434,0.38988
a,50%,-1.136664,0.678247
a,75%,0.662536,1.231431
a,max,2.461735,1.784615
b,count,2.0,2.0
b,mean,1.066035,-0.540058


In [33]:
frame = pd.DataFrame({'data1' : np.random.randn(1000),
                      'data2' : np.random.randn(1000)})
factor = pd.cut(frame.data1, 4)
factor[:10]

0    (-1.971, -0.406]
1      (-0.406, 1.16]
2    (-1.971, -0.406]
3      (-0.406, 1.16]
4       (1.16, 2.725]
5    (-1.971, -0.406]
6       (1.16, 2.725]
7      (-0.406, 1.16]
8      (-0.406, 1.16]
9    (-1.971, -0.406]
Name: data1, dtype: category
Categories (4, object): [(-3.543, -1.971] < (-1.971, -0.406] < (-0.406, 1.16] < (1.16, 2.725]]

In [34]:
def get_stats(group):
    return {'min' : group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}
grouped = frame.data2.groupby(factor)

grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-3.543, -1.971]",32.0,1.615994,-0.132134,-2.245312
"(-1.971, -0.406]",297.0,2.821692,-0.060767,-3.030239
"(-0.406, 1.16]",547.0,2.866991,-0.053595,-2.620152
"(1.16, 2.725]",124.0,2.339714,0.123519,-2.404605


In [35]:
grouping = pd.qcut(frame.data1, 10, labels=False)

grouped = frame.data2.groupby(grouping)

grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,100.0,2.380039,0.006999,-2.530172
1,100.0,2.787603,-0.027541,-3.030239
2,100.0,2.821692,-0.099315,-2.669539
3,100.0,2.542282,-0.149061,-2.203485
4,100.0,2.122418,-0.006729,-2.620152
5,100.0,2.769319,-0.145556,-2.616169
6,100.0,1.745859,-0.105678,-2.268189
7,100.0,2.633678,0.060648,-1.926117
8,100.0,2.866991,-0.023449,-2.00849
9,100.0,2.339714,0.126918,-2.404605


## Examples

In [36]:
df = pd.DataFrame({'level': ['c', 'c', 'c', 'b', 'c', 'b', 'b', 'c'],
                   'value': np.random.randn(8),
                   'w': np.random.randn(8)})
df

Unnamed: 0,level,value,w
0,c,0.6612,1.002428
1,c,-0.424406,0.203779
2,c,-0.397619,0.823286
3,b,0.611042,-0.46876
4,c,-0.177115,-0.638576
5,b,1.250478,0.481262
6,b,-0.557449,0.201949
7,c,-0.089,0.154958


In [37]:
weighted = lambda g: np.average(g['value'], weights = g['w'])
grouped = df.groupby('level')
grouped.apply(weighted)

level
b    0.945664
c    0.225294
dtype: float64