### Data Aggregation and Group Operations

Categorizing a dataset and applying a function to each group, whether an aggregation or transformation, is often a critical component of a data analysis workflow. After loading, merging, and preparing a dataset, you may need to compute group statistics or possibly pivot tables for reporting or visualization purposes.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                  'key2' : ['one', 'two', 'one', 'two', 'one'],
                  'data1' : np.random.randn(5),
                  'data2' : np.random.randn(5)})


In [4]:
df

Unnamed: 0,data1,data2,key1,key2
0,2.7048,-1.633017,a,one
1,0.86021,0.251227,a,two
2,-2.514772,-0.512034,b,one
3,-0.6506,-0.279363,b,two
4,-0.75943,1.144311,a,one


In [5]:
# Suppose you want to get the mean of data1 using labels from key1:

grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x11246f320>

In [6]:
# 'grouped' is now a Groupby object
# calculate the mean for the key1 labels

grouped.mean()




key1
a    0.935194
b   -1.582686
Name: data1, dtype: float64

In [7]:
# 'grouping' is now a Groupby object with more than one key: key1, key2 labels

grouping = df['data1'].groupby([df['key1'], df['key2']]).mean()


In [8]:
grouping

key1  key2
a     one     0.972685
      two     0.860210
b     one    -2.514772
      two    -0.650600
Name: data1, dtype: float64

In [9]:
grouping.mean()


-0.33311910821129176

In [10]:
grouping.unstack()


key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.972685,0.86021
b,-2.514772,-0.6506


In [11]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

years = np.array([2005, 2005, 2006, 2005, 2006])


In [12]:
df['data1'].groupby([states, years]).mean()


California  2005    0.860210
            2006   -2.514772
Ohio        2005    1.027100
            2006   -0.759430
Name: data1, dtype: float64

In [13]:
df.groupby(['key1', 'key2']).mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.972685,-0.244353
a,two,0.86021,0.251227
b,one,-2.514772,-0.512034
b,two,-0.6506,-0.279363


#### Iterating over Groups

The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following:

In [14]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)
    

a
     data1     data2 key1 key2
0  2.70480 -1.633017    a  one
1  0.86021  0.251227    a  two
4 -0.75943  1.144311    a  one
b
      data1     data2 key1 key2
2 -2.514772 -0.512034    b  one
3 -0.650600 -0.279363    b  two


In [19]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1,k2))
    print(group)
    

('a', 'one')
     data1     data2 key1 key2
0  2.70480 -1.633017    a  one
4 -0.75943  1.144311    a  one
('a', 'two')
     data1     data2 key1 key2
1  0.86021  0.251227    a  two
('b', 'one')
      data1     data2 key1 key2
2 -2.514772 -0.512034    b  one
('b', 'two')
    data1     data2 key1 key2
3 -0.6506 -0.279363    b  two


By default groupby groups on axis=0, but you can group on any of the other axes. For example, we could group the columns of our example df here by dtype like so:

In [15]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [16]:
grouped = df.groupby(df.dtypes, axis=1)

In [17]:
for dtype, group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0  2.704800 -1.633017
1  0.860210  0.251227
2 -2.514772 -0.512034
3 -0.650600 -0.279363
4 -0.759430  1.144311
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


#### Selecting a Column or subset of Columns

In [20]:
df.groupby('key1')['data1']
df.groupby('key2')['data2']

<pandas.core.groupby.SeriesGroupBy object at 0x112494a90>

In [21]:
df.groupby(['key1','key2'])[['data2']].mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-0.244353
a,two,0.251227
b,one,-0.512034
b,two,-0.279363


#### Grouping with Dicts and Series

Grouping information may exist in a form other than an array. Let’s consider another example DataFrame:

In [27]:
people = pd.DataFrame(np.random.randn(5,5),
                     columns=['a', 'b', 'c', 'd', 'e'],
                     index=['Joe', 'Steve', 'Wes', 'Tom', 'Travis'])


In [28]:
people.iloc[2:3, [1,2]] = np.nan # Add a few NaN values
people

Unnamed: 0,a,b,c,d,e
Joe,0.407359,-0.324796,-0.380217,-0.491808,1.053329
Steve,-2.520957,1.256412,0.731514,0.183016,-1.283439
Wes,0.868742,,,-0.072452,-0.818321
Tom,-2.031554,-0.992746,1.237881,0.814391,-2.149084
Travis,-1.200744,-0.455864,-0.95875,-0.491312,0.041602


Now, suppose I have a group correspondence for the columns and want to sum together the columns by group:

In [40]:
mapping = {'a': 'red', 'b' : 'red', 'c' : 'blue', 'd' : 'blue',
           'e' : 'red', 'f' : 'purple'}


In [41]:
by_column = people.groupby(mapping, axis=1)

by_column.sum()


Unnamed: 0,blue,red
Joe,-0.872025,1.135893
Steve,0.91453,-2.547984
Wes,-0.072452,0.050421
Tom,2.052272,-5.173384
Travis,-1.450062,-1.615007


The same functionality holds for Series, which can be viewed as a fixed-size mapping:

In [42]:
map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    purple
dtype: object

In [43]:
people.groupby(map_series, axis=1).count()


Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Tom,2,3
Travis,2,3


#### Grouping by Index Levels

A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. Let’s look at an example:

In [44]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1,3,5,1,3]],
                                   names = ['cty', 'tenor'])

hier_df = pd.DataFrame(np.random.randn(4,5), columns=columns)

In [45]:
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.100166,0.547257,0.74452,-0.727224,0.4387
1,-0.884902,-0.129561,0.360317,0.007564,0.031321
2,0.723342,0.10605,0.509529,-0.075508,-0.689487
3,-0.624642,-0.969405,-0.250108,0.180381,0.486756


To group by level, pass the level number or name using the level keyword:

In [46]:
hier_df.groupby(level='cty', axis=1).count()


cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


### Data Aggregation 

Aggregations refer to any data transformation that produces scalar values from arrays. The preceding examples have used several of them, including mean, count, min, and sum. You may wonder what is going on when you invoke mean() on a GroupBy object. Many common aggregations, such as those found in Table 10-1, have optimized implementations. However, you are not limited to only this set of methods.