# Chapter 10. Data Aggregation and Group Operations.

### Necessary imports

In [2]:
import pandas as pd
import numpy as np

Categorizing a dataset and applying a function to each group, wether an aggregation or transformation, is often a critical component of a data analysis workflow. One may need to compute statistics or possibly pivot tables.

One reason for the popularity of relational databases and SQL is the ease with which data can be joined, filtered, transformed and aggregated. However, querly languages like SQL are somewhat constrained in the kinds of group operations that can be performed. In this chapter we will learn to:

* Split a pandas objet into pieces using one or more keys (in the form of functions, arrays or DataFrame column names)
* Calculate group summary statistics, like count, mean, or standard deviation, or a user defined function
* Apply within-group transformationsor other manipulations, like normalization, linear regression, rank or subset selection
* Compute pivot tables and cross-tabulations
* Perform quantile analysis and other statistical group analyses

## GroupBy Mechanics

[Hadley Wickham](https://en.wikipedia.org/wiki/Hadley_Wickham) coined the term *split-apply-combine* for describing group operations, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is *split* into groups based on one or more *keys* that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on it's row (axis=0) or its columns (axis=1), Once this is done, a function is *applied* to each group, producing a new value. Finally, the results of all those function applications are *combined* into a result object. The form of the resulting object will usually depend on what's being done to the data.

In [3]:
np.random.seed(2)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.416758,-0.841747
1,a,two,-0.056267,0.502881
2,b,one,-2.136196,-1.245288
3,b,two,1.640271,-1.057952
4,a,one,-1.793436,-0.909008


Suppose you wanted to compute the mean of the data1 column using the labels from key1. There are a number of ways to do this. One is to access data1 and call groupby with the column at key1

In [4]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000025213C8F250>

In [5]:
grouped.mean()

key1
a   -0.755487
b   -0.247963
Name: data1, dtype: float64

In [6]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one    -1.105097
      two    -0.056267
b     one    -2.136196
      two     1.640271
Name: data1, dtype: float64

Here we grouped the data using two keys and the resulting series has a hierarchical indexing.

In [7]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-1.105097,-0.056267
b,-2.136196,1.640271


In [8]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

California  2005   -0.056267
            2006   -2.136196
Ohio        2005    0.611756
            2006   -1.793436
Name: data1, dtype: float64

Frequently, as the grouping information is found in the same DataFrame as the data you want to work on. In that case, you can pass column names (whether those are strings, numbers or other Python objects) as the group keys:

In [9]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.755487,-0.415958
b,-0.247963,-1.15162


In [10]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-1.105097,-0.875377
a,two,-0.056267,0.502881
b,one,-2.136196,-1.245288
b,two,1.640271,-1.057952


### Iterating over groups

The Groupby object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following:

In [11]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
  key1 key2     data1     data2
0    a  one -0.416758 -0.841747
1    a  two -0.056267  0.502881
4    a  one -1.793436 -0.909008
b
  key1 key2     data1     data2
2    b  one -2.136196 -1.245288
3    b  two  1.640271 -1.057952


In the case of multiple keys, the first element in the tuple will be a tuple of key values:

In [12]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

('a', 'one')
  key1 key2     data1     data2
0    a  one -0.416758 -0.841747
4    a  one -1.793436 -0.909008
('a', 'two')
  key1 key2     data1     data2
1    a  two -0.056267  0.502881
('b', 'one')
  key1 key2     data1     data2
2    b  one -2.136196 -1.245288
('b', 'two')
  key1 key2     data1     data2
3    b  two  1.640271 -1.057952


Of course, you can choose to do whatever you want with the pieces of data. A recipe you may find useful is computing a dict of the data pieces as a one-liner:

In [13]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,key1,key2,data1,data2
2,b,one,-2.136196,-1.245288
3,b,two,1.640271,-1.057952


By default, groupby groups on axis=0, but you can group on any of the other axes. For example, we could group the columns of our example df here by *dtype* like so: 

In [14]:
df.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

In [15]:
grouped = df.groupby(df.dtypes, axis = 1)
for dtype, group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0 -0.416758 -0.841747
1 -0.056267  0.502881
2 -2.136196 -1.245288
3  1.640271 -1.057952
4 -1.793436 -0.909008
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### Selecting a Column or Subset of Columns

Indexing a GroypBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggreagation. This means that:

In [16]:
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000025215D7E3A0>

are syntactic sugar for:

In [17]:
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000025215D7E910>

Especially for large datasets, it may be desirable to aggregate only a few columns. For example, in the preceding dataset, to compute means for just the data2 column and get the result as a DataFrame, we could write:

In [18]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-0.875377
a,two,0.502881
b,one,-1.245288
b,two,-1.057952


The object returned by this indexing operation is a grouped DataFrame if a list or array is passed or grouped Series if only a single column name is passed as a scalar:

In [19]:
s_grouped = df.groupby(['key1', 'key2',])['data2']
s_grouped.mean()

key1  key2
a     one    -0.875377
      two     0.502881
b     one    -1.245288
      two    -1.057952
Name: data2, dtype: float64

### Grouping with Dicts and Series

Grouping information may exist in a form other than an array. Let's condiser another example DataFrame:

In [20]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index = ['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,0.551454,2.292208,0.041539,-1.117925,0.539058
Steve,-0.59616,-0.01913,1.175001,-0.747871,0.009025
Wes,-0.878108,,,-0.988779,-0.338822
Jim,-0.236184,-0.637655,-1.187612,-1.421217,-0.153495
Travis,-0.269057,2.231367,-2.434768,0.112727,0.370445


Now suppose i have a group correspondence for the columns and i want to sum together the columns by group:

In [21]:
mapping = {'a' : 'red', 'b' : 'red', 'c' : 'blue',
           'd' : 'blue', 'e' : 'red', 'f' : 'orange'}

Now, you could construct an array from this dict to pass to groupby, but instead we can just pass the dict (I included the key 'f' to highlight that unused groupings keys are OK):

In [22]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,-1.076386,3.38272
Steve,0.42713,-0.606265
Wes,-0.988779,-1.21693
Jim,-2.60883,-1.027334
Travis,-2.322041,2.332754


### Grouping with functions

Using Python functions is a more generic way of defining a group mapping compared with a dict or Series. Any function passed as a group key will be called once per index value, with the return values being used as the group names.

In [23]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-0.562838,1.654553,-1.146073,-3.527922,0.046741
5,-0.59616,-0.01913,1.175001,-0.747871,0.009025
6,-0.269057,2.231367,-2.434768,0.112727,0.370445


Mixing functions with arrays, dicts, or Series is not a problem as everything gets converted to arrays internally:

In [24]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.878108,2.292208,0.041539,-1.117925,-0.338822
3,two,-0.236184,-0.637655,-1.187612,-1.421217,-0.153495
5,one,-0.59616,-0.01913,1.175001,-0.747871,0.009025
6,two,-0.269057,2.231367,-2.434768,0.112727,0.370445


### Grouping by Index Levels

A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. Let's look at an example:

In [25]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]],
                                     names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,1.359634,0.501857,-0.844214,1e-05,0.542353
1,-0.313508,0.771012,-1.868091,1.731185,1.467678
2,-0.335677,0.611341,0.047971,-0.829135,0.08771
3,1.000366,-0.381093,-0.375669,-0.074471,0.433496


## Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays. The preceding examples have used several of them, including mean, count, min and sum. You may wonder what is going on when you invoke *mean* on a groupby object. Many common aggregations have optimized implementations. However, you are not limited to only this set of methods,

You can use aggregations of your own devising and additionally call any method that is also defined on the grouped object. For example, you might recall that quantile computes sample quatiles of a Seres or a DataFrame's columns.

While quantile is not explicitly implemented for GroypBy, it is a Series method and thus available for use.

In [26]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.416758,-0.841747
1,a,two,-0.056267,0.502881
2,b,one,-2.136196,-1.245288
3,b,two,1.640271,-1.057952
4,a,one,-1.793436,-0.909008


In [27]:
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a   -0.128365
b    1.262624
Name: data1, dtype: float64

To use your own aggregation fucntions, pass any function that aggregates an array to the aggregate or agg method:

In [28]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.737169,1.411889
b,3.776467,0.187336


### Column-wise and multiple function application

Let's return to the tipping dataset used in the last chapter. After loading ti with *read_csv*, we add a tipping percentage column *tip_pct*:

In [31]:
tips = pd.read_csv('examples/tips.csv')
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808
5,25.29,4.71,No,Sun,Dinner,4,0.18624
