# Chapter 10. Data Aggregation and Group Operations.

### Necessary imports

In [None]:
import pandas as pd
import numpy as np

Categorizing a dataset and applying a function to each group, wether an aggregation or transformation, is often a critical component of a data analysis workflow. One may need to compute statistics or possibly pivot tables.

One reason for the popularity of relational databases and SQL is the ease with which data can be joined, filtered, transformed and aggregated. However, querly languages like SQL are somewhat constrained in the kinds of group operations that can be performed. In this chapter we will learn to:

* Split a pandas objet into pieces using one or more keys (in the form of functions, arrays or DataFrame column names)
* Calculate group summary statistics, like count, mean, or standard deviation, or a user defined function
* Apply within-group transformationsor other manipulations, like normalization, linear regression, rank or subset selection
* Compute pivot tables and cross-tabulations
* Perform quantile analysis and other statistical group analyses

## GroupBy Mechanics

[Hadley Wickham](https://en.wikipedia.org/wiki/Hadley_Wickham) coined the term *split-apply-combine* for describing group operations, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is *split* into groups based on one or more *keys* that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on it's row (axis=0) or its columns (axis=1), Once this is done, a function is *applied* to each group, producing a new value. Finally, the results of all those function applications are *combined* into a result object. The form of the resulting object will usually depend on what's being done to the data.

In [None]:
np.random.seed(2)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Suppose you wanted to compute the mean of the data1 column using the labels from key1. There are a number of ways to do this. One is to access data1 and call groupby with the column at key1

In [None]:
grouped = df['data1'].groupby(df['key1'])
grouped

In [None]:
grouped.mean()

In [None]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

Here we grouped the data using two keys and the resulting series has a hierarchical indexing.

In [None]:
means.unstack()

In [None]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

Frequently, as the grouping information is found in the same DataFrame as the data you want to work on. In that case, you can pass column names (whether those are strings, numbers or other Python objects) as the group keys:

In [None]:
df.groupby('key1').mean()

In [None]:
df.groupby(['key1', 'key2']).mean()

### Iterating over groups

The Groupby object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following:

In [None]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

In the case of multiple keys, the first element in the tuple will be a tuple of key values:

In [None]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

Of course, you can choose to do whatever you want with the pieces of data. A recipe you may find useful is computing a dict of the data pieces as a one-liner:

In [None]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

By default, groupby groups on axis=0, but you can group on any of the other axes. For example, we could group the columns of our example df here by *dtype* like so: 

In [None]:
df.dtypes

In [None]:
grouped = df.groupby(df.dtypes, axis = 1)
for dtype, group in grouped:
    print(dtype)
    print(group)

### Selecting a Column or Subset of Columns

Indexing a GroypBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggreagation. This means that:

In [None]:
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

are syntactic sugar for:

In [None]:
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])

Especially for large datasets, it may be desirable to aggregate only a few columns. For example, in the preceding dataset, to compute means for just the data2 column and get the result as a DataFrame, we could write:

In [None]:
df.groupby(['key1', 'key2'])[['data2']].mean()

The object returned by this indexing operation is a grouped DataFrame if a list or array is passed or grouped Series if only a single column name is passed as a scalar:

In [None]:
s_grouped = df.groupby(['key1', 'key2',])['data2']
s_grouped.mean()

### Grouping with Dicts and Series

Grouping information may exist in a form other than an array. Let's condiser another example DataFrame:

In [None]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index = ['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan
people

Now suppose i have a group correspondence for the columns and i want to sum together the columns by group:

In [None]:
mapping = {'a' : 'red', 'b' : 'red', 'c' : 'blue',
           'd' : 'blue', 'e' : 'red', 'f' : 'orange'}

Now, you could construct an array from this dict to pass to groupby, but instead we can just pass the dict (I included the key 'f' to highlight that unused groupings keys are OK):

In [None]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

### Grouping with functions

Using Python functions is a more generic way of defining a group mapping compared with a dict or Series. Any function passed as a group key will be called once per index value, with the return values being used as the group names.

In [None]:
people.groupby(len).sum()

Mixing functions with arrays, dicts, or Series is not a problem as everything gets converted to arrays internally:

In [None]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

### Grouping by Index Levels

A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. Let's look at an example:

In [None]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]],
                                     names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

## Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays. The preceding examples have used several of them, including mean, count, min and sum. You may wonder what is going on when you invoke *mean* on a groupby object. Many common aggregations have optimized implementations. However, you are not limited to only this set of methods,

You can use aggregations of your own devising and additionally call any method that is also defined on the grouped object. For example, you might recall that quantile computes sample quatiles of a Seres or a DataFrame's columns.

While quantile is not explicitly implemented for GroypBy, it is a Series method and thus available for use.

In [None]:
df

In [None]:
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

To use your own aggregation fucntions, pass any function that aggregates an array to the aggregate or agg method:

In [None]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

### Column-wise and multiple function application

Let's return to the tipping dataset used in the last chapter. After loading ti with *read_csv*, we add a tipping percentage column *tip_pct*:

In [None]:
tips = pd.read_csv('examples/tips.csv')
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]