# UNDERSTANDING GROUPBY IN PANDAS

In [None]:
import pandas as pd
import dateutil

In [None]:
# Load data from csv file
data = pd.DataFrame.from_csv('../data/phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

In [None]:
# How many rows the dataset
data['item'].count()

In [None]:
# What was the longest phone call / data entry?
data['duration'].max()

In [None]:
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()

In [None]:
# How many entries are there for each month?
data['month'].value_counts()

In [None]:
# Number of non-null unique network entries
data['network'].nunique()

The need for custom functions is minimal unless you have very specific requirements. The full range of basic statistics that are quickly calculable and built into the base Pandas package are:

- count: Number of non-null observations
- sum: Sum of values
- mean: Mean of values
- mad: Mean absolute deviation
- median: Arithmetic median of values
- min: Minimum
- max: Maximum
- mode: Mode
- abs: Absolute Value
- prod: Product of values
- std: Unbiased standard deviation
- var: Unbiased variance
- sem: Unbiased standard error of the mean
- skew: Unbiased skewness (3rd moment)
- kurt: Unbiased kurtosis (4th moment)
- quantile: Sample quantile (value at %)
- cumsum: Cumulative sum
- cumprod: Cumulative product
- cummax: Cumulative maximum
- cummin: Cumulative minimum

## PART 1: SUMMARIZING GROUPS IN THE DATAFRAME

There’s further power put into your hands by mastering the Pandas “groupby()” functionality. Groupby essentially splits the data into different groups depending on a variable of your choice. For example, the expression  data.groupby('month') will split our current DataFrame by month.

The groupby() function returns a GroupBy object, but essentially describes how the rows of the original data set has been split. the GroupBy object .groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. For example:

In [None]:
data.groupby(['month']).groups.keys()

In [None]:
len(data.groupby(['month']).groups['2014-11'])

In [None]:
# Get the first entry for each month
data.groupby('month').first()

In [None]:
# Get the sum of the durations per month
data.groupby('month')['duration'].sum()

In [None]:
# Get the number of dates / entries in each month
data.groupby('month')['date'].count()

In [None]:
# What is the sum of durations, for calls only, to each network
data[data['item'] == 'call'].groupby('network')['duration'].sum()

#### You can also group by more than one variable, allowing more complex queries.

In [None]:
# How many calls, sms, and data entries are in each month?
data.groupby(['month', 'item'])['date'].count()

In [None]:
# How many calls, texts, and data are sent per month, split by network_type?
data.groupby(['month', 'network_type'])['date'].count()

## PART 2: GROUPBY OUTPUT FORMAT - SERIES OR DATAFRAME?

The output from a groupby and aggregation operation varies between Pandas Series and Pandas Dataframes, which can be confusing for new users. As a rule of thumb, if you calculate more than one column of results, your result will be a Dataframe. For a single column of results, the agg function, by default, will produce a Series.

In [None]:
data.groupby('month')['duration'].sum() # produces Pandas Series

In [None]:
data.groupby('month')[['duration']].sum() # Produces Pandas DataFrame

#### The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.

In [None]:
data.groupby('month', as_index=False).agg({"duration": "sum"})

Using the as_index parameter while Grouping data in pandas prevents setting a row index on the result.

## PART 3: MULTIPLE STATISTICS PER GROUP

The final piece of syntax that we’ll examine is the “agg()” function for Pandas. The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation. The syntax is simple, and is similar to that of MongoDB’s [aggregation framework](http://docs.mongodb.org/manual/applications/aggregation/).

<img src='./images/pandas_aggregation-1024x409.png' />

Aggregation of variables in a Pandas Dataframe using the agg() function. Note that in Pandas versions 0.20.1 onwards, the renaming of results needs to be done separately.

## PART 4: APPLYING A SINGLE FUNCTION TO COLUMNS IN GROUPS

Instructions for aggregation are provided in the form of a python dictionary or list. The dictionary keys are used to specify the columns upon which you’d like to perform operations, and the dictionary values to specify the function to run.

In [None]:
# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']).agg({'duration':sum,      # find the sum of the durations for each group
                                     'network_type': "count", # find the number of network type entries
                                     'date': 'first'})    # get the first date per group

The aggregation dictionary syntax is flexible and can be defined before the operation. You can also define functions inline using “lambda” functions to extract statistics that are not provided by the built-in options.

In [None]:
# Define the aggregation procedure outside of the groupby operation
aggregations = {
    'duration':'sum',
    'date': lambda x: max(x) - pd.Timedelta(seconds=60)
}
data.groupby('month').agg(aggregations)

## PART 5: APPLYING MULTIPLE FUNCTIONS TO COLUMNS IN GROUPS

To apply multiple functions to a single column in your grouped data, expand the syntax above to pass in a list of functions as the value in your aggregation dataframe.

In [None]:
# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']).agg({'duration': [min, max, sum],      # find the min, max, and sum of the duration column
                                     'network_type': "count", # find the number of network type entries
                                     'date': [min, 'first', 'nunique']})    # get the min, first, and number of unique

## PART 6: RENAMING GROUPED STATISTICS FROM GROUPBY OPERATIONS

When multiple statistics are calculated on columns, the resulting dataframe will have a multi-index set on the column axis. This can be difficult to work with, and I typically have to rename columns after a groupby operation.

One option is to drop the top level (using .droplevel) of the newly created multi-index on columns using:

In [None]:
import numpy as np

grouped = data.groupby('month').agg({"duration":[min, max, np.mean]})
grouped.columns = grouped.columns.droplevel(level=0)
grouped.rename(columns={"min": "min_duration", "max": "max_duration", "mean": "mean_duration"})
grouped.head()

However, this approach loses the original column names, leaving only the function names as column headers. A neater approach is using the ravel() method on the grouped columns. Ravel() turns a Pandas multi-index into a simpler array, which we can combine into sensible column names:

In [None]:
grouped = data.groupby('month').agg({"duration":[min, max, np.mean]})
# Using ravel, and a string join, we can create better names for the columns:
grouped.columns = ["_".join(x) for x in grouped.columns.ravel()]

In [None]:
grouped.head()

#### Quick renaming of grouped columns from the groupby() multi-index can be achieved using the ravel() function.

## PART 7: MAP, REDUCE, AND FILTER IN AGGREGATE METHODS

In [None]:
medium_list = [['Item A', 59],
          ['Item B', 95],
          ['Item B', 82],
          ['Item C', 40],
          ['Item A', 11]]

medium = pd.DataFrame(medium_list)
medium.columns = ['item', 'value']
medium

On the above DataFrame each row is an item of type A, B or C and its value. A common task would be to know how much value you’ve got for each type of item. In order to do this, you just group by item and sum the value.

In [None]:
medium.groupby('item').value.sum()

For this case it’s pretty straight forward. We’ve got a sum function from Pandas that does the work for us. If there wasn’t such a function we could make a custom sum function and use it with the aggregate function in order to achieve the same result.

In [None]:
from functools import reduce 

def test_sum(series):
    return reduce(lambda x, y: x + y, series.tolist())

medium.groupby('item').agg({'value': ['sum', test_sum]})

The aggregation function we created receives the value Series from the DataFrame and them sums all the items from the series to get the same result as the sum function from Pandas.

Of course this is a dull example, as it’s not useful at all given the existence of the sum function. In a real world use case, when we want to verify if every sales analyst is tied to a manager, we can create the following aggregation function in order to return the set of every analyst for a given manager.

~~~
def agg_analyst_per_manager(series):
  analyst_list = series.astype(unicode).tolist()
  analyst_list = filter(lambda analyst: analyst != '', analyst_list)
  return set(analyst_list)
~~~

## SUMMARY

The groupby functionality in Pandas is well documented in the [official docs](http://pandas.pydata.org/pandas-docs/stable/groupby.html) and performs at speeds on a par (unless you have massive data and are picky with your milliseconds) with R’s data.table and dplyr libraries.

There are plenty of resources online on this functionality, and we’d recommomend really conquering this syntax if you’re using Pandas in earnest at any point.

1. [DataQuest Tutorial on Data Analysis]( https://www.dataquest.io/blog/pandas-tutorial-python-2/)
2. [Chris Albon notes on Groups]( https://chrisalbon.com/python/pandas_apply_operations_to_groups.html)
3. [Greg Reda Pandas Tutorial]( http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/)