# 2. Custom Aggregation

### Objectives

+ Write your own custom aggregation functions
+ Know what object is passed to the custom aggregation function

# Introduction
Pandas GroupBy objects come with many built-in aggregate functions. These are all available as strings within the **`agg`** method. There are, of course, many other possible aggregations that are not directly available. It is possible to define your own customized aggregate function. These customized functions must return a single value.

## Writing your own custom aggregation function
Let's suppose you would like to know the difference between the max and min value of a column for each group. Pandas does not have an aggregate function built to do this. You will have to define this one yourself. 

Each customized aggregate function is defined as you would a regular Python function with the **`def`** keyword. Each function is **implicitly** passed the aggregating column. This aggregating column is passed as a **`Series`**. This means that all Series methods will work on the passed argument.

The **`min_max`** function below takes one argument, **`s`**, which is a Series object. It returns the difference between the max and min values of that Series.

In [None]:
import pandas as pd
import numpy as np

college = pd.read_csv('../../data/college.csv')

def min_max(s):
    return s.max() - s.min()

## Using your customized aggregation function
Customized aggregation functions are used similarly to the built-in aggregation functions. When using them within the **`agg`** method, use the actual function object and not the string name. 

The following finds the difference between the maximum and minimum student populations for school with and without religious affiliation. 

In [None]:
college.groupby('relaffil').agg({'ugds': min_max})

### Implicit passing of aggregation Series
The above **`agg`** method passed the **`ugds`** column as a Series to our customized aggregation function, **`min_max`**, for each group. The parameter **`s`** takes on this Series. We say this is implicit, because we don't actually see the function executed.

An **explicit** call to **`min_max`** would look like this:

In [None]:
min_max(college['ugds'])

###  Custom aggregation function must return a single value
If your custom aggregation function does not return a single value, an exception will be raised. Let's create a custom aggregation that adds 5 to each value. This will return a Series the same size as group and not a single number.

In [None]:
def add5(s):
    return s + 5

Attempting this produces an error:

In [None]:
college.groupby('relaffil').agg({'ugds': add5})

## Combine custom aggregation function with built-ins
The custom aggregation function can be used in conjunction with any number of other built-in aggregation functions that we have previously seen. You will have to rename the columns to remove the MutliIndex as usual.

In [None]:
college.groupby(['stabbr', 'relaffil']) \
       .agg({'ugds': ['size', 'min', 'max', min_max]}).head(12)

## Finding the percentage of all undergraduates represented in the top 5 most populous colleges
A slightly more involved example would be to find the percentage of undergraduates that attend the top 5 most populous colleges for each state.

To accomplish this, our custom function sorts the values within each group from greatest to least. We then select the first 5 values with **`.iloc`** and sum them. We divide this sum by the total.

In [None]:
def top5_perc(s):
    s = s.sort_values(ascending=False)
    top5_total = s.iloc[:5].sum()
    total = s.sum()
    return top5_total / total

In [None]:
college.groupby('stabbr').agg({'ugds': top5_perc}).head(10)

## Optimizing a Custom Aggregation function
Defining your own custom aggregation function is tricky and can be a cause for large performance hits. Pandas optimizes its own built-in functions but can't ensure that your custom function is executed optimally. 

## Run operations that are independent of the group outside of the custom function
In general, it is best to minimize the amount of code inside the custom function. The only commands that should go inside the custom function are those that depend on the grouping.

In the above example, there is no need to sort the values inside the group. We can instead sort the entire DataFrame before the grouping. Pandas preserves the order of the values in each group, so you can be sure that the top 5 values are the same for both methods.

We redefine the custom aggregation function below:

In [None]:
def top5_perc_simple(s):
    top5_total = s.iloc[:5].sum()
    total = s.sum()
    return top5_total / total

We then sort the entire DataFrame first before grouping.

In [None]:
college.sort_values('ugds', ascending=False) \
       .groupby('stabbr').agg({'ugds': top5_perc_simple}).head(10)

### Comparing performance
The less operations that occur within the custom GroupBy function, the better performance will be.

About a 50% performance improvement is seen.

In [None]:
%timeit -n 5 college.groupby('stabbr').agg({'ugds': top5_perc}).head(10)

In [None]:
%%timeit -n 5 
college.sort_values('ugds', ascending=False) \
       .groupby('stabbr').agg({'ugds': top5_perc_simple}).head(10)

# Pandas Power User Optimization
Performance is always better when custom functions are avoided. This is because Pandas only optimizes for a few select functions - the ones that we can use as strings such as `sum`, `max`, `min`, etc...

We do the same calculation again below only using builtin Pandas GroupBy function.

### Get top 5 rows with `head` GroupBy method
You can get the first 5 rows of **each** group by calling the `head` method directly after grouping

In [None]:
college_top5 = college.sort_values('ugds', ascending=False) \
                      .groupby('stabbr').head()

In [None]:
college_top5.head()

We can verify this by counting the number of states in the resulting DataFrame. They should all be 5 or at most 5.

In [None]:
college_top5['stabbr'].value_counts().head(10)

### Sum the school populations from this DataFrame
We can now total the populations for each state by using another call to **`groupby`**.

In [None]:
top5_total = college_top5.groupby('stabbr').agg({'ugds': 'sum'})
top5_total.head()

#### Faster to use alternative groupby syntax
We can use an alternative groupby syntax to get another performance improvement.

In [None]:
top5_total = college_top5.groupby('stabbr')['ugds'].sum()
top5_total.head()

Check performance for the two operations here:

In [None]:
%timeit -n 5 college_top5.groupby('stabbr').agg({'ugds': 'sum'})

In [None]:
%timeit -n 5 college_top5.groupby('stabbr')['ugds'].sum()

### Sum all the school for each state
Use the original DataFrame to find the total of all the states with yet another call to **`groupby`**.

In [None]:
total = college.groupby('stabbr')['ugds'].sum()
total.head()

### Divide the last two Series
We get our desired result by dividing the top 5 total by the grand total. This is the same result as the other two methods.

In [None]:
(top5_total / total).head()

## New Performance Test
Let's run all these new commands together in a single cell and test performance. We were able to reduce the time to complete the task by 80% from the original custom aggregation. There is actually another optimization here. We assign the result of our first `groupby` to the variable `grouped` as we use this result twice.

In [None]:
%%timeit -n 5

college_sorted = college.sort_values('ugds', ascending=False)
grouped = college_sorted.groupby('stabbr')

college_top5 = grouped.head()
top5_total = college_top5.groupby('stabbr')['ugds'].sum()

total = grouped['ugds'].sum()
top5_total / total

# Complexity vs Performance
This is usually a topic of debate when deciding on which Pandas methods to use. I typically like to avoid custom aggregation functions at all cost as they can drastically reduce performance for larger datasets.

Readability (low complexity) is very valuable when sharing your code or looking back at it at a later date. The custom aggregation may provide slightly more readability, but if so it isn't by much so I would recommend using the faster solution here.

# Exercises
Solutions are below.

Use the flights data for these problems.

In [None]:
import pandas as pd
pd.options.display.max_columns = 40
flights = pd.read_csv('../../data/flights.csv')
flights.head()

## Problem 1
<span  style="color:green; font-size:16px">What are the 3 least common airlines?</span>

## Problem 2
<span  style="color:green; font-size:16px">For each airline, find out what percentage of its flights leave on the 4th day of the week. Use a custom aggregation function.</span>

## Problem 3
<span  style="color:green; font-size:16px">Redo problem 2 without using a custom aggregation problem. What is the performance difference?</span>

## Problem 4
<span  style="color:green; font-size:16px">The range of undergrad populations per state was calculated using the `min_max` custom function from the top of this notebook. Use this same function to calculate the range of distance for each airline. Then calculate this range again without a custom function.</span>

## Problem 5
<span  style="color:green; font-size:16px">For each airline, return the first and last row of each group. Use one of the direct [GroupBy methods][1]</span>

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#groupby