# Split-Apply-Combine operations with `.groupby()`

Our data often consists of individual observations or events. To make sense of the patterns in the data it can be helpful to aggregate the data within categorical groups. The general method is described as the Split-Apply-Combine strategy for data analysis, as described in [this classic paper by Hadley Wickham](https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf).

- In SQL this is done with `GROUPBY`
- In Tableau this is done with all visualizations + Level of Detail (LOD) calculations
- In R this is done with the `dplyr` package using `group_by()` and `summarize()`
- **In Python with Pandas this is done with `.groupby()`**

## Groupby

From [the Pandas `.groupby()` documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html):

**By “group by” we are referring to a process involving one or more of the following steps:**

- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure

After splitting, in **the apply step**, we do something to the groups, like:

- **Aggregation**: compute a summary statistic (or statistics) for each group, like group sums or means, or group sizes / counts. 
    - ***Reduces the number of rows to one per group***
- **Transformation**: perform some group-specific computations and return a like-indexed object, such as a standardize data (zscore) within a group, or filling NAs within groups with a value derived from each group. 
    - ***Keeps the number of rows the same (e.g. a new column with the group average in each row for easy comparisons to the individuals)***
- **Filtration**: discard some groups, according to a group-wise computation that evaluates True or False, such as discarding data that belongs to groups with only a few members, or filtering out data based on the group sum or mean


### Graphical Example 

As Jake VanderPlas shows in the 
[Aggregation and Grouping](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb) 
section of his excellent 
[Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook), 
an archtypical example of a `groupby()` operation with a sum aggregation is:

<img src='images/split-apply-combine.svg' width=600>

---

*To preserve the mystery, select from the notebook menus*

`Edit -> Clear All Outputs`

---

## Pandas groupby syntax

Let's try just a slightly more complex example to familiarize ourselves with the Pandas syntax for the aggregation and transformation operations. **To see some very common behaviors, we need multiple numerical columns.** *(We'll save multiple categorical columns for later...)*

In [1]:
import pandas as pd

### Create the example DataFrame

In [2]:
key_list = ['A','B','C','A','B','C']
data1_list = [2, 4, 6, 8, 10, 12]
data2_list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

df = pd.DataFrame({'key':key_list, 'data1':data1_list, 'data2':data2_list})
df

Unnamed: 0,key,data1,data2
0,A,2,0.1
1,B,4,0.2
2,C,6,0.3
3,A,8,0.4
4,B,10,0.5
5,C,12,0.6


### The `.groupby()` operation returns a groupby object

In [3]:
group = df.groupby('key')
group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x117789bb0>

### The individual rows in each group are still separate!

What you see returned are the keys and they indexes (names) of the rows

In [4]:
group.groups

{'A': [0, 3], 'B': [1, 4], 'C': [2, 5]}

### We can access the groups

In [5]:
group.get_group('A')

Unnamed: 0,key,data1,data2
0,A,2,0.1
3,A,8,0.4


---

## Aggregate


There are a few different variations of the aggregate syntax. The excellent blog post 
[Minimally Sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428) 
by Ted Petrou talks about the trade-offs. Later I will show you a shortcut method, but in general I try to follow his suggestions for preferred syntax.

- `.agg()` is just short for `.aggregate()` and is fine to use
- **The aggregate method returns a DataFrame with a single combined row for each group**
- You can specify multiple statistical functions at once with a list

Some [descriptive statistics are built into Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics) and it's easy to import
[NumPy](https://numpy.org/) (Numerical Python) and use functions from there, too.

## Preferred aggregate syntax

Because in many cases we have multiple data columns, **the preferred aggregation syntax specifies both the name of the data column and the function(s) to be applied in a dictionary format**

`df.groupby('grouping column').agg({'aggregating column':'aggregating function'})`

In [6]:
df.groupby('key').agg({'data1':'sum'})

Unnamed: 0_level_0,data1
key,Unnamed: 1_level_1
A,10
B,14
C,18


### Easy to apply different functions to different columns

Notice we don't retain any record of the applied function with only one function per column...

In [7]:
df.groupby('key').agg({'data1':'sum', 'data2':'mean'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,10,0.25
B,14,0.35
C,18,0.45


### Multiple aggregation functions list results in multi-index columns

- You can supply a list of functions
- The multi-index combines the original column name with the aggregation function that was applied

In [8]:
sum_mean = df.groupby('key').agg({'data1':['sum','mean']})
sum_mean

Unnamed: 0_level_0,data1,data1
Unnamed: 0_level_1,sum,mean
key,Unnamed: 1_level_2,Unnamed: 2_level_2
A,10,5.0
B,14,7.0
C,18,9.0


### Multi-index is selected by a tuple

We'll cover this further in Groupy_NCexploration. It's the same situation if a groupby() returns a multi-index in the rows.

In [9]:
sum_mean.loc[:,('data1','sum')]

key
A    10
B    14
C    18
Name: (data1, sum), dtype: int64

### Just the highest level gets you all underneath

*You can't just use the lower level, or you'll get an error*

In [10]:
sum_mean.loc[:,('data1')]

Unnamed: 0_level_0,sum,mean
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,10,5.0
B,14,7.0
C,18,9.0


---

#### This "function name only" shortcut syntax is not recommended

But you may see it around, and it does work fine for a quick result. The aggregation function is applied to all the numeric columns.

In [11]:
df.groupby('key').agg('sum')

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,10,0.5
B,14,0.7
C,18,0.9


In [12]:
df.groupby('key').agg(['sum','mean'])

Unnamed: 0_level_0,data1,data1,data2,data2
Unnamed: 0_level_1,sum,mean,sum,mean
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
A,10,5.0,0.5,0.25
B,14,7.0,0.7,0.35
C,18,9.0,0.9,0.45


---

**To try the exercise below, select this cell and from the Jupyter menus choose**

`Run -> Run All Above Selected Cell`

## EXERCISE

**Use the preferred syntax to find the minimum and maximum simultaneously on only the "data2" column of `df` within the "key" groups**

*Note: Type instead of using copy/paste for better retention*

*Expected output*

<img src='images/data2_max_min.png'>

---

## Transform

**You'll probably use aggregation more often, but it's important to know how to transform.** 

- Applies a function to each of the groups
- Only a single function is allowed – it will be applied to all the columns
- **Doesn't change the number of rows from the original!**
- *Notice that we lost the 'key' column, but we retain the original Index!*

In [13]:
df.groupby('key').transform('mean')

Unnamed: 0,data1,data2
0,5.0,0.25
1,7.0,0.35
2,9.0,0.45
3,5.0,0.25
4,7.0,0.35
5,9.0,0.45


### Transform a single column

More commonly you'll want to operate on a single column so you can compare individuals in that column to the group result.

**If you want a single column (Series) out, you have to either**

- select a single column from the groupby object to pass to the transform function
- select a single column from the transform output

**using the standard "name of column in square brackets" notation**

In [14]:
df.groupby('key')['data1'].transform('mean')

0    5.0
1    7.0
2    9.0
3    5.0
4    7.0
5    9.0
Name: data1, dtype: float64

In [15]:
df.groupby('key').transform('mean')['data1']

0    5.0
1    7.0
2    9.0
3    5.0
4    7.0
5    9.0
Name: data1, dtype: float64

### Store the tranform results so we can use it in a comparison

- Here we're just storing the result in a new variable
- We could alternatively store the transform results as a new column in the original DataFrame if we wanted to, say, color points in a plot

In [16]:
data1_mean = df.groupby('key')['data1'].transform('mean')
data1_mean

0    5.0
1    7.0
2    9.0
3    5.0
4    7.0
5    9.0
Name: data1, dtype: float64

### Return the rows that are above the group mean

Just so you can picture what's going on, let's look at the comparison first by itself, which returns a boolean Series

In [17]:
df['data1'] > data1_mean

0    False
1    False
2    False
3     True
4     True
5     True
Name: data1, dtype: bool

#### Now let's use the result to only return the rows that evaluate to True

In [18]:
df.loc[df['data1'] > data1_mean, :]

Unnamed: 0,key,data1,data2
3,A,8,0.4
4,B,10,0.5
5,C,12,0.6


#### Or we could have just done all of the operations at once

In [19]:
df.loc[df['data1'] > df.groupby('key')['data1'].transform('mean'), :]

Unnamed: 0,key,data1,data2
3,A,8,0.4
4,B,10,0.5
5,C,12,0.6


---

**To try the exercise below, select this cell and from the Jupyter menus choose**

`Run -> Run All Above Selected Cell`

## EXERCISE

**Transform the "key" group "data2" minimums and store in a new `data2_min` variable**

*Note: Type instead of using copy/paste for better retention*

*Expected output*

<img src='images/data2_min_series.png'>

## EXERCISE

**Return the rows where data2 value == data2_min**

*Expected output*

<img src='images/data2_equals_min_rows.png'>

---

## Filter – remove groups based on conditional function results

Just a quick example of filtering after groupby. Here's a reminder of the "key" group means.

In [20]:
df.groupby('key').agg({'data2':'mean'})

Unnamed: 0_level_0,data2
key,Unnamed: 1_level_1
A,0.25
B,0.35
C,0.45


### Need to supply a function that operates on each group's DataFrame

- The groupby will happen, and then each group will have this function applied to its DataFrame
- **You are defining what should pass through the filter!**
- You get back the original DataFrame only including the rows that made it through the filter

#### Using a lambda function in the filter

I always find lambda functions a little awkward, but it is handy to be able to define a function in-place. `x` will represent a DataFrame after the groupby operation in this case.

In [21]:
df.groupby('key').filter(lambda x: x['data2'].mean() > 0.3)

Unnamed: 0,key,data1,data2
1,B,4,0.2
2,C,6,0.3
4,B,10,0.5
5,C,12,0.6


#### Defining the filter function separately

The alternative to using a lambda function would have been to first define a function that takes a DataFrame and returns a boolean True/False value, and then pass it to the `.filter()` method

In [22]:
def pass_high_mean(x):
    return x['data2'].mean() > 0.3

df.groupby('key').filter(pass_high_mean)

Unnamed: 0,key,data1,data2
1,B,4,0.2
2,C,6,0.3
4,B,10,0.5
5,C,12,0.6
