# Groupby Operations: Split-Apply-Combine

We like to use grouped operations to aggregate, transform, and filter data. 

These types of operations generally follow the same pattern:
1. data is split into separate parts based on key(s)
2. a function is applied to each part of the data
3. the results from each part are combine to create a new dataset

All the techniques in this notebook can be done without using Pandas' `groupby()` method; however, 
this method allows for flexibility, ease-of-use, faster code, and allowing you to work with larger datasets
on distributed or parallel systems. 

---

## Aggregate

Of course, aggregation is the process of taking multiple values and returning a single value. 

### Basic One-Variable Grouped Aggregation

Recall using `groupby()` to calculate the average life expectancy for each year in the `gapminder` dataset.

In [1]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

df = pd.read_csv('../../data/gapminder.tsv', sep='\t')
df.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

In [2]:
# calculate average life expectancy for each year 
# split the data up by the year values then compute the 
# average lifeExp for each of those years
avg_life_exp_by_year = df.groupby('year')['lifeExp'].mean()
avg_life_exp_by_year

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

Groupby statements can be thought of as creating a subset of each unique value of a column.
Meaning whatever column we use `groupby()` on, the returned groupedby dataframe will contain 
only the unique values for that column. 

So, for our example, we called it on the year column so that groupedby dataframe only contains a single instance 
of each of the values for the `years` column.

We could also accomplish this by using the `unique` method.

In [3]:
years = df.year.unique()
years

array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
       2007])

Next, we can go through each of the years and subset the data for a given year.

In [4]:
# subset the data for 1952
y1952 = df.loc[df.year == 1952, :]
y1952

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
12,Albania,Europe,1952,55.230,1282697,1601.056136
24,Algeria,Africa,1952,43.077,9279525,2449.008185
36,Angola,Africa,1952,30.015,4232095,3520.610273
48,Argentina,Americas,1952,62.485,17876956,5911.315053
...,...,...,...,...,...,...
1644,Vietnam,Asia,1952,40.412,26246839,605.066492
1656,West Bank and Gaza,Asia,1952,43.160,1030585,1515.592329
1668,"Yemen, Rep.",Asia,1952,32.548,4963829,781.717576
1680,Zambia,Africa,1952,42.038,2672000,1147.388831


Finally, we can perform a function on the subset of the data. 

In [5]:
y1952_mean = y1952['lifeExp'].mean()
y1952_mean

49.057619718309866

These steps were stated explicitly to show that this is how the `groupby()` method works. Of course, this method can use almost any time of aggregation function.

### Built-in Pandas Aggregation Methods 

As an example of using one of the built-in aggregation functions, we can compute summary stats on one of the 
variables grouped by a specific value.

In [6]:
# for each continent, compute the summary stats for life expectancy
df.groupby('continent')['lifeExp'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,624.0,48.86533,9.15021,23.599,42.3725,47.792,54.4115,76.442
Americas,300.0,64.658737,9.345088,37.579,58.41,67.048,71.6995,80.653
Asia,396.0,60.064903,11.864532,28.801,51.42625,61.7915,69.50525,82.603
Europe,360.0,71.903686,5.433178,43.585,69.57,72.241,75.4505,81.757
Oceania,24.0,74.326208,3.795611,69.12,71.205,73.665,77.5525,81.235


So, I find it helps to read the previous line of code as 'for each continent, compute the summer stats for life expectancy'.

### Aggregation Functions

The above example demonstrated using a Pandas method as an aggregation function; however, we can also use the 
`.agg()` or `.aggregate()` method and pass in a `NumPy` or `Scipy` function instead.

#### Functions from Other Libraries

We can pass the function object into the `.agg()` function. Let's look at using `Numpy`'s mean function instead 
of the Pandas version. 

In [7]:
import numpy as np

df.groupby('continent')['lifeExp'].agg(np.mean)

continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

#### Custom User Functions

Of course, we can always use our own functions instead of using one from another library. 

It works the same as passing another library function. We simply pass the function object.

In [8]:
def my_mean(values):
    """
    values should be an array-like object 
    (e.g. a Pandas Series)
    """
    n = len(values)

    start = 0 
    sum = 0
    for value in values:
        sum += value

    return sum / n

In [9]:
# pass our custom function 
df.groupby('year')['lifeExp'].agg(my_mean)

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

If we use a custom function with multiple parameters, we just need the first one to be a series of values and 
the remaining will get passed as keyword arguments into `.agg()`.

To demonstrate this, we'll calculate the global average life expectancy and subtract it from the grouped value.

In [10]:
def my_mean_diff(values, diff_value):
    """
    Difference between the mean and the diff_value (global life exp?)
    """
    n = len(values)
    sum = 0
    for value in values:
        sum += value
    mean = sum / n 
    return (mean - diff_value)

In [11]:
# calculate global average life expectancy mean 
global_mean = df['lifeExp'].mean()
global_mean

59.474439366197174

In [12]:
# use custom aggregation function with multiple parameters
df.groupby('year')['lifeExp'].agg(my_mean_diff, diff_value=global_mean)

year
1952   -10.416820
1957    -7.967038
1962    -5.865190
1967    -3.796150
1972    -1.827053
1977     0.095718
1982     2.058758
1987     3.738173
1992     4.685899
1997     5.540237
2002     6.220483
2007     7.532983
Name: lifeExp, dtype: float64

### Multiple Functions Simultaneously

You can pass multiple functions to the `.agg()` as a Python list to perform multiple aggregate functions at a time.

In [13]:
# calculate the count, mean, and std of lifeExp by continent
df.groupby('year')['lifeExp'].agg([np.count_nonzero, np.mean, np.std])

Unnamed: 0_level_0,count_nonzero,mean,std
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1952,142,49.05762,12.225956
1957,142,51.507401,12.231286
1962,142,53.609249,12.097245
1967,142,55.67829,11.718858
1972,142,57.647386,11.381953
1977,142,59.570157,11.227229
1982,142,61.533197,10.770618
1987,142,63.212613,10.556285
1992,142,64.160338,11.22738
1997,142,65.014676,11.559439


### Use a dic in `.agg()`

We can pass a Python dictionary to this method and the result depends on if we're calling it on a DataFrame or a Series object. 

#### On a DataFrame

When using a dictionary on a grouped DataFrame, the keys are the columns we want to compute on and the values are the aggregation functions.

In [14]:
# for each year, computer the average life expectancy, the 
# median population, and median gdp
df.groupby('year').agg(
    {
        'lifeExp': 'mean', 
        'pop': 'median',
        'gdpPercap': 'median'
    }
)

Unnamed: 0_level_0,lifeExp,pop,gdpPercap
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1952,49.05762,3943953.0,1968.528344
1957,51.507401,4282942.0,2173.220291
1962,53.609249,4686039.5,2335.439533
1967,55.67829,5170175.5,2678.33474
1972,57.647386,5877996.5,3339.129407
1977,59.570157,6404036.5,3798.609244
1982,61.533197,7007320.0,4216.228428
1987,63.212613,7774861.5,4280.300366
1992,64.160338,8688686.5,4386.085502
1997,65.014676,9735063.5,4781.825478


#### On a Series

Meh

---

## Transform

The `transform()` method takings multiple values and returns a one-to-one transformation of the values.

### Z-Score Example

As an example of a transformation, we'll calculate the z-score of life expectancy by year.

Recall, the z-score identifies the number of standard deviations from the mean of our data. 
It will standardized our data by centering it around 0 with a standard deviation of 1. 
This allows use to compare different variables. 

$$z = \frac{x-\mu}{\sigma}$$

where, 
* $x$ is a data point 
* $\mu$ is the average of our dataset
* $\sigma$ is the standard deviation given by 
$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}{(x_i-\mu)^2}}$$

So what we'll do it write a function that calculates a z-score and then use this function to transform our data by group.

In [17]:
def my_zscore(x):
    """
    Calculates the z-score of provided data where 'x' is 
    a vector or series of values
    """
    return ((x - x.mean()) / x.std())

In [18]:
# now, let's transform our data by group
transform_z = df.groupby('year')['lifeExp'].transform(my_zscore)
transform_z

0      -1.656854
1      -1.731249
2      -1.786543
3      -1.848157
4      -1.894173
          ...   
1699   -0.081621
1700   -0.336974
1701   -1.574962
1702   -2.093346
1703   -1.948180
Name: lifeExp, Length: 1704, dtype: float64

Luckily for us `scipy` has many standard statistical computing functions baked into it. 
We'll use their implementation of a z-score function to transform with a group by and without grouping. 

In [20]:
from scipy.stats import zscore

# calculate a grouped zscore
sp_z_grouped = df.groupby('year')['lifeExp'].transform(zscore)

# calculate without grouping directly on the 
sp_z_nogroup = zscore(df["lifeExp"])

In [21]:
sp_z_grouped.head()

0   -1.662719
1   -1.737377
2   -1.792867
3   -1.854699
4   -1.900878
Name: lifeExp, dtype: float64

In [22]:
sp_z_nogroup.head()

0   -2.375334
1   -2.256774
2   -2.127837
3   -1.971178
4   -1.811033
Name: lifeExp, dtype: float64

Observer the vast difference in z-scores for the same observations. 
When we use some transformation functions such as `zscore` without grouping, they operate on the entire dataset.

So use caution and be sure to group when performing a transformation on a dataset variable values. 

### Missing Value Example

The next section of the text will cover how to deal with missing values in-depth, so let's leave this guy here for now. 

--- 

## Filter

`filter()` allows use to split data by keys and then perform a boolean subsetting on the data. 
We can do this without using groupby and just use regular subsetting as well. 

In [25]:
import seaborn as sns

tips = sns.load_dataset('tips')

# observe the number of rows in the original data
tips.shape

(244, 7)

In [26]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [27]:
# frequency counts for the table size variable
tips['size'].value_counts()

size
2    156
3     38
4     37
5      5
1      4
6      4
Name: count, dtype: int64

For the sake of this example, let's say for our analysis we need to only keep those table sizes with frequency counts
greater than 30.
We can then use `filter()` to filter the data points that don't meet this requirement out. 

In [30]:
# filter the data with the proper number of observations
tips_filtered = (
    tips
    .groupby('size')  
    .filter(lambda x: x['size'].count() >= 30)
)

tips_filtered.shape

(231, 7)

In [46]:
tips.groupby('total_bill').filter(tips['total_bill'] > 10)

TypeError: 'Series' object is not callable

---

## The pandas.code.groupby.DataFrameGroupyBy object 

