# The groupby operation (split-apply-combine)
 
The `group by` concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**.

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:
 
* **Splitting** the data into groups based on some criteria
* **Applying** a function to each group independently
* **Combining** the results into a data structure

<img src="../images/splitApplyCombine.png">

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  It will provide a dashboard which 
is useful to gain insight on the computation.  

The link to the dashboard will become visible when you create the client below.  We recommend having it open on one side of your screen while using your notebook on the other side.  This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

In [None]:
from dask.distributed import Client, progress
client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')
client

In [None]:
import dask.dataframe as dd 
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## Create Dask Dataframe

To illustrate the groupby operation in the image above, let's see how this can be accomplished with Dask in the following steps:

- Create a pandas dataframe 

In [None]:
pdf = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
pdf

- Create a dask dataframe from the created pandas dataframe 


In [None]:
ddf = dd.from_pandas(pdf, npartitions=5)
ddf

You can see that the dask dataframe is a lazily-evaluated dataframe object. 

## GroupBy Operation

Dask provides a `groupby` method for us to do the **split-apply-combine** operation. 

Using groupby, we will group elements by column **`key`** and compute some aggregations such as `sum`, `mean`, etc..

In [None]:
ddf.groupby(by='key')

## Aggregation 

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. 

An obvious one is aggregation via the [`aggregate` or `agg()`](https://dask.pydata.org/en/latest/dataframe-groupby.html#aggregate) method:

In [None]:
ddf.groupby(by='key').aggregate('sum')

Once again, one notices that Dask returns a lazily-evaluated object until we ask it to compute the actual result. 

In [None]:
ddf.groupby(by='key').aggregate('sum').compute()

As you can see, the result of the aggregation will have the group names as the new index along the grouped axis.

The above can also be accomplished without using the `aggregate` method in the following way:

In [None]:
ddf.groupby(by='key').sum().compute()

In [None]:
ddf.groupby(by='key').mean().compute()

Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating functions are tabulated below:

| **Function**  | **Description**               |
|---------------|-------------------------------|
| **`mean()`**  | Compute mean of groups        |
| **`sum()`**   | Compute sum of groups         |
| **`size()`**  | Compute group sizes           |
| **`count()`** | Compute count of group        |
| **`std()`**   | Standard deviation of groups  |
| **`var()`**   | Compute variance of groups    |
| **`first()`** | Compute first of group values |
| **`last()`**  | Compute last of group values  |
| **`min()`**   | Compute min of group values   |
| **`max()`**   | Compute max of group values   |

## Applying multiple functions at once 

With grouped `dask.dataframe` you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:

In [None]:
grouped = ddf.groupby(by='key')
grouped

In [None]:
grouped.agg(['sum', 'mean', 'std', 'max', 'min']).compute()

The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation like:

In [None]:
grouped.agg(['sum', 'mean', 'std', 'max', 'min'])\
       .rename(columns={'sum': 'foo_sum', 
                        'mean': 'bar_mean'})\
        .compute()

## Transformation 

Sometimes you don't want to aggregate the groups, but transform the values in each group. In Pandas this can be achieved with the `transform()` method for groupby objects. This method returns an object that is indexed the same (same size) as the one being grouped.

This method is not yet implemented in Dask (https://github.com/dask/dask/issues/2536). However, we can achieve the same functionality by using a custom implementation which combines `.apply()` + a custom function. 

For example, suppose we wished to normalize the data within each group:

In [None]:
index = pd.date_range('10/1/1999', periods=1100)
ts = pd.DataFrame(data=np.random.normal(0.5, 2, 1100), index=index, columns=['data'])
ts = ts.rolling(window=100,min_periods=100).mean().dropna()
ts['key'] = ts.index.map(lambda x: x.year)
dts = dd.from_pandas(ts, npartitions=1)
#dts = dts.rolling(window=100,min_periods=100).mean().dropna()
dts.head()

In [None]:
grouped = dts.groupby(by='key')

In [None]:
grouped.agg('mean').head()

In [None]:
def normalize(group):
    return (group - group.mean()) / group.std()

# Transformed Data
transformed = grouped.apply(normalize, meta={'data':'float32', 'key': 'int8'})\
                     .drop('key', axis=1)
transformed.head()

In [None]:
compare = dts.merge(transformed, left_index=True, right_index=True, 
          suffixes=('_original', '_transformed')).compute()
compare.head()

We would expect the result to now have mean 0 and standard deviation 1 within each group, which we can easily check by visually comparing the original and transformed data sets.

In [None]:
compare.drop(columns=['key']).plot()

## Applying group by on some real data

For this section, we will use titanic dataset available in `data` directory. The original dataset can be found at https://www.kaggle.com/c/titanic/data. 


In [None]:
# Read titanic csv file into a Dask dataframe 
ddf = dd.read_csv("../data/titanic.csv")
ddf.head()

- Let's use `groupby()` to calculate the average age for each gender/sex. 

In [None]:
ddf.groupby(by='Sex')['Age'].mean().compute()

- Calculate the average survival ratio for all passengers on Titanic. 

In [None]:
(ddf['Survived'].sum() / len(ddf['Survived'])).compute()

- Calculate the average survival ratio for passengers younger thatn 25. 

In [None]:
# Use filtering/boolean indexing to select our group of interest
ddf25 = ddf[ddf['Age'] <= 25]

(ddf25['Survived'].sum() / len(ddf25['Survived'])).compute()

- Is there a difference in this survival ratio between the sexes? To answer this question, we will need to group by `Sex` and aggregate the mean for each group.


In [None]:
a = ddf.groupby('Sex')['Survived'].agg('mean').compute()
a

In [None]:
a.plot(kind='bar')

- Let's make a bar plot of the survival ratio for the different classes ('Pclass' column).

In [None]:
p = ddf.groupby('Pclass')['Survived'].agg('mean').compute()
p.plot(kind='bar')

In [None]:
p

- Applying multiple functions to columns in groups

To apply multiple functions to a single column in a grouped data, we expand the `agg()` syntax to pass in a list of functions as the value in the aggregation dataframe. 

In [None]:
ddf.groupby(by=['Pclass', 'Sex'])\
   .agg({'Survived': ['count', 'sum', 'mean'],
       'Fare': ['mean', 'min', 'max']})\
   .compute()

- We can apply a transformation such as **data normalization** by defining a custom function `normalize()`


In [None]:
def normalize(group):
    return (group - group.mean()) / group.std()

In [None]:
# Normalize the fare column
ddf.groupby(['Pclass','Sex'])['Fare'].apply(normalize, meta=('x', 'float32')).head(10)

## Summary 

This notebook covers the `groupby()` functionality in Dask. If you are interested in other examples, Dask's [official documentation](http://dask.pydata.org/en/latest/dataframe-groupby.html) is a great source. 

- Start Dask Client for Dashboard
- Create Dask dataframe 
- GroupBy Operation
- Aggregation 
- Applying multiple functions at once 
- Transformation 
- Applying group by on some real data