# Pandas 07 Grouping and Aggregating

by Nova@Douban

The video record of this session is here: https://zoom.us/recording/share/s_pFpdvC9VD2Ve0jgzN9McNTmRi6NcxP2yrDk-SVb2SwIumekTziMw

---

## 7.1 Split-Apply-Combine

### 7.1.1 Introduction to SAC

To process large datasets, we often considers using the 'split-apply-combine' (SAC) strategy in order to optimise computing efficiency. SAC means three steps:

1. split the dataset into subsets;
2. apply computing to each subsets;
3. combine the results of each sub-results

<img src="../image/SAC.png">

About 10 years ago, this strategy inspired a famous computing model -- MapReduce, which then led to Hadoop. In pandas, we can also apply this strategy to process datasets.

### 7.1.2 SAC in pandas

Suppore we need calculate the mean price of S&P historical data, we can calculate one by one by applying mean to each Series, or calcuate at once by applying to the entire DataFrame.

In [16]:
import pandas as pd
import numpy as np
%load_ext memory_profiler

sp0 = pd.read_csv('../data/gspc.csv', index_col='Date')
price = sp0[['High', 'Low', 'Open', 'Close']].reset_index()

display(price.head())
display(price['High'].mean(), price['Low'].mean(), price['Open'].mean(), price['Close'].mean())

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler


Unnamed: 0,Date,High,Low,Open,Close
0,2018-12-28,2520.27002,2472.889893,2498.77002,2485.73999
1,2018-12-31,2509.23999,2482.820068,2498.939941,2506.850098
2,2019-01-02,2519.48999,2467.469971,2476.959961,2510.030029
3,2019-01-03,2493.139893,2443.959961,2491.919922,2447.889893
4,2019-01-04,2538.070068,2474.330078,2474.330078,2531.939941


2516.0419922

2468.2939942000003

2488.1839844

2496.4899902

In [2]:
display(price.mean())

High     2516.041992
Low      2468.293994
Open     2488.183984
Close    2496.489990
dtype: float64

---

However, we can also apply SAC to this problem by using `pd.groupby()`. This can significantly improve computing efficiency. Because, in pandas, all of the data is in memory as a single system, and applying 

In [3]:
melted = price.melt(id_vars=['Date'], var_name='Category', value_name='Price')
display(melted.head(10))
grouped = melted.groupby('Category')  # split
display(grouped.mean()) # apply and combine

Unnamed: 0,Date,Category,Price
0,2018-12-28,High,2520.27002
1,2018-12-31,High,2509.23999
2,2019-01-02,High,2519.48999
3,2019-01-03,High,2493.139893
4,2019-01-04,High,2538.070068
5,2018-12-28,Low,2472.889893
6,2018-12-31,Low,2482.820068
7,2019-01-02,Low,2467.469971
8,2019-01-03,Low,2443.959961
9,2019-01-04,Low,2474.330078


Unnamed: 0_level_0,Price
Category,Unnamed: 1_level_1
Close,2496.48999
High,2516.041992
Low,2468.293994
Open,2488.183984


---

In a nutsheel, in pandas, SAC is accomplished by:

1. __split__ by `pd.groupby`;
2. __apply__ by associated methods with `pd.groupby`;
3. __combine__ by pandas itself
    
---

## 7.2 Split

`pd.groupby` is the primary method in pandas to split a dataset categorically, and it returns a special `pandas.core.groupby.groupby.DataFrameGroupBy` object.

The GroupBy object is a very flexible abstraction -- we can treat it as a collection of DataFrames.

In [4]:
grouped = melted.groupby('Category')
display(grouped, type(grouped))

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10bd4bef0>

pandas.core.groupby.groupby.DataFrameGroupBy

---

We can use `pd.groupby` associated methods to get the basic idea of this object, for example, `ngroups`, `groups`, `size`, `count`, etc.

In [5]:
display(grouped.ngroups, grouped.groups, grouped.size(), grouped.count())

4

{'Close': Int64Index([15, 16, 17, 18, 19], dtype='int64'),
 'High': Int64Index([0, 1, 2, 3, 4], dtype='int64'),
 'Low': Int64Index([5, 6, 7, 8, 9], dtype='int64'),
 'Open': Int64Index([10, 11, 12, 13, 14], dtype='int64')}

Category
Close    5
High     5
Low      5
Open     5
dtype: int64

Unnamed: 0_level_0,Date,Price
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Close,5,5
High,5,5
Low,5,5
Open,5,5


---

We can also use `head`, `tail`, `get_group`, `nth`, etc to access this object.

In [18]:
display(grouped.get_group('Open'), grouped.head(1), grouped.tail(1), grouped.nth(0))

Unnamed: 0,Date,Price
10,2018-12-28,2498.77002
11,2018-12-31,2498.939941
12,2019-01-02,2476.959961
13,2019-01-03,2491.919922
14,2019-01-04,2474.330078


Unnamed: 0,Date,Category,Price
0,2018-12-28,High,2520.27002
5,2018-12-28,Low,2472.889893
10,2018-12-28,Open,2498.77002
15,2018-12-28,Close,2485.73999


Unnamed: 0,Date,Category,Price
4,2019-01-04,High,2538.070068
9,2019-01-04,Low,2474.330078
14,2019-01-04,Open,2474.330078
19,2019-01-04,Close,2531.939941


Unnamed: 0_level_0,Date,Price
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Close,2018-12-28,2485.73999
High,2018-12-28,2520.27002
Low,2018-12-28,2472.889893
Open,2018-12-28,2498.77002


---

`pd.groupby` can take multiple categories as condition to split data, and the multiple categories can be combined as a list.

In [7]:
group = melted.groupby(['Date', 'Category'])
display(group.describe())

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price,Price,Price,Price,Price
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
Date,Category,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2018-12-28,Close,1.0,2485.73999,,2485.73999,2485.73999,2485.73999,2485.73999,2485.73999
2018-12-28,High,1.0,2520.27002,,2520.27002,2520.27002,2520.27002,2520.27002,2520.27002
2018-12-28,Low,1.0,2472.889893,,2472.889893,2472.889893,2472.889893,2472.889893,2472.889893
2018-12-28,Open,1.0,2498.77002,,2498.77002,2498.77002,2498.77002,2498.77002,2498.77002
2018-12-31,Close,1.0,2506.850098,,2506.850098,2506.850098,2506.850098,2506.850098,2506.850098
2018-12-31,High,1.0,2509.23999,,2509.23999,2509.23999,2509.23999,2509.23999,2509.23999
2018-12-31,Low,1.0,2482.820068,,2482.820068,2482.820068,2482.820068,2482.820068,2482.820068
2018-12-31,Open,1.0,2498.939941,,2498.939941,2498.939941,2498.939941,2498.939941,2498.939941
2019-01-02,Close,1.0,2510.030029,,2510.030029,2510.030029,2510.030029,2510.030029,2510.030029
2019-01-02,High,1.0,2519.48999,,2519.48999,2519.48999,2519.48999,2519.48999,2519.48999


--- 

## 7.3 Apply


Once the data is split into groups, the following operations can be applied: 

- Aggregation: to calculate a summary statistic.
- Transformation: to performs group- or item-specific calculations
- Filtration: to remove entire groups of data

### 7.3.1 Aggregation

`groupby` can take some aggregations directly, or take some external aggregation with `agg` function.

In [8]:
display(grouped.mean(), grouped.agg(np.mean))

Unnamed: 0_level_0,Price
Category,Unnamed: 1_level_1
Close,2496.48999
High,2516.041992
Low,2468.293994
Open,2488.183984


Unnamed: 0_level_0,Price
Category,Unnamed: 1_level_1
Close,2496.48999
High,2516.041992
Low,2468.293994
Open,2488.183984


---

`agg` can take multiple aggregations combined as a list at once, but this may lead to significant performance reduce. You can choose wisely between computing performance and code simplicity.

In [9]:
%timeit grouped.agg([np.sum, np.std])

%timeit grouped.sum()
%timeit grouped.std()

3.97 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
397 µs ± 4.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
573 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Another benefit of `agg` is that it can apply different aggregations to different groups at once by using a dict as input: group name as keys, and specific methods as values.

In [10]:
grouped.agg([np.sum, np.std])

group.agg({'Date': np.sum,
            'Price': np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,Price
Date,Category,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-12-28,Close,2018-12-28,2485.73999
2018-12-28,High,2018-12-28,2520.27002
2018-12-28,Low,2018-12-28,2472.889893
2018-12-28,Open,2018-12-28,2498.77002
2018-12-31,Close,2018-12-31,2506.850098
2018-12-31,High,2018-12-31,2509.23999
2018-12-31,Low,2018-12-31,2482.820068
2018-12-31,Open,2018-12-31,2498.939941
2019-01-02,Close,2019-01-02,2510.030029
2019-01-02,High,2019-01-02,2519.48999


A full list of aggregation functions with `pd.groupby`

<img src="../image/groupby.png">

---

### 7.3.2 Transformation

The `.transform()` method applies a function to each group and returns either Series or DataFrame that has the following parameters: 

1. Indexed identically to the concatenation of the indexes in all the groups 
2. The number of rows is equal to the sum of the number of rows in all the groups 
3. Consists of non-noise, nongrouped columns to which pandas has applied the given function
    
The `.transform()` method does not change the original data or the data in the group that is being applied to.

In [19]:
display(grouped.transform(lambda x: x - x.mean()).head(3), grouped.head(3))

Unnamed: 0,Price
0,4.228028
1,-6.802002
2,3.447998


Unnamed: 0,Date,Category,Price
0,2018-12-28,High,2520.27002
1,2018-12-31,High,2509.23999
2,2019-01-02,High,2519.48999
5,2018-12-28,Low,2472.889893
6,2018-12-31,Low,2482.820068
7,2019-01-02,Low,2467.469971
10,2018-12-28,Open,2498.77002
11,2018-12-31,Open,2498.939941
12,2019-01-02,Open,2476.959961
15,2018-12-28,Close,2485.73999


---

### 7.3.3 Filtering

`groupby.filter()`: Return a copy of a DataFrame excluding elements from groups that
do not satisfy the boolean criterion specified by func.

In [21]:
display(grouped.head(), grouped.filter(lambda x: x['Price'].std() > 20))

Unnamed: 0,Date,Category,Price
0,2018-12-28,High,2520.27002
1,2018-12-31,High,2509.23999
2,2019-01-02,High,2519.48999
3,2019-01-03,High,2493.139893
4,2019-01-04,High,2538.070068
5,2018-12-28,Low,2472.889893
6,2018-12-31,Low,2482.820068
7,2019-01-02,Low,2467.469971
8,2019-01-03,Low,2443.959961
9,2019-01-04,Low,2474.330078


Unnamed: 0,Date,Category,Price
15,2018-12-28,Close,2485.73999
16,2018-12-31,Close,2506.850098
17,2019-01-02,Close,2510.030029
18,2019-01-03,Close,2447.889893
19,2019-01-04,Close,2531.939941


## 7.4 More about groupby object

### 7.4.1 Iterating over groups:

We can iterate through groupby object, which is similar to a dict.

In [13]:
for name, group in grouped: 
    print(name)
    display(group)

Close


Unnamed: 0,Date,Category,Price
15,2018-12-28,Close,2485.73999
16,2018-12-31,Close,2506.850098
17,2019-01-02,Close,2510.030029
18,2019-01-03,Close,2447.889893
19,2019-01-04,Close,2531.939941


High


Unnamed: 0,Date,Category,Price
0,2018-12-28,High,2520.27002
1,2018-12-31,High,2509.23999
2,2019-01-02,High,2519.48999
3,2019-01-03,High,2493.139893
4,2019-01-04,High,2538.070068


Low


Unnamed: 0,Date,Category,Price
5,2018-12-28,Low,2472.889893
6,2018-12-31,Low,2482.820068
7,2019-01-02,Low,2467.469971
8,2019-01-03,Low,2443.959961
9,2019-01-04,Low,2474.330078


Open


Unnamed: 0,Date,Category,Price
10,2018-12-28,Open,2498.77002
11,2018-12-31,Open,2498.939941
12,2019-01-02,Open,2476.959961
13,2019-01-03,Open,2491.919922
14,2019-01-04,Open,2474.330078


### 7.4.2 Grouping by different axis and with Dict and Series

In [22]:
mapping ={'Open': 'open_close', 'Close': 'open_close',
         'High': 'high_low', 'Low': 'high_low'}

groups = price.groupby(mapping, axis=1)
display(price, groups.sum())

Unnamed: 0,Date,High,Low,Open,Close
0,2018-12-28,2520.27002,2472.889893,2498.77002,2485.73999
1,2018-12-31,2509.23999,2482.820068,2498.939941,2506.850098
2,2019-01-02,2519.48999,2467.469971,2476.959961,2510.030029
3,2019-01-03,2493.139893,2443.959961,2491.919922,2447.889893
4,2019-01-04,2538.070068,2474.330078,2474.330078,2531.939941


Unnamed: 0,high_low,open_close
0,4993.159913,4984.51001
1,4992.060058,5005.790039
2,4986.959961,4986.98999
3,4937.099854,4939.809815
4,5012.400146,5006.270019


In [15]:
map_series = pd.Series(mapping)
price.groupby(map_series, axis=1)
display(map_series, groups, groups.sum())

Open     open_close
Close    open_close
High       high_low
Low        high_low
dtype: object

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10bdfa208>

Unnamed: 0,high_low,open_close
0,4993.159913,4984.51001
1,4992.060058,5005.790039
2,4986.959961,4986.98999
3,4937.099854,4939.809815
4,5012.400146,5006.270019


---

## 7.5 Exercise

1. Can `pd.groupby` takes a function as the groupby condition?
2. If you want to understand SAC, please read [The split-apply-combine strategy for data analysis](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.5667&rep=rep1&type=pdf) by Hadley Wickham
3. Check other funciton in the full table of apply function in 7.3.1

---