# Pandas notes: groupby

12 November 2021

---

**Objective**
To understand what groupby is doing, and how to think about it.

**Description**
Despite using it regularly, the groupby operation in Pandas is a constant source of mystery for me, and I often find myself to resorting to stack overflow to solve some seemingly basic problem. This is my attempt to finally clarify what is actually happening when I perform a groupby on a Pandas dataframe, and how operations are then performed on the resulting DataFrameGroupBy object. 

I'll do this by doing through a few problems that can be tackled effectively using groupby:
- Finding all occurences of events leading up to a specific event for a group of patients
- Extracting the first n rows from each group
- Calculating the running sum of a variable
- Applying aggregation functions to groups

## Load up libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
%matplotlib inline

## What is the groupby operation?

Given a dataframe, a groupby splits it into distinct groups based on the names of the columns that are provided to it. 

One question I often ask myself is: *what is returned after performing a groupby object and how do I interact with it?*

Let's start with an example of some dummy data

In [40]:
# create a dataframe
patientid_values = [1, 1, 1, 2123, 2123, 2123, 2123, 2123, 1043, 1043, 1043, 1043, 1043, 1043]

date_values = ['01/01/2010', '04/09/1987', '25/03/1990',
              '17/03/2013', '31/01/2015', '04/07/2016', '19/07/2007', '02/02/2012',
              '19/03/2011', '18/08/2004', '31/10/2004', '04/01/2009', '31/03/2010', '07/07/2021']

status_values = [0, 0, 1,
                0, 1, 1, 0, 0,
                1, 0, 0, 1, 0, 1]

bpsys_values = [118, 110, 112,
               111, 118, 119, 118, 130,
               109, 110, 110, 112, 135, 140]

df = pd.DataFrame(data={'patientid': patientid_values, 'date': date_values, 'status': status_values,
                       'bp_systolic': bpsys_values})

df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')

df = df.sort_values(by=['patientid', 'date']).reset_index()[['patientid', 'date', 'status', 'bp_systolic']]

In [41]:
# let's group by the patient id
df.groupby('patientid')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1290a4ee0>

This returns a DataFrameGroupBy object. We can inspect the groups, apply functions to the groups and check their size as follows:

In [42]:
# for each of the groups, with patientid being the label, this gives a list of the indices 
# in the original dataframe that correspond to them
df.groupby('patientid').groups

{1: [0, 1, 2], 1043: [3, 4, 5, 6, 7, 8], 2123: [9, 10, 11, 12, 13]}

In [43]:
# we can then return the data for a specific group
df.groupby('patientid').get_group(2123)

Unnamed: 0,patientid,date,status,bp_systolic
9,2123,2007-07-19,0,118
10,2123,2012-02-02,0,130
11,2123,2013-03-17,0,111
12,2123,2015-01-31,1,118
13,2123,2016-07-04,1,119


In [44]:
# if we want to count the number of elements in each of the groups
df.groupby('patientid').size()

patientid
1       3
1043    6
2123    5
dtype: int64

In [45]:
# apply a function to each group
df.groupby('patientid').apply(lambda g: type(g))

patientid
1       <class 'pandas.core.frame.DataFrame'>
1043    <class 'pandas.core.frame.DataFrame'>
2123    <class 'pandas.core.frame.DataFrame'>
dtype: object

So what is this DataFrameGroupBy object? 

We see that the groupby operation splits up the dataframe into separate dataframes based on the patientid in this case, which means all the operations that can be applied to dataframes can be applied via the *apply* operation. Let's look at how some specific problems can now be solved.

In [46]:
df.groupby('patientid')['status'].sum()

patientid
1       1
1043    3
2123    2
Name: status, dtype: int64

In [47]:
gb1 = df.groupby('patientid').__iter__()

In [48]:
next(gb1)

(1,
    patientid       date  status  bp_systolic
 0          1 1987-09-04       0          110
 1          1 1990-03-25       1          112
 2          1 2010-01-01       0          118)

We can also get a SeriesGroupBy object if we index the DataFrameGroupBy object by one of the original column
names

As expected, this also has a groups attribute and a get_group method which behave similarly to the ones before:

In [49]:
df.groupby('patientid')['status'].groups, df.groupby('patientid')['status'].get_group(1043)

({1: [0, 1, 2], 1043: [3, 4, 5, 6, 7, 8], 2123: [9, 10, 11, 12, 13]},
 3    0
 4    0
 5    1
 6    0
 7    1
 8    1
 Name: status, dtype: int64)

### 1. Finding all occurences of events leading up to a specific event for a group of patients

In [59]:
# for each patientid, find all events up to and including a status==1
df.groupby('patientid').apply(lambda g: g.iloc[:np.where(g['status'] == 1)[0][0]+1])

Unnamed: 0_level_0,Unnamed: 1_level_0,patientid,date,status,bp_systolic,status_cumsum
patientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,1,1987-09-04,0,110,0
1,1,1,1990-03-25,1,112,1
1043,3,1043,2004-08-18,0,110,0
1043,4,1043,2004-10-31,0,110,0
1043,5,1043,2009-01-04,1,112,1
2123,9,2123,2007-07-19,0,118,0
2123,10,2123,2012-02-02,0,130,0
2123,11,2123,2013-03-17,0,111,0
2123,12,2123,2015-01-31,1,118,1


In [68]:
# for each patientid, find all events up to and including above the systolic bp value of 120
df.groupby('patientid').apply(lambda g: g[g['bp_systolic'].ge(120)])

Unnamed: 0_level_0,Unnamed: 1_level_0,patientid,date,status,bp_systolic,status_cumsum
patientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1043,6,1043,2010-03-31,0,135,1
1043,8,1043,2021-07-07,1,140,3
2123,10,2123,2012-02-02,0,130,0


### 2. Extracting the first n rows from each group

In [52]:
df.groupby('patientid').head(3)

Unnamed: 0,patientid,date,status,bp_systolic
0,1,1987-09-04,0,110
1,1,1990-03-25,1,112
2,1,2010-01-01,0,118
3,1043,2004-08-18,0,110
4,1043,2004-10-31,0,110
5,1043,2009-01-04,1,112
9,2123,2007-07-19,0,118
10,2123,2012-02-02,0,130
11,2123,2013-03-17,0,111


In [53]:
# get the last element of each group
df.groupby('patientid').last()

Unnamed: 0_level_0,date,status,bp_systolic
patientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2010-01-01,0,118
1043,2021-07-07,1,140
2123,2016-07-04,1,119


### 3. Calculating the running sum of a variable

In [54]:
df['status_cumsum'] = df.groupby('patientid')['status'].cumsum()

In [55]:
df

Unnamed: 0,patientid,date,status,bp_systolic,status_cumsum
0,1,1987-09-04,0,110,0
1,1,1990-03-25,1,112,1
2,1,2010-01-01,0,118,1
3,1043,2004-08-18,0,110,0
4,1043,2004-10-31,0,110,0
5,1043,2009-01-04,1,112,1
6,1043,2010-03-31,0,135,1
7,1043,2011-03-19,1,109,2
8,1043,2021-07-07,1,140,3
9,2123,2007-07-19,0,118,0


### 4. Applying aggregation functions to groups

In [58]:
df.groupby('patientid')['bp_systolic'].agg([np.mean, np.var])

Unnamed: 0_level_0,mean,var
patientid,Unnamed: 1_level_1,Unnamed: 2_level_1
1,113.333333,17.333333
1043,119.333333,201.466667
2123,119.2,46.7
