# Grouping in Pandas

- toc: true 
- badges: true
- comments: true
- categories: [python, pandas]

In [1]:
import pandas as pd
import numpy as np

## Using apply with groupby

From the [cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#grouping):

In [13]:
df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
                   'size': list('SSMMMLL'),
                   'weight': [8, 10, 11, 1, 20, 12, 12],
                   'adult': [False] * 5 + [True] * 2})
df

Unnamed: 0,animal,size,weight,adult
0,cat,S,8,False
1,dog,S,10,False
2,cat,M,11,False
3,fish,M,1,False
4,dog,M,20,False
5,cat,L,12,True
6,cat,L,12,True


In [20]:
# Return size of heaviest animal

df.groupby('animal').apply(lambda g: g.loc[g.weight.idxmax(), 'size'])

animal
cat     L
dog     M
fish    M
dtype: object

## Expanding apply

Assume you want to calculate the cumulative return from a series of one-period returns in an expanding fashion -- in each period, you want the cumulative return up to that period.

In [92]:
s = pd.Series([i / 100.0 for i in range(1, 4)])
s

0    0.01
1    0.02
2    0.03
dtype: float64

The solution is given [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#grouping).

In [93]:
import functools

def cum_return(x, y):
    return x * (1 + y) 

def red(x):
    res = functools.reduce(cum_return, x, 1)
    return res
    
s.expanding().apply(red, raw=True)

0    1.010000
1    1.030200
2    1.061106
dtype: float64

I found that somewhere between bewildering and magical. To see what's going on, it helps to add a few print statements:

In [94]:
import functools

def cum_return(x, y):
    print('x:', x)
    print('y:', y)
    return x * (1 + y) 

def red(x):
    print('Series:', x)
    res = functools.reduce(cum_return, x, 1)
    print('Result:', res)
    print()
    return res
    
s.expanding().apply(red, raw=True)

Series: [0.01]
x: 1
y: 0.01
Result: 1.01

Series: [0.01 0.02]
x: 1
y: 0.01
x: 1.01
y: 0.02
Result: 1.0302

Series: [0.01 0.02 0.03]
x: 1
y: 0.01
x: 1.01
y: 0.02
x: 1.0302
y: 0.03
Result: 1.061106



0    1.010000
1    1.030200
2    1.061106
dtype: float64

This makes transparent how reduce works: it takes the starting value (1 here) as the initial x value and the first value of the series as y value, and then returns the result of cum_returns. Next, it uses that result as x, and the second element in the series as y, and calculates the new result of cum_returns. This is then repeated until it has run through the entire series.

What surprised me is to see that reduce always starts the calculation from the beginning, rather than re-using the last calculated result. This seems inefficient, but is probably necessary for some reason.

## Sort by sum of group values

In [3]:
df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
                   'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
                   'flag': [False, True] * 3})

df

Unnamed: 0,code,data,flag
0,foo,0.16,False
1,bar,-0.21,True
2,baz,0.33,False
3,foo,0.45,True
4,bar,-0.59,False
5,baz,0.62,True


In [111]:
g = df.groupby('code')
sort_order = g['data'].transform(sum).sort_values().index
df.loc[sort_order]

Unnamed: 0,code,data,flag
1,bar,-0.21,True
4,bar,-0.59,False
0,foo,0.16,False
3,foo,0.45,True
2,baz,0.33,False
5,baz,0.62,True


In [20]:
# Get observation with largest data entry for each group
g = df.groupby('code')
g.apply(lambda g: g.loc[g.data.idxmax()])

Unnamed: 0_level_0,code,data,flag
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,bar,-0.21,True
baz,baz,0.62,True
foo,foo,0.45,True


## Expanding group operations

Based on [this](https://stackoverflow.com/a/15489701/13666841) answer.

In [54]:
df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 4,
                   'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62] * 2,
                   'flag': [False, True] * 6})
df

Unnamed: 0,code,data,flag
0,foo,0.16,False
1,bar,-0.21,True
2,baz,0.33,False
3,foo,0.45,True
4,bar,-0.59,False
5,baz,0.62,True
6,foo,0.16,False
7,bar,-0.21,True
8,baz,0.33,False
9,foo,0.45,True


In [59]:
g = df.groupby('code')

def helper(g):    
    s = g.data.expanding()
    g['exp_mean'] = s.mean()
    g['exp_sum'] = s.sum()
    g['exp_count'] = s.count()
    return g

g.apply(helper).sort_values('code')

Unnamed: 0,code,data,flag,exp_mean,exp_sum,exp_count
1,bar,-0.21,True,-0.21,-0.21,1.0
4,bar,-0.59,False,-0.4,-0.8,2.0
7,bar,-0.21,True,-0.336667,-1.01,3.0
10,bar,-0.59,False,-0.4,-1.6,4.0
2,baz,0.33,False,0.33,0.33,1.0
5,baz,0.62,True,0.475,0.95,2.0
8,baz,0.33,False,0.426667,1.28,3.0
11,baz,0.62,True,0.475,1.9,4.0
0,foo,0.16,False,0.16,0.16,1.0
3,foo,0.45,True,0.305,0.61,2.0


## Pivoting

From [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#pivot)

In [9]:
df = pd.DataFrame(data={'province': ['ON', 'QC', 'BC', 'AL', 'AL', 'MN', 'ON'],
                        'city': ['Toronto', 'Montreal', 'Vancouver',
                                 'Calgary', 'Edmonton', 'Winnipeg',
                                 'Windsor'],
                        'sales': [13, 6, 16, 8, 4, 3, 1]})
df

Unnamed: 0,province,city,sales
0,ON,Toronto,13
1,QC,Montreal,6
2,BC,Vancouver,16
3,AL,Calgary,8
4,AL,Edmonton,4
5,MN,Winnipeg,3
6,ON,Windsor,1


You want to group sales by province and get subtotal for total state.

In [39]:
table = (
    df
    .pivot_table(values='sales', index='province',
                 columns='city', aggfunc='sum',
                 margins=True)
    .stack()
    .drop('All')
)
table

province  city     
AL        Calgary       8.0
          Edmonton      4.0
          All          12.0
BC        Vancouver    16.0
          All          16.0
MN        Winnipeg      3.0
          All           3.0
ON        Toronto      13.0
          Windsor       1.0
          All          14.0
QC        Montreal      6.0
          All           6.0
dtype: float64

## Aggregating

From [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#pivot)

In [93]:
df = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6",   "x7",     "x8", "x9"],
                       'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
                 'ExamYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
                 'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
                 'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
                  'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
                  columns = ['StudentID', 'StudentGender', 'ExamYear', 'Exam', 'Participated', 'Passed'])

df.columns = [str.lower(c) for c in df.columns]
df

Unnamed: 0,studentid,studentgender,examyear,exam,participated,passed
0,x1,F,2007,algebra,no,no
1,x10,M,2007,stats,yes,yes
2,x2,F,2007,bio,yes,yes
3,x3,M,2008,algebra,yes,yes
4,x4,F,2008,algebra,no,no
5,x5,M,2008,stats,yes,yes
6,x6,F,2008,stats,yes,yes
7,x7,M,2009,algebra,yes,yes
8,x8,M,2009,bio,yes,no
9,x9,M,2009,bio,yes,yes


In [61]:
numyes = lambda x: sum(x == 'yes')
df.groupby('examyear').agg({'participated': numyes,
                            'passed': numyes})

Unnamed: 0_level_0,participated,passed
examyear,Unnamed: 1_level_1,Unnamed: 2_level_1
2007,2,2
2008,3,3
2009,3,2


## Sources

- [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653/)