**Preface**

Author: Richard (Dien Giau Bui), at `buidiengiau@gmail.com`. I am a PhD student in Finance at National Taiwan University.

Date: 2018/9/6

This note presents how to use `pipe` with `plydata` in Python. In addition, I also present how to incorporate my package, `broom`, into this "tidyverse" world by easily summarize the regressions outputs by four functions: `lm`, `tidy`, `glance`, and `augment`.

In [1]:
import pandas as pd
import numpy as np
from plydata import group_by, define, summarize, do, query
from broom import *

# 1. Introduction



In [2]:
np.random.seed(123)
x = np.random.randn(1000)
y = np.random.randn(1000)
d = np.array([1,2]*500)
df = pd.DataFrame({
    'x' : x,
    'y' : y ,
    'd' : d
}
)
df.head()

Unnamed: 0,x,y,d
0,-1.085631,-0.748827,1
1,0.997345,0.567595,2
2,0.282978,0.718151,1
3,-1.506295,-0.999381,2
4,-0.5786,0.474898,1


## Group_by and Summarize

In [3]:
(df >> 
 group_by('d') 
 >> summarize('x.mean()')
)

Unnamed: 0,d,x.mean()
0,1,-0.042167
1,2,-0.036961


In [4]:
df[df.d == 1].describe() # correct as above

Unnamed: 0,x,y,d
count,500.0,500.0,500.0
mean,-0.042167,-0.005625,1.0
std,1.011905,0.9709,0.0
min,-3.167055,-3.801378,1.0
25%,-0.692185,-0.647071,1.0
50%,-0.043672,0.038892,1.0
75%,0.683227,0.643466,1.0
max,2.766603,2.371388,1.0


In [5]:
df[df.d == 2].describe() # correct as above

Unnamed: 0,x,y,d
count,500.0,500.0,500.0
mean,-0.036961,0.022403,2.0
std,0.991565,0.946679,0.0
min,-3.231055,-2.637922,2.0
25%,-0.682595,-0.636253,2.0
50%,-0.040681,0.0548,2.0
75%,0.642703,0.672326,2.0
max,3.571579,2.850708,2.0


## Mutate

In [6]:
(df >> 
 group_by('d') 
 >> define(xy = 'x + y')
).head()

0,1
groups,d

Unnamed: 0,x,y,d,xy
0,-1.085631,-0.748827,1,-1.834458
1,0.997345,0.567595,2,1.56494
2,0.282978,0.718151,1,1.001129
3,-1.506295,-0.999381,2,-2.505675
4,-0.5786,0.474898,1,-0.103702


## Plotting

# 2. Group_by + Regression + `broom`

In [7]:
# regression with `broom::lm`
lm(data=df, formula='y ~ x + d')

Intercept   -0.035073
x           -0.029983
d            0.028184
dtype: float64


<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x1c3cf4e0>

The interesting thing is we can combine this `pipe` property with our `broom` package to summarize regressions for each group in a DataFrame.

In [8]:
(df >> 
group_by('d') >>
do(lambda x: tidy(lm('y ~ x', data=x)))
)

Intercept   -0.003746
x            0.044565
dtype: float64
Intercept    0.018425
x           -0.107621
dtype: float64


0,1
groups,d

Unnamed: 0,d,coef,t,p_value
0,1,-0.003746,-0.086195,0.931346
1,1,0.044565,1.037639,0.299942
2,2,0.018425,0.437255,0.662116
3,2,-0.107621,-2.531667,0.011659


In [9]:
(df >> 
group_by('d') >>
do(lambda x: glance(lm('y ~ x', data=x)))
)

Intercept   -0.003746
x            0.044565
dtype: float64
Intercept    0.018425
x           -0.107621
dtype: float64


0,1
groups,d

Unnamed: 0,d,obs,r2,ar2,f,f_pvalue,df,df_resid,aic,bic
0,1,500.0,0.002157,0.000154,1.076694,0.299942,1.0,498.0,1391.32628,1399.755496
1,2,500.0,0.012707,0.010724,6.409337,0.011659,1.0,498.0,1360.748773,1369.177989


In [10]:
(df >> 
group_by('d') >>
do(lambda x: augment(lm('y ~ x', data=x)))
).head()

Intercept   -0.003746
x            0.044565
dtype: float64
Intercept    0.018425
x           -0.107621
dtype: float64


0,1
groups,d

Unnamed: 0,d,.fitted,.resid
0,1,-0.052127,-0.6967
1,2,-0.08891,0.656505
2,1,0.008865,0.709285
3,2,0.180534,-1.179915
4,1,-0.029531,0.504429
