# Linear Regression Using `statsmodels`

Manually creating the model matrix with numpy


In [1]:
import numpy as np

def lm(X, y, intercept=True):
    if intercept:
        model_mat = np.column_stack((np.ones(X.shape[0], 1), X))
    else:
        model_mat = X
    return np.linalg.lstsq(model_mat, y)


## Not too bad. What if we need dummies?

Can still do with `pd.get_dummies`

In [2]:
import pandas as pd
sectors = pd.Series(['HiTec', 'Hlth', 'HiTec', 'Utils'])
pd.get_dummies(sectors)

Unnamed: 0,HiTec,Hlth,Utils
0,1.0,0.0,0.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0


Seems straightforward, though could get complicated with interaction terms, different contrast setups.

# How about regression results

R-squared, coefficient estimates, SEs, confidence intervals

Time to start writing code!

# `statsmodels` makes it easy

In [3]:
import statsmodels.formula.api as smf

dat = pd.read_csv('starmine_small.csv', low_memory=False)
m1 = smf.ols('ret_0_1_m ~ sector', data=dat).fit()
m1.summary()

0,1,2,3
Dep. Variable:,ret_0_1_m,R-squared:,0.014
Model:,OLS,Adj. R-squared:,0.009
Method:,Least Squares,F-statistic:,2.76
Date:,"Tue, 17 May 2016",Prob (F-statistic):,0.0022
Time:,19:46:51,Log-Likelihood:,1954.6
No. Observations:,1974,AIC:,-3887.0
Df Residuals:,1963,BIC:,-3826.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.0285,0.011,2.683,0.007,0.008 0.049
sector[T.Enrgy],0.0212,0.015,1.420,0.156,-0.008 0.050
sector[T.HiTec],0.0217,0.012,1.785,0.074,-0.002 0.046
sector[T.Hlth],-0.0094,0.014,-0.696,0.486,-0.036 0.017
sector[T.Manuf],0.0166,0.012,1.408,0.159,-0.007 0.040
sector[T.Money],0.0056,0.011,0.491,0.623,-0.017 0.028
sector[T.NoDur],-0.0061,0.013,-0.462,0.644,-0.032 0.020
sector[T.Other],0.0086,0.012,0.700,0.484,-0.016 0.033
sector[T.Shops],-0.0079,0.013,-0.632,0.528,-0.033 0.017

0,1,2,3
Omnibus:,153.726,Durbin-Watson:,1.957
Prob(Omnibus):,0.0,Jarque-Bera (JB):,784.476
Skew:,0.136,Prob(JB):,4.5e-171
Kurtosis:,6.076,Cond. No.,18.9


# How about interaction terms?

In [4]:
m2 = smf.ols('ret_0_1_m ~ smi * sector', data=dat).fit()
m2.summary()

0,1,2,3
Dep. Variable:,ret_0_1_m,R-squared:,0.031
Model:,OLS,Adj. R-squared:,0.016
Method:,Least Squares,F-statistic:,2.059
Date:,"Tue, 17 May 2016",Prob (F-statistic):,0.00326
Time:,19:46:51,Log-Likelihood:,1368.4
No. Observations:,1384,AIC:,-2693.0
Df Residuals:,1362,BIC:,-2578.0
Df Model:,21,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.0195,0.024,0.801,0.423,-0.028 0.067
sector[T.Enrgy],0.0589,0.032,1.812,0.070,-0.005 0.123
sector[T.HiTec],0.0050,0.029,0.174,0.862,-0.052 0.062
sector[T.Hlth],-0.0059,0.035,-0.171,0.864,-0.074 0.062
sector[T.Manuf],0.0052,0.028,0.188,0.851,-0.049 0.059
sector[T.Money],0.0195,0.026,0.741,0.459,-0.032 0.071
sector[T.NoDur],-0.0007,0.031,-0.022,0.982,-0.062 0.061
sector[T.Other],0.0238,0.028,0.842,0.400,-0.032 0.079
sector[T.Shops],-0.0091,0.029,-0.318,0.750,-0.066 0.047

0,1,2,3
Omnibus:,108.687,Durbin-Watson:,2.036
Prob(Omnibus):,0.0,Jarque-Bera (JB):,566.538
Skew:,0.09,Prob(JB):,9.5e-124
Kurtosis:,6.129,Cond. No.,2140.0


## Easy to switch out for different estimators

Try an M-estimator

In [5]:
m3 = smf.rlm('ret_0_1_m ~ sector', data=dat).fit()
m3.summary()

0,1,2,3
Dep. Variable:,ret_0_1_m,No. Observations:,1974.0
Model:,RLM,Df Residuals:,1963.0
Method:,IRLS,Df Model:,10.0
Norm:,HuberT,,
Scale Est.:,mad,,
Cov Type:,H1,,
Date:,"Tue, 17 May 2016",,
Time:,19:46:51,,
No. Iterations:,19,,

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,0.0309,0.009,3.400,0.001,0.013 0.049
sector[T.Enrgy],0.0127,0.013,0.990,0.322,-0.012 0.038
sector[T.HiTec],0.0181,0.010,1.736,0.083,-0.002 0.039
sector[T.Hlth],-0.0061,0.012,-0.525,0.599,-0.029 0.017
sector[T.Manuf],0.0108,0.010,1.070,0.285,-0.009 0.031
sector[T.Money],0.0041,0.010,0.417,0.676,-0.015 0.023
sector[T.NoDur],-0.0060,0.011,-0.527,0.598,-0.028 0.016
sector[T.Other],0.0103,0.011,0.974,0.330,-0.010 0.031
sector[T.Shops],-0.0106,0.011,-0.982,0.326,-0.032 0.011


# Questions?