## Workshop - OLS Python

In this workshop, we are going to:

1. perform backward selection on the class data set
   1. fit the full model with $\%\Delta rGDP$ as the label
   2. remove the feature with the highest p-value
   3. refit the model
   4. repeat steps B. and C. until all features have p-values below 0.05
2. evaluatate the model performance

*Do not use interactions or polynomial terms in this workshop.*

# Preliminaries

- Load any necessary packages and/or functions
    * For backward select, I recommend using `statsmodels.api` instead of `statsmodels.formula.api`. Your choice.
- Load in the class data
- Define `x` and `y`
- Create a train-test split with
    * training size of two-thirds
    * random state of 490

In [73]:
import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [74]:
df = pd.read_csv('class_data.csv')
df.set_index(['fips', 'year', 'GeoName'], inplace = True)
df['year'] = df.index.get_level_values('year')

In [75]:
y = df['pct_d_rgdp']
x = df.drop(columns = 'pct_d_rgdp')

# Creating dummies
x = x.join([pd.get_dummies(x['year'], prefix = 'year', drop_first = True),
          pd.get_dummies(x['urate_bin'], prefix = 'urate', drop_first = True)]).drop(columns = ['year', 'urate_bin'])
x = sm.add_constant(x)
x.sort_index(axis = 'columns', inplace = True)

In [76]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)

In [77]:
x_train

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,const,density,emp_estabs,estabs_entry_rate,estabs_exit_rate,lfpr,pop,pop_pct_black,pop_pct_hisp,pos_net_jobs,...,year_2009,year_2010,year_2011,year_2012,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
fips,year,GeoName,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
29073,2016,"Gasconade, MO",1.0,28.478012,11.358442,6.744,7.004,87.686185,14746.0,0.942629,1.349519,0,...,0,0,0,0,0,0,0,1,0,0
48189,2002,"Hale, TX",1.0,35.432072,17.140449,10.585,12.256,77.452148,35598.0,5.963818,49.514018,1,...,0,0,0,0,0,0,0,0,0,0
13281,2010,"Towns, GA",1.0,63.249455,9.767857,10.122,14.660,72.434507,10535.0,0.588514,2.050308,0,...,0,1,0,0,0,0,0,0,0,0
29103,2007,"Knox, MO",1.0,8.237920,6.711538,14.563,12.621,96.899545,4152.0,0.602119,0.818882,0,...,0,0,0,0,0,0,0,0,0,0
48441,2002,"Taylor, TX",1.0,137.534575,16.216269,9.781,11.731,77.851508,125920.0,7.411849,18.904066,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38047,2015,"Logan, ND",1.0,1.940941,7.508475,10.345,6.897,91.049086,1927.0,0.518941,1.245459,1,...,0,0,0,0,0,0,1,0,0,0
48355,2005,"Nueces, TX",1.0,388.219425,16.864460,9.849,9.229,74.193686,325515.0,4.346958,58.515276,1,...,0,0,0,0,0,0,0,0,0,0
48221,2006,"Hood, TX",1.0,113.997338,9.332004,16.980,13.015,78.443747,47952.0,0.613113,9.079913,1,...,0,0,0,0,0,0,0,0,0,0
47171,2004,"Unicoi, TN",1.0,96.344641,16.513725,10.317,7.937,71.839986,17936.0,0.206289,2.899197,0,...,0,0,0,0,0,0,0,0,0,0


*****
# Backward Selection 

In [78]:
#1st fit
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     54.18
Date:                Thu, 25 Feb 2021   Prob (F-statistic):          2.24e-285
Time:                        15:09:33   Log-Likelihood:            -1.2346e+05
No. Observations:               33889   AIC:                         2.470e+05
Df Residuals:                   33861   BIC:                         2.472e+05
Df Model:                          27                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -2.1296      0.58

In [79]:
#2st fit drop "density" column
x_train = x_train.drop(columns = 'density')
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     56.27
Date:                Thu, 25 Feb 2021   Prob (F-statistic):          2.99e-286
Time:                        15:09:45   Log-Likelihood:            -1.2346e+05
No. Observations:               33889   AIC:                         2.470e+05
Df Residuals:                   33862   BIC:                         2.472e+05
Df Model:                          26                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -2.1252      0.58

In [80]:
#3st fit drop "year_2005" column
x_train = x_train.drop(columns = 'year_2005')
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     58.52
Date:                Thu, 25 Feb 2021   Prob (F-statistic):          4.00e-287
Time:                        15:10:06   Log-Likelihood:            -1.2346e+05
No. Observations:               33889   AIC:                         2.470e+05
Df Residuals:                   33863   BIC:                         2.472e+05
Df Model:                          25                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -2.1619      0.56

In [81]:
#3st fit drop "pop_pct_black  " column
x_train = x_train.drop(columns = 'pop_pct_black')
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     60.95
Date:                Thu, 25 Feb 2021   Prob (F-statistic):          5.50e-288
Time:                        15:10:45   Log-Likelihood:            -1.2346e+05
No. Observations:               33889   AIC:                         2.470e+05
Df Residuals:                   33864   BIC:                         2.472e+05
Df Model:                          24                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -2.0901      0.54

In [82]:
#4st fit drop "year_2013" column
x_train = x_train.drop(columns = 'year_2013')
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     63.57
Date:                Thu, 25 Feb 2021   Prob (F-statistic):          9.80e-289
Time:                        15:10:46   Log-Likelihood:            -1.2346e+05
No. Observations:               33889   AIC:                         2.470e+05
Df Residuals:                   33865   BIC:                         2.472e+05
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -2.2308      0.51

In [83]:
#5st fit drop "year_2003" column
x_train = x_train.drop(columns = 'year_2003')
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     66.40
Date:                Thu, 25 Feb 2021   Prob (F-statistic):          2.23e-289
Time:                        15:10:51   Log-Likelihood:            -1.2346e+05
No. Observations:               33889   AIC:                         2.470e+05
Df Residuals:                   33866   BIC:                         2.472e+05
Df Model:                          22                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -2.3009      0.51

In [84]:
#6st fit drop "pop" column
#for this model, the p-values for all parameters are less than 0.05
x_train = x_train.drop(columns = 'pop')
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     69.38
Date:                Thu, 25 Feb 2021   Prob (F-statistic):          1.65e-289
Time:                        15:11:04   Log-Likelihood:            -1.2346e+05
No. Observations:               33889   AIC:                         2.470e+05
Df Residuals:                   33867   BIC:                         2.472e+05
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -2.1414      0.50

**********
# Testing

Evaluate two RMSEs:

1. null model
2. backward-selected model

Then, determine the percent improvement of the backward-selected model from the null model.

In [85]:
x_test = x_test.drop(columns = ['density','year_2005','pop_pct_black','year_2013','year_2003','pop'])

In [86]:
y_hat_sm = fit_sm.predict(x_test)
y_hat_sm.head()

fips   year  GeoName         
6013   2005  Contra Costa, CA    3.437510
29015  2006  Benton, MO          5.709151
40069  2007  Johnston, OK        2.082689
48235  2015  Irion, TX           4.832244
29075  2013  Gentry, MO          4.200842
dtype: float64

In [87]:
rmse_sm = np.sqrt(np.mean((y_test - y_hat_sm)**2))
rmse_sm

9.216942512937903

In [88]:
# null model
rmse_null = np.sqrt(  np.mean((y_test - np.mean(y_train))**2)  )
rmse_null

9.403229309446852

In [89]:
print(round((rmse_null - rmse_sm)/rmse_null*100, 3), '%', sep = '')

1.981%


The model backward-selected model only improves 1.981%, which is not much. 