## Workshop - OLS Python

In this workshop, we are going to:

1. perform backward selection on the class data set
   1. fit the full model with $\%\Delta rGDP$ as the label
   2. remove the feature with the highest p-value
   3. refit the model
   4. repeat steps B. and C. until all features have p-values below 0.05
2. evaluatate the model performance

*Do not use interactions or polynomial terms in this workshop.*

# Preliminaries

- Load any necessary packages and/or functions
    * For backward select, I recommend using `statsmodels.api` instead of `statsmodels.formula.api`. Your choice.
- Load in the class data
- Define `x` and `y`
- Create a train-test split with
    * training size of two-thirds
    * random state of 490

In [1]:
import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_pickle('C:/Users/T-2070/Desktop/ECON490AML/Data/class_data.pkl')
df.columns

Index(['GeoName', 'pct_d_rgdp', 'urate_bin', 'pos_net_jobs', 'emp_estabs',
       'estabs_entry_rate', 'estabs_exit_rate', 'pop', 'pop_pct_black',
       'pop_pct_hisp', 'lfpr', 'density', 'year'],
      dtype='object')

In [3]:
y = df['pct_d_rgdp']
x = df.drop(columns = 'pct_d_rgdp')

# Creating dummies
x = x.join([pd.get_dummies(x['year'], prefix = 'year', drop_first = True),
          pd.get_dummies(x['urate_bin'], prefix = 'urate', drop_first = True)]).drop(columns = ['year', 'urate_bin'])
x = sm.add_constant(x)

# Creating interactions
x['lfpr:urate_lower'] = x.lfpr * x.urate_lower
x['lfpr:urate_similar'] = x.lfpr * x.urate_similar
x['emp_estabs_sq'] = x.emp_estabs**2

# Dropping features we do not want to use
x.drop(columns = ['GeoName', 'pos_net_jobs', 'estabs_entry_rate', 'estabs_exit_rate',
                  'pop', 'pop_pct_black', 'pop_pct_hisp', 'density'], inplace = True)

# Sorting the columns for output
x.sort_index(axis = 'columns', inplace = True)

# Dropping un
x.columns

Index(['const', 'emp_estabs', 'emp_estabs_sq', 'lfpr', 'lfpr:urate_lower',
       'lfpr:urate_similar', 'urate_lower', 'urate_similar', 'year_2003',
       'year_2004', 'year_2005', 'year_2006', 'year_2007', 'year_2008',
       'year_2009', 'year_2010', 'year_2011', 'year_2012', 'year_2013',
       'year_2014', 'year_2015', 'year_2016', 'year_2017', 'year_2018'],
      dtype='object')

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)
print(x_train.shape)
print(y_train.shape, '\n')

print(x_test.shape)
print(y_test.shape)

(33418, 24)
(33418,) 

(16709, 24)
(16709,)


*****
# Backward Selection 

In [24]:
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.030
Model:                            OLS   Adj. R-squared:                  0.029
Method:                 Least Squares   F-statistic:                     59.86
Date:                Tue, 23 Feb 2021   Prob (F-statistic):          2.51e-202
Time:                        14:38:02   Log-Likelihood:            -1.2100e+05
No. Observations:               33418   AIC:                         2.420e+05
Df Residuals:                   33400   BIC:                         2.422e+05
Df Model:                          17                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 -0.9802      0

In [11]:
x_train = x_train.drop(columns = ['year_2003'])

In [13]:
x_train = x_train.drop(columns = ['year_2010'])

In [15]:
x_train = x_train.drop(columns = ['year_2004'])

In [17]:
x_train = x_train.drop(columns = ['lfpr:urate_lower'])

In [20]:
x_train = x_train.drop(columns = ['emp_estabs_sq'])

In [22]:
x_train = x_train.drop(columns = ['year_2005'])

**********
# Testing

Evaluate two RMSEs:

1. null model
2. backward-selected model

Then, determine the percent improvement of the backward-selected model from the null model.

In [27]:
fit_sm.params
fit_sm.pvalues
fit_sm.resid
fit_sm.conf_int(alpha = 0.01)
fit_sm.rsquared

0.029565183747597867

In [28]:
x_test = x_test.drop(columns = ['year_2003', 'year_2010', 'year_2004', 'lfpr:urate_lower', 'emp_estabs_sq', 'year_2005'])

In [29]:
y_hat_sm = fit_sm.predict(x_test)
y_hat_sm.head()
rmse_sm = np.sqrt(np.mean((y_test - y_hat_sm)**2))
rmse_sm

9.175397656227684

In [31]:
rmse_null = np.sqrt(  np.mean((y_test - np.mean(y_train))**2)  )
rmse_null

9.295172646932564

In [32]:
print(round((rmse_null - rmse_sm)/rmse_null*100, 2), '%', sep = '')

1.29%
