Li Ju: Cantonese dim sum; Watch TV Series
Chauguk Lee:Sushi; Play Tennis

## Workshop - Regression-Based Classification

Does `statsmodels` marginal effect use the average of covariates or the average predicted values? 
- Use the class data.
- Show your work.

Load the necessary packages and data:

In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm # progress bar

import statsmodels.api as sm
from sklearn import linear_model as lm

from sklearn.model_selection import GridSearchCV, train_test_split, KFold
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc = {'axes.titlesize': 24,
             'axes.labelsize': 20,
             'xtick.labelsize': 12,
             'ytick.labelsize': 12,
             'figure.figsize': (8, 4.5)})
sns.set_style("white") # for plot_confusion_matrix()

In [2]:
df = pd.read_pickle('class_data.pkl')

In [3]:
df_prepped = df.drop(columns = ['urate_bin', 'year']).join([
    pd.get_dummies(df['urate_bin'], drop_first = True),
    pd.get_dummies(df.year, drop_first = True)    
])
y = df_prepped['pos_net_jobs'].astype(float)
x = df_prepped.drop(columns = 'pos_net_jobs')

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)

x_train_std = x_train.apply(lambda x: (x - np.mean(x))/np.std(x), axis = 0)
x_test_std  = x_test.apply(lambda x: (x - np.mean(x))/np.std(x), axis = 0)

x_train_std = sm.add_constant(x_train_std)
x_test_std  = sm.add_constant(x_test_std)
x_train     = sm.add_constant(x_train)
x_test      = sm.add_constant(x_test)

Fit a logistic regression using either `sm.Logit()` or `smf.logit()`.

In [15]:
fit_logit = sm.Logit(y_train, x_train[['const','pct_d_rgdp', 'estabs_exit_rate']]).fit()

Optimization terminated successfully.
         Current function value: 0.670597
         Iterations 5


Get the marginal effects (`.get_margeff()`). Print the summary (`.summary()`).

In [16]:
fit_logit.get_margeff().summary()

0,1
Dep. Variable:,pos_net_jobs
Method:,dydx
At:,overall

Unnamed: 0,dy/dx,std err,z,P>|z|,[0.025,0.975]
pct_d_rgdp,0.0057,0.0,17.597,0.0,0.005,0.006
estabs_exit_rate,-0.028,0.001,-26.347,0.0,-0.03,-0.026


In [18]:
fit_logit.params

const               1.238731
pct_d_rgdp          0.023849
estabs_exit_rate   -0.117092
dtype: float64

***
# Covariate Averages
$$
\frac{\partial p(x_i)}{\partial \beta_1} \approx \frac{e^{\hat{\beta}_0 + \bar{x}\hat{\beta}_1 + \bar{x}\hat{\beta_2}}}{(1 + e^{\hat{\beta}_0 + \bar{x}\hat{\beta}_1 + \bar{x}\hat{\beta_2}})^2}\hat{\beta}
$$

In [28]:
params_values = fit_logit.params.values
exp_ca_1 = np.exp(params_values[0] + 
                  params_values[1] * np.mean(x_train['pct_d_rgdp']) 
                  + params_values[2] * np.mean(x_train['pct_d_rgdp']))
marginal_ca_1 = exp_ca_1*params_values[1]/(1+exp_ca_1)**2
exp_ca_2 = np.exp(params_values[0] + 
                  params_values[1] * np.mean(x_train['estabs_exit_rate']) 
                  + params_values[2] * np.mean(x_train['estabs_exit_rate']))
marginal_ca_2 = exp_ca_2*params_values[2]/(1+exp_ca_2)**2

In [29]:
print(marginal_ca_1)
print(marginal_ca_2)

0.004571176832974548
-0.028056827483015245


***
# Predicted values Averages
$$
\frac{\partial p(x_i)}{\partial \beta_1} \approx \frac{1}{n} \sum_{i=1}
^n \frac{e^{\hat{y}_i}}{1 + e^{\hat{y}_i}}\hat{\beta}
$$

In [33]:
y_i = fit_logit.predict(x_test[['const','pct_d_rgdp', 'estabs_exit_rate']])
marginal_pva_1 = np.mean(np.exp(y_i)*params_values[1]/(1 + np.exp(y_i)))
marginal_pva_2 = np.mean(np.exp(y_i)*params_values[2]/(1 + np.exp(y_i)))

In [34]:
print(marginal_pva_1)
print(marginal_pva_2)

0.01517294450030358
-0.07449624730702548


*** 
# Interpretation

Interpret the marginal effect on one feature.

An increase in GDP by one percentage point is associated with
an increase in the probability of positive net job creation rate of 0.005. 