In [2]:
# Import package for getting dataset example
import wooldridge as woo

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

import math

  from pandas.core import (


# Multiple Regression in Practice

The general formula of Multiple Regression Model:

$$
y = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \hat{\beta}_3 x_3 + \cdots + \hat{\beta}_k x_k + \hat{\text{e}}
$$


Assumptions model (Gauss-Markov Asumptions):

1. The true model follows:
> $$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \cdots + \beta_k x_k + \text{e}
$$


2. There are random sample of n observations from population {$(x_{i,1}, x_{i,2}, ..., x_{i,k}, y_i) : i = 1, 2, 3, ..., n$}

3. No perfect collinearity 
> It allows the regressors to be correlated, they just cannot be perfectly linear correlated.

4. Zero conditional mean
> The error term $\text{e}$ has an expected value of zero given any values of the regressors:
> $$ E (e | x_1, x_2, ..., x_k) = 0 $$

5. Homoskedasticity (**It doesn't make the estomator bias**, but it makes less precise estimator and less accurate hypothesis tests)
> $$ Var(e | x_1, x_2, ..., x_k) = \sigma $$

**Residu and fitted value properteis**:
1. The sample average of the residuals is zero, $ \hat{y} = \hat{\bar{y}} $.
2. The sample covariance between each regressors and the OLS residuals is zero, $ Cov( \hat{e}, x_j ) = 0 $.
3. The point of ($\bar{x}_1, \bar{x}_2, \bar{x}_3, ..., \bar{x}_k, \bar{y}$) is always on the OLS regression line, $\bar{y} = \hat{\beta}_0 + \hat{\beta}_1 \bar{x}_1 + \hat{\beta}_2 \bar{x}_2 + \hat{\beta}_3 \bar{x}_3 + \cdots + \hat{\beta}_k \bar{x}_k $

![image](images/table-term-multiple-regression.png)

![image](images/Example_3-1.png)

In [3]:
# Import data
gpa1 = woo.dataWoo('gpa1')

# Modeling
model = smf.ols(formula='colGPA ~ hsGPA + ACT', data=gpa1).fit()
print("Summary:\n")
print(model.summary())

Summary:

                            OLS Regression Results                            
Dep. Variable:                 colGPA   R-squared:                       0.176
Model:                            OLS   Adj. R-squared:                  0.164
Method:                 Least Squares   F-statistic:                     14.78
Date:                Sat, 31 Aug 2024   Prob (F-statistic):           1.53e-06
Time:                        13:39:33   Log-Likelihood:                -46.573
No. Observations:                 141   AIC:                             99.15
Df Residuals:                     138   BIC:                             108.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.2863      0.341      3.77

![image](images/Example_3-2.png)

In [4]:
# Case  Wage vs education, experience, and tenure
# Extract data

wage1 = woo.dataWoo('wage1')

# Modeling
model = smf.ols(formula='np.log(wage) ~ educ + exper + tenure',
                data=wage1).fit()

print("Summary:")
print(model.summary())

Summary:
                            OLS Regression Results                            
Dep. Variable:           np.log(wage)   R-squared:                       0.316
Model:                            OLS   Adj. R-squared:                  0.312
Method:                 Least Squares   F-statistic:                     80.39
Date:                Sat, 31 Aug 2024   Prob (F-statistic):           9.13e-43
Time:                        13:39:33   Log-Likelihood:                -313.55
No. Observations:                 526   AIC:                             635.1
Df Residuals:                     522   BIC:                             652.2
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.2844      0.104      2.729

In [5]:
# Case prate vs mrate and age

# Extract data
k401k = woo.dataWoo('401k')

# Model
model = smf.ols(formula='prate ~ mrate + age', data=k401k).fit()
print("Summary:")
print(model.summary())

Summary:
                            OLS Regression Results                            
Dep. Variable:                  prate   R-squared:                       0.092
Model:                            OLS   Adj. R-squared:                  0.091
Method:                 Least Squares   F-statistic:                     77.79
Date:                Sat, 31 Aug 2024   Prob (F-statistic):           6.67e-33
Time:                        13:39:34   Log-Likelihood:                -6422.3
No. Observations:                1534   AIC:                         1.285e+04
Df Residuals:                    1531   BIC:                         1.287e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     80.1190      0.779    102.846

In [6]:
# Case narr86 vs pcnv, ptime89, and qemp86

# Extract data
crime1 = woo.dataWoo('crime1')

# Modeling
model = smf.ols(formula='narr86 ~ pcnv + ptime86 + qemp86',
                data=crime1).fit()

print("Summary:")
print(model.summary())

Summary:
                            OLS Regression Results                            
Dep. Variable:                 narr86   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.040
Method:                 Least Squares   F-statistic:                     39.10
Date:                Sat, 31 Aug 2024   Prob (F-statistic):           9.91e-25
Time:                        13:39:34   Log-Likelihood:                -3394.7
No. Observations:                2725   AIC:                             6797.
Df Residuals:                    2721   BIC:                             6821.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.7118      0.033     21.565

In [7]:
# Case narr86 vs pcnv, avgsen, ptime86, and qemp86

# Extract data
crime1 = woo.dataWoo('crime1')

# Modeling
model = smf.ols(formula='narr86 ~ pcnv + avgsen + ptime86 + qemp86',
                data=crime1).fit()

print("Summary:")
print(model.summary())

Summary:
                            OLS Regression Results                            
Dep. Variable:                 narr86   R-squared:                       0.042
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     29.96
Date:                Sat, 31 Aug 2024   Prob (F-statistic):           2.01e-24
Time:                        13:39:34   Log-Likelihood:                -3393.5
No. Observations:                2725   AIC:                             6797.
Df Residuals:                    2720   BIC:                             6826.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.7068      0.033     21.319

In [8]:
# Case  Wage vs education
# Extract data

wage1 = woo.dataWoo('wage1')

# Modeling
model = smf.ols(formula='np.log(wage) ~ educ',
                data=wage1).fit()

print("Summary:")
print(model.summary())

Summary:
                            OLS Regression Results                            
Dep. Variable:           np.log(wage)   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.184
Method:                 Least Squares   F-statistic:                     119.6
Date:                Sat, 31 Aug 2024   Prob (F-statistic):           3.27e-25
Time:                        13:39:34   Log-Likelihood:                -359.38
No. Observations:                 526   AIC:                             722.8
Df Residuals:                     524   BIC:                             731.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5838      0.097      5.998

# OLS in Matrix Form

- Beta parameters:

$$
\hat{\boldsymbol{\beta}} = \left( \mathbf{X}'\mathbf{X} \right)^{-1} \mathbf{X}'\mathbf{y}
$$


- Residuals:

$$
\hat{\mathbf{u}} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}
$$

- Variance of error (residuals):

$$
\mathbf{\hat{\sigma}}^2 = \frac{1}{n - k - 1} \mathbf{\hat{u}}' \mathbf{\hat{u}}
$$


- Estimated parameters OLS variance:

$$
\text{Var}(\mathbf{\hat{\beta}}) = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}
$$



NOTE:
- The standard errors of the parameter estimates are the square roots of the main diagonal of $\text{Var}(\mathbf{\hat{\beta}})$
- Matrix multiplication using "@" for example: X@X

In [9]:
gpa1 = woo.dataWoo('gpa1')

# Determine sample size & no. of regressors
n = len(gpa1)
k = 2

# Extract y
y = gpa1['colGPA']

# Extract X 
X = pd.DataFrame({'const': 1, 'hsGPA': gpa1['hsGPA'], 'ACT': gpa1['ACT']})

# Parameters estimates
X = np.array(X)
y = np.array(y).reshape(n, 1)
b = np.linalg.inv(X.T@X) @ X.T @ y
print(f"beta: \n{b}")

# Residuals, estiamted variance of residuals and SER
residuals = y - X @ b
var_residuals = (residuals.T @ residuals) / (n - k - 1)
std_err_residuals = np.sqrt(var_residuals)
print(f"SER: {std_err_residuals}")

# Estimated variance of the parameters estiamtors and SE
var_beta = var_residuals * np.linalg.inv(X.T @ X)
std_error_est = np.sqrt(np.diagonal(var_beta))
print(f"Std error beta = {std_error_est}")

beta: 
[[1.28632777]
 [0.45345589]
 [0.00942601]]
SER: [[0.34031576]]
Std error beta = [0.34082212 0.09581292 0.01077719]


# Ceteris Paribus Interpretation and Omitted Variable Bias

- Ceteris Paribus Interpretation
> Analyze the effect of a single variable on an outcome, while holding all other relevant factors constant.

- Omitted Variable Bias
A particular estimator $\hat{\beta}_j$ is unbiased if:
1. The omitted variables do not appear in true model.
2. There is no correlation between regressor x_j and omitted variables (direct and indirect).
> Example:
	> - Direct correlation: If $x_g$ is omitted regressor. Direct correlation means $x_g$ and $x_j$ is correlated.
	> - Indirect correlation: If $x_g$ is omitted regressor and there is another regressor $x_s$. Let $x_j$ and $x_s$  are correlated. Indirect correlation means that no correlation between  $x_j$ and $x_g$ but there is a correlation between $x_j$ and $x_s$.


- Mathematically Equation

We define $\mathbf{x} = \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ \vdots \\ x_k \end{bmatrix}$ as vector regressors and $\mathbf{\beta} = \begin{bmatrix} \hat{\beta}_0, & \hat{\beta}_1, & \hat{\beta}_2, & \dots, & \hat{\beta}_k \end{bmatrix}$ as vector estimators.

We define $\mathbf{x}_{om}$ as vector omitted regressors and $\mathbf{\beta}_{om}$ as vector omitted estimators, which both contain $m$ vectors. Where $\mathbf{x}_{om} \subseteq \mathbf{x}$ and $\mathbf{\beta}_{om} \subseteq \mathbf{\beta}$.


$$
\tilde{\beta}_j = \hat{\beta}_j + \sum_{g=1}^{m} \beta_g \tilde{\delta}_{j,g}
$$

Where,
- $ \tilde{\delta}_{j,g} = \frac{\sum_{i=1}^{n} x_g \tilde{r}_{i,j}}{\sum_{i=1}^{n} \left(\tilde{r}_{i,j}\right)^2} $
- $ x_g \in x_{om}$ and $\beta_g \in \beta_{om}.$
- $ \tilde{r}_{i,j} $ = residual regression $x_j$ on un-omitted regressors at observation $i$.

**Special case**:

- If true model: $ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + e $
- If omitted model: $ \tilde{y} = \tilde{\beta}_0 + \tilde{\beta}_1 x_1 $

In this case, we got:

- $ \tilde{r}_{i,1} = x_{i,1} - \bar{x}_1 $
- $ \tilde{\delta}_{1,2} = \frac{\sum_{i=1}^{n} x_2 \left( x_{i,1} - \bar{x}_1 \right)}{\sum_{i=1}^{n} \left( x_{i,1} - \bar{x}_1 \right)^2} \approx \frac{ \text{cov}(x_2, x_1) }{ \text{var}(x_1) } = \text{slope regression } x_2 \text{ to } x_1 $
- $ \tilde{\beta}_1 = \beta_1 + \beta_2 \times \text{slope regression } x_2 \text{ to } x_1 $
- $ \text{Bias}(\tilde{\beta}_j) = \beta_2 \times \text{slope regression } x_2 \text{ to } x_1 $



### Case omitted hsGPA
- True model: colGPA = ACT + hsGPA
- Omitted model: colGPA = ACT

In [10]:
gpa1 = woo.dataWoo('gpa1')

# Parameter estimates for full and simple model
model = smf.ols(formula='colGPA ~ ACT + hsGPA', data=gpa1).fit()
params = model.params
print(f"Parameter model (full): \n", params)

print()

# Relation between regressors 
# (aprroximate using slope regresion hsGPA and ACT)
model_reg = smf.ols(formula='hsGPA ~ ACT', data=gpa1).fit()
delta_tilde_1 = model_reg.params['ACT']
b1_tilde = params['ACT'] + params['hsGPA'] * delta_tilde_1
print(f"Delta tilde (omitted) alternative 1: {delta_tilde_1}")
print(f"Estimator ACT (omitted) alternative 1: {b1_tilde}")

print()

# Relation between un-omitted regressors
#  (approximation using ratio covariance residu un-omitted and 
#    omitted regressor regressors to variance of residu un-omitted)
model_reg_unomitted = smf.ols(formula='ACT ~ 1', data=gpa1).fit()
residu_unomitted = model_reg_unomitted.resid
omitted = gpa1['hsGPA']
delta_tilde_2 = np.sum(residu_unomitted * omitted) / np.sum(residu_unomitted ** 2)
b2_tilde = params['ACT'] + params['hsGPA'] * delta_tilde_2
print(f"Delta tilde (omitted) alternative 2: {delta_tilde_2}")
print(f"Estimator ACT (omitted) alternative 2: {b2_tilde}")

print()

# Actual regression with hsGPA omitted:
model_om = smf.ols(formula='colGPA ~ ACT', data=gpa1).fit()
params_om = model_om.params
print(f"Parameter model (omitted): \n", params_om)

Parameter model (full): 
 Intercept    1.286328
ACT          0.009426
hsGPA        0.453456
dtype: float64

Delta tilde (omitted) alternative 1: 0.03889675325123464
Estimator ACT (omitted) alternative 1: 0.027063973943178613

Delta tilde (omitted) alternative 2: 0.03889675325123099
Estimator ACT (omitted) alternative 2: 0.027063973943176955

Parameter model (omitted): 
 Intercept    2.402979
ACT          0.027064
dtype: float64


### Case omitted ACT
- True model: colGPA = ACT + hsGPA
- Omitted model: colGPA = hsGPA

In [11]:
gpa1 = woo.dataWoo('gpa1')

# Parameter estimates for full and simple model
model = smf.ols(formula='colGPA ~ ACT + hsGPA', data=gpa1).fit()
params = model.params
print(f"Parameter model (full): \n", params)

print()

# Relation between regressors 
# (aprroximate using slope regresion hsGPA and ACT)
model_reg = smf.ols(formula='ACT ~ hsGPA', data=gpa1).fit()
delta_tilde_1 = model_reg.params['hsGPA']
b1_tilde = params['hsGPA'] + params['ACT'] * delta_tilde_1
print(f"Delta tilde (omitted) alternative 1: {delta_tilde_1}")
print(f"Estimator hsGPA (omitted) alternative 1: {b1_tilde}")

print()

# Relation between un-omitted regressors
#  (approximation using ratio covariance residu un-omitted and 
#    omitted regressor regressors to variance of residu un-omitted)
model_reg_unomitted = smf.ols(formula='hsGPA ~ 1', data=gpa1).fit()
residu_unomitted = model_reg_unomitted.resid
omitted = gpa1['ACT']
delta_tilde_2 = np.sum(residu_unomitted * omitted) / np.sum(residu_unomitted ** 2)
b2_tilde = params['hsGPA'] + params['ACT'] * delta_tilde_2
print(f"Delta tilde (omitted) alternative 2: {delta_tilde_2}")
print(f"Estimator hsGPA (omitted) alternative 2: {b2_tilde}")

print()

# Actual regression with ACT omitted:
model_om = smf.ols(formula='colGPA ~ hsGPA', data=gpa1).fit()
params_om = model_om.params
print(f"Parameter model (omitted): \n", params_om)

Parameter model (full): 
 Intercept    1.286328
ACT          0.009426
hsGPA        0.453456
dtype: float64

Delta tilde (omitted) alternative 1: 3.07433061472438
Estimator hsGPA (omitted) alternative 1: 0.48243456341525165

Delta tilde (omitted) alternative 2: 3.074330614724289
Estimator hsGPA (omitted) alternative 2: 0.4824345634152508

Parameter model (omitted): 
 Intercept    1.415434
hsGPA        0.482435
dtype: float64


# Standard Errors, Multicollinearity, and VIF

**Under Gauss-Markov Assumptions:**

$$
\text{Var}(\hat{\beta}_j) = \frac{\sigma^2}{SST_j(1-R_j^2)} = \frac{\sigma^2}{SST_j} VIF_j
$$

Where,

- $ j = 1, 2, 3, \dots, k $
- $ SST_j = \sum_{i=1}^{n} (x_{i,j} - \bar{x}_j) $
- $ R_j^2 = R $ — squared from regressing $ x_j $ on other regressors (and including an intercept).
- $ \sigma^2 = E(e^2) = $ Variance of error model
- $ VIF_j = \frac{1}{(1-R_j^2)} = \text{variance inflation factor} $

The $\sigma^2$ is estimated using sample variance error model:
$$ \hat{\sigma}^2 = \sum_{i=1}^{n} \frac{\hat{e}^2}{n - k -1} $$ 

Where,
- $\hat{e}^2$ = Error term model
- $\text{n}$ = Total observation
- $\text{k}$ = Number of regressors

NOTE: The $\sqrt{\hat\sigma^2}$ is Standard Error of Regression (SER)

Then Standard Errors of $\hat{\beta}_j$:

$$
\text{se}(\hat{\beta}_j) = \frac{\hat{\sigma}}{\sqrt{n} \sqrt{1 - R_j^2} \, sd(x_j)}
$$

Where,
- $sd(x_j) = \sqrt{1/n \sum_{i=1}^{n} (x_{i, j} - \bar{x})^2}$

The standard Error for Intercept
$$
se(\beta_0) = \sqrt{ \frac{\sigma^2}{n} \cdot \left( 1 + \frac{\bar{x}_1^2 + \bar{x}_2^2 + \dots + \bar{x}_n^2}{\sum_{i=1}^{n} (x_{i1} - \bar{x}_1)^2 + \sum_{i=1}^{n} (x_{i2} - \bar{x}_2)^2 + \dots + \sum_{i=1}^{n} (x_{in} - \bar{x}_n)^2} \right) }
$$



### Case estimating VIF and Standard Error Estimator hsGPA

- True model: colGPA = ACT + hsGPA
- Omitted model: colGPA = ACT

In [13]:
# full estimation results including automatic SE
model = smf.ols(formula='colGPA ~ hsGPA + ACT', data=gpa1).fit()

# Extract SER
SER = np.sqrt(model.mse_resid)

# Regressing hsGPA on ACT for calculation of R2 & VIF
model_hsGPA = smf.ols(formula='hsGPA ~ ACT', data=gpa1).fit()
R2_hsGPA = model_hsGPA.rsquared
VIF_hsGPA = 1 / (1 - R2_hsGPA)
print(f"VIF_hsGPA: {VIF_hsGPA}\n")

# Manual calculation of SE of hsGPA coefficient
n = model.nobs
sdx = np.std(gpa1['hsGPA'], ddof=1) * np.sqrt((n - 1) / n)
SE_hsGPA = 1 / np.sqrt(n) * SER / sdx * np.sqrt(VIF_hsGPA)
print(f"SE_hsGPA: {SE_hsGPA}\n")

VIF_hsGPA: 1.1358234481972784

SE_hsGPA: 0.09581291608057595

