### Linear regression - model specification
This notebook demostrates:

0. Correctly specified model (baseline)
1. Under-specified model
    - Uncorrelated predictors
    - Correlated predictors
2. Over-specified model
    - Perfect multi-colinearity
    - Varying degrees of Multi-colinearity
    - Including non related variables

In [1]:
import numpy as np
import statsmodels.api as sm

### 1. Correctly specified model (baseline)

- We randomly generate 100 uncorrelated X1 and X2 following a normal distirbution.

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We correctly include X1 and X2 in our regression model: y ~ X1 + X2 (with intercept)

- We fit 1000 such regression models and check for biasedness and preciseness of the estimates.

In [2]:
estimates = []
for i in range(1000):
    X1 = np.random.randn(100)
    X2 = np.random.randn(100)
    
    e = np.random.randn(100)
    y = 3 + 2*X1 + 4*X2 + e
    X = sm.add_constant(np.column_stack([X1,X2]))
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [3]:
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [2.99693107 1.9989116  3.99801653]
SE of estimates: [0.09907705 0.10185332 0.10522559]


### 1. Under-specified model

#### Uncorrelated predictors (#1)
- We randomly generate 100 uncorrelated X1 and X2 following a normal distirbution.

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We only include X1 in our regression model: y ~ X1 (with intercept)

- We fit 1000 such regression models and check for biasedness.

In [4]:
estimates = []
for i in range(1000):
    means = [4,2]
    cov = [[1,0], [0,1]]
    X = np.random.multivariate_normal(means,cov,100)

    e = np.random.randn(100)
    y = 3 + 2*X[:,0] + 4*X[:,1] + e
    X = sm.add_constant(X[:,0])
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [5]:
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))

print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [11.04517657  1.99440733]
SE of estimates: [1.65451365 0.40479627]


We can see the intercept estimate is **biased**, and the coefficient for b1 is **unbiased**. The SE of the estimates are much higher than that in a correctly specified model.

#### Correlated predictors (#2)
- We randomly generate 100 correlated X1 and X2 following a bi-variate normal distirbution (so X1 and X2 are having a correlation coefficient of 0.4).

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We only include X1 in our regression model: y ~ X1 (with intercept)

- We fit 1000 such regression models and check for biasedness.

In [6]:
estimates = []
for i in range(1000):
    means = [4,2]
    cov = [[1,0.4], [0.4,1]]
    X = np.random.multivariate_normal(means,cov,100)

    e = np.random.randn(100)
    y = 3 + 2*X[:,0] + 4*X[:,1] + e
    X = sm.add_constant(X[:,0])
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [7]:
#The averages of estimates across the 1000 replications are:
np.mean(np.array(estimates),axis=0)

array([4.61173289, 3.59313288])

In [8]:
#The averages of estimates across the 1000 replications are:
np.std(np.array(estimates),axis=0)

array([1.56633713, 0.38443653])

We can see the intercept estimate is **biased**, and the coefficient for b1 is also **biased** from its true value of 2.

Conclusion: when predictors are correlated, omitting one will make other estimates biased.

### 2. Over-specified model

#### Perfect multicolinearity (#1)
- We randomly generate 100 data points for X1 following a normal distribution.
- We set X2 = 1 - X1

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We only fit the regression model as: y ~ X1 + X2 (with intercept)


In [9]:
X1 = np.random.randn(100)
X2 = 1 - X1
e = np.random.randn(100)

y = 3 + 2*X1 + 4*X2 + e

X = sm.add_constant(np.column_stack([X1,X2]))

model = sm.OLS(y,X).fit()

In [10]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.821
Model:,OLS,Adj. R-squared:,0.82
Method:,Least Squares,F-statistic:,450.8
Date:,"Mon, 12 Feb 2024",Prob (F-statistic):,1.93e-38
Time:,14:53:29,Log-Likelihood:,-139.36
No. Observations:,100,AIC:,282.7
Df Residuals:,98,BIC:,287.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.9815,0.074,54.019,0.000,3.835,4.128
x1,0.9864,0.072,13.706,0.000,0.844,1.129
x2,2.9951,0.045,66.782,0.000,2.906,3.084

0,1,2,3
Omnibus:,1.927,Durbin-Watson:,2.079
Prob(Omnibus):,0.381,Jarque-Bera (JB):,1.886
Skew:,0.262,Prob(JB):,0.389
Kurtosis:,2.577,Cond. No.,4080000000000000.0


####  Multicolinearity (#2)
- We randomly generate 100 correlated X1 and X2 following a bi-variate normal distirbution and let X1 and X2 have different degree of correlation (0, 0.3, 0.6, 0.9, 0.95, and 0.99).

- We speficy a true regression line: $y$ = 2*$X_1$ + 4*$X_2$ + e

- We include X1 and X2 in our regression model: y ~ X1 + X2 (no intercept, for simplicity)

- For each pre-set correlation value, we repeat 1000 times to generate the sampling distirbution of regression estimates.
- Plot the sampling distirbutions

In [11]:
#Write a function to return the sampling distribution of regression coefficients 
# based on different degree of correlation between X1 and X2
def simulation_multi_colinearity(cor):
    params = []
    for i in range(1000):
        means = [4,2]
        cov = [[1,cor], [cor,1]]
        X = np.random.multivariate_normal(means,cov,100)
        e = np.random.randn(100)*2
        y = 2*X[:,0] + 4*X[:,1] + e
        model = sm.OLS(y,X).fit()
        params.append(model.params)
    return params

In [12]:
sampling_dist_0 = simulation_multi_colinearity(0)
sampling_dist_0_3 = simulation_multi_colinearity(0.3)
sampling_dist_0_6 = simulation_multi_colinearity(0.6)
sampling_dist_0_9 = simulation_multi_colinearity(0.9)
sampling_dist_0_95 = simulation_multi_colinearity(0.95)
sampling_dist_0_99 = simulation_multi_colinearity(0.99)

In [13]:
print("SE of estimates when X1 and X2 have a cor of 0:", np.std(sampling_dist_0,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.3:", np.std(sampling_dist_0_3,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.6:", np.std(sampling_dist_0_6,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.9:", np.std(sampling_dist_0_9,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.95:", np.std(sampling_dist_0_95,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.99:", np.std(sampling_dist_0_99,axis=0))

SE of estimates when X1 and X2 have a cor of 0: [0.09774962 0.17829866]
SE of estimates when X1 and X2 have a cor of 0.3: [0.11897213 0.21516405]
SE of estimates when X1 and X2 have a cor of 0.6: [0.13283715 0.24456713]
SE of estimates when X1 and X2 have a cor of 0.9: [0.18710183 0.34713535]
SE of estimates when X1 and X2 have a cor of 0.95: [0.19926059 0.36769283]
SE of estimates when X1 and X2 have a cor of 0.99: [0.23572081 0.43517237]


In [14]:
print("Mean of estimates when X1 and X2 have a cor of 0:", np.mean(sampling_dist_0,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.3:", np.mean(sampling_dist_0_3,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.6:", np.mean(sampling_dist_0_6,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.9:", np.mean(sampling_dist_0_9,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.95:", np.mean(sampling_dist_0_95,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.99:", np.mean(sampling_dist_0_99,axis=0))

Mean of estimates when X1 and X2 have a cor of 0: [1.99815907 4.00067393]
Mean of estimates when X1 and X2 have a cor of 0.3: [2.00283493 3.99499812]
Mean of estimates when X1 and X2 have a cor of 0.6: [2.00121313 4.00266588]
Mean of estimates when X1 and X2 have a cor of 0.9: [2.01198014 3.97377536]
Mean of estimates when X1 and X2 have a cor of 0.95: [2.00067449 3.99208516]
Mean of estimates when X1 and X2 have a cor of 0.99: [2.00075003 3.99593505]


We can observe that, with varying degrees of correlation between the two predictors in the model, the regression coefficients are still **unbiased**, but the standard error of the estimates are **inflated** based on the degree of the correlation.

#### Adding non-related predictors (#2)
- We randomly generate 100 data points for X1, X2, and X3 following a normal distribution.

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We  include X1, X2, and X3in our regression model: y ~ X1 (with intercept). Here X3 should not be in the model, but included.

- We fit 1000 such regression models and check for biasedness.

In [15]:
estimates = []
for i in range(1000):
    X1 = np.random.randn(100)
    X2 = np.random.randn(100)
    X3 = np.random.randn(100)
    e = np.random.randn(100)

    y = 3 + 2*X1 + 4*X2 + e

    X = sm.add_constant(np.column_stack([X1,X2,X3]))
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [16]:
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [ 3.00154667e+00  1.99827735e+00  3.99297478e+00 -3.24848139e-03]
SE of estimates: [0.10063326 0.10192471 0.10255253 0.10575763]


We find the estimates of X1 and X2 are **unbiased**, and the standard errors are also quite small. The estimate for X3 is nearly **zero**, and if we check one model out of the 1000, we find the estimate is not statistically significant. In this sense, including non-related random data into the model does not do  much harm.

In [17]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.945
Model:,OLS,Adj. R-squared:,0.944
Method:,Least Squares,F-statistic:,552.2
Date:,"Mon, 12 Feb 2024",Prob (F-statistic):,2.1699999999999999e-60
Time:,14:53:30,Log-Likelihood:,-138.6
No. Observations:,100,AIC:,285.2
Df Residuals:,96,BIC:,295.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9756,0.102,29.294,0.000,2.774,3.177
x1,2.0007,0.119,16.877,0.000,1.765,2.236
x2,3.9751,0.099,40.036,0.000,3.778,4.172
x3,-0.1522,0.100,-1.520,0.132,-0.351,0.047

0,1,2,3
Omnibus:,3.052,Durbin-Watson:,1.835
Prob(Omnibus):,0.217,Jarque-Bera (JB):,1.86
Skew:,-0.041,Prob(JB):,0.395
Kurtosis:,2.337,Cond. No.,1.5
