### Linear regression - model specification
This notebook demostrates:

0. Correctly specified model (baseline)
1. Under-specified model
    - Uncorrelated predictors
    - Correlated predictors
2. Over-specified model
    - Perfect multi-colinearity
    - Varying degrees of Multi-colinearity
    - Including non related variables

In [1]:
import numpy as np
import statsmodels.api as sm

### 1. Correctly specified model (baseline)

- We randomly generate 100 uncorrelated X1 and X2 following a normal distirbution.

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We correctly include X1 and X2 in our regression model: y ~ X1 + X2 (with intercept)

- We fit 1000 such regression models and check for biasedness and preciseness of the estimates.

In [2]:
estimates = []
for i in range(1000):
    X1 = np.random.randn(100) + 2
    X2 = np.random.randn(100) + 2
    
    e = np.random.randn(100)
    y = 3 + 2*X1 + 4*X2 + e
    X = sm.add_constant(np.column_stack([X1,X2]))
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [3]:
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [3.00195035 2.00115695 3.99953601]
SE of estimates: [0.29882325 0.10161407 0.10566198]


### 1. Under-specified model

#### Uncorrelated predictors (#1)
- We randomly generate 100 uncorrelated X1 and X2 following a normal distirbution.

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We only include X1 in our regression model: y ~ X1 (with intercept)

- We fit 1000 such regression models and check for biasedness.

In [4]:
estimates = []
for i in range(1000):
    means = [2 , 2]
    cov = [[1,0], [0,1]]
    
    X = np.random.multivariate_normal(means,cov,100)

    e = np.random.randn(100)
    
    y = 3 + 2*X[:,0] + 4*X[:,1] + e
    
    X = sm.add_constant(X[:,0])
    model = sm.OLS(y, X).fit()
    
    estimates.append(model.params)

In [5]:
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))

print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [11.00273446  1.99328831]
SE of estimates: [0.94906234 0.42030678]


We can see the intercept estimate is **biased** (if will be unbiased only if your omitted variable has a mean of zero), and the coefficient for b1 is **unbiased**. The SE of the estimates are much higher than that in a correctly specified model.

#### Correlated predictors (#2)
- We randomly generate 100 correlated X1 and X2 following a bi-variate normal distirbution (so X1 and X2 are having a correlation coefficient of 0.4).

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We only include X1 in our regression model: y ~ X1 (with intercept)

- We fit 1000 such regression models and check for biasedness.

In [6]:
estimates = []
for i in range(1000):
    means = [2, 2]
    cov = [[1,0.4], [0.4,1]]
    X = np.random.multivariate_normal(means,cov,100)

    e = np.random.randn(100)
    y = 3 + 2*X[:,0] + 4*X[:,1] + e
    X = sm.add_constant(X[:,0])
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [7]:
#The averages of estimates across the 1000 replications are:
np.mean(np.array(estimates),axis=0)

array([7.78186818, 3.60331787])

In [8]:
#The averages of estimates across the 1000 replications are:
np.std(np.array(estimates),axis=0)

array([0.86955328, 0.39256739])

We can see the intercept estimate is **biased** (it will be unbiased only if both of your variable have a mean of zero), and the coefficient for b1 is also **biased** from its true value of 2.

Conclusion: when predictors are correlated, omitting one will make other estimates biased.

### 2. Over-specified model

#### Perfect multicolinearity (#1)
- We randomly generate 100 data points for X1 following a normal distribution.
- We set X2 = 1 - X1

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We only fit the regression model as: y ~ X1 + X2 (with intercept)


In [9]:
X1 = np.random.randn(100)
X2 = 1 - X1
e = np.random.randn(100)

y = 3 + 2*X1 + 4*X2 + e

X = sm.add_constant(np.column_stack([X1,X2]))

model = sm.OLS(y,X).fit()

In [10]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.847
Model:,OLS,Adj. R-squared:,0.846
Method:,Least Squares,F-statistic:,543.1
Date:,"Wed, 29 Jan 2025",Prob (F-statistic):,9.389999999999999e-42
Time:,21:42:42,Log-Likelihood:,-137.42
No. Observations:,100,AIC:,278.8
Df Residuals:,98,BIC:,284.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.9652,0.069,57.135,0.000,3.827,4.103
x1,0.9980,0.064,15.620,0.000,0.871,1.125
x2,2.9672,0.044,68.140,0.000,2.881,3.054

0,1,2,3
Omnibus:,0.767,Durbin-Watson:,2.198
Prob(Omnibus):,0.682,Jarque-Bera (JB):,0.392
Skew:,0.124,Prob(JB):,0.822
Kurtosis:,3.18,Cond. No.,1.14e+16


####  Multicolinearity (#2)
- We randomly generate 100 correlated X1 and X2 following a bi-variate normal distirbution and let X1 and X2 have different degree of correlation (0, 0.3, 0.6, 0.9, 0.95, and 0.99).

- We speficy a true regression line: $y$ = 2*$X_1$ + 4*$X_2$ + e

- We include X1 and X2 in our regression model: y ~ X1 + X2 (no intercept, for simplicity)

- For each pre-set correlation value, we repeat 1000 times to generate the sampling distirbution of regression estimates.
- Plot the sampling distirbutions

In [11]:
#Write a function to return the sampling distribution of regression coefficients 
# based on different degree of correlation between X1 and X2
def simulation_multi_colinearity(cor):
    params = []
    for i in range(1000):
        means = [2,2]
        cov = [[1,cor], [cor,1]]
        X = np.random.multivariate_normal(means,cov,100)
        e = np.random.randn(100)*2
        y = 2*X[:,0] + 4*X[:,1] + e
        model = sm.OLS(y,X).fit()
        params.append(model.params)
    return params

In [12]:
sampling_dist_0 = simulation_multi_colinearity(0)
sampling_dist_0_3 = simulation_multi_colinearity(0.3)
sampling_dist_0_6 = simulation_multi_colinearity(0.6)
sampling_dist_0_9 = simulation_multi_colinearity(0.9)
sampling_dist_0_95 = simulation_multi_colinearity(0.95)
sampling_dist_0_99 = simulation_multi_colinearity(0.99)

In [13]:
print("SE of estimates when X1 and X2 have a cor of 0:", np.std(sampling_dist_0,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.3:", np.std(sampling_dist_0_3,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.6:", np.std(sampling_dist_0_6,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.9:", np.std(sampling_dist_0_9,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.95:", np.std(sampling_dist_0_95,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.99:", np.std(sampling_dist_0_99,axis=0))

SE of estimates when X1 and X2 have a cor of 0: [0.14975577 0.14674037]
SE of estimates when X1 and X2 have a cor of 0.3: [0.17521922 0.17300788]
SE of estimates when X1 and X2 have a cor of 0.6: [0.23954783 0.23633053]
SE of estimates when X1 and X2 have a cor of 0.9: [0.44319264 0.44149093]
SE of estimates when X1 and X2 have a cor of 0.95: [0.64226773 0.64220472]
SE of estimates when X1 and X2 have a cor of 0.99: [1.47670598 1.47910517]


In [14]:
print("Mean of estimates when X1 and X2 have a cor of 0:", np.mean(sampling_dist_0,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.3:", np.mean(sampling_dist_0_3,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.6:", np.mean(sampling_dist_0_6,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.9:", np.mean(sampling_dist_0_9,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.95:", np.mean(sampling_dist_0_95,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.99:", np.mean(sampling_dist_0_99,axis=0))

Mean of estimates when X1 and X2 have a cor of 0: [2.0016277  4.00016232]
Mean of estimates when X1 and X2 have a cor of 0.3: [2.00211592 4.00058463]
Mean of estimates when X1 and X2 have a cor of 0.6: [2.00172266 4.00223066]
Mean of estimates when X1 and X2 have a cor of 0.9: [1.99749542 4.00501842]
Mean of estimates when X1 and X2 have a cor of 0.95: [1.99934875 3.99953973]
Mean of estimates when X1 and X2 have a cor of 0.99: [1.95444333 4.04792674]


We can observe that, with varying degrees of correlation between the two predictors in the model, the regression coefficients are still **unbiased**, but the standard error of the estimates are **inflated** based on the degree of the correlation.

#### Adding non-related predictors (#2)
- We randomly generate 100 data points for X1, X2, and X3 following a normal distribution.

- We speficy a true regression line: $y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- We  include X1, X2, and X3in our regression model: y ~ X1 (with intercept). Here X3 should not be in the model, but included.

- We fit 1000 such regression models and check for biasedness.

In [15]:
estimates = []
for i in range(1000):
    X1 = np.random.randn(100)
    X2 = np.random.randn(100)
    X3 = np.random.randn(100)
    e = np.random.randn(100)

    y = 3 + 2*X1 + 4*X2 + e

    X = sm.add_constant(np.column_stack([X1,X2,X3]))
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [16]:
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [ 3.00559236e+00  2.00282613e+00  4.00539610e+00 -3.54773022e-03]
SE of estimates: [0.1031899  0.10409916 0.10071232 0.10225802]


We find the estimates of X1 and X2 are **unbiased**, and the standard errors are also quite small. The estimate for X3 is nearly **zero**, and if we check one model out of the 1000, we find the estimate is not statistically significant. In this sense, including non-related random data into the model does not do  much harm.

In [17]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.956
Model:,OLS,Adj. R-squared:,0.954
Method:,Least Squares,F-statistic:,692.0
Date:,"Wed, 29 Jan 2025",Prob (F-statistic):,7.3700000000000004e-65
Time:,21:42:43,Log-Likelihood:,-133.51
No. Observations:,100,AIC:,275.0
Df Residuals:,96,BIC:,285.4
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9365,0.095,30.995,0.000,2.748,3.125
x1,1.9564,0.104,18.897,0.000,1.751,2.162
x2,4.1883,0.096,43.420,0.000,3.997,4.380
x3,0.0342,0.106,0.323,0.747,-0.176,0.244

0,1,2,3
Omnibus:,0.758,Durbin-Watson:,1.889
Prob(Omnibus):,0.685,Jarque-Bera (JB):,0.803
Skew:,-0.01,Prob(JB):,0.669
Kurtosis:,2.562,Cond. No.,1.33
