What is regression - does colinearity matter?

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
%matplotlib inline 

class Model(object):
    def __init__(self,cov,w):
        self.w = w
        self.cov = cov
    def sample(self,samples):
        X = np.random.multivariate_normal([0,0],self.cov,size = samples)
        y = np.dot(X,self.w)+np.random.normal(size=samples)
        return X,y
    
    def fit_predict(self,ntrain,ntest):
        m = LinearRegression()
        X_train,y_train = self.sample(ntrain)
        m.fit(X_train,y_train)
        X_test,y_test = self.sample(ntest)
        y_pred = m.predict(X_test)
        error = mean_squared_error(y_test,y_pred)
        return np.concatenate(([error],m.coef_))
        
        
    

In [2]:
orthogonal = Model([[1,0],[0,1]],[.2,.4])
colinear = Model([[1,.9],[.9,1]],[.2,.4])
orthogonal.fit_predict(200,100)


array([ 0.84401341,  0.14198801,  0.39801145])

In [13]:
# look at what happens to prediction vs estimating value of covariates.
# 1) try to learn the coefficients
# 2) try to predict y
orthogonal = Model([[1,0],[0,1]],[.2,.4])
colinear = Model([[1,.9],[.9,1]],[.2,.4])

ntrain = 200
ntest = 100
nsamples = 10000
results_o = np.zeros((nsamples,3))
results_c = np.zeros((nsamples,3))
for i in range(nsamples):
    results_o[i,:] = orthogonal.fit_predict(ntrain,ntest)
    results_c[i,:] = colinear.fit_predict(ntrain,ntest)

print ("Means")
print(results_o.mean(axis=0))
print(results_c.mean(axis=0))
print ("Standard deviations")
print(results_o.std(axis=0))
print(results_c.std(axis=0))
    

Means
[ 1.01569656  0.19905881  0.39988909]
[ 1.01340503  0.20006369  0.40067206]
Standard deviations
[ 0.14401949  0.0715076   0.07127321]
[ 0.14518729  0.16492828  0.1633644 ]


The variance of the coefficients/parameters (but not the error) is much higher in the presense of co-linearity. 

The issue is instability in parameter estimates - this is only a problem if we care about the parameter estimates. And why do we care about the parameter estimates if we are not giving them some causal interpretation?

Restate the problem such that the goal (and loss is in terms of a specific parameter)