In [2]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.linear_model

%matplotlib inline

You have a matrix $X$ of 100 samples of 100 input features, and a vector $Y$ of 100 outputs. Using this training data, you want to fit a linear model that will make good predictions of $y$ given new $x$s, $\hat y = a^T x$.

In [3]:
# data generation code -- run this cell, but you don't have to read it
pts = 100
dims = 100
a0 = np.zeros(dims)
a0[[11, 71]] = 1, -1

def make_data(pts, dims, a0):
    X = np.random.normal(0, 10, (pts, dims))
    y0 = X @ a0
    noise = np.random.normal(0, 1, pts)
    Y = y0 + noise
    return X, Y

X_train, Y_train = make_data(pts, dims, a0)

**Question 1:** if you fit $a$ using all of $X$ and $Y$ and no regularization,  how good will the fit to the **training** data set be? Why?

**Answer:** very good! there are as many features as data points, so you can even invert the training data matrix to get a perfect fit.

**Question 2:** How well do you expect your model will perform on new, unseen data drawn from the same distribution? Why?

**Answer:** very bad, since the high number of features and low number of data points leads to tons of overfitting

**Question 3:** The below function fits a linear model using least squares and L1 regularization (lasso regression). How can you use this to produce a model that will generalize well? Why does regularization improve generalization?

**Answer:** Fit to training data set with appropriate regularization strength. This works because regularization forces the model to produce a simpler model, which rejects noise in the training data even at the expense of higher training error.

In [4]:
def linreg_lasso(X, Y, c=1e-6):
    lr = sklearn.linear_model.Lasso(
        alpha=c, fit_intercept=False).fit(X, Y).coef_
    return lr

**Question 4:** Using whatever technique you think is appropriate, write code to find the amount of regularization that will give the best generalization. Use it to find the best regularization parameter and fit a model with that parameter.

**Answer:** One way is with cross validation:

In [8]:
def compute_mse(a, X, Y):
    return np.mean((X@a - Y)**2)

def k_fold_cross_validation(X, Y, k=3, c=1e-6):
    start = 0
    total = len(X)
    errors = []
    for i in range(k):
        val_X = X[start:start+(total//k)]
        train_X = np.concatenate([
            X[:start], X[start+(total//k):]
        ])
        val_Y = Y[start:start+(total//k)]
        train_Y = np.concatenate([
            Y[:start], Y[start+(total//k):]
        ])
        a = linreg_lasso(train_X, train_Y, c)
        error = compute_mse(a, val_X, val_Y)
        errors.append(error)
        start = start+(total//k)
        
    return np.mean(errors)

In [13]:
cc = np.logspace(-1, 2, 10, base=10)
error_cv = [
    k_fold_cross_validation(X_train, Y_train, 3, c)
    for c in cc
]

In [15]:
cc, error_cv

(array([  0.1       ,   0.21544347,   0.46415888,   1.        ,
          2.15443469,   4.64158883,  10.        ,  21.5443469 ,
         46.41588834, 100.        ]),
 [3.019965383929488,
  2.4280967710414196,
  1.812653424362381,
  1.4175558422978771,
  1.3017527898034045,
  1.5628763081766586,
  3.137640480887894,
  10.341882182124822,
  43.554432361273605,
  169.5827501135839])

Looks like C = 2 or so will do the job.