Pair Problem

Practice Lasso regularization technique in four steps on the given data set. Note there are functions (such as GridSearchCV) that will do a lot of the heavy lifting for you, but for the purpose of this pair don't use them -- we want you to implement cross-validation manually!

Use the KFold function from sklearn to divide the data into 5 training/test sets.

Tune the alpha parameter in the lasso model by looping over a grid of possible $\alpha$s (sklearn: lasso)

For each candidate $\alpha$, loop over the 5 training/test sets.
On each training/test set run the lasso model on the training set and then compute and record the prediction error in the test set.
Finally total the prediction error for the 5 training/test sets.

Set $\alpha$ to be the value that minimizes prediction error.

Run the lasso model again with the optimal $\alpha$ determined in step 3. Which variables would you consider excluding on the basis of these results?

In [1]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import KFold
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing



In [2]:
df = pd.read_csv('Lasso_practice_data.csv')

FileNotFoundError: File b'Lasso_practice_data.csv' does not exist

In [12]:
y = df['y']
X = df.drop(('y'), axis = 1)

In [17]:
kf = KFold(n=len(X), n_folds=5, shuffle=True)

In [29]:
alphas = [10**x for x in range(-12, 5)]
scores = np.zeros((len(alphas),1))

In [64]:
for j in range(len(alphas)):
    kf_score = []
    for train, test in kf:
        X_train = X.iloc[train]
        y_train = y.iloc[train]
        X_test = X.iloc[test]
        y_test = y.iloc[test]
        std_scaler = preprocessing.StandardScaler()
        X_train_norm = std_scaler.fit_transform(X_train)
        X_test_norm = std_scaler.transform(X_test)
    
        model =  Lasso(alpha = alphas[j])
        model.fit(X_train_norm, y_train)
        y_test_predict = model.predict(X_test_norm)
        score = mean_squared_error(y_test_predict, y_test)
        kf_score.append(score)
    scores[j] = np.mean(kf_score)
print ('mse for all alpha values \n')
print (pd.DataFrame(list(zip(alphas, scores)), columns=['alpha', 'mse']))

mse for all alpha values 

           alpha              mse
0   1.000000e-12  [1.01336924417]
1   1.000000e-11  [1.01336924417]
2   1.000000e-10  [1.01336924412]
3   1.000000e-09  [1.01336924371]
4   1.000000e-08  [1.01336923957]
5   1.000000e-07  [1.01336919854]
6   1.000000e-06  [1.01336878545]
7   1.000000e-05   [1.0133646847]
8   1.000000e-04  [1.01332383867]
9   1.000000e-03  [1.01293200546]
10  1.000000e-02  [1.01078287584]
11  1.000000e-01  [1.08560564798]
12  1.000000e+00  [5.20589356445]
13  1.000000e+01  [14.1439768168]
14  1.000000e+02  [14.1439768168]
15  1.000000e+03  [14.1439768168]
16  1.000000e+04  [14.1439768168]


In [65]:
best_alpha = alphas[scores.argmin()]
print ('best alpha is:', best_alpha)

best alpha is: 0.01


In [66]:
std_scaler = preprocessing.StandardScaler()
X_norm = std_scaler.fit_transform(X)

final_model =  Lasso(alpha = best_alpha)
final_model.fit(X_norm, y)
y_predict = final_model.predict(X_norm)
final_model_score = mean_squared_error(y_predict, y)
print ('MSE of final model is:', final_model_score)

MSE of final model is: 0.991396697544


In [71]:
df_coef = pd.DataFrame(list(zip(X.columns, final_model.coef_)), columns = ['variable', 'coefficient'])
print ('Variables we would like to retain are')
print (df_coef[df_coef['coefficient']!=0])

Variables we would like to retain are
   variable  coefficient
1        x2    -1.792169
2        x3    -0.130112
4        x5    -0.009247
5        x6     1.844775
7        x8     0.002409
8        x9    -0.195843
9       x10     0.173621
11      x12    -0.007097
13      x14    -2.224042
14      x15     0.013011
15      x16     1.019130
16      x17     0.044092
17      x18    -0.017809
19      x20    -0.347023


In [72]:
print ('Variables we would like to drop are')
print (df_coef[df_coef['coefficient']==0])

Variables we would like to drop are
   variable  coefficient
0        x1         -0.0
3        x4          0.0
6        x7          0.0
10      x11         -0.0
12      x13         -0.0
18      x19          0.0
