#  Regression on House Pricing Dataset: Variable Selection & Regularization
We consider a reduced version of a dataset containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

[https://www.kaggle.com/harlfoxem/housesalesprediction]

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

## TO DO 1: insert your ID number ("numero di matricola") below

In [None]:
#put here your ``numero di matricola''
numero_di_matricola = # COMPLETE

In [None]:
#import all packages needed
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load the data, remove data samples/points with missing values (NaN) and take a look at them.

In [None]:
#load the data
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

df.describe()

Extract input and output data. We want to predict the price by using features other than id as input.

In [None]:
Data = df.values
# m = number of input samples
m = 3164
Y = Data[:m,2]
X = Data[:m,3:]

## Data Pre-Processing

Split the data into training  set of $m_{train}=50$ samples and a test set of $m_{test}:=m-m_{train}$ samples.

In [None]:
# Split data into train (50 samples) and test data (the rest)
m_train = 50

m_test = m - m_train 
from sklearn.model_selection import train_test_split

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=m_test/m, random_state=numero_di_matricola)


Standardize the data.

In [None]:
# Data pre-processing
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain_scaled = scaler.transform(Xtrain)
Xtest_scaled = scaler.transform(Xtest)

## Linear Regression with Squared Loss Solution

Now compute the solution for linear regression with squared loss (i.e., the Least-Squares estimate) using LinearRegression() in Scikit-learn, and print the corresponding average loss in training and test data.

Since the average loss can be quite high, we also compute the coefficient of determination $R^2$ and look at $1 - R^{2}$ to have an idea of what the average loss amounts to. To compute the coefficient of determination you can use the "score(...)" function.

In [None]:
# Least-Squares
from sklearn import linear_model 
#LR the linear regression model
LR = linear_model.LinearRegression()

#fit the model on training data
LR.fit(Xtrain_scaled, Ytrain)

#obtain predictions on training data
Ytrain_predicted = LR.predict(Xtrain_scaled)

#obtain predictions on test data
Ytest_predicted = LR.predict(Xtest_scaled)

#coefficients from the model
w_LR = np.hstack((LR.intercept_, LR.coef_))

#average error in training data
loss_train = np.linalg.norm(Ytrain - Ytrain_predicted)**2/m_train

#average error in test data
loss_test =np.linalg.norm(Ytest - Ytest_predicted)**2/m_test

#print average loss in training data and in test data
print("Average loss in training data:"+str(loss_train))
print("Average loss in test data:"+str(loss_test))

#print 1 - coefficient of determination in training data and in test data
print("1 - coefficient of determination on training data:"+str(1 - LR.score(Xtrain_scaled,Ytrain)))
print("1 - coefficient of determination on test data:"+str(1 - LR.score(Xtest_scaled,Ytest)))

### Confidence Intervals

We now compute the confidence interval for each coefficient.

In [None]:
# Least-Squares: Confidence Intervals
from scipy.stats import t

Xtrain_im_testrcept = np.hstack((np.ones((Xtrain_scaled.shape[0],1)), Xtrain_scaled))

#alpha for confidence intervals
alpha = 0.05

d = Xtrain_scaled.shape[1]-1

#quantile from t-student distribution
tperc = t.ppf(1-alpha/2, m_train-d-1, loc=0, scale=1)
sigma2 = np.linalg.norm(Ytrain-Ytrain_predicted)**2/(m_train-d-1)

R = np.dot(Xtrain_im_testrcept.transpose(),Xtrain_im_testrcept)
Ur, Sr, Vr = np.linalg.svd(R, full_matrices=1, compute_uv=1)


Sri = 1/Sr
Sri = Sri*(Sri<1e10)

Ri2 = np.dot(Ur,np.dot(np.diag(Sri),np.transpose(Ur)))

v = np.sqrt(np.diag(Ri2))
Delta = np.sqrt(sigma2)*v*tperc
CI = np.transpose(np.vstack((w_LR,w_LR))) + np.transpose(np.vstack((-Delta,+Delta) ))

Plot the LS coefficients and their confidence intervals.

In [None]:
# Plot confidence
plt.figure(1)
plt.plot(w_LR[1:], 'r', marker='o', ms=7.0)
plt.plot(CI[1:,0], 'b--')
plt.plot(CI[1:,1], 'b--')
plt.plot(np.zeros(w_LR.shape[0],), 'k', linewidth=2.0)
plt.xlabel('Coefficient Index')
plt.ylabel('LR Coefficient')
plt.title('Coefficients and Confidence Sets')
plt.show()

### Question: based on the results above, if you had to choose at most 4 features for a linear regression model, which ones would you choose? Why?

### TO DO 2
Answer the question above (max 5 lines)

## Best-Subset Selection

Split the (previous) training data (i.e., the 50 samples chosen above) into a training data and validation dataset to perform best-subset selection. For splitting, put 50% of the data into the validation set.

For $k$ going from 1 to $n_{sub}=4$:
1. Compute the best model for all the possible subsets of $k$ features
2. Compute the prediction error on the validation dataset

Finally we choose the subset of $k^*$ features giving the lowest validation error.


In [None]:
import itertools
import math 

m_trainBSS=int(math.ceil(m_train/2))
m_valBSS=m_train-m_trainBSS


Xtrain_BSS = Xtrain_scaled[:m_trainBSS,:]
Ytrain_BSS = Ytrain[:m_trainBSS]
Xval_BSS = Xtrain_scaled[m_trainBSS:,:]
Yval_BSS = Ytrain[m_trainBSS:,]

nsub = 4
features_idx_dict = {}
validation_err_dict = {}
validation_err_min = np.zeros(nsub,)
validation_err_min_idx = np.zeros(nsub, dtype=np.int64)
for k in range(1,nsub+1):
    features_idx = list(itertools.combinations(range(Xtrain_BSS.shape[1]),k))
    validation_error = np.zeros(len(features_idx),)
    for j in range(len(features_idx)):
        LR_subset = linear_model.LinearRegression()
        LR_subset.fit(Xtrain_BSS[:,features_idx[j]], Ytrain_BSS)
        validation_error[j] = np.linalg.norm(Yval_BSS - LR_subset.predict(Xval_BSS[:,features_idx[j]]))**2/m_valBSS 
    validation_err_min[k-1] = np.min(validation_error)    
    validation_err_min_idx[k-1] = np.argmin(validation_error)
    features_idx_dict.update({k: features_idx})
    validation_err_dict.update({k: validation_error})
    
print("Validation error as a function of k (starting at k=2): "+str(validation_err_min))

Plot the validation error as a function of the number of retained features.

In [None]:
# Plot
plt.figure(2)
for k in range(1,nsub+1):
    plt.scatter(k*np.ones(validation_err_dict[k].shape), validation_err_dict[k], color='k', alpha=0.5)
    #plt.scatter(k, validation_err_min[k-1], color='r', alpha=0.8)
    if k > 1:
        plt.plot([k-1, k], [validation_err_min[k-2], validation_err_min[k-1]], color='r',marker='o', 
            markeredgecolor='k', markerfacecolor = 'r', markersize = 10)
plt.xlabel('Number of retained features')
plt.ylabel('Avg. validation error')
plt.title('Best-Subset Selection')
plt.show()

Compute the model using the selected subset of features.

### TO DO 3: pick the number of features for the best subset according to figure above, learn the model on the entire training data (i.e., the 50 samples chosen at the beginning), and compute score on training and on test data

In [None]:
LR_best_subset = linear_model.LinearRegression()

# now pick the number of features according to best subset
opt_num_features = # COMPLETE

#opt_features_idx contains the indices of the features from best subset
opt_features_idx = # COMPLETE

#let's print the indices of the features from best subset
print(opt_features_idx)

#fit the best subset on the entire training set
LR_best_subset.fit(Xtrain_scaled[:,opt_features_idx], Ytrain)

#obtain predictions on training data
Ytrain_predicted_best_subset = # COMPLETE

#obtain predictions on test data
Ytest_predicted_best_subset = # COMPLETE

#average loss in training data
loss_train_best_subset = # COMPLETE

#average loss in test data
loss_test_best_subset = # COMPLETE

#print average loss in training data and in test data
print("Average loss in training data:"+str(loss_train_best_subset))
print("Average loss in test data:"+str(loss_test_best_subset))

#now print 1-  the coefficient of determination on training and on test data to get an idea to what the average
#loss corresponds to
print("1 - coefficient of determination of best subset on training data: "+str(1 - LR_best_subset.score(Xtrain_scaled[:,opt_features_idx],Ytrain)))
print("1 - coefficient of determination of best subset on test data: "+str(1 - LR_best_subset.score(Xtest_scaled[:,opt_features_idx],Ytest)))

### TO DO 4: do the features from best subset selection correspond to the ones you would have chosen based on confidence intervals for the linear regression coefficients? Comment (max 5 lines)

## Lasso

### TO DO 5
Use the routine *lasso_path* from *sklearn.linear_regression* to compute the "lasso path" for different values of the regularization parameter $\lambda$. You should first fix a grid a possible values of lambda (the variable "lasso_lams"). For each entry of the vector "lasso_lams" you should compute the corresponding model (The i-th column of the vector  "lasso_coefs" should contain the coefficients of the linear model computed using lasso_lams[i] as regularization parameter).

Be careful that the grid should be chosen appropriately.

Note that the parameter $\lambda$ is called $\alpha$ in the Lasso model from sklearn


In [None]:
from sklearn.linear_model import lasso_path

# select a grid of possible regularization parameters 
# (be carefull how this is chosen, you may have to refine the choice after having seen the results)

#Note: lasso_lams is supposed to be a numpy array
lasso_lams = # COMPLETE

# Use the function lasso_path to compute the "lasso path", passing in input the lambda values
# you have specified in lasso_lams
lasso_lams, lasso_coefs, _ = # COMPLETE

Evaluate the sparsity in the estimated coefficients as a function of the regularization parameter $\lambda$: to this purpose, compute the number of non-zero entries in the estimated coefficient vector.

In [None]:
l0_coef_norm = np.zeros(len(lasso_lams),)

for i in range(len(lasso_lams)):
    l0_coef_norm[i] = sum(lasso_coefs[:,i]!=0)


plt.figure(6)
plt.plot(lasso_lams, l0_coef_norm, marker='o', markersize=5)
plt.xlabel('Lambda')
plt.ylabel('Number of non-zero coefficients')
plt.title('Sparsity Degree')
plt.show()

### TO DO 6: explain the results you observe in the figure above (max 5 lines)

### TO DO 7: Use k-fold Cross-Validation to fix the regularization parameter

Use the scikit-learn built-in routine *Lasso* (from the *linear_regression* package) to compute the lasso  coefficients.

Use *KFold* from *sklearn.cross_validation* to split the data (i.e. Xtrain_scaled and Ytrain) into the desired number of folds.

Then pick $lam\_opt$ to be the chosen value for the regularization parameter.

In [None]:
from sklearn.model_selection import KFold
num_folds = 5

kf = KFold(n_splits = num_folds)

#loss_lasso_kfold will contain the value of the loss
loss_lasso_kfold = np.zeros(len(lasso_lams),)

for i in range(len(lasso_lams)):
    
    #define a lasso model   using Lasso() for the i-th value of lam_values
    lasso_kfold = # COMPLETE
    for train_index, validation_index in kf.split(Xtrain_scaled):
        Xtrain_kfold, Xval_kfold = Xtrain_scaled[train_index], Xtrain_scaled[validation_index]
        Ytrain_kfold, Yval_kfold = Ytrain[train_index], Ytrain[validation_index]
    
        #learn the model using the training data from the k-fold
        
        # ADD CODE
        
        #compute the loss using the validation data from the k-fold

        # ADD CODE
    
# loss_lass_kfold should be the average loss observed in the folds
loss_lasso_kfold /= num_folds


#choose the regularization parameter that minimizes the loss
lasso_lam_opt = # COMPLETE
print("Best value of the regularization parameter:", lasso_lam_opt)

Plot the Cross-Validation estimate of the prediction error as a function of the regularization parameter

In [None]:
plt.figure(4)
plt.xscale('log')
plt.plot(lasso_lams, loss_lasso_kfold, color='b')
plt.scatter(lasso_lams[np.argmin(loss_lasso_kfold)], loss_lasso_kfold[np.argmin(loss_lasso_kfold)], color='b', marker='o', linewidths=5)
plt.xlabel('Lambda')
plt.ylabel('Validation Error')
plt.title('Lasso: choice of regularization parameter')
plt.show()
print("Total number of coefficients:"+str(len(lasso_kfold.coef_)))
print("Number of non-zero coefficients:"+str(sum(lasso_kfold.coef_ != 0)))
print("Best value of regularization parameter:"+str(lasso_lam_opt))


### TO DO 8 now estimate the lasso coefficients using all the training data and the optimal regularization parameter (chosen at previous step)

In [None]:
# Estimate Lasso  Coefficients with all data (trainval) for the the optimal value lasso_lam_opt of the regularization paramter

#define the model
lasso_reg = # COMPLETE

#fit using the training data

# ADD CODE

#average loss on training data
loss_train_lasso = # COMPLETE
#average loss on test data
loss_test_lasso = # COMPLETE

#print average loss in training data and in test data
print("Average loss in training data:"+str(loss_train_lasso))
print("Average loss in test data:"+str(loss_test_lasso))

#now print 1-  the coefficient of determination on training and on test data to get an idea to what the average
#loss corresponds to
print("1 - coefficient of determination on training data:"+str(1 - lasso_reg.score(Xtrain_scaled,Ytrain)))
print("1 - coefficient of determination on test data:"+str(1 - lasso_reg.score(Xtest_scaled,Ytest)))

Compare the LR and the Lasso coefficients.

In [None]:
# Compare LR and lasso coefficients
ind = np.arange(1,len(LR.coef_)+1)  # the x locations for the groups
width = 0.35       # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(ind, LR.coef_, width, color='r')
rects2 = ax.bar(ind + width, lasso_reg.coef_, width, color='y')
ax.legend((rects1[0], rects2[0]), ('LR', 'Lasso'))
plt.xlabel('Coefficient Idx')
plt.ylabel('Coefficient Value')
plt.title('LR and Lasso Coefficient')
plt.show()

## Ridge Regression

## TO DO 9
### Use Ridge regression with cross-validation

We perform Ridge regression (i.e., linear regression with squared loss and L2 regularization) for different values of the regularization parameter $\alpha$ (called $\lambda$ in class), and use the Scikit-learn function to perform cross-validation (CV).

In Ridge regression for scikit learn, the objective function is:

$$
    ||y - Xw||^2_2 + \alpha * ||w||^2_2
$$

In the code below:
- use RidgeCV() to select the best value of $\alpha$ with a 5-fold CV with L2 penalty;
- use Ridge() to learn the best model for the best $\alpha$ for ridge regression using the entire training set (i.e., the 50 samples chosen at the beginning)

Note that RidgeCV() picks some default values of $\alpha$ to try, but we decide to pass the same values used for the Lasso.




In [None]:
#let's define the values of alpha to use
ridge_alphas = # COMPLETE

#define the model using RidgeCV passing the vector of alpha values and the cv value (= number of folds)
ridge = # COMPLETE

#fit the model on training data

# ADD CODE

# the attribute 'alpha_' contains the best value of alpha as identified by cross-validation;
# let's print it

print("Best value of parameter alpha according to 5-fold Cross-Validation: "+str(ridge.alpha_))

#define the model using the best alpha; note that various solvers are availalbe, choose
# an appropriate one
ridge_final = # COMPLETE

#fit the model using the best C on the entire training set

# ADD CODE

#average loss on training data
loss_train_ridge = # COMPLETE

#average loss on test data
loss_test_ridge = # COMPLETE

#print average loss in training data and in test data
print("Average loss in training data:"+str(loss_train_ridge))
print("Average loss in test data:"+str(loss_test_ridge))

#now print 1-  the coefficient of determination on training and on test data to get an idea to what the average
#loss corresponds to
print("1 - coefficient of determination on training data:"+str(1 - ridge_final.score(Xtrain_scaled,Ytrain)))
print("1 - coefficient of determination on test data:"+str(1 - ridge_final.score(Xtest_scaled,Ytest)))

Compare LR, Lasso, and Ridge regression coefficients

In [None]:
# Compare LR and lasso coefficients
ind = np.arange(1,len(LR.coef_)+1)  # the x locations for the groups
width = 0.25       # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(ind, LR.coef_, width, color='r')
rects2 = ax.bar(ind + width, lasso_reg.coef_, width, color='y')
rects3 = ax.bar(ind + 2*width, ridge_final.coef_, width, color='b')
ax.legend((rects1[0], rects2[0], rects3[0]), ('LR', 'Lasso', 'Ridge'))
plt.xlabel('Coefficient Idx')
plt.ylabel('Coefficient Value')
plt.title('LR, Lasso, and Ridge Coefficient')
plt.show()

## TODO 10: comment on the coefficients obtained by the different methods and their comparison (max 5 lines)



## Comparison of models: evaluation of the performance on the test set



In [None]:
print("Average loss of LR on test data:"+str(loss_test))
print("Average loss of LR with subset selection on test data:"+str(loss_test_best_subset))
print("Average loss of LASSO on test data:"+str(loss_test_lasso))
print("Average loss of Ridge regression on test data:"+str(loss_test_ridge))

print("1 - coefficient of determination of LR on test data:"+str(1 - LR.score(Xtest_scaled,Ytest)))
print("1 - coefficient of determination of LR with best subset on test data: "+str(1 - LR_best_subset.score(Xtest_scaled[:,opt_features_idx],Ytest)))
print("1 - coefficient of determination of LASSO on test data:"+str(1 - lasso_reg.score(Xtest_scaled,Ytest)))
print("1 - coefficient of determination of Ridge regression on test data:"+str(1 - ridge_final.score(Xtest_scaled,Ytest)))

## TODO 11: comment and compare the results obtained by the different methods (max 5 lines)

## TODO 12: using your final model of choice (write which one you choose), what are the features that seem more relevant for the prices of houses? Does this match your intuition?

### SUGGESTION (not compulsory): repeat the entire analysis above using a different data size, and try to understand the differences that you observe

