# Regression with Ridge and Lasso

In this part of the assignment, you need the predict the price of the house (i.e., column 4 of the csv file) and the features provided to you are 'len', 'width', 'rooms' (i.e., the first 3 columns of the csv file). You can use the sklearn library to use ``LinearRegression``, ``Ridge``, and ``Lasso`` from ``sklearn.linear_model``. Moreover, if you feel the need to expand the features to polynomials (say degree 2) you can either transform the CSV file manually or use the ``PolynomialFeatures`` from ``sklearn.preprocessing``. You might realize that adding polynomial features can improve the results but you have to be careful about overfitting.


In [1]:
# Standard includes
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# Routines for linear regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Set label size for plots
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

In [2]:
data = np.genfromtxt('LandPriceTrain.csv', delimiter=',')
features = ['len', 'width', 'rooms']
x_train = data[:,0:3] # predictors
y_train = data[:,3] # response variable

In [3]:
data = np.genfromtxt('LandPriceTest.csv', delimiter=',')
x_test = data[:,0:3] # predictors
y_test = data[:,3] # response variable

### 1. What best can we acheive if we have no predictors and only response (House Prices) values in the training data? What will be the mean error?

In [4]:
### START CODE HERE ###
from sklearn.linear_model import LinearRegression

y_pred = np.mean(y_train)
print ("Prediction: ", y_pred)
print ("Mean squared error: ", np.var(y_train))
print ("Mean error: ", np.mean(np.abs(y_pred-y_train)))

Prediction:  84779.45
Mean squared error:  3210718511.6474996
Mean error:  44200.84


### 2. Let's now use the features and see what we can observe  

In [5]:

def feature_subset_regression(x,y,flist):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 2):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of LinearRegression
    regr = LinearRegression(fit_intercept=True, normalize=True, copy_X=True)
    regr.fit(x[:,flist], y)
    return regr

In [11]:
flist = [0,1,2]
regr = feature_subset_regression(x_train,y_train,flist)
print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)
print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test[:,flist])))

w =  [ 3010.83212779  2914.47821951 -2420.72225879]
b =  -77044.07528278609
Mean squared error (train):  217514383.15439206
Mean error (train):  12629.08643427138
Mean squared error (test):  158706743.78641912
Mean error (test):  9999.972138004105


### 3. It seems we are underfitting as the train and test error are significantly high. Let's try to use polynomial features.

Try incorporating polynomial features (say of degree 2) and see how you perform on the train and the test set. You can either transform the CSV file manually or use the ``PolynomialFeatures`` from ``sklearn.preprocessing``.

In [12]:
### START CODE HERE ###
#try to expand the fetaures fit the linear regression and report the results
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
x_train_poly = poly.fit_transform(x_train)
x_test_poly = poly.fit_transform(x_test)
print(np.shape(x_train_poly))

(20, 10)


In [14]:
### UPDATE THE CODE BELOW ###
flist = [0,1,2]
regr = feature_subset_regression(x_train_poly,y_train,flist)
print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)
print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train_poly[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train_poly[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test[:,flist])))

w =  [   0.         3087.2179578  2942.24531245]
b =  -88410.28873001739
Mean squared error (train):  224311218.08118543
Mean error (train):  12908.28793341165
Mean squared error (test):  8446371439.880243
Mean error (test):  82436.86160080953


### 4. It seems we are overfitting as the train error is significantly lower than the test error. Let's try some regularization techniques.

### Ridge Regression

In [15]:
from sklearn.linear_model import Ridge

In [16]:
def feature_subset_ridge(x,y,flist, alp):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 8):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of Ridge, be careful of the parameters
    regr = Ridge(alp, fit_intercept=True, normalize=True, copy_X=True)
    regr.fit(x[:,flist], y)
    return regr

In [17]:
flist = [0,1,2,3,4,5,6,7,8]
regr = feature_subset_ridge(x_train_poly,y_train,flist, 0.05)
print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)
print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train_poly[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train_poly[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test_poly[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test_poly[:,flist])))

w =  [   0.          671.22857011  873.0746094  2763.47289999    7.59290655
   55.65823143   13.74225506   18.5524101  -161.24098778]
b =  -27692.892418302057
Mean squared error (train):  46947651.7286587
Mean error (train):  5668.600791740504
Mean squared error (test):  69716998.73251502
Mean error (test):  6606.276700825309


### Lasso Regression

In [18]:
from sklearn.linear_model import Lasso

In [19]:
def feature_subset_lasso(x,y,flist, alp):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 8):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of Lasso, be careful of the parameters
    regr = Lasso(alp, fit_intercept=True, normalize=True, copy_X=True)
    regr.fit(x[:,flist], y)
    return regr

In [20]:
flist = [0,1,2,3,4,5,6,7,8]
regr = feature_subset_lasso(x_train_poly,y_train,flist, 1150)
print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)
print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train_poly[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train_poly[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test_poly[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test_poly[:,flist])))

w =  [ 0.          0.          0.         -0.          0.         91.30207043
  0.          1.1332545   0.        ]
b =  5854.543815015029
Mean squared error (train):  65329533.3564923
Mean error (train):  6799.294735785111
Mean squared error (test):  55470019.35749112
Mean error (test):  5565.061009110918


## Document your observation and understanding

Add you observations and understanding:


Initially using the normal Linear Regression, we found high training and test error. This means we have to increase the complexity by increasing the features via polynomial.

We found that the test error was significantly higher than the training error after the polynomial degree of 2 transformation.
Due to this desparity, it seems clear that the model is "memorizing" the training data due to its high features. Regularization is a method of dealing with overfitting by controlling the complexity of the learning function. This is done via added penalty to the cost function if the weights of the features are high.

After regularization, we still overfit slightly as the test error is still higher than the training error, but less signficantly.

Now we perform feature selection using Lasso regression to remove unnecessary features, hoping to reduce the complexity.
This gave us the desired objective of:
    a) reduced error compared to linear regression
    b) lower training error versus test error. 
    

# Additional Questions (Optional)

1. Implement the losed-form solution for Ridge
2. Implement the iterative solution (Gradient Descent) for Ridge
3. Implement the iterative solution for Lasso
4. Use the sklearn linear_model.ElasticNet and try on the above problem.

Compare your implemented solutions with the built-in solutions on the above problem