**Exercise 1: (5 points) You are given a dataset having more variables than observations. Assuming that
there seems to be a linear relationship between the target variable and the input variables in the
dataset, why ordinary least squares (OLS) is a bad option to estimate the model parameters?
Which technique would be best to use? Why?**

Since the number of input variables is larger than the number of observations in the dataset, there is no longer a unique least squares coefficient and the variance is infinite. So the least squares method cannot be used at all. Also, least squares is very unlikely to yield any coefficients of that are exactly 0, which doesn't allow us to use it as a variable selection procedure. I would use lasso in this situation in the hopes of eliminating unnecessary variables that add complexity and aren't significant when creating the actual models.

**Exercise 2: (5 points) For Ridge regression, if the regularization parameter, λ, is equal to 0, what are the
implications?**

(d) All of the above. Because when λ = 0 the penalty term has no effect and ridge regression will produce the least squares estimates. So  regularization isn't technically used at all. Therefore, since it doesn't change the model, it doesn't really account for overfitting either.

**Exercise 3: (5 points) For Lasso Regression, if the regularization parameter, λ, is very high, which options are
true? Select all that apply.**


(f) (a) and (b). It can be used to select important features of a dataset and shrinks the coefficients of less important features to exactly 0.

**Exercise 4:
An important theoretical result of statistics and Machine Learning is the fact that model’s generalization error can be expressed as the sum of two very different errors:**
- **Bias: This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to under-fit the training data.**
- **Variance: This part is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance and thus overfit the training data.**

**(5 points) Suppose you are using Ridge Regression and you notice that the training error and
the validation error are almost equal and fairly high. Would you say that the model suffers from
high bias or high variance? Should you increase the regularization parameter, λ, or reduce it?**

When the training error and validation error are high and close to eachother that means your model is underfitting (high bias). To fix it, you should reduce the regularization parameter λ.

**Exercise 5: Consider the CarPrice Assignment.csv data file. This data is public available on the Kaggle
website, and has information on cars (characteristics related to car dimensions, engine and more).
The goal is to use car information to predict the price of the car. In Python, answer the following:**

(a) (5 points) Load the data file to you S3 bucket. Using the pandas library, read the csv data file and create a data-frame called car price.

In [1]:
import boto3
import pandas as pd; pd.set_option('display.max_column', 100)
import numpy as np

from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import train_test_split


## Defining the s3 bucket
s3 = boto3.resource('s3')
bucket_name = 'craig-shaffer-data-445-bucket'
bucket = s3.Bucket(bucket_name)

## Defining the file to be read from s3 bucket
file_key = 'CarPrice_Assignment.csv'

bucket_object = bucket.Object(file_key)
file_object = bucket_object.get()
file_content_stream = file_object.get('Body')

# reading the datafile
car_price = pd.read_csv(file_content_stream)
car_price.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,carheight,curbweight,enginetype,cylindernumber,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


(b) (15 points) Using the wheelbase, enginesize, compressionratio, horsepower, peakrpm, citympg, and highwaympg as the predictor variables, and price is the target variable. Do the following:

- Split the data into train (80%) and test (20%)
- Using the train dataset: 
 - Estimate the optimal lambda using default values for lambda in scikit-learn and 5-folds. Make sure to normalize the data (normalize = True).
 - Perform LASSO as a variable selector (using the optimal lambda from previous step (i)). Make sure to normalize the data (normalize = True).

Repeat steps (1) and (2) 1000 times. Store the estimated model coefficients of each iteration
in a data-frame. Remove the variables, whose estimated coefficients is 0 more than 500
times, from the training and testing datasets.


In [5]:
#stopping warnings for 
import warnings
warnings.simplefilter(action= 'ignore', category=FutureWarning)

In [7]:
# Defining input and target variables
x= car_price[['wheelbase','enginesize','compressionratio','horsepower','peakrpm','citympg','highwaympg']]
y= car_price['price']

#list to store coefficients
coef = list()

#for loop to estimate optimal lambda
for i in range(0,1000):
    #split into train and test
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
    
    #extracting best lambda with lasso cross-validation
    lasso_cv = LassoCV(normalize = True, cv = 5).fit(x_train, y_train)
    
    #building lasso
    lasso_md = Lasso(alpha = lasso_cv.alpha_, normalize = True).fit(x_train,y_train)
    
    #storing estimated coefficients
    coef.append(lasso_md.coef_)

#turning the list to dataframe
coef_data = pd.DataFrame(coef)  
coef_data

Unnamed: 0,0,1,2,3,4,5,6
0,223.165194,103.614078,304.414921,47.010222,2.004323,-220.500569,72.397113
1,201.389639,113.825106,323.300015,33.561209,1.951058,-170.015359,-0.000000
2,159.111488,121.836182,345.312057,54.329357,2.834021,-77.549650,-0.000000
3,192.430545,114.456982,264.725505,45.118464,1.653040,-93.741814,-0.000000
4,175.665494,111.820618,321.911830,54.230756,1.964926,-77.485174,-0.000000
...,...,...,...,...,...,...,...
995,186.004969,103.929205,370.892198,48.426278,2.195520,-127.772405,-0.000000
996,146.917755,111.574212,297.133158,49.730411,1.880853,-95.656910,-1.383831
997,231.706283,94.191458,276.908226,55.459976,1.681358,-109.400307,-0.000000
998,153.982471,111.699259,299.222390,42.932017,1.880002,-92.961329,-14.183885


In [10]:
zeros = (coef_data ==0).sum()
zeros

0      0
1      0
2      1
3      0
4      1
5     11
6    730
dtype: int64

We should drop highwaympg (6) because it has 730 coefficients that equal zero

(c) (5 points) Split the data into train (80%) and test (20%). Then, normalize the inputs variables of the train and test datasets using the L2 normalization. That is, for each input variable subtract the mean of that variable, then divide by the L2-norm of that variable.


In [11]:
#Dropping highwaympg
x_train = x_train.drop(columns = ['highwaympg'], axis=1)
x_test = x_test.drop(columns = ['highwaympg'], axis=1)

#split into train and test
x_train,x_test,y_train,y_test = train_test_split(x, y, test_size= 0.2)

#defining l2 normalization and applying it to test and train
def l2_normalization(x):
    x_mean = np.mean(x)
    l2 = np.sqrt(sum(x**2))
    return (x - x_mean) / l2

x_train = x_train.apply(l2_normalization, axis=1)
x_test = x_test.apply(l2_normalization, axis=1)

(d) (5 points) Using the train dataset, build a linear regression model. After that, use this model to predict on the test dataset. Report the MSE of this model.


In [12]:
#linear regression
lm_md = LinearRegression().fit(x_train,y_train)

#predicting on test
lm_pred = lm_md.predict(x_test)

#computing mse of the lm model
mse1 = np.mean(np.power(y_test-lm_pred,2))
print('the mse of the model is',mse1)

the mse of the model is 8667662.164883392


(e) (10 points) Using the train dataset, build a Ridge regression model as follows:
- Using the train dataset, estimate the optimal lambda from the following set [0.001, 0.01, 0.1, 1, 10, 100] and using 5-folds.
- Repeat (i) 100, store the optimal lambda of each iteration.

Using the most common lambda of the 100 optimal lambdas and the train dataset, build a Ridge regression model. After that, use this model to predict on the test dataset. Report the MSE of this model.

In [15]:
ridge = list()
for k in range(0,100):
    #ridge regression
    ridge_cv = RidgeCV(alphas = [0.001,0.01,0.1,1,10], cv = 5).fit(x_train,y_train)

    #extract lambda
    cv_lambda = ridge_cv.alpha_
    
    #storing lambdas
    ridge.append(cv_lambda)
#find most common/optimal lambda
import statistics as st
op_lambda=st.mode(ridge)
print('the most common lambda is',op_lambda)


the most common lambda is 0.001


In [21]:
#build the ridge regression model
ridge_md = Ridge(alpha = op_lambda).fit(x_train,y_train)

#predicting on test
ridge_pred = ridge_md.predict(x_test)

#computing the mse of the ridge regregression model
mse2 = np.mean(np.power(y_test-ridge_pred,2))
print('the mse of the ridge model is ',mse2)

the mse of the ridge model is  8860780.161636842


(f) (5 points) Using the results from parts (d) and (e), what model would you use to predict car prices? Explain.

In [22]:
print('the MSE of the linear regression model is:', mse1)
print('the MSE of the ridge regression model is:', mse2)

the MSE of the linear regression model is: 8667662.164883392
the MSE of the ridge regression model is: 8860780.161636842


Based on the results from part (d) and (e), I would use the linear regression model because the mse for it was lower than the mse for ridge regression model.