In [4]:
## Q1 You are given a dataset having more variables than observations. Assuming that
## there seems to be a linear relationship between the target variable and the input variables in the
## dataset, why ordinary least squares (OLS) is a bad option to estimate the model parameters?
## Which technique would be best to use? Why?

## OLS is not a great option because there are so many variables tht you can get a high variance in the predicted
## values. This is due to small changes in the data set causing a large changes in the coefficients.
## Ridge regression is a good alternative for these situations, as it shrinks the coefficient values towards zero. Generating
## multiple sets of coefficient values (using different lambda values) allows a set of coefficients to be generated that
## minimizes variance in the predictions.

In [5]:
## Q2 For Ridge regression, if the regularization parameter, λ, is equal to 0, what are the implications?

## (d) all of the above

## (a) Large coefficients in the linear model are not penalized & 
## (b) Overfitting problems are not accounted for &
## (c) The objective function is the same as ordinary least squares objective function.

In [6]:
## Q3 For Lasso Regression, if the regularization parameter, λ, is very high, which options are
## true? Select all that apply

## (f) (a) and (b)

## (a) Can be used to select important features of a dataset &
## (b) Shrinks the coefficients of less important features to exactly 0

## although I also think if the regularization parameter gets too high, there will be
## underfitting as variables that are important will be inappropriately minimized

In [7]:
## Q4 Suppose you are using Ridge Regression and you notice that the training error and
## the validation error are almost equal and fairly high. Would you say that the model suffers from
## high bias or high variance? Should you increase the regularization parameter, λ, or reduce it?

## I would say the model suffers from high bias--the model is not complex enough to accurately
## model the train or predict the test data.  In order to increase the rigor of fitting, 
## the regularization parameter, λ, should be decreased. This will bring fewer 
## terms closer to zero and will fit the model better to the data. If λ gets too high the data
## will suffer from underfitting, and will show low bias but high variance.

In [8]:
## Q5a Load the data file to you S3 bucket. Using the pandas library, read the csv data
## file and create a data-frame called car price.

import pandas as pd
import boto3
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV, Lasso, Ridge, RidgeCV

## Defining the S3 bucket
s3 = boto3.resource('s3')
bucket_name = 'bonnieh-data-445-bucket'
bucket = s3.Bucket(bucket_name)

## Defining the file to be read from s3 bucket
file_key = 'CarPrice_Assignment.csv'

bucket_object = bucket.Object(file_key)
file_object = bucket_object.get()
file_content_stream = file_object.get('Body')

##R Reading the csv file
car_price = pd.read_csv(file_content_stream)
car_price.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [9]:
## Q5b Using the wheelbase, enginesize, compressionratio, horsepower, peakrpm, citympg, and highwaympg as the 
## predictor variables, and price is the target variable. Do the following:

## Defining the variables
X = car_price[['wheelbase', 'enginesize', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']]
Y = car_price['price']

## defining list to store results for variable optimization
var_coef = []
## initially defined dataframe, but this is not efficience computing. Better to define list
## then convert to df
var_coef

for i in range (0, 1000):

    ## Split the data into train (80%) and test (20%)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

    ## Using the train dataset:
    ## Estimate the optimal lambda using default values for lambda in scikit-learn and 5-folds. 
    ## Make sure to normalize the data (normalize = True).
    ## only do on train data, never on test
    lasso_cv = LassoCV(normalize = True, cv = 5).fit(X_train, Y_train)

    ## extracting the optimal alpha (lambda)
    cv_alpha = lasso_cv.alpha_
    cv_alpha
    
    ## Perform LASSO as a variable selector (using the optimal lambda from previous step 
    ## Make sure to normalize the data (normalize = True).
    lasso_md = Lasso(alpha = cv_alpha, normalize = True).fit(X_train, Y_train)
    lasso_md.coef_
    
    ## append coefficients to the dataframe
    var_coef.append(lasso_md.coef_)
    
    ##pd.concat([pd.DataFrame([lasso_md.coef_], columns = var_coef_df.columns), var_coef_df], ignore_index = True)
    
var_coef_df= pd.DataFrame(var_coef, columns = ['wb', 'es','cr','hp', 'prpm', 'cmpg', 'hmpg'])
var_coef_df.head()

Unnamed: 0,wb,es,cr,hp,prpm,cmpg,hmpg
0,158.28779,103.786591,288.692944,66.937028,1.642054,-89.367206,-0.0
1,199.657603,106.141763,277.849249,54.046312,1.994572,-68.220976,0.0
2,209.292338,105.485367,279.722051,50.531343,1.966438,-0.0,-107.923572
3,207.653831,109.332169,273.034923,47.710945,1.904311,-95.080879,-0.0
4,221.683069,103.859431,246.666863,56.217672,1.959814,-54.427753,-0.0


In [10]:
## determining which variables have a coefficient of zero at least 500 of the 1000 runs

## counting the number of times there is a non-zero number for each column
var_coef_df.astype(bool).sum(axis=0)

## highwaympg is the only variable with less than 500 non-zero numbers (it is equal
## to zero 753 times)

wb      1000
es      1000
cr      1000
hp      1000
prpm     999
cmpg     980
hmpg     247
dtype: int64

In [11]:
## dropping the variables where there are less than 500 non-zero numbers
X_train = X_train.drop(columns = ['highwaympg'])
X_test = X_test.drop(columns = ['highwaympg'])
Y_train = Y_train.drop(columns = ['highwaympg'])
Y_test = Y_test.drop(columns = ['highwaympg'])

## since will do train/test split again below, don't really need to do this here as 
## instead will drop the variables to be used from train and test

In [12]:
## Q5d   Split the data into train (80%) and test (20%). Then, normalize the inputs
## variables of the train and test datasets using the L2 normalization. That is, for each input
## variable subtract the mean of that variable, then divide by the L2-norm of that variable.

## splitting data into train and test

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

## dropping highway mpg from X_train and X_test 

X_train = X_train.drop(columns = ['highwaympg'], axis = 1)
X_test = X_test.drop(columns = ['highwaympg'], axis = 1)

X_train.head()


Unnamed: 0,wheelbase,enginesize,compressionratio,horsepower,peakrpm,citympg
124,95.9,156,7.0,145,5000,19
55,95.3,70,9.4,101,6000,17
77,93.7,92,9.4,68,5500,31
158,95.7,110,22.5,56,4500,34
139,93.7,108,8.7,73,4400,26


In [13]:
## l2 normalization of the input variables (never the target!!)

def l2_normalization(X):
    x_mean = np.mean(X)
    l2 = np.sqrt(sum(X**2))
    
    return (X-x_mean) / l2

X_train = X_train.apply(l2_normalization, axis = 1)
X_test = X_test.apply(l2_normalization, axis = 1)
X_train.head()

Unnamed: 0,wheelbase,enginesize,compressionratio,horsepower,peakrpm,citympg
124,-0.161406,-0.149399,-0.179166,-0.151597,0.818338,-0.176769
55,-0.15886,-0.163075,-0.173171,-0.15791,0.824921,-0.171905
77,-0.158482,-0.158791,-0.173804,-0.163153,0.824109,-0.169878
158,-0.157084,-0.153908,-0.17334,-0.165901,0.82102,-0.170786
139,-0.156983,-0.153736,-0.176288,-0.161685,0.821051,-0.172359


In [17]:
## Q5e Using the train dataset, build a linear regression model. After that, use this model
## to predict on the test dataset. Report the MSE of this model.

## building the model

lm_md = LinearRegression().fit(X_train, Y_train)

## predicting on the test data

preds = lm_md.predict(X_test)

## calculating the MSE for the predictions from the linear model

mse = np.mean((preds-Y_test)**2)

print ('The MSE for the linear model is', mse)


The MSE for the linear model is 8238405.247646711


In [19]:
## Q5f Using the train dataset, build a Ridge regression model as follows:
## (i) Using the train dataset, estimate the optimal lambda from the following set [0.001,
## 0.01, 0.1, 1, 10, 100] and using 5-folds.
## (ii) Repeat (i) 100, store the optimal lambda of each iteration.

## Using the most common lambda of the 100 optimal lambdas and the train dataset, build a
## Ridge regression model. After that, use this model to predict on the test dataset. Report
## the MSE of this model.

## defining a list to store the ridge values
ridge_alpha_iter = []

## setting up the Ridge cross-validation with 100 rounds of determining the best lambda:

for i in range (0, 100):
    ridge_cv = RidgeCV(alphas = [0.01, 0.1, 1, 10, 100], cv=5).fit(X_train, Y_train)
    ridge_alpha = ridge_cv.alpha_
    
    ridge_alpha_iter.append(ridge_alpha)

from collections import Counter

Counter(ridge_alpha_iter)

## the lambda (alpha) value selected the most was 0.01 (selected all 100 times)


Counter({0.01: 100})

In [20]:
## building the Ridge model

ridge_md = Ridge(alpha = 0.01).fit(X_train, Y_train)

## predicting on the test data

preds_ridge = ridge_md.predict(X_test)

## calculating the MSE

mse_ridge = np.mean((preds_ridge - Y_test)**2)

print ('The MSE for the ridge model is ', mse_ridge)

The MSE for the ridge model is  19312693.670808848


In [None]:
## Q5g Using the results from parts (e) and (f), what model would you use to predict car
## prices? Explain.

## the MSE for the linear model is 8,238,405 while the MSE for the ridge model is 19,312,693.
## Based on these values, the linear model has a substantially smaller MSE for this split of the data.
## I would prefer to run each model multiple times and calculate an average MSE, to be sure the
## model selected is best for a range of train/test splits of the data. I ran my code several times
## getting it all correct, and noticed the MSE was varying quite a bit for different train/test splits.