1. You are given a dataset having more variables than observations. Assuming that
there seems to be a linear relationship between the target variable and the input variables in the
dataset, why ordinary least squares (OLS) is a bad option to estimate the model parameters?
Which technique would be best to use? Why?

If given a dataset with more vairables than observations, there is no longer a unique least squares estimate, making the variance infinite which makes LS less effective. Lasso would be the best technique because it allow the removal of variables if the coefficient is equal to 0, lowering the number of features.

2. For Ridge regression, if the regularization parameter, λ, is equal to 0, what are the
implications?

(f): (a) and (c)

3. For Lasso Regression, if the regularization parameter, λ, is very high, which options are
true? Select all that apply.

(f): (a) and (b)

4. Suppose you are using Ridge Regression and you notice that the training error and
the validation error are almost equal and fairly high. Would you say that the model suffers from
high bias or high variance? Should you increase the regularization parameter, λ, or reduce it?

When the training and validationg error are almost equal and fairly high, the model is suffering from high bias and underfitting. To fix this problem, you should reduce λ.

In [1]:
import boto3
import pandas as pd; pd.set_option('display.max_column', 100)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV

## define bucket in which you are trying to reach
s3 = boto3.resource('s3')
bucket_name = 'daltondencklau-data445-bucket'
bucket = s3.Bucket(bucket_name)

## define csv file to read in the bucket
file_key= 'CarPrice_Assignment.csv'

## syntax to allow us to read the file
bucket_object = bucket.Object(file_key)
file_object = bucket_object.get()
file_content_stream = file_object.get('Body')

## reading the data file
car_price = pd.read_csv(file_content_stream)
car_price.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,carheight,curbweight,enginetype,cylindernumber,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [2]:
## disabling the 'FutureWarning' because i am doing 1000 iterations
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)

In [3]:
## creating a list to store the results
coeffs= []

## defining input and target variables
X = car_price[['wheelbase', 'enginesize', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']]
Y = car_price['price']
    
## for loop to estimate optimal lambda and to use optimal lambda to estimate coefficients
for i in range(0,1000):
    
    # print(i)

    ## splitting data into training and testing datasets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
    
    ## estimating lambda for lasso by CV with 5 folds
    lasso_cv = LassoCV(normalize = True, cv = 5).fit(X_train, Y_train)

    ## extracting the best lambda value via cross validation
    cv_lambda = lasso_cv.alpha_
    
    ## building the lasso model and capturing coefficients
    lasso_md = Lasso(alpha = cv_lambda, normalize = True).fit(X_train, Y_train)
    coeffs.append(lasso_md.coef_)

## creating a dataframe from array/list to store results
df_coeffs = pd.DataFrame(coeffs, columns = [['wheelbase', 'enginesize', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']])
df_coeffs
                               

Unnamed: 0,wheelbase,enginesize,compressionratio,horsepower,peakrpm,citympg,highwaympg
0,189.075330,103.820178,296.806156,52.823060,2.555695,-174.041224,97.837789
1,167.741971,97.101028,280.132969,49.507967,1.629300,-156.889803,-0.000000
2,201.003883,86.884562,274.495175,78.409891,1.857745,-29.840206,-0.000000
3,228.839537,95.127565,192.522294,52.776780,1.348007,-79.175090,-0.000000
4,99.214265,119.299426,356.503960,27.202152,1.833573,-211.818431,-0.000000
...,...,...,...,...,...,...,...
995,137.963432,112.610099,319.753915,46.995271,2.327203,-137.316876,-0.000000
996,229.125662,101.924436,299.135076,57.956384,1.958099,-0.813038,-77.757122
997,117.624911,108.720770,415.879896,49.223274,1.901520,-138.157603,-0.000000
998,203.329397,109.034381,261.885719,47.512437,2.082388,-94.467200,-0.000000


In [4]:
## counting all 0s in each column
count_0 = (df_coeffs ==0).sum()
count_0

wheelbase             0
enginesize            0
compressionratio      0
horsepower            0
peakrpm               0
citympg              16
highwaympg          749
dtype: int64

Will be removing the 'highwaympg' feature because it has 500+ values(coefficients) that equal 0 

In [5]:
## removing 'highwaympg' because of the lambda value and defining input and target variables
X_train = X.drop(columns = ['highwaympg'], axis = 1)
X_test = X.drop(columns = ['highwaympg'], axis = 1)

## splitting the data into 80% training and 20% testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

In [6]:
## creating function (defined by def), applied to each of the columns (axis = 1, columns)
## normalizing the data so that there is no bias
def l2_normalization(X):
    x_mean = np.mean(X)
    l2 = np.sqrt(sum(X**2))
    return (X - x_mean) / l2

## making sure variables are all on the same scale
X_train = X_train.apply(l2_normalization, axis = 1)
X_test = X_test.apply(l2_normalization, axis = 1)

## Building both Linear and Ridge Regression Models

In [7]:
## Building  both linear and ridge regression models

## defining a list to store 100 optimal lambda values
ridge_lambda = list()

for i in range (0,100):
    
    ## estimating the best lambda (Ridge)
    ridge_cv = RidgeCV(alphas = [0.001, 0.01, 0.1, 1, 10, 100], cv = 5).fit(X_train, Y_train)
    
    ## estracting the optimal lambda
    cv_lambda = ridge_cv.alpha_
    
    ## appending the results to a list
    ridge_lambda.append(cv_lambda)
    
## finding out what is the most common lambda value
import statistics 
from statistics import mode
print('The most common lambda is', (mode(ridge_lambda)))

The most common lambda is 0.001


In [8]:
## extracting the best lambda value (Ridge)
CV_lambda = ridge_cv.alpha_
print('The best lambda of the ridge model is', CV_lambda)

## building the linear(1st) and ridge (2nd) regression model
lm_md = LinearRegression().fit(X_train, Y_train)
ridge_md = Ridge(alpha = cv_lambda).fit(X_train, Y_train)

## Predicting on testing dataset
lm_pred = lm_md.predict(X_test)
ridge_pred = ridge_md.predict(X_test)

## Computing the MSE of the Models
mse1 = np.mean(np.power(Y_test - lm_pred, 2))
mse2 = np.mean(np.power(Y_test - ridge_pred, 2))

    
print('The MSE of the Linear Regression Model is', mse1)
print('The MSE of the Ridge Regression Model is', mse2)

The best lambda of the ridge model is 0.001
The MSE of the Linear Regression Model is 14403871.56213078
The MSE of the Ridge Regression Model is 16547976.292737931


In [9]:
mse1<mse2

True

I would chose the Linear Regression Model, model 1, because of the lower MSE value.