# Assignments for "Overfitting and Regularization"
In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

house_prices_df = pd.read_csv("https://djl-lms-assets.s3.eu-central-1.amazonaws.com/datasets/house_prices.csv", sep = ";")
house_prices_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


* Reimplement your model from the previous lesson.

In [2]:
Y = house_prices_df['SalePrice']

numerical_cols = [col_name for col_name in house_prices_df.dtypes[house_prices_df.dtypes.values == 'int64'].index 
                    if col_name not in ["id", "SalePrice"] ]

X = house_prices_df[numerical_cols]
X = pd.concat([X**i for i in range(1,21)], axis=1)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)
print("The number of observations in training set is {}".format(X_train.shape[0]))
print("The number of observations in test set is {}".format(X_test.shape[0]))

The number of observations in training set is 1168
The number of observations in test set is 292


* Try OLS, Lasso, Ridge and ElasticNet regressions using the same model specification. This time, you need to do k-fold cross-validation to choose the best hyperparameter values for your models. Which model is the best? Why?

<h3> OLS </h3>

In [4]:
lrm = LinearRegression()
lrm.fit(X_train, y_train)

LinearRegression()

In [5]:
y_preds_train = lrm.predict(X)
y_preds_test = lrm.predict(X_test)

print("R-squared of the model in training set is: {:.5f}".format(lrm.score(X_train, y_train)))
print("R-squared of the model in test set is: {:.5f}".format(lrm.score(X_test, y_test)))

R-squared of the model in training set is: 0.94679
R-squared of the model in test set is: -20270209797650137088.00000


As we can see, the R-squared of the model in the training set is 0.95 whereas it's negative in the test set. The difference between them is too large, hence our model overfits the training set. 

<h3>Ridge Regression</h3>

In [6]:
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error

alphas = [1e+37,1e+38,1e+39,1e+40]
ridge_cv = RidgeCV(alphas=alphas, store_cv_values=True)
ridge_mod = ridge_cv.fit(X_train,y_train)

print("Best alpha value:",ridge_mod.alpha_)

y_pred = ridge_mod.predict(X_test)
score = ridge_mod.score(X_test,y_test)
print("R-squared of the model in training set is: {:.5f}".format(ridge_mod.score(X_train, y_train)))
print("R-squared of the model in test set is: {:.5f}".format(ridge_mod.score(X_test, y_test)))
mse = mean_squared_error(y_test,y_pred)
print("MSE:{:.2f}, RMSE:{:.2f}"
   .format(mse, np.sqrt(mse)))

Best alpha value: 1e+38
R-squared of the model in training set is: 0.63496
R-squared of the model in test set is: 0.30263
MSE:5052701949.51, RMSE:71082.36


<h3>Lasso Regression </h3>

In [7]:
from sklearn.linear_model import LassoCV

alphas = [1e+20,1e+21,1e+22,1e+23]
lasso_cv = LassoCV(alphas=alphas, max_iter=2000)
lasso_mod = lasso_cv.fit(X_train,y_train)

print("Best alpha value:",lasso_mod.alpha_)

y_pred = lasso_mod.predict(X_test)
score = lasso_mod.score(X_test,y_test)
print("R-squared of the model in training set is: {:.5f}".format(lasso_mod.score(X_train, y_train)))
print("R-squared of the model in test set is: {:.5f}".format(lasso_mod.score(X_test, y_test)))
mse = mean_squared_error(y_test,y_pred)
print("MSE:{:.2f}, RMSE:{:.2f}"
   .format(mse, np.sqrt(mse)))

Best alpha value: 1e+22
R-squared of the model in training set is: 0.40703
R-squared of the model in test set is: 0.19182
MSE:5855564486.87, RMSE:76521.66


<h3>ElasticNet Regression</h3>

In [8]:
from sklearn.linear_model import ElasticNetCV

alphas = [1e+20,1e+21,1e+22,1e+23]
elasticnet_cv = ElasticNetCV(alphas=alphas)
elasticnet_mod = elasticnet_cv.fit(X_train,y_train)

print("Best alpha value:",elasticnet_mod.alpha_)

y_pred = elasticnet_mod.predict(X_test)
score = elasticnet_mod.score(X_test,y_test)
print("R-squared of the model in training set is: {:.5f}".format(elasticnet_mod.score(X_train, y_train)))
print("R-squared of the model in test set is: {:.5f}".format(elasticnet_mod.score(X_test, y_test)))
mse = mean_squared_error(y_test,y_pred)
print("MSE:{:.2f}, RMSE:{:.2f}"
   .format(mse, np.sqrt(mse)))

Best alpha value: 1e+21
R-squared of the model in training set is: 0.69059
R-squared of the model in test set is: 0.40863
MSE:4284700747.25, RMSE:65457.63


> With ElasticNet model, the R-squared is 0.69 in the train set and 0.41 in the test set. The R-squared value in the test set is the highest among the other models and the difference between the R-squareds of the train and test set is the lowest. All of the performance statistics on the test set are also the lowest which means that this ElasticNet model performs the best.