# <font color=blue>Assignments for "Overfitting and Regularization"</font>

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

- Load the **houseprices** data from Kaggle.
- Reimplement your model from the previous lesson.
- Try OLS, Lasso, Ridge and ElasticNet regressions using the same model specification. Which model is the best? Why?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.api.types import is_numeric_dtype
from sklearn import metrics
import statsmodels.api as sm
import math
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse

In [2]:
data = pd.read_csv("C:/Users/Elif/data/house_prices.csv")

In [3]:
data=data.drop(['PoolQC','MiscFeature','Fence','Alley'], axis=1)


In [4]:
Y = data['SalePrice']

numerical_cols = [col_name for col_name in data.dtypes[data.dtypes.values == 'int64'].index 
                    if col_name not in ["id", "SalePrice"] ]

X = data[numerical_cols]

X = pd.concat([X**i for i in range(1,21)], axis=1)

# X is the feature set

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)
print("The number of observations in training set is {}".format(X_train.shape[0]))
print("The number of observations in test set is {}".format(X_test.shape[0]))

The number of observations in training set is 1168
The number of observations in test set is 292


In [5]:
lrm = LinearRegression()
lrm.fit(X_train, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [6]:
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

In [7]:
print("R-squared of the model in training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in training set is: 0.9377002797057834
-----Test set statistics-----
R-squared of the model in test set is: -1.3160874361787074e+21
Mean absolute error of the prediction is: 328492441329036.25
Mean squared error of the prediction is: 9.535536210816502e+30
Root mean squared error of the prediction is: 3087966355195034.0
Mean absolute percentage error of the prediction is: 101249567191.06326


As we see, the R-squared of the model in the training set is 0.93 whereas it's negative in the test set. Since the difference between them is too large, our model overfits the training set.

**Ridge**

In [13]:
from sklearn.linear_model import Ridge

# Fitting a ridge regression model. Alpha is the regularization
# parameter (usually called lambda). As alpha gets larger, parameter
# shrinkage grows more pronounced.
ridgeregr = Ridge(alpha=10**37) 
ridgeregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridgeregr.predict(X_train)
y_preds_test = ridgeregr.predict(X_test)

print("R-squared of the model in training set is: {}".format(ridgeregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(ridgeregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in training set is: 0.6770174559040603
-----Test set statistics-----
R-squared of the model in test set is: 0.297818203127579
Mean absolute error of the prediction is: 47894.63427972848
Mean squared error of the prediction is: 5087564675.865489
Root mean squared error of the prediction is: 71327.16646457708
Mean absolute percentage error of the prediction is: 26.962967217426126


The R-squared of the training set dropped from 0.93 to 0.67. Although this seems like a deterioration, if we look at the R-squared in the test set we see that it's now 0.30 which marks a significant jump from negative. Moreover, all of the performance statistics from the test set improved quite substantially. These mean that by using Ridge regression, we reduced the magnitude of the overfitting.

**Lasso**

In [15]:
from sklearn.linear_model import Lasso

lassoregr = Lasso(alpha=10**20.5) 
lassoregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lassoregr.predict(X_train)
y_preds_test = lassoregr.predict(X_test)

print("R-squared of the model in training set is: {}".format(lassoregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lassoregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in training set is: 0.7475021396629613
-----Test set statistics-----
R-squared of the model in test set is: 0.5034197264408707
Mean absolute error of the prediction is: 38788.18915031811
Mean squared error of the prediction is: 3597906225.629578
Root mean squared error of the prediction is: 59982.54934253443
Mean absolute percentage error of the prediction is: 21.46659513286364


The R-squared is 0.75 in the train set and 0.51 in the test set. The R-squared value in the test set is the highest among the models we covered in this lesson. Also, the difference between the R-squareds of the train and test set is also the lowest.

**ElasticNet**

In [16]:
from sklearn.linear_model import ElasticNet

elasticregr = ElasticNet(alpha=10**21, l1_ratio=0.5) 
elasticregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticregr.predict(X_train)
y_preds_test = elasticregr.predict(X_test)

print("R-squared of the model in training set is: {}".format(elasticregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in training set is: 0.6905850254071653
-----Test set statistics-----
R-squared of the model in test set is: 0.4086288703040336
Mean absolute error of the prediction is: 42888.214215121654
Mean squared error of the prediction is: 4284700747.2546387
Root mean squared error of the prediction is: 65457.625585218404
Mean absolute percentage error of the prediction is: 23.39053232065967


According to the results, ElasticNet's train and test set performances are somewhere between Lasso's and Ridge's.