<img src="../images/blackfin_logo_black.png" style="float: left; margin: 20px; height: 55px">

# Project 2 - Ames Housing Data and Kaggle Challenge
# Notebook 3: Modeling

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV, ElasticNet, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

%matplotlib inline

In [2]:
# import datasets
train = pd.read_csv('../datasets/train_cleaned.csv')
test = pd.read_csv('../datasets/test_cleaned.csv')

In [3]:
# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [4]:
# functions
def cv_score(model,m_str, X_train, y_train):
    cv_score = cross_val_score(model, X_train, y_train, cv = 5).mean()
    mse = (-cross_val_score(model, X_train, y_train, cv = 5, scoring = 'neg_mean_squared_error')).mean()
    rmse = np.sqrt(mse)
    print(f'Cross validation score for {m_str}: {cv_score}')
    print(f'RMSE score for {m_str}: {rmse}')
   
    
def model_scores(model, m_str, X_train, X_test, y_train, y_test):
    cv_score = cross_val_score(model, X_train, y_train, cv = 5).mean()
    accuracy_train = model.score(X_train, y_train)
    accuracy_test = model.score(X_test, y_test)
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train)) 
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
    
    print(f'Model Scores for {m_str}:\n')
    print(f'r^2 score: {cv_score}')
    print(f'Train accuracy score: {accuracy_train}')
    print(f'Test accuracy score: {accuracy_test}')
    print(f'Training RMSE : {rmse_train}')
    print(f'Testing RMSE : {rmse_test}\n')

## Baseline Model

In [5]:
# train-test split
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(train.drop(columns = ['saleprice']), train['saleprice'], test_size = 0.2, random_state = 42)

In [6]:
dr = DummyRegressor() # instantiate DummyRegressor that makes predictions using simple rules (mean)

In [7]:
dr.fit(X_train_b, y_train_b) # fit the model

In [8]:
cv_score(dr, 'Dummy Regressor', X_train_b, y_train_b)

Cross validation score for Dummy Regressor: -0.006348095145533783
RMSE score for Dummy Regressor: 79853.0336685394


With a negative r<sup>2</sup> score, none of the observed variations can be explained by the with the DummyRegressor used. <br>
Hence this model is not useful in predicting saleprices. 

## Model Prep

In [9]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(train.drop(columns = ['saleprice']), train['saleprice'], test_size = 0.2, random_state = 42)

In [10]:
# instantiate and apply scaling 
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [11]:
# instantiate models
lr = LinearRegression()

In [12]:
lasso_cv = LassoCV(n_alphas = 200)

In [13]:
ridge_cv = RidgeCV(alphas = np.linspace(1, 200,100))

In [14]:
enet_cv = ElasticNetCV()

### Cross validation

In [15]:
cv_score(lr, 'Linear Regression', X_train_sc, y_train)

Cross validation score for Linear Regression: -6.3280951229768745e+19
RMSE score for Linear Regression: 575114680007441.1


In [16]:
cv_score(lasso_cv, 'Lasso Regression', X_train_sc, y_train)

Cross validation score for Lasso Regression: 0.9033582892553115
RMSE score for Lasso Regression: 24615.23713615206


In [17]:
cv_score(ridge_cv, 'Ridge Regression', X_train_sc, y_train)

Cross validation score for Ridge Regression: 0.9019160694907061
RMSE score for Ridge Regression: 24769.973344529626


In [18]:
cv_score(enet_cv, 'ElasticNet Regression', X_train_sc, y_train)

Cross validation score for ElasticNet Regression: 0.27036672304449827
RMSE score for ElasticNet Regression: 68030.85724559907


From the above cross validation values, it seems like `LassoCV` is doing the best based on R<sup>2</sup> score of 0.9034 and RMSE of 24615

### Model Fitting & Evaluation

#### Linear Regression

In [19]:
lr.fit(X_train_sc, y_train)

In [20]:
model_scores(lr, 'Linear regression', X_train_sc, X_test_sc, y_train, y_test)

Model Scores for Linear regression:

r^2 score: -6.3280951229768745e+19
Train accuracy score: 0.9290424763393723
Test accuracy score: -1.4665986599357684e+21
Training RMSE : 21248.283126723432
Testing RMSE : 2957743506330492.0



From the scores above, despite the high train accuracy score, we can tell that the model is highly over-fitted from the negative r<sup>2</sup> score and large testing RMSE score. 

#### Lasso Regression 

In [21]:
lasso_cv.fit(X_train_sc, y_train)

In [22]:
# getting the optimal value of alpha from lasso cv 
lasso_a = lasso_cv.alpha_
lasso_a

328.3057478754679

In [23]:
lasso = Lasso(alpha = lasso_a)
lasso.fit(X_train_sc, y_train)

In [24]:
model_scores(lasso, 'Lasso regression', X_train_sc, X_test_sc, y_train, y_test)

Model Scores for Lasso regression:

r^2 score: 0.9041296963596919
Train accuracy score: 0.9247432934381704
Test accuracy score: 0.9144722880883185
Training RMSE : 21882.51442693321
Testing RMSE : 22586.986613515193



From the scores above, the r<sup>2</sup> has increased slightly and the RMSE score has decreased. We can also observe that the lasso model is slightly overfitted but it definitely has better results lassoCV model and the linear regression model above. 

#### Ridge Regression

In [25]:
ridge_cv.fit(X_train_sc, y_train)

In [26]:
# getting the optimal value of alpha from ridge cv 
ridge_a = ridge_cv.alpha_
ridge_a

163.8181818181818

In [27]:
ridge = Ridge(alpha = ridge_a)
ridge.fit(X_train_sc, y_train)

In [28]:
model_scores(ridge, 'Ridge Regression', X_train_sc, X_test_sc, y_train, y_test)

Model Scores for Ridge Regression:

r^2 score: 0.9019475619672189
Train accuracy score: 0.9256428722847146
Test accuracy score: 0.914583943451902
Training RMSE : 21751.335237821797
Testing RMSE : 22572.238283528517



From the scores above, the r<sup>2</sup> has decreased slightly and the RMSE score has decreased too. We can also observe that the ridge model is slightly overfitted.

Comparing the lasso and ridge models, the lasso model has a better r<sup>2</sup> score, while the ridge model has better RMSE score.

#### ElasticNet Regression

In [29]:
enet_cv = ElasticNetCV(alphas = np.linspace(0.5, 1, 100), cv = 5)
enet_cv.fit(X_train_sc, y_train)

In [30]:
# getting the optimal value of alpha from elasticnet cv 
enet_a = enet_cv.alpha_
enet_a

0.5

In [31]:
enet = ElasticNet(alpha = enet_a)
enet.fit(X_train_sc, y_train)

In [32]:
model_scores(enet, 'ElasticNet Regression', X_train_sc, X_test_sc, y_train, y_test)

Model Scores for ElasticNet Regression:

r^2 score: 0.9013047000459462
Train accuracy score: 0.9210051294645254
Test accuracy score: 0.9105682346637366
Training RMSE : 22419.40410189203
Testing RMSE : 23096.744461893126



From the scores above, the r<sup>2</sup> has decreased slightly and the RMSE score has decreased too. We can also observe that the ridge model is slightly overfitted.

Comparing the Lasso, Ridge and ElasticNet models, ElasticNet and Ridge have similar r<sup>2</sup> scores while the RMSE of ElasticNet is worse off than both Lasso and Ridge. 

## Model Selection Summary

|Model|r<sup>2</sup> score|Train score|Test score|Train RMSE|Test RMSE|
|---|---|---|---|---|---|
|Linear Regression|-6.3281e+19|0.9290|-1.4666|21248|2957743506330492|
|Lasso Regression|0.9041|0.9247|0.9145|21883|22586|
|Ridge Regression|0.9019|0.9256|0.9146|21751|22572|
|ElasticNet Regression|0.9013|0.9210|0.9101|22419|23097|

From the summary table above, 
1. The closer to 1 the r<sup>2</sup> value is, the better fit and less error it implies. Lasso Regression has the best r<sup>2</sup> score, followed by Ridge Regression and ElasticNet Regression.
2. The train & test scores for Lasso, Ridge and ElasticNet Regression are almost similar. 
3. Looking at the test root mean squared error values, Ridge regression has the lowest RMSE followed closely by Lasso regression and then ElasticNet regression. 
<br><br>
Since the values between Lasso and Ridge regression are so close, we shall submit both predicted datasets to kaggle, to see which produces the best RMSE score. 