<div>
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" width=100>
</div>

# Project 2: Ames Housing Data and Kaggle Challenge

## Background
The Ames Housing Dataset is an exceptionally detailed and robust dataset with over 70 columns of different features relating to houses.  
We are presented with this dataset to predict the prices of the houses in Ames, Iowa.  
The data is taken  from: https://www.kaggle.com/c/dsi-us-11-project-2-regression-challenge/data

## Problem Statement
As a consultant to the house-owner in the city of Ames of Iowa, I am presented with the challenge of finding the features that will affect the sale price of the house. In doing so, I will then be able to give recommendations to the house-owner in order to increase the value of the house.

## Progress thus far
In Part 1, we have cleaned the provided datasets and selected the following features out of the original 80 features that will be used for our modelling process.

|Feature|Description|
|:--:|:----------:|
|**id**|  |
|**saleprice**| The Id of the property |

## Part 2: Modelling

### 1. Importing the libraries (All libraries used will be added here)

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


### 2. Importing the training dataset and the test dataset

In [2]:
df_train_clean = pd.read_csv('../datasets/train_clean.csv')

df_test_clean = pd.read_csv('../datasets/test_clean.csv')

### 3. Train/Test Split
As we are unable to confirm the results of our modelling if we are to immediately predict on the test dataset since the `SalePrice` for the test dataset is not provided, if we are to evaluate our models, we will have to split up the training dataset into train data and test data.

In [3]:
# Determining the features for the modelling
features = [feat for feat in df_train_clean.columns if feat != 'saleprice' and feat != 'id']

In [4]:
# Creating the X and y variables from the training dataset
X = df_train_clean[features]
y = df_train_clean['saleprice']

# Creating the X variable from the test dataset [Using X_kaggle here to differ from the X_test later]
X_kaggle = df_test_clean[features]

# Displaying the shapes of the X and y variables
display(X.shape)
display(y.shape)
display(X_kaggle.shape)

(2051, 25)

(2051,)

(878, 25)

In [5]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

### 4. Baseline Model
For the Baseline Model for the prediction of the `SalePrice`, we will be using the mean value of `SalePrice` from the train data above as the predictions for the test data.

In [6]:
# The mean value of SalePrice
np.mean(y_train)

181061.9934980494

In [7]:
# RMSE for the train data
y_pred_base = np.ones_like(y_train) * np.mean(y_train)
np.sqrt(mean_squared_error(y_train, y_pred_base))

79526.85223710592

In [8]:
# RMSE for the test data
y_pred_base = np.ones_like(y_test) * np.mean(y_train)
np.sqrt(mean_squared_error(y_test, y_pred_base))

78375.26238032707

For the Baseline Model, the RMSE for the train set and the test set are 79526.8522 and 78375.2624 respectively. Considering the RMSE for the sample provided by Kaggle is at 83945.31, this is somewhat expected for the Baseline Model.

From here on, we just have to improve on this score by modelling the data on the appriopriate models.

### 5. Model Fitting and Evaluation
At this point in time, we will be fitting the given data to a multi-variable linear regression model. As we have checked the linear relationships in Part 1, we will not be doing these here again. 

Also, to evaluate the models, we will be looking at the three metrics: _Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Coefficient of Determination, $R^2$_.

In [9]:
# Function for metrics (Since we will be using this often)
def model_metrics(model, X_mod, y_mod):
    # Predicting the y values based on the given X_mod
    y_mod_pred = model.predict(X_mod)       
    
    # Printing the metrics data
    print(f'RMSE: {round(np.sqrt(mean_squared_error(y_mod, y_mod_pred)), 4)}')
    print(f'MAE: {round(mean_absolute_error(y_mod, y_mod_pred), 4)}')
    print(f'R2: {round(r2_score(y_mod, y_mod_pred), 4)}')

### 5.1 Scaling of data
As our later models using Ridge and Lasso would require us to scale our data prior to modelling, we will be scaling the features in our dataset

In [10]:
# The columns that need to be scaled, taken from Part 1, dummy columns do not need to be scaled
scale_cols = ['lot_area',
              'overall_qual', 
              'year_built', 
              'exter_qual', 
              'bsmt_qual', 
              'bsmt_exposure', 
              'total_bsmt_sf', 
              'heating_qc', 
              'gr_liv_area', 
              'full_bath', 
              'kitchen_qual', 
              'totrms_abvgrd', 
              'fireplaces', 
              'garage_area']

In [11]:
# Initializing the scaled dataframes in order to avoid the transformation of the original dataframes
Z_train = X_train.copy()
Z_test = X_test.copy()
Z_kaggle = X_kaggle.copy()

In [12]:
# Initializing the StandardScaler
ss = StandardScaler()

# Fit and transform the training data
Z_train[scale_cols] = ss.fit_transform(Z_train[scale_cols])

#Transform the testing data and the kaggle test dataset
Z_test[scale_cols] = ss.transform(Z_test[scale_cols])
Z_kaggle[scale_cols] = ss.transform(Z_kaggle[scale_cols])

### 5.2 Linear Regression
Our first model to start off will be the Ordinary Least Squares Linear Regression or OLS. 

Since we have both the scaled and the unscaled data to model from, we will be using both here as well.

In [13]:
## Unscaled data
# Initializing and fitting the model 
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [14]:
# Metrics for Linear Regression with Unscaled data for the training data
model_metrics(lr, X_train, y_train)

RMSE: 30832.4528
MAE: 19940.5007
R2: 0.8497


In [15]:
# Metrics for Linear Regression with Unscaled data for the test data
model_metrics(lr, X_test, y_test)

RMSE: 26923.1815
MAE: 19641.124
R2: 0.8819


In [30]:
# Cross validation score for Linear Regression with Unscaled data
display(cross_val_score(lr, X_train, y_train, cv=10).mean())
# display(cross_val_score(lr, X_train, y_train, scoring='neg_mean_squared_error', cv=10).mean())

0.8325085355893315

In [17]:
# Coefficients of Linear Regression with Unscaled data
lr.coef_

array([ 1.24765778e+00,  1.10308594e+04,  1.32165341e+02,  1.08464410e+04,
        5.26016505e+03,  6.46223403e+03, -2.94178731e+00,  3.47357664e+03,
        5.23488580e+01, -1.46166120e+03,  9.81290396e+03,  6.71414136e+02,
        8.65868952e+03,  2.70092921e+01,  6.02151979e+04,  4.44652129e+04,
        3.79653563e+04,  8.28727051e+03,  2.74152354e+03,  1.84104942e+03,
       -6.13998667e+02,  1.18962525e+04, -4.21136040e+03,  6.03822142e+03,
       -4.44055455e+01])

The R2 scores for both the training and test data are 0.8497 and 0.8819, which is good since we are looking at how well the model is performing. However the cross validation score of 0.8325 differs quite a bit from the R2 score, so the model does not work so well on unseen data. 

The RMSE for the train and test data are at 30832.4528 and 26923.1815 respectively, which is an improvement from the baseline scores. 

In [18]:
## Scaled data
# Initializing and fitting the model 
lr_ss = LinearRegression()
lr_ss.fit(Z_train, y_train)

LinearRegression()

In [19]:
# Metrics for Linear Regression with Scaled data for the training data
model_metrics(lr_ss, Z_train, y_train)

RMSE: 30832.4528
MAE: 19940.5007
R2: 0.8497


In [20]:
# Metrics for Linear Regression with Scaled data for the test data
model_metrics(lr_ss, Z_test, y_test)

RMSE: 26923.1815
MAE: 19641.124
R2: 0.8819


In [21]:
# Cross validation score for Linear Regression with Scaled data
cross_val_score(lr_ss, Z_train, y_train, cv=10).mean()

0.8325085355893307

In [22]:
# Coefficients of Linear Regression with Scaled data
lr_ss.coef_

array([ 6.11316263e+03,  1.57524810e+04,  3.98663953e+03,  6.32788041e+03,
        4.70107621e+03,  6.90360309e+03, -1.33918917e+03,  3.38460825e+03,
        2.59577443e+04, -8.08431568e+02,  6.54494878e+03,  1.04884543e+03,
        5.51075653e+03,  5.89329917e+03,  6.02151979e+04,  4.44652129e+04,
        3.79653563e+04,  8.28727051e+03,  2.74152354e+03,  1.84104942e+03,
       -6.13998667e+02,  1.18962525e+04, -4.21136040e+03,  6.03822142e+03,
       -4.44055455e+01])

We have the exact same scores for both unscaled and scaled data. This should not be a surprise since scaling only standardizes the X values and does not affect the modelling in any way. The only difference is the coefficients of the model due to the scaling.

### 5.3 Ridge Regression
The next step is regularization. First off, we will start with Ridge Regression.

In [31]:
# Initializing and fitting the model 
r_alphas = np.logspace(0,5,100)
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2', cv=10)
ridge_cv.fit(Z_train,y_train)

RidgeCV(alphas=array([1.00000000e+00, 1.12332403e+00, 1.26185688e+00, 1.41747416e+00,
       1.59228279e+00, 1.78864953e+00, 2.00923300e+00, 2.25701972e+00,
       2.53536449e+00, 2.84803587e+00, 3.19926714e+00, 3.59381366e+00,
       4.03701726e+00, 4.53487851e+00, 5.09413801e+00, 5.72236766e+00,
       6.42807312e+00, 7.22080902e+00, 8.11130831e+00, 9.11162756e+00,
       1.02353102e+01, 1.14975700e+0...
       6.89261210e+03, 7.74263683e+03, 8.69749003e+03, 9.77009957e+03,
       1.09749877e+04, 1.23284674e+04, 1.38488637e+04, 1.55567614e+04,
       1.74752840e+04, 1.96304065e+04, 2.20513074e+04, 2.47707636e+04,
       2.78255940e+04, 3.12571585e+04, 3.51119173e+04, 3.94420606e+04,
       4.43062146e+04, 4.97702356e+04, 5.59081018e+04, 6.28029144e+04,
       7.05480231e+04, 7.92482898e+04, 8.90215085e+04, 1.00000000e+05]),
        cv=10, scoring='r2')

In [33]:
# Optimal value of alpha 
ridge_cv.alpha_

2.8480358684358014

In [39]:
# Initiate and fitting Ridge with the optimal value of alpha
ridge = Ridge(alpha=ridge_cv.alpha_)
ridge.fit(Z_train, y_train)

Ridge(alpha=2.8480358684358014)

In [40]:
# Metrics for Ridge for the training data
model_metrics(ridge, Z_train, y_train)

RMSE: 30850.6322
MAE: 19923.4464
R2: 0.8495


In [41]:
# Metrics for Ridge for the test data
model_metrics(ridge, Z_test, y_test)

RMSE: 26738.2836
MAE: 19623.7797
R2: 0.8836


In [43]:
# Cross validation score for Linear Regression with Scaled data
cross_val_score(ridge, Z_train, y_train, cv=10).mean()

0.8327419700176986

### 5.4 Lasso Regression
After Ridge Regression, we will try Lasso Regression.

In [35]:
# Initializing and fitting the model 
lasso_cv = LassoCV(n_alphas=500, cv=10)
lasso_cv.fit(Z_train,y_train)

LassoCV(cv=10, n_alphas=500)

In [36]:
# Optimal value of alpha 
lasso_cv.alpha_

122.99946170715148

In [44]:
# Initiate and fitting Lasso with the optimal value of alpha
lasso = Lasso(alpha=lasso_cv.alpha_)
lasso.fit(Z_train, y_train)

Lasso(alpha=122.99946170715148)

In [48]:
# Metrics for Ridge for the training data
model_metrics(lasso, Z_train, y_train)

RMSE: 30890.0557
MAE: 19841.8641
R2: 0.8491


In [47]:
# Metrics for Ridge for the test data
model_metrics(lasso, Z_test, y_test)

RMSE: 26590.7275
MAE: 19485.1481
R2: 0.8848


In [49]:
# Cross validation score for Linear Regression with Scaled data
cross_val_score(lasso, Z_train, y_train, cv=10).mean()

0.8333717846630158

### 6 Final Model
`To be Added`