<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Regression Challenge
Notebook III

--- 
#### Linear Regression Modeling


Notebook II to IV showcase the modeling process for this project: Started from selecting most relevant features. Then a null model was established as a baseline, follwed by fitting a simple ols model. More features and engineering were added along the way as well as more complex tools such as regularization and log transformation to fine tune the model for a best result. There are 7 models included in three notebooks.


This notebook contains the following content:

- [Standardized Scaling](#Standardized-Scaling)
- [LASSO Model](#Model-4---LASSO-Model)
- [Ridge Model](#Model-5---Ridge-Model)
---

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import pickle

In [2]:
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)

train = pd.read_csv('../datasets/train_clean.csv')
test = pd.read_csv('../datasets/test_clean.csv')

### Standardized Scaling

In this notebook I will improve my model using regularization tools. As the prep work, I will scale all the features.

In [3]:
train['total_sqft * overall_qual'] = train['total_sqft'] * train['overall_qual']
test['total_sqft * overall_qual'] = test['total_sqft'] * test['overall_qual']

In [4]:
df= train.copy()
df_test = test.copy()

In [5]:
# define numberical features
num_feature = ['total_sqft',
            'total_bath',
            'total_bedroom',
            'overall_qual',
            'exter_qual',
            'kitchen_qual',
            'bsmt_qual',
            'fireplace_qu',
            'garage_finish',
            'mas_vnr_area',
            'garage_cars',
            'age_sold',
            'age_since_remod',
            'total_sqft * overall_qual']

X_num_train = df[num_feature]
X_num_test = df_test[num_feature]

In [6]:
# instantiate the standardized scaler
sc = StandardScaler()

# fit the train data and transform it
Z_num_train = sc.fit_transform(X_num_train)

# transform the test data so it will not influence the model
Z_num_test = sc.transform(X_num_test)

In [7]:
# define categorical features
cat_feature = ['neighborhood',
              'yr_sold',
              'ms_subclass']

X_cat_train = df[cat_feature]
X_cat_test = df_test[cat_feature]

In [8]:
# dummify all the categorical features
X_cat_train = pd.get_dummies(data = X_cat_train, columns = cat_feature, drop_first = True)
X_cat_test  = pd.get_dummies(data = X_cat_test, columns = cat_feature, drop_first = True)

# match testing features to training features 
X_train_only = set(X_cat_train.columns) - set(X_cat_test.columns)

for missing_col in X_train_only:
    X_cat_test[missing_col] = 0

X_cat_test = X_cat_test[X_cat_train.columns]

In [10]:
# convert the scaled data into a dataframe 
Z_train_df = pd.DataFrame(data = Z_num_train, index = X_num_train.index, columns = num_feature)
Z_test_df = pd.DataFrame(data = Z_num_test, index = X_num_test.index, columns = num_feature)

# combine the numerical and categorical data
X_scaled_train = Z_train_df.join(X_cat_train, on = X_cat_train.index)
X_scaled_test = Z_test_df.join(X_cat_test, on = X_cat_test.index)

# define the target variable
y = df['saleprice']

# train-test-split
X_train, X_test, y_train, y_test = train_test_split(X_scaled_train, y, random_state = 42)

In [35]:
X_scaled_train.to_csv('../assets/scaled.csv', index = False)

### Model 4 - LASSO Model

In [12]:
# import LASSO libraries
from sklearn.linear_model import Lasso, LassoCV

In [22]:
# inspired by notes from the regularization lecture
l_alpha = np.logspace(-3, 10, 100)

# find the best alpha value
lasso_cv = LassoCV(alphas = l_alpha, cv = 5, max_iter = 50000 )

# fit the lasso.cv model
lasso_cv.fit(X_train, y_train)

# evaluate with R2 score
print(f'Train R2 score is: {lasso_cv.score(X_train, y_train)}')
print(f'Test R2 score is: {lasso_cv.score(X_test, y_test)}')

# evaluate with rmse
y_hat = lasso_cv.predict(X_test)
rmse = (metrics.mean_squared_error(y_test, y_hat)) ** 0.5
print(f'RMSE is: {rmse}')

Train R2 score is: 0.9078259117174003
Test R2 score is: 0.9215496162363584
RMSE is: 22270.4358639252


Compared to Model 3, the LASSO model actually increased my RMSE score. The R<sup>2</sup> did not change too much. Perhaps since I do not have too many features and didn't cause too much overfitting. And thus LASSO's reducing feature effect didn't show significantly in this model.

In [23]:
# predict on the actual testing data
pred = lasso_cv.predict(X_scaled_test)

submission = pd.DataFrame()
submission['id'] = test['id']
submission['SalePrice'] = pred
submission.to_csv('../result/submission_3.csv', index = False)

In [24]:
coef = pd.DataFrame(data = lasso_cv.coef_, index = X_scaled_train.columns, columns = ['coef'])

This submission has a Kaggle RMSE score of 24844.61970.

### Model 5 - Ridge Model

In [26]:
# import Ridge libraries
from sklearn.linear_model import Ridge, RidgeCV

In [27]:
# inspired by notes from the regularization lecture
r_alphas = np.logspace(0, 5, 100)

# find the best alpha value
ridge_cv = RidgeCV(alphas = r_alphas, scoring = 'r2', cv = 5)

# fit the ridge.cv model
ridge_cv = ridge_cv.fit(X_train, y_train)

# evaluate with R2 score
print(f'Train R2 score is: {ridge_cv.score(X_train, y_train)}')
print(f'Test R2 score is: {ridge_cv.score(X_test, y_test)}')

# evaluate with rmse
y_hat = ridge_cv.predict(X_test)
rmse = (metrics.mean_squared_error(y_test, y_hat)) ** 0.5
print(f'RMSE is: {rmse}')

Train R2 score is: 0.9090943758307257
Test R2 score is: 0.9215212038783754
RMSE is: 22274.468338241983


The Ridge result didn't change much from LASSO. RMSE is almost the same and so are the R squared score.

In [28]:
# predict on the actual test data
pred = ridge_cv.predict(X_scaled_test)

submission = pd.DataFrame()
submission['id'] = test['id']
submission['SalePrice'] = pred
submission.to_csv('../result/submission_4.csv', index = False)

This submission has a Kaggle RMSE score of 25077.55084, which is lower than the LASSO score.

In [33]:
# save the models to pickle for later stage

lasso = '../assets/lasso.pkl'
ridge = '../assets/ridge.pkl'

pickle.dump(lasso_cv, open(lasso, 'wb'))
pickle.dump(ridge_cv, open(ridge, 'wb'))