<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Regression Challenge
Notebook IV

--- 
#### Linear Regression Modeling


Notebook II to IV showcase the modeling process for this project: Started from selecting most relevant features. Then a null model was established as a baseline, follwed by fitting a simple ols model. More features and engineering were added along the way as well as more complex tools such as regularization and log transformation to fine tune the model for a best result. There are 7 models included in three notebooks.


This notebook contains the following content:

- [Log Transformation](#Model-6---Log-Transformation)
- [Log Transformation Improved](#Model-7---Log-Transformation-Improved)
- [Pickling](#Pickling)
---

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics

In [22]:
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)

train = pd.read_csv('../datasets/train_clean.csv')
test = pd.read_csv('../datasets/test_clean.csv')

### Model 6 - Log Transformation

As LASSO and Ridge didn't improve my model, I'm performing log transformation on my model. Since our target variable is right skewed, the transformation will normalize it which may help our result.

In [23]:
# feature engineering
train['total_sqft * overall_qual'] = train['total_sqft'] * train['overall_qual']
test['total_sqft * overall_qual'] = test['total_sqft'] * test['overall_qual']

df= train.copy()
df_test = test.copy()

In [24]:
# define features
features = ['total_sqft',
            'total_bath',
            'total_bedroom',
            'neighborhood',
            'overall_qual',
            'exter_qual',
            'kitchen_qual',
            'bsmt_qual',
            'fireplace_qu',
            'garage_finish',
            'fireplaces',
            'mas_vnr_area',
            'garage_cars',
            'age_sold',
            'age_since_remod',
            'yr_sold',
            'total_sqft * overall_qual',
            'overall_cond',
            'ms_subclass']

X = df[features]
X = pd.get_dummies(data = X, columns = ['neighborhood', 'yr_sold', 'ms_subclass'], drop_first = True)
y = train['saleprice']
features_train = X.columns

In [25]:
# train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# log transform the target variables
y_train_log = y_train.map(np.log)
y_test_log = y_test.map(np.log)

# instantiate the linear regression model
lr = LinearRegression()

# fit the model with logged training data
lr.fit(X_train, y_train_log)

# evaluate with R2
print(f'Train score is: {lr.score(X_train, y_train_log)}')

# exponentiate the target 
pred = lr.predict(X_test)
pred_test = np.exp(pred)
print(f'Test score is: {metrics.r2_score(y_test, pred_test)}')
print(f'RMSE is: {(metrics.mean_squared_error(y_test, pred_test)) ** 0.5}')

Train score is: 0.9089377332630848
Test score is: 0.9332237320865686
RMSE is: 20546.7081489154


This is the best model so far. I have achieved significant drop in RMSE by log transformation. Both train and test R squared scores are in the 90% category.

In [26]:
# define features for actual testing data and preprocessing 
X_1 = df_test[features]
X_1 = pd.get_dummies(data = X_1, columns = ['neighborhood', 'yr_sold', 'ms_subclass'], drop_first = True)
features_test = X_1.columns

# match testing features to training features 

train_only = set(features_train) - set(features_test)
for missing_col in train_only:
    X_1[missing_col] = 0
X_1 = X_1[X.columns]

In [27]:
# predict on the actual test data
pred_log = lr.predict(X_1)

# exponentiate the predicted value back to sale price
pred_saleprice = np.exp(pred_log)

# submission
submission = pd.DataFrame()
submission['id'] = test['id']
submission['SalePrice'] = pred_saleprice
submission.to_csv('../result/submission_5.csv', index = False)

This submission has a Kaggle RMSE value of 21856.70871

### Model 7 - Log Transformation Improved

This model is in fact just a minor upgrade from the previous model. Since log transformation has improved my model significantly, instead of using train-test-split, I can just fit the model to my entire training data. 

In [28]:
# define and preprocessing the training data
X = df[features]
X = pd.get_dummies(data = X, columns = ['neighborhood', 'yr_sold', 'ms_subclass'], drop_first = True)
y = train['saleprice']

In [29]:
# instantiate the linear regression model
lr = LinearRegression()

# log transformation the sale price of the entire training data
y_log = y.map(np.log)

# check the cross val score
print(f'Cross Val Score is: {cross_val_score(lr, X, y_log).mean()}')

# fit the model
lr.fit(X, y_log)

# predict the value and exponentiate
pred = lr.predict(X)
pred_unlog = np.exp(pred)

# evaluate with R2
print(f'Test Score: {metrics.r2_score(y, pred_unlog)}')

# evaluate with RMSE
print(f'RMSE is: {(metrics.mean_squared_error(y, pred_unlog)) ** 0.5}')

Cross Val Score is: 0.8975296507191131
Test Score: 0.9277140786962689
RMSE is: 21314.328873098806


The RMSE dropped from the previous model while test R squared score has improved. Would like to see how Kaggle score performs.

In [30]:
# define and preprocessing actual testing data
X_1 = df_test[features]
X_1 = pd.get_dummies(data = X_1, columns = ['neighborhood', 'yr_sold', 'ms_subclass'], drop_first = True)
features_test = X_1.columns

# match the features
train_only = set(features_train) - set(features_test)
for missing_col in train_only:
    X_1[missing_col] = 0
X_1 = X_1[X.columns]

In [31]:
pred_log = lr.predict(X_1)

pred_saleprice = np.exp(pred_log)

submission = pd.DataFrame()
submission['id'] = test['id']
submission['SalePrice'] = pred_saleprice
submission.to_csv('../result/submission_6.csv', index = False)

This submission has a Kaggle RMSE value of 21204.55452, which is the best Kaggle score I have.

### Save to Pickle

In [37]:
file_name = '../assets/log.pkl'
pickle.dump(lr, open(file_name, 'wb'))