# Modeling

The models are scored using the Root Mean Squared Logarithmic Error (RMSLE). 

In [1]:
import pandas as pd
import numpy as np
from utilities import rmsle
from utilities import gen_sub

# the input files are train2 and test2 because they have been preprocessed
trainFile = 'data/train2.csv'
testFile = 'data/test2.csv'

train = pd.read_csv(trainFile)
test = pd.read_csv(testFile)

### Naive Model

We begin with the Naive Model: we assume all houses are the mean house price.

In [2]:
naiveTrain_y = pd.read_csv(trainFile)
naiveTrain_y['NaivePrediction'] = naiveTrain_y['SalePrice'].mean()

naiveScore_train = rmsle(naiveTrain_y['NaivePrediction'], naiveTrain_y['SalePrice'])

print("Naive Model Score: ", naiveScore_train)

Naive Model Score:  0.3644525219748948


This Naive Model score is about .364, which leaves a lot of room for improvement.

### A More Complex Model: Linear Regression

We will now use a linear regression model and measure the loss.

In [3]:
# first generate a train_test split for evaluating the performance of the model
from sklearn.model_selection import train_test_split

y_linear = train['SalePrice']

X_linear = train.drop(columns=['SalePrice'])

# we are using a train-test split of 70/30 to ensure enough data is used for training.
# the actual test set is the same size as the training set
X_linear_train, X_linear_test, y_linear_train, y_linear_test = train_test_split(X_linear, y_linear, 
                                                                                test_size=0.3, random_state=0)

In [4]:
# now we train the model on the train split of the training data
from sklearn.linear_model import LinearRegression

lm = LinearRegression()

lm.fit(X_linear_train, y_linear_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [5]:
# now we generate predictions on the validation split of the training data and score the predictions
#lm_train_pred = lm.predict(X_train)
#lm_train_rmsle = rmsle(lm_train_pred, y_train)
#print('Linear Model Loss for Training Set: ', lm_train_rmsle)

lm_validate_pred = lm.predict(X_linear_test)
lm_validate_rmsle = rmsle(lm_validate_pred, y_linear_test)
print('Linear Model RMSLE for Validation Set: ', lm_validate_rmsle)

Linear Model RMSLE for Validation Set:  0.13807988910042823


The linear model has a loss of about .138, which is almost 3 times as good at the naive model, which had a loss of about .364.

This .138 is very good for a simple linear model.

### Linear Regression with Hand-Picked Features

We will now try a linear regression using only a small subset of features, the features that I beileve to be the most important.
 - TotalSF
 - TotalBaths
 - LotArea
 - Neighborhood
 - OverallCond
 - OverallQual
 - SaleCondition
   -  To catch forclosures, short sales, and sales between family members which would result in significantly lower prices than normal

In [6]:
# linear model with hand picked features
most_important_features = ['TotalSF', 'TotalBaths', 'LotArea', 
                           'OverallCond', 'OverallQual']

most_important_features.extend([i for i in list(train) if 'Neighborhood_' in i])
most_important_features.extend([i for i in list(train) if 'SaleCondition_' in i])

y_linear_hp = train['SalePrice']

X_linear_hp = train[most_important_features]

X_linear_hp_train, X_linear_hp_test, y_linear_hp_train, y_linear_hp_test = train_test_split(X_linear_hp, y_linear_hp,
                                                                                           test_size=.3, random_state=0)

In [7]:
lm_hp = LinearRegression()

lm_hp.fit(X_linear_hp_train, y_linear_hp_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [8]:
lm_hp_validate_pred = lm_hp.predict(X_linear_hp_test)
lm_hp_validate_rmsle = rmsle(lm_hp_validate_pred, y_linear_hp_test)
print('Linear Model RMSLE for Hand-Picked attributes: ', lm_hp_validate_rmsle)

Linear Model RMSLE for Hand-Picked attributes:  0.15685113282342042


The linear Model with Hand-Picked attributes did not perform better than the Linear Model with all attributes, meaning I may not have selected the best attributes for the linear model.

I predict a Lasso Model and a Boosted model will outperform both linear models.

### Lasso Regression

The Lasso Regression model should select the most important metrics and ignore the least important metrics.

In [9]:
from sklearn.linear_model import LassoCV

y_lasso = train['SalePrice']

X_lasso = train.drop(columns=['SalePrice'])

X_lasso_train, X_lasso_test, y_lasso_train, y_lasso_test = train_test_split(X_lasso, y_lasso,
                                                                           test_size=.3, random_state=0)

In [10]:
# train the lasso model
lasso = LassoCV(alphas=[50, 30, 17.5, 15, 10, 1, 0.1, 0.01, 0.001]).fit(X_lasso_train, y_lasso_train)

In [11]:
# get predictions and check accuracy
lasso_validate_pred = lasso.predict(X_lasso_test)
lasso_validate_rmsle = rmsle(lasso_validate_pred, y_lasso_test)
print('Lasso Model RMSLE: ', lasso_validate_rmsle)

Lasso Model RMSLE:  0.11467111358094641


I was able to get the Lasso RMSLE down to .1147 using Hyper-parameter Tuning on the alpha values for the lasso model. This is a very good loss number. The lowest values on Kaggle are in the .109 range, so my result is not too far off.

In [12]:
# lets submit the results of the lasso model to kaggle.
lasso_test_pred = lasso.predict(test)
lasso_sub = pd.DataFrame(test['Id'])
lasso_sub['SalePrice'] = lasso_test_pred

lasso_sub = gen_sub(lasso_sub, 'Id', 'SalePrice', filename='lasso_submission.csv')

### Boosted Model

In [13]:
import xgboost as xgb

y_xgb = train['SalePrice']

X_xgb = train.drop(columns=['SalePrice'])

X_xgb_train, X_xgb_test, y_xgb_train, y_xgb_test = train_test_split(X_xgb, y_xgb,
                                                                   test_size=0.3, random_state=0)

In [14]:
xgb_model = xgb.XGBRegressor()

xgb_model.fit(X_xgb_train, y_xgb_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [15]:
# get predictions and find error
xgb_validation_pred = xgb_model.predict(X_xgb_test)
xgb_validation_rmsle = rmsle(xgb_validation_pred, y_xgb_test)
print('XGBoost Model RMSLE: ', xgb_validation_rmsle)

XGBoost Model RMSLE:  0.11692079007432554


The boosted model performed the second best of all the models tested, with a RMSLE of .1169. Only the Lasso model outperformed the boosted model, and even then, only slightly.

In [16]:
# get 10 most important features in the dataset.
importance = list(xgb_model.feature_importances_)
labels = list(train.drop(columns='SalePrice'))

mapping = {}
for i, j in enumerate(labels):
    mapping[j] = importance[i]
    
sorted_features = sorted([(j,i) for i,j in mapping.items()], reverse=True)
print('The 10 most important features XGBoosted Model:')
for i in range(10):
    print(f'{sorted_features[i][1]:15}: {sorted_features[i][0]:.4f}')

The 10 most important features XGBoosted Model:
LotArea        : 0.0840
OverallQual    : 0.0611
OverallCond    : 0.0565
TotalSF        : 0.0534
Id             : 0.0534
GrLivArea      : 0.0473
GarageArea     : 0.0473
TotalBsmtSF    : 0.0412
YearBuilt      : 0.0397
GarageYrBlt    : 0.0305


In [17]:
# Generate Kaggle submission for the xgboosted model
xgb_test_pred = xgb_model.predict(test)
xgb_sub = pd.DataFrame(test['Id'])
xgb_sub['SalePrice'] = xgb_test_pred

xgb_sub = gen_sub(xgb_sub, 'Id', 'SalePrice', filename='xgb_submission.csv')