# Housing Price Prediction

Competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview  

This is the notebook for my entry to the housing prices prediction competition on Kaggle.
As of right now, my best submission has a best score (root mean logarithmic error) of 0.15797 and 
am 3301 on the leaderboard.

My next steps are to try and use sklearn's builtin feature_selection funtions and PCA to see if these can improve the model's accuracy compared to my intuition.

2/9/20 UPDATE: After learning more about Scikit-Learn and some of the tools that come with it, I made a pipeline to transform and fit the data and found marked improvements to my model! As of this update, I am now 3184 on the Leaderboard with a score of 0.15202. I decided to do cross validation with KFold because I felt that there weren't enough examples to simply split it into training and test sets. 

Next steps from this point on is to do some hyperparameter tuning, and also seeing if implementing more feature engineering/feature selection techniques will improve performance. For example, I may want to use some tools from the feature_selection package like SelectKBest or chi2. 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

%matplotlib inline

In [3]:
train_data = pd.read_csv("./train.csv")
test_data = pd.read_csv("./test.csv")

In [4]:
train_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [None]:
train_data.head(5)

In [5]:
# replace current index with index from data frame
train_data.index = train_data["Id"]
train_data = train_data.drop("Id", axis=1)

test_data.index = test_data["Id"]
test_data = test_data.drop("Id", axis=1)

In [121]:
x_train = train_data.drop("SalePrice", axis=1)
y_train = train_data["SalePrice"]

# reset the Id column because KFold uses 0-indexing and the data uses 1-indexing
x_train = x_train.reset_index().drop(["Id"], axis=1)
x_train.index.name = "Id"

y_train = y_train.reset_index().drop(["Id"], axis=1)
y_train.index.name = "Id"
y_train = np.ravel(y_train)

# cols identified as not having enough values to be considered
sparse_cols = ["Alley", "FireplaceQu", "PoolQC", "Fence", "MiscFeature"]
x_train = x_train.drop(sparse_cols, axis=1)
x_test = test_data.drop(sparse_cols, axis=1)

In [65]:
num_cols = x_train.select_dtypes(include=[np.number]).columns
cat_cols = x_train.select_dtypes(exclude=np.number).columns

In [66]:
num_cols, cat_cols

Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold'],
      dtype='object')

## Model Selection and Training

This is where I create a model and try to identify more features to see if the accuracy will improve. In the code below, you will see I implemented a function to check if the accuracy has increased from this run to the next. For categorical feature testing, I used the training data from the numerical features as a baseline to see if adding this categorical feature would increase the accuracy of the model. I checked the accuracy via Mean Squared Log Error (as it is in the Kaggle competition) and tested it against Support Vector Regression, Linear Regression, K Neighbors Regression, Decision Tree Regressor and Random Forest Regressor. 

From this experiment, I've found that the model with the best training accuracy is the Decision Tree Regressor, however, the Random Forest Regressor tended to do better on the validation set so I went with the RF Regressor in my Kaggle submissions. 

In [94]:
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

# from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error

In [150]:
# do KFold cross validation
kfold = KFold(n_splits=5)
kfold.get_n_splits(x_train)

# pipeline for numerical features
num_trans = make_pipeline(SimpleImputer(strategy="mean"), StandardScaler(), PCA(n_components=24))

# pipeline for categorical features
cat_trans = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(handle_unknown="ignore"))
preprocessor = ColumnTransformer(transformers=[("nums", num_trans, num_cols), ("cat", cat_trans, cat_cols)])

# keep track of which regressor trained gives the best score 
best_model = None
min_score = float("inf")
for train, validation in kfold.split(x_train):    
    reg = make_pipeline(preprocessor, RandomForestRegressor())
    reg.fit(x_train.loc[train], y_train[train])
    pred = reg.predict(x_train.loc[validation])
    pred[pred < 0] = 0
    score = mean_squared_log_error(y_train[validation], pred)
    if score < min_score:
        min_score = score
        best_model = reg

In [151]:
min_score

0.01556653747993075

In [152]:
y_pred = reg.predict(x_test)

In [153]:
# write our predictions to file
ret = pd.DataFrame(columns=["Id", "SalePrice"])
ret["Id"] = x_test.index
ret["SalePrice"] = y_pred
ret.to_csv(path_or_buf="./submission.csv", index=False)