# Housing Price Prediction

Competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview  

This is the notebook for my entry to the housing prices prediction competition on Kaggle.
As of right now, my best submission has a best score (root mean logarithmic error) of 0.15797 and 
am 3301 on the leaderboard.

My next steps are to try and use sklearn's builtin feature_selection funtions and PCA to see if these can improve the model's accuracy compared to my intuition.

2/9/20 UPDATE: After learning more about Scikit-Learn and some of the tools that come with it, I made a pipeline to transform and fit the data and found marked improvements to my model! As of this update, I am now 3184 on the Leaderboard with a score of 0.15202. I decided to do cross validation with KFold because I felt that there weren't enough examples to simply split it into training and test sets. 

Next steps from this point on is to do some hyperparameter tuning, and also seeing if implementing more feature engineering/feature selection techniques will improve performance. For example, I may want to use some tools from the feature_selection package like SelectKBest or chi2. There will be some feature engineering involved, as some of the numerical features could better be expressed as categorical values. Will need to look at each feature to see if there is anything better that can be done.


In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import tqdm
%matplotlib inline

In [0]:
pwd = "/content/drive/My Drive/Colab Notebooks/housing_prices/"

In [0]:
train_data = pd.read_csv(pwd + "/train.csv")
test_data = pd.read_csv(pwd + "/test.csv")

In [7]:
train_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [8]:
train_data.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,...,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,...,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,...,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,...,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,...,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,...,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [0]:
# replace current index with index from data frame
train_data.index = train_data["Id"]
train_data = train_data.drop("Id", axis=1)

test_data.index = test_data["Id"]
test_data = test_data.drop("Id", axis=1)

In [0]:
x_train = train_data.drop("SalePrice", axis=1)
y_train = train_data["SalePrice"]

# reset the Id column because KFold uses 0-indexing and the data uses 1-indexing
x_train = x_train.reset_index().drop(["Id"], axis=1)
x_train.index.name = "Id"

y_train = y_train.reset_index().drop(["Id"], axis=1)
y_train.index.name = "Id"
y_train = np.ravel(y_train)

# cols identified as not having enough values to be considered
sparse_cols = ["Alley", "FireplaceQu", "PoolQC", "Fence", "MiscFeature"]
x_train = x_train.drop(sparse_cols, axis=1)
x_test = test_data.drop(sparse_cols, axis=1)

In [0]:
num_cols = x_train.select_dtypes(include=[np.number]).columns
cat_cols = x_train.select_dtypes(exclude=np.number).columns

In [0]:
num_cols, cat_cols

## Model Selection and Training

This is where I create a model and try to identify more features to see if the accuracy will improve. In the code below, you will see I implemented a function to check if the accuracy has increased from this run to the next. For categorical feature testing, I used the training data from the numerical features as a baseline to see if adding this categorical feature would increase the accuracy of the model. I checked the accuracy via Mean Squared Log Error (as it is in the Kaggle competition) and tested it against Support Vector Regression, Linear Regression, K Neighbors Regression, Decision Tree Regressor and Random Forest Regressor. 

From this experiment, I've found that the model with the best training accuracy is the Decision Tree Regressor, however, the Random Forest Regressor tended to do better on the validation set so I went with the RF Regressor in my Kaggle submissions. 

In [0]:
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

# from sklearn.model_selection import train_test_split
# from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import KFold
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

# from sklearn.svm import SVR
# from sklearn.linear_model import LinearRegression
# from sklearn.neighbors import KNeighborsRegressor
# from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error

In [20]:
# keep track of which regressor trained gives the best score 
best_model = None
min_score = float("inf")

# find the best number of folds for cross validation
for k in range(1, 10):
  # do KFold cross validation
  kfold = KFold(n_splits=5)
  kfold.get_n_splits(x_train)

  # train, x_val, ytrain, y_val = train_test_split(x_train, y_train, test_size=.33)
  # hyperparam tuning for n_components on PCA for numerical features
  for n in range(3, len(num_cols)):
    print("k: {} n: {}".format(k, n))
    # pipeline for numerical features
    num_trans = make_pipeline(SimpleImputer(strategy="mean"), StandardScaler(), PCA(n_components=n))
    # pipeline for categorical features
    cat_trans = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(handle_unknown="ignore"))
    preprocessor = ColumnTransformer(transformers=[("nums", num_trans, num_cols), ("cat", cat_trans, cat_cols)])

    for train, validation in kfold.split(x_train): 
      reg = make_pipeline(preprocessor, RandomForestRegressor())
      reg.fit(x_train.loc[train], y_train[train])
      pred = reg.predict(x_train.loc[validation])
      score = mean_squared_log_error(y_train[validation], pred)
      if score < min_score:
        min_score = score
        best_model = reg
        print("New Min Score Found!: {0} {1} {2}".format(k, n, min_score))

k: 1 n: 3
New Min Score Found!: 1	3	0.015997474853247415
k: 1 n: 4
k: 1 n: 5
New Min Score Found!: 1	5	0.015305370807277452
k: 1 n: 6
New Min Score Found!: 1	6	0.01494957690635384
k: 1 n: 7
k: 1 n: 8
k: 1 n: 9
k: 1 n: 10
k: 1 n: 11
k: 1 n: 12
k: 1 n: 13
k: 1 n: 14
k: 1 n: 15
k: 1 n: 16
k: 1 n: 17
k: 1 n: 18
k: 1 n: 19
k: 1 n: 20
k: 1 n: 21
k: 1 n: 22
k: 1 n: 23
k: 1 n: 24
k: 1 n: 25
k: 1 n: 26
k: 1 n: 27
k: 1 n: 28
k: 1 n: 29
k: 1 n: 30
k: 1 n: 31
k: 1 n: 32
k: 1 n: 33
k: 1 n: 34
k: 1 n: 35
k: 2 n: 3
k: 2 n: 4
k: 2 n: 5
k: 2 n: 6
New Min Score Found!: 2	6	0.014777960215137078
k: 2 n: 7
k: 2 n: 8
k: 2 n: 9
k: 2 n: 10
k: 2 n: 11
k: 2 n: 12
k: 2 n: 13
k: 2 n: 14
k: 2 n: 15
k: 2 n: 16
k: 2 n: 17
k: 2 n: 18
k: 2 n: 19
k: 2 n: 20
k: 2 n: 21
k: 2 n: 22
k: 2 n: 23
k: 2 n: 24
k: 2 n: 25
k: 2 n: 26
k: 2 n: 27
k: 2 n: 28
k: 2 n: 29
k: 2 n: 30
k: 2 n: 31
k: 2 n: 32
k: 2 n: 33
k: 2 n: 34
k: 2 n: 35
k: 3 n: 3
k: 3 n: 4
k: 3 n: 5
k: 3 n: 6
k: 3 n: 7
k: 3 n: 8
k: 3 n: 9
New Min Score Found!: 3	9	0.014

In [21]:
min_score

0.01455576357494706

In [0]:
y_pred = reg.predict(x_test)

In [0]:
# write our predictions to file
ret = pd.DataFrame(columns=["Id", "SalePrice"])
ret["Id"] = x_test.index
ret["SalePrice"] = y_pred
ret.to_csv(path_or_buf=pwd+"/submission.csv", index=False)