## General Overview

-------------

The main goal of this research is to build and compare a few models to predict the housing prices in Ames, Iowa(USA). The data is sourced from Kaggle website: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. It contains housing data - 2919 records in total - where 1460 will be used for training purposes and 1459 for testing our models. There are 4 separate files which we are going to use:

- train.csv -> training data in CSV format
- test.csv -> testing data in CSV format
- data_description.txt -> attributes description

Let's start off by importing the necessary modules and reading the file.

In [1]:
# Basic modules for dataframe manipulation
import numpy as np
import pandas as pd
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype

# Plots
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

# Data Standardization
from sklearn.preprocessing import StandardScaler

# Cross Validaton
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# Metrics
from sklearn.metrics import mean_absolute_error

# Don't display warnings 
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Read files into a dataframe
df_train = pd.read_csv('train.csv', low_memory = False)
df_test = pd.read_csv('test.csv', low_memory = False)

# Merge training and testing datasets
df_raw = pd.concat([df_train.drop('SalePrice', axis = 1), df_test])
print("Number of records: {}\nNumber of variables: {}".format(df_raw.shape[0], df_raw.shape[1]))

Number of records: 2919
Number of variables: 80


It is important to look at the data first in order to understand its format, structure, value types, number(percentage) of missing data, etc.

In [3]:
# Change the default number of columns displayed by DataFrame's head method
pd.set_option('display.max_columns', 85)

# Display first 5 rows
df_raw.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706.0,Unf,0.0,150.0,856.0,GasA,Ex,Y,SBrkr,856,854,0,1710,1.0,0.0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2.0,548.0,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978.0,Unf,0.0,284.0,1262.0,GasA,Ex,Y,SBrkr,1262,0,0,1262,0.0,1.0,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2.0,460.0,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486.0,Unf,0.0,434.0,920.0,GasA,Ex,Y,SBrkr,920,866,0,1786,1.0,0.0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2.0,608.0,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216.0,Unf,0.0,540.0,756.0,GasA,Gd,Y,SBrkr,961,756,0,1717,1.0,0.0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3.0,642.0,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655.0,Unf,0.0,490.0,1145.0,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1.0,0.0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3.0,836.0,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal


As we can see, our dataset consists of various data types: integers, floats, strings so let's check further what are they exact types.

In [4]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 80 columns):
Id               2919 non-null int64
MSSubClass       2919 non-null int64
MSZoning         2915 non-null object
LotFrontage      2433 non-null float64
LotArea          2919 non-null int64
Street           2919 non-null object
Alley            198 non-null object
LotShape         2919 non-null object
LandContour      2919 non-null object
Utilities        2917 non-null object
LotConfig        2919 non-null object
LandSlope        2919 non-null object
Neighborhood     2919 non-null object
Condition1       2919 non-null object
Condition2       2919 non-null object
BldgType         2919 non-null object
HouseStyle       2919 non-null object
OverallQual      2919 non-null int64
OverallCond      2919 non-null int64
YearBuilt        2919 non-null int64
YearRemodAdd     2919 non-null int64
RoofStyle        2919 non-null object
RoofMatl         2919 non-null object
Exterior1st      2918 non-

According to the above result, strings representing categorical variables are stored as objects, which is very unefficient due to the increased size and processing time so we will have to convert their data type into "category".

## Data preprocessing


-------------------------------------

Data pre-processing is a critical step that needs to be taken to convert the raw data into a clean data set which is a requirement of the Machine Learning algorithms. The common steps are:

- Cleaning: removal or fixing missing data
- Formatting: adjusting the type of each column and making them suitable for machine learning algorithms


### Cleaning

We have seen above that some variables have missing data which makes them unusable with Machine Learning algorithms. To fix this problem, we have to get rid of variables which have more than 75% of the data missing. For remaining columns, we will apply the following imputation methods: median for continuous variables and mode for categorical ones. Median is usually more preferable to mean, because of negligible impact of outliers.

In [5]:
# Select and print missing values ratio in descending order
missing = df_raw.isnull().sum().sort_values(ascending=False)/len(df_raw)
print(missing)

PoolQC           0.996574
MiscFeature      0.964029
Alley            0.932169
Fence            0.804385
FireplaceQu      0.486468
LotFrontage      0.166495
GarageCond       0.054471
GarageQual       0.054471
GarageYrBlt      0.054471
GarageFinish     0.054471
GarageType       0.053786
BsmtCond         0.028092
BsmtExposure     0.028092
BsmtQual         0.027749
BsmtFinType2     0.027407
BsmtFinType1     0.027064
MasVnrType       0.008222
MasVnrArea       0.007879
MSZoning         0.001370
BsmtHalfBath     0.000685
Utilities        0.000685
Functional       0.000685
BsmtFullBath     0.000685
BsmtFinSF1       0.000343
Exterior1st      0.000343
Exterior2nd      0.000343
BsmtFinSF2       0.000343
BsmtUnfSF        0.000343
TotalBsmtSF      0.000343
SaleType         0.000343
                   ...   
YearBuilt        0.000000
OverallCond      0.000000
SaleCondition    0.000000
Heating          0.000000
ExterQual        0.000000
ExterCond        0.000000
YrSold           0.000000
MoSold      

In [6]:
# Copy all columns containing less then 75% of missing values to new variable: 'df
df = df_raw.loc[:, missing < 0.75]

### Formatting

In this section, we are going to convert object data types into category, impute missing values and take a closer look at all variables. Instead of iterating through all variables individually, we will work on certain data types using for loops to ease and speed up the whole process - this will be handled by functions stored in "helper.py" module since converting data and imputing missing values in common in every Data Science - related problem. These actions will result in a clean dataframe object, which then could be used for modelling.

In [7]:
# Import helper functions which are used to speed up the preprocessing
from helper import obj_to_cat, fill_missing_nums, fill_missing_cats

In [8]:
# Convert objects(strings) into category data type
df = obj_to_cat(df)

In [9]:
# Fill missing numerical data with median
df = fill_missing_nums(df)

In [10]:
# Fill missing categorical data with mode
df = fill_missing_cats(df)

# Check if the functions worked as intended
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 76 columns):
Id               2919 non-null int64
MSSubClass       2919 non-null int64
MSZoning         2919 non-null category
LotFrontage      2919 non-null float64
LotArea          2919 non-null int64
Street           2919 non-null category
LotShape         2919 non-null category
LandContour      2919 non-null category
Utilities        2919 non-null category
LotConfig        2919 non-null category
LandSlope        2919 non-null category
Neighborhood     2919 non-null category
Condition1       2919 non-null category
Condition2       2919 non-null category
BldgType         2919 non-null category
HouseStyle       2919 non-null category
OverallQual      2919 non-null int64
OverallCond      2919 non-null int64
YearBuilt        2919 non-null int64
YearRemodAdd     2919 non-null int64
RoofStyle        2919 non-null category
RoofMatl         2919 non-null category
Exterior1st      2919 non-null cate