## Data preparation

In [91]:
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport features.build_features
from features.build_features import read_raw_data
from features.build_features import transform_ordinals
import numpy as np
import pandas as pd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [92]:
df = read_raw_data("../data/raw/train.csv")
the_ordinals = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'HeatingQC', 
            'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC']
print(np.unique(df[the_ordinals].fillna('').values))


['' 'Av' 'Ex' 'Fa' 'Gd' 'Mn' 'No' 'Po' 'TA']


Let's replace these rankings with numeric values:
NA, No = 0
Poor, Minimum = 1
Fair = 2
Average, Typical/Average = 3
Good = 4
Excellent = 5

In [93]:
df = transform_ordinals(df)
print(np.unique(df[the_ordinals].fillna('').values))


[0 1 2 3 4 5]


Next, transform all nominal (categorical) data to boolean dummy variables using one hot encoding.

In [94]:
the_nominals = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
            'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
            'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
            'Foundation', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'CentralAir', 'Electrical',
            'Functional', 'GarageType', 'GarageFinish', 'PavedDrive', 'Fence', 'MiscFeature', 'SaleType',
            'SaleCondition']

In [95]:
print([df[c].dtype for c in the_nominals])

[dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O')]


The Pandas get_dummies method only takes 'O'-type features into account, thus need first to transform MSSubClass to string.

In [96]:
from features.build_features import transform_categoricals
print(df.shape)
df = transform_categoricals(df)
print(df.shape)

(1460, 81)
(1460, 267)


For the records where the year in which the garage was built ('GarageBltYr') lies before the year in which the house was built ('YearBuilt'), default the value of 'GarageBltYr' to the one of 'YearBuilt'.

In [97]:
from features.build_features import correct_garageyrblt
df = correct_garageyrblt(df)

In [98]:
df = df.fillna(0)

Finally we have a dataset of 1460 records with 266 features and one target.
In order to avoid overfitting later on, let's try to reduce the number of features.