### Predict the prices of the house

Dataset available at {https://www.kaggle.com/c/house-prices-advanced-regression-techniques}

In [79]:
train = !ls data/house_prices/

In [26]:
train

['test.csv', 'train.csv']

In [33]:
import pandas as pd
import numpy as np

In [28]:
PATH = "data/house_prices/"

In [159]:
train = pd.read_csv(f'{PATH}train.csv')
test = pd.read_csv(f'{PATH}test.csv')
dataset = pd.concat([train, test], axis=0)

In [160]:
dataset.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
0,856,854,0,,3,1Fam,TA,No,706.0,0.0,...,WD,0,Pave,8,856.0,AllPub,0,2003,2003,2008
1,1262,0,0,,3,1Fam,TA,Gd,978.0,0.0,...,WD,0,Pave,6,1262.0,AllPub,298,1976,1976,2007
2,920,866,0,,3,1Fam,TA,Mn,486.0,0.0,...,WD,0,Pave,6,920.0,AllPub,0,2001,2002,2008
3,961,756,0,,3,1Fam,Gd,No,216.0,0.0,...,WD,0,Pave,7,756.0,AllPub,0,1915,1970,2006
4,1145,1053,0,,4,1Fam,TA,Av,655.0,0.0,...,WD,0,Pave,9,1145.0,AllPub,192,2000,2000,2008


Since the competition evaluates on the Root-Mean-Squared-Error (RMSE), we take the log of the price to be able to evaluate our accuracy accordingly.

In [161]:
dataset['SalePrice'] = np.log(dataset['SalePrice'])

# Initial data processing

The idea now is to get rid of categories in the string format in the dataset, converting them to numerical values.
Also, NaN values will be categorized too.

## Feature engineering

The fast.ai package offers great methods for manipulating data. It would be great if we had info regarding the day the house was sold, but we only have data regarding month and year. Too bad.

Therefore the only processing I will make is to convert the categorical values into numerical ones, for efficiency purposes and also because random forests need them as numerical values anyways.

In [162]:
from fastai.imports import *
from fastai.structured import *

In [163]:
??train_cats

In [164]:
# it makes inplace by default
train_cats(dataset)

In [243]:
# We can specify the orders of the categories if we like
#with pd.option_context("display.max_columns", 1000): 
#    display(dataset)

We cannot pass the NaN values to the Random forest. Let's check those.

In [166]:
dataset.isnull().sum().sort_values()/len(dataset)

1stFlrSF         0.000000
YearRemodAdd     0.000000
HeatingQC        0.000000
HouseStyle       0.000000
Id               0.000000
KitchenAbvGr     0.000000
LandContour      0.000000
LandSlope        0.000000
LotArea          0.000000
LotConfig        0.000000
LotShape         0.000000
LowQualFinSF     0.000000
MSSubClass       0.000000
MiscVal          0.000000
HalfBath         0.000000
MoSold           0.000000
OpenPorchSF      0.000000
OverallCond      0.000000
OverallQual      0.000000
PavedDrive       0.000000
PoolArea         0.000000
RoofMatl         0.000000
RoofStyle        0.000000
SaleCondition    0.000000
ScreenPorch      0.000000
Street           0.000000
TotRmsAbvGrd     0.000000
WoodDeckSF       0.000000
YearBuilt        0.000000
Neighborhood     0.000000
                   ...   
TotalBsmtSF      0.000343
GarageCars       0.000343
Exterior1st      0.000343
KitchenQual      0.000343
Exterior2nd      0.000343
Electrical       0.000343
Functional       0.000685
Utilities   

So we see from the NaN analysis above that the categories Fence, Alley, MiscFeature and PoolQC have more than 80% of NaNs.

In [69]:
??proc_df

In [198]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()

In [250]:
df, y, nas = proc_df(dataset, 'SalePrice', na_dict=nas)
X_train, X_test = split_vals(df, len(train))
y_train, y_test = split_vals(y, len(train))

In [251]:
print(df.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2919, 91) (1460, 91) (1459, 91) (1460,) (1459,)


In [215]:
X_test.columns

Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr',
       'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath',
       'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2',
       'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st',
       'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation',
       'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond',
       'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea',
       'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'Id', 'KitchenAbvGr',
       'KitchenQual', 'LandContour', 'LandSlope', 'LotArea', 'LotConfig',
       'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass', 'MSZoning',
       'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'MoSold',
       'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual',
       'PavedDrive', 'PoolArea', 'Po

In [203]:
from sklearn.ensemble import RandomForestRegressor

In [252]:
# Let's train the Random Forest
m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
m.score(X_train, y_train)

0.9743594297457777

In [253]:
# Let's define some helpful methods for error calculation
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_test), y_test),
                m.score(X_train, y_train), m.score(X_test, y_test)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [254]:
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)

CPU times: user 290 ms, sys: 5.29 ms, total: 295 ms
Wall time: 225 ms


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [256]:
# We see from the above that the RMSE for the test set is around 0.15.
# Let's predict on the test dataset and submit to Kaggle to see how well it goes.

In [258]:
# Import test set
X_test['SalePrice'] = m.predict(X_test)

In [178]:
# Export to csv
print(X_test.columns)
X_test.head()

Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr',
       'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath',
       'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2',
       'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st',
       'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation',
       'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond',
       'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea',
       'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'Id', 'KitchenAbvGr',
       'KitchenQual', 'LandContour', 'LandSlope', 'LotArea', 'LotConfig',
       'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass', 'MSZoning',
       'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'MoSold',
       'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual',
       'PavedDrive', 'PoolArea', 'Po

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,BsmtFullBath_na,BsmtHalfBath_na,BsmtUnfSF_na,GarageArea_na,GarageCars_na,GarageYrBlt_na,LotFrontage_na,MasVnrArea_na,TotalBsmtSF_na,SalePrice
1279,1008,0,0,0,2,1,4,4,0.0,0.0,...,False,False,False,False,False,False,False,False,False,11.1378
852,768,768,0,0,3,1,4,4,0.0,0.0,...,False,False,False,False,False,False,False,False,False,12.079326
725,864,0,0,0,3,1,4,4,375.0,239.0,...,False,False,False,False,False,False,False,False,False,11.274245
1318,1084,867,0,1,4,2,4,4,0.0,0.0,...,False,False,False,False,False,False,False,False,False,11.23188
158,991,956,0,0,3,1,4,4,222.0,0.0,...,False,False,False,False,False,False,False,False,False,10.956364


In [264]:
# Remember to export the exp of the salePrice
X_test['SalePrice'] = np.exp(X_test['SalePrice'])

In [265]:
X_test[['Id','SalePrice']].sort_values('Id').to_csv(f'{PATH}submission.csv', index=False, header=True)

In [269]:
np.exp(y_train)

array([208500., 181500., 223500., ..., 266500., 142125., 147500.])