# XGBoost parameter tuning: Predicting House Prices

Nowadays XGBoost is considered to be one of the most advanced algorithms in machine learning which gives highly accurate predictions when tuned properly. Hence, in this project I will concentrate on implementing and parameter tuning this up-to-date algorithm to predict the house prices for [Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

###### Importing the libraries and preparing the dataset.

In [312]:
import pandas as pd
import xgboost as xgb

from sklearn import cross_validation as cv
from sklearn.metrics import mean_absolute_error

In [313]:
X_train = pd.read_csv('train.csv', delimiter=',', header=0)
X_test = pd.read_csv('test.csv', delimiter=',', header=0)

test_Id = X_test.Id
y_train = X_train.loc[:,'SalePrice']
X_train.drop(['Id','SalePrice'], axis=1, inplace=True)
X_test.drop('Id', axis=1, inplace=True)

# I will unite train and test for data cleansing purposes
X_full = pd.concat([X_train, X_test]) 

X_full.reset_index(inplace=True, drop=True)

### Data cleaning and feature engineering

Let's have a look at the number of non-null values in each column.

In [314]:
X_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2919 entries, 0 to 2918
Data columns (total 79 columns):
MSSubClass       2919 non-null int64
MSZoning         2915 non-null object
LotFrontage      2433 non-null float64
LotArea          2919 non-null int64
Street           2919 non-null object
Alley            198 non-null object
LotShape         2919 non-null object
LandContour      2919 non-null object
Utilities        2917 non-null object
LotConfig        2919 non-null object
LandSlope        2919 non-null object
Neighborhood     2919 non-null object
Condition1       2919 non-null object
Condition2       2919 non-null object
BldgType         2919 non-null object
HouseStyle       2919 non-null object
OverallQual      2919 non-null int64
OverallCond      2919 non-null int64
YearBuilt        2919 non-null int64
YearRemodAdd     2919 non-null int64
RoofStyle        2919 non-null object
RoofMatl         2919 non-null object
Exterior1st      2918 non-null object
Exterior2nd      2918 non

At first glance the dataset seems to have a lot of missing values. However, according to the data description some categorical variables use NA as an actual category. For instance, NaN in *Alley* means "No Alley". 

Furthermore, some groups of variables with the same NA meaning, e.g. all *Bsmt* columns with "No Basement" category, have a different number of non-null observations. Hence, there exist some true missing values, which should be handled differently.

In [315]:
l = []

X_full.TotalBsmtSF.fillna(0, inplace=True)

for i in X_full.columns:
    if i.startswith('Bsmt'):
        print(i)
        print(X_full[(X_full[i].isna()) & (X_full.TotalBsmtSF==0)].index)

for inx in X_full[(X_full['BsmtQual'].isna()) & (X_full.TotalBsmtSF==0)].index:
    l.append(inx)

for row in l:
    for column in ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']:
        X_full.loc[row, column] = 'No Basement'
        
X_full.BsmtFullBath.fillna(0, inplace=True)
X_full.BsmtHalfBath.fillna(0, inplace=True)
X_full.BsmtFinSF1.fillna(0, inplace=True)
X_full.BsmtFinSF2.fillna(0, inplace=True)

BsmtQual
Int64Index([  17,   39,   90,  102,  156,  182,  259,  342,  362,  371,  392,
             520,  532,  533,  553,  646,  705,  736,  749,  778,  868,  894,
             897,  984, 1000, 1011, 1035, 1045, 1048, 1049, 1090, 1179, 1216,
            1218, 1232, 1321, 1412, 1585, 1593, 1729, 1778, 1814, 1847, 1848,
            1856, 1857, 1858, 1860, 1915, 2050, 2066, 2068, 2120, 2122, 2188,
            2189, 2190, 2193, 2216, 2224, 2387, 2435, 2452, 2453, 2490, 2498,
            2547, 2552, 2564, 2578, 2599, 2702, 2763, 2766, 2803, 2804, 2824,
            2891, 2904],
           dtype='int64')
BsmtCond
Int64Index([  17,   39,   90,  102,  156,  182,  259,  342,  362,  371,  392,
             520,  532,  533,  553,  646,  705,  736,  749,  778,  868,  894,
             897,  984, 1000, 1011, 1035, 1045, 1048, 1049, 1090, 1179, 1216,
            1218, 1232, 1321, 1412, 1585, 1593, 1729, 1778, 1814, 1847, 1848,
            1856, 1857, 1858, 1860, 1915, 2050, 2066, 2068, 2120, 2122, 2

In [316]:
l_garage = []

for i in X_full.columns:
    if i.startswith('Garage'):
        print(i)
        print(X_full[(X_full[i].isna()) & (X_full.GarageArea==0)].index)

for inx in X_full[(X_full['GarageType'].isna()) & (X_full.TotalBsmtSF==0)].index:
    l_garage.append(inx)

for row in l_garage:
    for column in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
        X_full.loc[row, column] = 'No Garage'

GarageType
Int64Index([  39,   48,   78,   88,   89,   99,  108,  125,  127,  140,
            ...
            2862, 2870, 2888, 2891, 2892, 2893, 2909, 2913, 2914, 2917],
           dtype='int64', length=157)
GarageYrBlt
Int64Index([  39,   48,   78,   88,   89,   99,  108,  125,  127,  140,
            ...
            2862, 2870, 2888, 2891, 2892, 2893, 2909, 2913, 2914, 2917],
           dtype='int64', length=157)
GarageFinish
Int64Index([  39,   48,   78,   88,   89,   99,  108,  125,  127,  140,
            ...
            2862, 2870, 2888, 2891, 2892, 2893, 2909, 2913, 2914, 2917],
           dtype='int64', length=157)
GarageCars
Int64Index([], dtype='int64')
GarageArea
Int64Index([], dtype='int64')
GarageQual
Int64Index([  39,   48,   78,   88,   89,   99,  108,  125,  127,  140,
            ...
            2862, 2870, 2888, 2891, 2892, 2893, 2909, 2913, 2914, 2917],
           dtype='int64', length=157)
GarageCond
Int64Index([  39,   48,   78,   88,   89,   99,  108,  125,  127

In [317]:
X_full.Alley.fillna('No Alley', inplace=True)
X_full.FireplaceQu.fillna('No Fireplace', inplace=True)
X_full.PoolQC.fillna('No Pool', inplace=True)
X_full.Fence.fillna('No Fence', inplace=True)
X_full.MiscFeature.fillna('None', inplace=True)

I will fill in the remaining NaNs of categorical variables with the most frequent values.

In [318]:
categorical_v = ['MSZoning','Utilities','Exterior2nd','MasVnrType', 'BsmtQual', 'BsmtCond',
                 'BsmtExposure', 'BsmtFinType2','Electrical','KitchenQual','Functional',
                 'GarageType','GarageFinish','GarageQual', 'GarageCond','SaleType']

for i in categorical_v:
    X_full[i].fillna(X_full[i].value_counts().index[0], inplace=True)

Next, the houses in one neighbourhood most likely will have the same lot frontage.

In [319]:
X_full["LotFrontage"] = X_full.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

Let's transform the ordinal variables.

In [320]:
quality_order = ['Po', 'Fa', 'TA', 'Gd','Ex']
quality_order_2 = ['NoBsmt', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
quality_order_3 = ['NoBsmt', 'No', 'Mn', 'Av', 'Gd']
finished_area_quality = ['NoBsmt', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ']
quality_order_4 = ['NoFireplace', 'Po', 'Fa', 'TA', 'Gd', 'Ex']

X_full.ExterQual = X_full.ExterQual.astype('category', ordered=True,
                                          categories=quality_order).cat.codes
X_full.BsmtQual = X_full.BsmtQual.astype('category', ordered=True,
                                        categories = quality_order_2).cat.codes
X_full.BsmtExposure = X_full.BsmtExposure.astype('category', ordered=True,
                                        categories = quality_order_3).cat.codes
X_full.BsmtFinType1 = X_full.BsmtFinType1.astype('category', ordered=True,
                                                categories=finished_area_quality).cat.codes
X_full.BsmtFinType2 = X_full.BsmtFinType2.astype('category', ordered=True,
                                                categories=finished_area_quality).cat.codes
X_full.HeatingQC = X_full.HeatingQC.astype('category', ordered=True,
                                          categories=quality_order).cat.codes
X_full.KitchenQual = X_full.KitchenQual.astype('category', ordered=True,
                                          categories=quality_order).cat.codes
X_full.FireplaceQu = X_full.FireplaceQu.astype('category', ordered=True,
                                              categories=quality_order_4).cat.codes


specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead


specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead


specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead


specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead


specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead


specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead


specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead


specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead



In addition, I want to add some new columns to our dataset.

In [321]:
X_full['Bsmt'] = X_full.BsmtQual.map({'Ex': 1, 'Gd':1, 'TA':1,
                                     'Fa':1, 'Po':1, 'NoBsmt':0})
X_full['Fireplace'] = X_full.FireplaceQu.map({0: 0, 1: 1, 2: 1, 3:1, 4:1, 5:1})
X_full['TotalLivingAreaSF'] = X_full.GrLivArea + X_full.TotalBsmtSF

Furthermore, let's transform the columns with string values into dummy variables.

In [322]:
X_full = pd.get_dummies(X_full)

Finally, let's fill in the remaining missing values of integer or float variables.

In [323]:
X_full.MasVnrArea.fillna(0, inplace=True)
X_full.GarageArea.fillna(X_full.GarageArea.mean(), inplace=True)
X_full.GarageCars.fillna(round(X_full.GarageCars.median()), inplace=True)
X_full.GarageYrBlt.fillna(round(X_full.GarageYrBlt.median()), inplace=True)

### XGBoost implementation and parameter tuning

Now we are ready to build the XGBoost regression model. For hyperparameter tuning I will use Kevin Lemagnen's code from [Cambridge Spark website](https://cambridgespark.com/content/tutorials/hyperparameter-tuning-in-xgboost/index.html).

The code itself represents a comprehensive search over specific values of xgb parameters. Although it is not as fast as sklearn's GridSearchCV, it is actually easier to control parameters by using this code since it helps to see every change in Mean Absolute Error(MAE) caused by change in parameters.

First, we need to split our dataset back into train and test.

In [324]:
X_train = X_full.iloc[:1460,:]
X_test = X_full.iloc[1460:, :]

In [325]:
dtrain = xgb.DMatrix(X_train, label=y_train)

In [354]:
params = {
    # Parameters that we are going to tune.
    'max_depth':6,
    'min_child_weight': 1,
    'eta':.3,
    'subsample': 1,
    'colsample_bytree': 1,
    'gamma':0,
    # Other parameters
    'objective':'reg:linear',
}

In [327]:
cv_results = xgb.cv(
    params,
    dtrain,
    num_boost_round=999,
    seed=42,
    nfold=5,
    metrics={'mae'},
    early_stopping_rounds=10
)

cv_results

Unnamed: 0,test-mae-mean,test-mae-std,train-mae-mean,train-mae-std
0,127614.420312,2200.085644,127476.235937,436.253819
1,90353.85,1775.839586,89979.482812,265.011569
2,64484.528906,1236.494356,63645.503125,184.061838
3,46910.822656,824.028248,45182.195312,167.263613
4,35345.33125,942.254642,32287.435547,139.321398
5,28025.551172,1008.925931,23435.428906,111.165799
6,23960.754297,1303.009211,17552.895312,125.045333
7,21584.228906,1470.466189,13739.057812,191.382594
8,20224.731641,1610.430816,11359.008789,260.29965
9,19468.752344,1619.124319,9861.133594,324.012067


According to *cv_results*, the smallest MAE we are able to get with default parameters is equal to 17881.680078.

Now, let's start parameter tuning with *max_depth* and *min_child_weight*. I found that the optimal values were in the range from 5 to 7 for *max_depth* and from 3 to 5 for *min_child_weight*.

In [340]:
gridsearch_params = [
    (max_depth, min_child_weight)
    for max_depth in range(5,8)
    for min_child_weight in range(3,6)
]

min_mae = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))

    # Update our parameters
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight

    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=999,
        seed=42,
        nfold=5,
        metrics={'mae'}
    )

    # Update best MAE
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].idxmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (max_depth,min_child_weight)

print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

CV with max_depth=6, min_child_weight=4
	MAE 16825.166211 for 44 rounds
CV with max_depth=6, min_child_weight=12
	MAE 17331.1449218 for 53 rounds
Best params: 6, 4, MAE: 16825.166211


In accordance with the output above, we have the lowest MAE equal to 16825.166211 with *max_depth*=6 and *min_child_weight*=4

In [350]:
params['max_depth'] = 6
params['min_child_weight'] = 4

Next parameters to tune are *subsample* and *colsample_bytree*.

In [342]:
gridsearch_params = [
    (subsample, colsample)
    for subsample in [i/10. for i in range(5,11)]
    for colsample in [i/10. for i in range(5,11)]
]

min_mae = float("Inf")
best_params = None

# We start by the largest values and go down to the smallest
for subsample, colsample in reversed(gridsearch_params):
    print("CV with subsample={}, colsample={}".format(
                             subsample,
                             colsample))

    # We update our parameters
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample

    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=999,
        seed=42,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10
    )

    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].idxmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (subsample,colsample)

print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

CV with subsample=1.0, colsample=1.0
	MAE 16825.166211 for 44 rounds
CV with subsample=1.0, colsample=0.9
	MAE 17168.7484374 for 42 rounds
CV with subsample=1.0, colsample=0.8
	MAE 17314.7248048 for 34 rounds
CV with subsample=1.0, colsample=0.7
	MAE 17411.571875 for 27 rounds
CV with subsample=1.0, colsample=0.6
	MAE 17640.1347656 for 34 rounds
CV with subsample=1.0, colsample=0.5
	MAE 17170.4107422 for 29 rounds
CV with subsample=0.9, colsample=1.0
	MAE 17002.554492 for 44 rounds
CV with subsample=0.9, colsample=0.9
	MAE 17053.9716796 for 50 rounds
CV with subsample=0.9, colsample=0.8
	MAE 17099.0121096 for 53 rounds
CV with subsample=0.9, colsample=0.7
	MAE 17406.0292972 for 35 rounds
CV with subsample=0.9, colsample=0.6
	MAE 17638.7781246 for 40 rounds
CV with subsample=0.9, colsample=0.5
	MAE 17362.7523438 for 35 rounds
CV with subsample=0.8, colsample=1.0
	MAE 17352.835547 for 45 rounds
CV with subsample=0.8, colsample=0.9
	MAE 16970.4091796 for 31 rounds
CV with subsample=0.8, c

According to the results above, the default values of *subsample* and *colsample* provide the smallest MAE value.

In [343]:
params['subsample'] = 1
params['colsample_bytree'] = 1

In [334]:
%time

min_mae = float("Inf")
best_params = None

for eta in [.3, .2, .1, .05, .01, .005]:
    print("CV with eta={}".format(eta))
    
    params['eta'] = eta

    # Run and time CV
    %time cv_results = xgb.cv( params, dtrain, num_boost_round=999, seed=42, nfold=5, metrics=['mae'], early_stopping_rounds=10)


    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].idxmin()
    print("\tMAE {} for {} rounds\n".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = eta

print("Best params: {}, MAE: {}".format(best_params, min_mae))

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 9.06 µs
CV with eta=0.3
CPU times: user 5.59 s, sys: 43.4 ms, total: 5.63 s
Wall time: 5.65 s
	MAE 16825.166211 for 44 rounds

CV with eta=0.2
CPU times: user 7.89 s, sys: 60 ms, total: 7.95 s
Wall time: 8.03 s
	MAE 16678.1511718 for 66 rounds

CV with eta=0.1
CPU times: user 12.1 s, sys: 67.9 ms, total: 12.2 s
Wall time: 12.2 s
	MAE 16322.5140626 for 109 rounds

CV with eta=0.05
CPU times: user 25.3 s, sys: 125 ms, total: 25.4 s
Wall time: 25.6 s
	MAE 15882.1800782 for 239 rounds

CV with eta=0.01
CPU times: user 1min 40s, sys: 445 ms, total: 1min 40s
Wall time: 1min 41s
	MAE 16003.8255858 for 998 rounds

CV with eta=0.005
CPU times: user 1min 41s, sys: 503 ms, total: 1min 42s
Wall time: 1min 43s
	MAE 16320.513672 for 998 rounds

Best params: 0.05, MAE: 15882.1800782


As it was expected, by decreasing eta value we were able to decrease MAE to 15882.1800782.

In [344]:
params['eta'] = 0.05

Lastly, with the final set of parameters we are able to make predictions and then submit the results on Kaggle to see our score.

In [336]:
model = xgb.train(params, dtrain, num_boost_round=999)
dtest = xgb.DMatrix(X_test)
predictions = model.predict(dtest)

In [337]:
submission = pd.DataFrame(predictions, index=test_Id, columns= ['SalePrice'])
submission.to_csv('submission.csv')

The score is 0.13031, and with this result we are able to get into top 30% of leaderboard. Undoubtedly, with more detailed data cleaning process and feature engineering, it would have been possible to receive the higher score. However, in this project I only wanted to illustrate the importance of XGBoost parameter tuning.