## Setup

In the cell below we load the data in and split it into training and validating data. We also load the testing data.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('./input/train.csv', index_col='Id')
X_test_full = pd.read_csv('./input/test.csv', index_col='Id')

# Remove rows with missing Price info, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# Split the data into training and validating
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

## Preprocessing

Below we perform feature engineering (i.e deal with categorical columns and the missing data). We have two choices for dealing with categorical data:
- Ordinal Encoding, where we assign an integer to each of the unique variables that appear in the column. For example if the variables appearing in the column are `['bad', 'neutral','good']`, then we can assign integers to each of them like this: `'bad'=0, 'neutral'=1, 'good'=2`.
- One Hot Encoding, when there is no clear logical pattern as to which variable we should assign the highest integer. Instead, One Hot Encoding creates an extra dummy column for each unique variable appearing in a column and then fills that columns with `1` or `0` depending on whether that variable was present in the original column for that datapoint.

After going through the description of each column in the dataset, we determine which categorical column is better suited for Ordinal Encoding and which is better suited for One Hot Encoding.  In contrast to the heavy work needed for the categorical columns, we only need to take care of the missing data for the numerical columns. 

In [2]:
# Select categorical columns to perform Ordinal Encoding
# Add Utilities to ord somehow ['ELO', 'NoSeWa','NoSewr', 'AllPub']
ord_categorical_cols = ['Street', 'Alley', 'ExterQual', 'ExterCond',
                       'BsmtQual', 'BsmtCond', 'BsmtExposure',
                       'BsmtFinType1', 'BsmtFinType2', 'HeatingQC',
                       'CentralAir', 'KitchenQual', 'Functional',
                       'FireplaceQu', 'GarageFinish', 'GarageQual',
                       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence']

# Select categorical columns to perform OneHotEncoding
ohe_categorical_cols = [cname for cname in X_train_full.columns if 
                    X_train_full[cname].dtype == "object" and
                    cname not in ord_categorical_cols]

specified_ordering = [[None, 'Grvl', 'Pave'],
                      ['Grvl', 'Pave'],
                      ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'No', 'Mn', 'Av', 'Gd'],
                      [None, 'Unf', 'Lwq', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
                      [None, 'Unf', 'Lwq', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
                      ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      ['N', 'Y'],
                      ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      ['Sal','Sev','Maj2','Maj1','Mod','Min2','Min1','Typ'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'Unf', 'RFn', 'Fin'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      ['N', 'P', 'Y'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']]

#specified_ordering = [X_full[col].unique() for col in 
#                      ord_categorical_cols]


# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = ord_categorical_cols + ohe_categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

X_deploy = X_full[my_cols].copy()
y_deploy = y.copy()

We can run the cell below to get a quick idea about what the data looks like.

In [3]:
X_train.head()

Unnamed: 0_level_0,Street,Alley,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,HeatingQC,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,Pave,,Ex,TA,Ex,TA,Av,GLQ,Unf,Ex,...,774,0,108,0,0,260,0,0,7,2007
871,Pave,,TA,TA,TA,TA,No,Unf,Unf,Gd,...,308,0,0,0,0,0,0,0,8,2009
93,Pave,Grvl,TA,Gd,Gd,TA,No,ALQ,Unf,Ex,...,432,0,0,44,0,0,0,0,8,2009
818,Pave,,Gd,TA,Gd,TA,No,GLQ,Unf,Ex,...,857,150,59,0,0,0,0,0,7,2008
303,Pave,,Gd,TA,Gd,TA,No,Unf,Unf,Ex,...,843,468,81,0,0,0,0,0,1,2006


### Creating the pipeline

First we define the preproessing steps which describe how we handle the missing data and how we want to transform the categorical data into numerical data. In this step, we pick the model (in our case it is `XGBRegressor()`).Lastly, we bundle everything together into a ML Pipeline which we call `model_pipeline`.

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from xgboost import XGBClassifier

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
ord_categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('ordinal', OrdinalEncoder(handle_unknown="use_encoded_value",
                               unknown_value=np.nan,
                               categories=specified_ordering))
])

ohe_categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat_ord', ord_categorical_transformer, ord_categorical_cols),
        ('cat_ohe', ohe_categorical_transformer, ohe_categorical_cols)
    ])


# Define model
#model = RandomForestRegressor(n_estimators=100, random_state=0)
#model = XGBClassifier(eval_metric='logloss', seed=7) 
model = XGBRegressor()

# Bundle preprocessing and modeling code in a pipeline
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('xgbrg', model)])

In [13]:
from sklearn.model_selection import cross_val_score
# Scoring the model without any tuning
scores = cross_val_score(model_pipeline, X_train, y_train, cv=2,
                        scoring='neg_mean_absolute_error')
print("neg_MAE = %f (%f)" % (scores.mean(), scores.std()))

# Model Parameters to tune
print(model)

neg_MAE = -20111.437661 (66.154518)
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, gamma=None,
             gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, n_estimators=100, n_jobs=None,
             num_parallel_tree=None, predictor=None, random_state=None,
             reg_alpha=None, reg_lambda=None, ...)


## Parameter tuning

In the cells below we supply `model_pipeline` to `GridSearchCV` in order to optimise over a range of parameters specified in `param_grid`. Since optimising all parameters at the same time is very computationally heavy, we optimise only a few parameters at a time. We save the best choice of parameters after each optimisation step so that at the end, we can train the model on our training data and then predict the `Price` for houses in the validation data-set. We use the Mean Absolute Error (MAE) as a measure to how well our model is doing and to determine which parameters are the most optimal (the lower the MAE, the better the model is). 

### Parameter `booster`
The first optimisation step tells us to use `dart` for this parameter, but since the change in `MAE` is very tiny,
we will continue using the default option which is `gbtree'.

In [19]:
from sklearn.model_selection import GridSearchCV

# Parameter grid to tune learning_rate and n_estimators
param_grid = {
    "xgbrg__booster": ['gbtree', 'gblinear', 'dart']
}

searchCV_1 = GridSearchCV(model_pipeline,
                          cv=2,
                          n_jobs=-2,
                          param_grid = param_grid,
                          scoring='neg_mean_absolute_error')

# Fit model 
searchCV_1.fit(X_train, y_train)

# Get the best model parameters
# sklearn reports negative MAE, hence the minus sign below
print("Best model: %f using %s" % (-searchCV_1.best_score_,
                                   searchCV_1.best_params_))

Best model: 20111.437634 using {'xgbrg__booster': 'dart'}


### Parameters `n_estimators` and `learning_rate`

In [23]:
# Parameter grid to tune learning_rate and n_estimators
param_grid = {
    "xgbrg__booster": ['gbtree'],
    "xgbrg__n_estimators": range(400,600,50),
    "xgbrg__learning_rate": [0.03+r*0.005 for r in range(6)]
}

searchCV_2 = GridSearchCV(model_pipeline,
                          cv=2,
                          n_jobs=-2,
                          param_grid = param_grid,
                          scoring='neg_mean_absolute_error')

# Fit model 
searchCV_2.fit(X_train, y_train)

# Get the best model parameters
print("Best model: %f using %s" % (searchCV_2.best_score_,
                                   searchCV_2.best_params_))

Best model: -17832.349636 using {'xgbrg__booster': 'gbtree', 'xgbrg__learning_rate': 0.05, 'xgbrg__n_estimators': 500}


### Parameters `max_depth` and `min_child_weight`

In [24]:
# Parameter grid to tune learning_rate and n_estimators
param_grid = {
    "xgbrg__booster": ['gbtree'],
    "xgbrg__n_estimators": [500],
    "xgbrg__learning_rate": [0.05],
    'xgbrg__max_depth': range(3,10,2),
    'xgbrg__min_child_weight': range(1,6,2)
}

searchCV_3 = GridSearchCV(model_pipeline,
                          cv=2,
                          n_jobs=-2,
                          param_grid = param_grid,
                          scoring='neg_mean_absolute_error')

# Fit model 
searchCV_3.fit(X_train, y_train)

# Get the best model parameters
print("Best model: %f using %s" % (searchCV_3.best_score_,
                                   searchCV_3.best_params_))

Best model: -16786.680851 using {'xgbrg__booster': 'gbtree', 'xgbrg__learning_rate': 0.05, 'xgbrg__max_depth': 3, 'xgbrg__min_child_weight': 1, 'xgbrg__n_estimators': 500}


### Parameters `gamma` and `lambda`

In [25]:
# Parameter grid to tune gamma and lambda
param_grid = {
    "xgbrg__booster": ['gbtree'],
    "xgbrg__n_estimators": [500],
    "xgbrg__learning_rate": [0.05],
    'xgbrg__max_depth': [3],
    'xgbrg__min_child_weight': [1],
    "xgbrg__gamma": [i/10.0 for i in range(0,6)],
    "xgbrg__reg_lambda": [0, 0.5, 1, 1.5, 2, 3, 4.5]
}

searchCV_4 = GridSearchCV(model_pipeline,
                          cv=2,
                          n_jobs=-2,
                          param_grid = param_grid,
                          scoring='neg_mean_absolute_error')

# Fit model 
searchCV_4.fit(X_train, y_train)

# Get the best model parameters
print("Best model: %f using %s" % (searchCV_4.best_score_,
                                   searchCV_4.best_params_))

Best model: -16771.782966 using {'xgbrg__booster': 'gbtree', 'xgbrg__gamma': 0.0, 'xgbrg__learning_rate': 0.05, 'xgbrg__max_depth': 3, 'xgbrg__min_child_weight': 1, 'xgbrg__n_estimators': 500, 'xgbrg__reg_lambda': 4.5}


## Validating the model

In [26]:
# Parameter grid to tune gamma and lambda
param_grid = {
    "xgbrg__booster": ['gbtree'],
    "xgbrg__n_estimators": [500],
    "xgbrg__learning_rate": [0.05],
    'xgbrg__max_depth': [3],
    'xgbrg__min_child_weight': [1],
    "xgbrg__gamma": [0.0],
    "xgbrg__reg_lambda": [4.5]
}

searchCV = GridSearchCV(model_pipeline,
                          n_jobs=-2,
                          param_grid = param_grid)

# Fit model 
searchCV.fit(X_train, y_train)

# Validation data, get predictions
preds = searchCV.predict(X_valid)


# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

MAE: 15803.071088398972


## Prediction

To conclude, we predict the `Price` for the houses in the testing dataset using all the training data available and write the predictions to `submission.csv` in order to submit to a Kaggle Competition.

In [28]:
# Fit model for deployment
searchCV.fit(X_deploy, y_deploy)

# Preprocessing of test data, fit model
preds_test = searchCV.predict(X_test)

In [29]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)