## Setup

In the cell below we load the data in and split it into training and validating data. We also load the testing data.

Furthermore, we inspect the data file and specify which columns correspond to numerical data and which ones correspond to categorical data. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('./input/train.csv', index_col='Id')
X_test_full = pd.read_csv('./input/test.csv', index_col='Id')

# Remove rows with missing Price info, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

## Feature Engineering

We explore the dataset in order to determine the categorical variables and then determine which categorical columns are suitable for Ordinal Encoding and which ones are suitable for One Hot Encoding. Then, we encode the ordinal categorical variables according to an ordering that we judge to be the most appropriate.\
\
In this step, we also determine how to use the `SimpleImputer` in order to handle the missing data. After trying different ways of dealing with missing data, the following works best:
- Replace any missing numerical value with the `mean` value in a particular column.
- Replace any missing ordinal value with the `constant` value in a particular column.


In [2]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer

# Select categorical columns to perform Ordinal Encoding
ord_categorical_cols = ['Street', 'Alley', 'ExterQual', 'ExterCond',
                       'BsmtQual', 'BsmtCond', 'BsmtExposure',
                       'BsmtFinType1', 'BsmtFinType2', 'HeatingQC',
                       'CentralAir', 'KitchenQual', 'Functional',
                       'FireplaceQu', 'GarageFinish', 'GarageQual',
                       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence']

# Select categorical columns to perform OneHotEncoding
ohe_categorical_cols = [cname for cname in X_full.columns if 
                    X_full[cname].dtype == "object" and
                    cname not in ord_categorical_cols]

# Use SimpleImputer here just like you do encoding
# Tensors have no columns by name apparently so one cannot use
# ColumnTransformer in a Pipeline and work differently on different
# columns

# Select numerical columns
numerical_cols = [cname for cname in X_full.columns if 
                X_full[cname].dtype in ['int64', 'float64']]

# Imputing missing vals in numerical cols for training data
num_imp = SimpleImputer(missing_values=np.nan, strategy='mean')
num_imp.fit(X_full[numerical_cols])
X_full[numerical_cols] = num_imp.transform(X_full[numerical_cols])

# Imputing missing vals in numerical cols for testing data
num_imp_test = SimpleImputer(missing_values=np.nan, strategy='mean')
num_imp_test.fit(X_test_full[numerical_cols])
X_test_full[numerical_cols] = num_imp_test.transform(X_test_full[numerical_cols])

# Ordinal encoding
specified_ordering = [[None, 'Grvl', 'Pave'],
                      ['Grvl', 'Pave'],
                      ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'No', 'Mn', 'Av', 'Gd'],
                      [None, 'Unf', 'Lwq', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
                      [None, 'Unf', 'Lwq', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
                      ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      ['N', 'Y'],
                      ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      ['Sal','Sev','Maj2','Maj1','Mod','Min2','Min1','Typ'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'Unf', 'RFn', 'Fin'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      ['N', 'P', 'Y'],
                      [None, 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
                      [None, 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']]

# Ordinal encoding of the training data
ord_encod = OrdinalEncoder(handle_unknown="use_encoded_value",
                              unknown_value=np.nan,
                              categories=specified_ordering)
ord_encod.fit(X_full[ord_categorical_cols])
X_full[ord_categorical_cols]= ord_encod.transform(X_full[ord_categorical_cols])

# Ordinal encoding of the testing data for the competition
ord_encod_test = OrdinalEncoder(handle_unknown="use_encoded_value",
                              unknown_value=np.nan,
                              categories=specified_ordering)
ord_encod_test.fit(X_test_full[ord_categorical_cols])
X_test_full[ord_categorical_cols]= ord_encod_test.transform(X_test_full[ord_categorical_cols])

# Imputing missing vals in ordinal cols for training data
ord_imp = SimpleImputer(missing_values=np.nan, strategy='constant')
ord_imp.fit(X_full[ord_categorical_cols])
X_full[ord_categorical_cols] = ord_imp.transform(X_full[ord_categorical_cols])

# Imputing missing vals in ordinal cols for testing data
ord_imp_test = SimpleImputer(missing_values=np.nan, strategy='constant')
ord_imp_test.fit(X_test_full[ord_categorical_cols])
X_test_full[ord_categorical_cols] = ord_imp_test.transform(X_test_full[ord_categorical_cols])

# For now we drop ohe columns ...
X_full.drop(ohe_categorical_cols, axis=1, inplace=True)

# Get the number of features
input_shape = len(X_full.columns)

In [3]:
# Split the data into training and validating
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

## Preprocessing

Transforming the data to `tensor` type so that we can use it in the `PyTorch` model.

In [4]:
# Keep selected columns only
my_cols = ord_categorical_cols + numerical_cols

X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

X_deploy = X_full[my_cols].copy()
y_deploy = y.copy()


from torch import tensor

X_train = tensor(X_train.to_numpy()).float()
X_valid = tensor(X_valid.to_numpy()).float()
y_train = tensor(y_train.values.reshape(-1,1)).float()
y_valid = tensor(y_valid.values.reshape(-1,1)).float()

X_test = tensor(X_test.to_numpy()).float()

X_deploy = tensor(X_deploy.to_numpy()).float()
y_deploy = tensor(y_deploy.values.reshape(-1,1)).float()

### Creating the pipeline

In this step, we create the Regressor Neural Network model which we wrap in a `NeuralNetRegressor` object. Lastly, we bundle everything together into a ML Pipeline which we call `model_pipeline`.

In [5]:
## Model Definition

from torch import nn
import torch.nn.functional as F

class Regressor(nn.Module):
    def __init__(self):
        super(Regressor, self).__init__()

        self.first_layer = nn.Linear(input_shape, 26)
        self.second_layer = nn.Linear(26,52)
        self.final_layer = nn.Linear(52,1)

    def forward(self, x_batch):
        # For some weird reason .float() is needed here...
        # Although the data has already been casted to float...
        X = self.first_layer(x_batch.float())
        X = F.relu(X)

        X = self.second_layer(X)
        X = F.relu(X)

        return self.final_layer(X)

## Declare Model

from skorch import NeuralNetRegressor
from torch import optim

skorch_regressor = NeuralNetRegressor(module=Regressor, optimizer=optim.Adam, max_epochs=500, verbose=0)

skorch_regressor

<class 'skorch.regressor.NeuralNetRegressor'>[uninitialized](
  module=<class '__main__.Regressor'>,
)

## Tuning the model

We are ready to tune the hyper-parameters of the model now in order to obtain the best possible version of our Deep Learning model. To do this, we make use of `GridSearch` by providing it with a list of possible parameters.

In [6]:
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Bundle preprocessing and modeling code in a pipeline
model_pipeline = Pipeline(steps=[#('preprocessor', preprocessor),
                                 ('normalize', RobustScaler()),
                                 ('model', skorch_regressor)])
#model_pipeline.fit(X_train,y_train)

params = {
    "model__lr": [0.02, 0.03],
    "model__max_epochs": [500, 600],
    "model__optimizer__weight_decay": [0.2, 0.25]

}

grid = GridSearchCV(model_pipeline, params)

grid.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('normalize', RobustScaler()),
                                       ('model',
                                        <class 'skorch.regressor.NeuralNetRegressor'>[uninitialized](
  module=<class '__main__.Regressor'>,
))]),
             param_grid={'model__lr': [0.02, 0.03],
                         'model__max_epochs': [500, 600],
                         'model__optimizer__weight_decay': [0.2, 0.25]})

## Choosing the best model

Below we print the parameters that give us the model with the best predictive ability as well as print some metrics that evaluate that ability.

In [7]:
### Evaluate Model
from sklearn.metrics import mean_absolute_error

print("Best Score  : {}".format(grid.best_score_))
print("Best Params : {}".format(grid.best_params_))

print("\nTrain MAE : {}".format(mean_absolute_error(y_train, grid.predict(X_train).reshape(-1))))
print("Test  MAE : {}".format(mean_absolute_error(y_valid, grid.predict(X_valid).reshape(-1))))

print("\nTrain R^2 : {}".format(grid.score(X_train, y_train)))
print("Test  R^2 : {}".format(grid.score(X_valid, y_valid)))

Best Score  : 0.7846246222141062
Best Params : {'model__lr': 0.02, 'model__max_epochs': 500, 'model__optimizer__weight_decay': 0.2}

Train MAE : 14711.517578125
Test  MAE : 22862.02734375

Train R^2 : 0.8995725348836303
Test  R^2 : 0.7018347580194947
