## Setup

In the cell below we load the data in and split it into training and validating data. We also load the testing data.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('./input/train.csv', index_col='Id')
X_test_full = pd.read_csv('./input/test.csv', index_col='Id')

# Remove rows with missing Price info, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# Split the data into training and validating
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)


## Preprocessing

Below we select which columns to use for training the model. For now, we only perform One Hot Encoding on the categorical data and since that means we will be adding extra dummy columns, we discard the categorical columns with high cardinality. We do, however, include all columns that have numerical data. 

In [2]:
# Cardinality denotes the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

We can run the cell below to get a quick idea about what the data looks like.

In [4]:
X_train.head()

Unnamed: 0_level_0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Condition1,Condition2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,774,0,108,0,0,260,0,0,7,2007
871,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,PosN,Norm,...,308,0,0,0,0,0,0,0,8,2009
93,RL,Pave,Grvl,IR1,HLS,AllPub,Inside,Gtl,Norm,Norm,...,432,0,0,44,0,0,0,0,8,2009
818,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Norm,Norm,...,857,150,59,0,0,0,0,0,7,2008
303,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,...,843,468,81,0,0,0,0,0,1,2006


### Creating the pipeline

First we define the preproessing steps which describe how we handle the missing data and how we want to transform the categorical data into numerical data. In this step, we pick the model (in our case it is `XGBRegressor()`).Lastly, we bundle everything together into a ML Pipeline which we call `model_pipeline`.

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
#model = RandomForestRegressor(n_estimators=100, random_state=0)
model = XGBRegressor()

# Bundle preprocessing and modeling code in a pipeline

model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('xgbrg', model)
                             ])



## Parameter tuning

In the cell below we supply `model_pipeline` to `GridSearchCV` in order to optimise over a range of parameters specified in `param_grid`. We use the best choice of parameters to train the model on our training data and then predict the `Price` for houses in the validation data-set. Lastly, we provide the Mean Absolute Error (MAE) as a measure to how well our model is doing (the lower the MAE, the better the model is). 

In [7]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "xgbrg__n_estimators": [215, 230, 250],
    "xgbrg__learning_rate": [0.01, 0.05, 0.1],
}

#TODO: incorporate early stopping
fit_params = {"xgbrg__eval_set": [(X_valid, y_valid)],
              "xgbrg__early_stopping_rounds": 5,
              "xgbrg__verbose": False
}

searchCV = GridSearchCV(model_pipeline, cv=2, param_grid = param_grid)


#Preprocessing of training data, fit model 
searchCV.fit(X_train, y_train)


# Preprocessing of validation data, get predictions
print("Optimal parameters",searchCV.best_params_) 
preds = searchCV.predict(X_valid)


# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

Optimal parameters {'xgbrg__learning_rate': 0.05, 'xgbrg__n_estimators': 250}
MAE: 17193.06560359589


## Prediction

To conclude, we predict the `Price` for the houses in the testing dataset and write the predictions to `submission.csv` in order to submit to a Kaggle Competition.

In [8]:
# Preprocessing of test data, fit model
preds_test = searchCV.predict(X_test)



In [9]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)