## Gradient Boosted Decision Trees

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.[1]

This technique employs the logic in which the subsequent predictors learn from the mistakes of the previous predictors. Therefore, the observations have an unequal probability of appearing in subsequent models and ones with the highest error appear most. The predictors can be chosen from a range of models like decision trees, regressors, classifiers etc. Because new predictors are learning from mistakes committed by previous predictors, it takes less time/iterations to reach close to actual predictions. But we have to choose the stopping criteria carefully or it could lead to overfitting on training data. Gradient Boosting is an example of boosting algorithm.


In this notebook we're using the home prices using dataset from Iowa, USA.

(Inspired by a tutorial published at www.kaggle.com)

[1] https://en.wikipedia.org/wiki/Gradient_boosting   
[2] https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [3]:
import numpy as np
import pandas as pd
    
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score

from xgboost import XGBRegressor

ImportError: No module named 'xgboost'

### 1. Loading dataset

In [2]:
df_train = pd.read_csv('data/iowa_train.csv')
df_test = pd.read_csv('data/iowa_test.csv')

# Drop houses where the target value for prediction is not defined (!)
df_train.dropna(axis=0, subset=['SalePrice'], inplace=True)

print("Train size={} Test size={}".format(df_train.shape[0], df_test.shape[0]))

NameError: name 'pd' is not defined

In [2]:
# Dropping columns with missing values
# Also removing Id and SalesPrice as these columns does not matter 
cols_with_missing = [col for col in df_train.columns 
                                 if df_train[col].isnull().any()]                                  

# Loading predictors and target (X/y) 
X_train = df_train.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
y_train = df_train.SalePrice
X_test  = df_test.drop(['Id'] + cols_with_missing, axis=1)

In [3]:
# Now just considering numeric columns and columns with low cardinality
# "cardinality" means the number of unique values in a column.
# We use it as our only way to select categorical columns here. This is convenient, though
# a little arbitrary.
low_cardinality_cols = [cname for cname in X_train.columns if 
                                X_train[cname].nunique() < 10 and
                                X_train[cname].dtype == "object"]
numeric_cols = [cname for cname in X_train.columns if 
                                X_train[cname].dtype in ['int64', 'float64']]
cols = low_cardinality_cols + numeric_cols

X_train = X_train[cols]
X_test  = X_test[cols]
print("Amount of columns used as predictors = {}".format(len(cols)))
X_train.dtypes.sample(10)

Amount of columns used as predictors = 57


RoofStyle        object
HeatingQC        object
MSSubClass        int64
Fireplaces        int64
HouseStyle       object
ExterQual        object
MiscVal           int64
LandSlope        object
EnclosedPorch     int64
BsmtUnfSF         int64
dtype: object

### Function to return MAE using cross_val_score (k-folding)

In [4]:
def get_mae(X, y):
    # multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
    return -1 * cross_val_score(RandomForestRegressor(50), 
                                X, y, 
                                scoring = 'neg_mean_absolute_error').mean()

### Trying some different configurations

In [5]:
# Model 1 - Not considering categorical columns
X_train_new = X_train.select_dtypes(exclude=['object'])

print('MAE of SalePrice when Dropping Categoricals = {:,.0f}'.format(get_mae(X_train_new, y_train)))

MAE of SalePrice when Dropping Categoricals = 18,465


In [6]:
# Model 2 - Using One-Hot Encoding (get_dummies)
X_train_new = pd.get_dummies(X_train)

print('MAE of SalePrice when Dropping Categoricals = {:,.0f}'.format(get_mae(X_train_new, y_train)))

MAE of SalePrice when Dropping Categoricals = 18,036


In [7]:
# Model 3 - Considering data from test dataset also

X_train_new = pd.get_dummies(X_train)
X_test_new = pd.get_dummies(X_test)
a,b = X_train_new.align(X_test_new, join='left', axis=1)
