# House Price Prediction


## Boosting

This notebook discusses the algorithm and application of boosting methodologies for better predicting performance compared to single models.

## Overview
- AdaBoost
    - Bagging vs AdaBoosting
    - Algorithm (Regression)
- Gradient Boosting Machine (GBM)
    - AdaBoost vs GBM
    - General Form
    - Algorithm
- XGBoost
    - GBM vs XGBoost
    - Algorithm

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
### Read in the data
df = pd.read_csv('../project-house-price-prediction/data/train.csv')
df_test = pd.read_csv('../project-house-price-prediction/data/test.csv')

## Data
The data preparation below follows the data exploration done in the notebook data-exploration-and-preprocessing.

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin

In [4]:
class CleanHouseAttributes(BaseEstimator, TransformerMixin):
    """Apply rules during data exploration to clean house price dataset"""
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, df_house_price, cols_to_drop, target_col=[]):
        # Age of building/remodle from YearBuilt and YearRemodAdd
        df_house_price['AgeBuilding'] = 2012 - df_house_price['YearBuilt']
        df_house_price['AgeRemodel'] = 2012 - df_house_price['YearRemodAdd']
        
        # Create boolean variables for Alley, PoolQC, and Fence
        df_house_price['HasAlley'] = ~df_house_price[['Alley']].isnull()
        df_house_price['HasPool'] = ~df_house_price[['PoolQC']].isnull()
        df_house_price['Fence'] = ~df_house_price[['Fence']].isnull()

        # Remove categories not exist in test.csv
        df_filtered = df_house_price[(df_house_price['HouseStyle'] != '2.5Fin') &
                                     (~df_house_price['RoofMatl'].isin(['Membran', 'Roll', 'ClyTile', 'Metal'])) &
                                     (~df_house_price['Exterior1st'].isin(['Stone', 'ImStucc'])) &
                                     (df_house_price['Exterior2nd'] != 'Other') &
                                     (df_house_price['Electrical'] != 'Mix')]
        
        # Drop columns
        df_dropped = df_filtered.drop(cols_to_drop + target_col + ['HasAlley', 'HasPool', 'Fence'], axis=1)
               
        # Fill NA for numeric columns
        df_numeric = df_dropped.select_dtypes(include=['int64', 'float64']).apply(lambda x: x.fillna(x.mean()), axis=1)
        
        # Fill NA for non-numcric columns with the most frequent category
        df_nonNumeric = df_dropped.select_dtypes(include=['object']).apply(lambda x: x.fillna(x.mode()[0]))
                
        X = pd.concat([df_numeric, pd.get_dummies(df_nonNumeric)], axis=1)
        y = df_filtered[target_col]

        return X, y

In [5]:
drop_cols = ['Id', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'TotalBsmtSF', 
             'TotRmsAbvGrd', 'MoSold', 'YrSold', 'GarageQual', 'Street', 'Alley', 
             'Utilities', 'LandSlope', 'Condition2', 'Heating', 'Functional', 
             'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']
target_col = ['SalePrice']

In [6]:
X, y = CleanHouseAttributes().transform(df, drop_cols, target_col)

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

### Decision Tree

In [9]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In [10]:
param_grid = {'max_depth': [3, 5, 8]}
dt = DecisionTreeRegressor()
gs_dt = GridSearchCV(dt, param_grid, cv=5)
gs_dt.fit(X_train, y_train)
print(gs_dt.best_params_, gs_dt.best_score_)

y_pred = gs_dt.predict(X_test)
print(gs_dt.score(X_test, y_test))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

{'max_depth': 5} 0.690666524719
0.751774480166
40851.1893427


### AdaBoosting $_{[1]}$

#### Bagging vs. AdaBoosting
In bagging, each training example is equally likely to be picked. In boosting, the probability of a particular example being in the training set of a particular machine depends on the performance of the prior machines on that sample. In each round, each training samples is assigned a new weight depending on the prediction performance from the previous round. More weights are assigned to samples with worse prediction so that each subsequent machine would focus on training the "difficult" samples.

#### Algorithm (Regression)
Given $(x_1, y_1),...,(x_n, y_n)$, assign a weight $w_i=1$ for $i = 1,...,n$.

For $t = 1,..., T$
1. The probability that training sample $i$ is in the training set is $p_i = \frac{w_i}{\sum w_i}$ where the summation is over all members of the training set. Pick $n$ samples with replacement to form the training set.
2. Construct a regression machine $t$.
3. Make prediction $y_i^{(p)}(x_i)$ for $i=1,...,n$ with machine $t$. **Note: $y_i^{(p)}(x_i)$ is the prediction of this machine but not the final prediction of the ensemble model.**
4. Calculate a loss for $y_i^{(p)}(x_i)$ and $y_i$. The loss function may be of any functional form as long as $L \in$ [0,1]. If we let $$ D = sup|y_i^{(p)}(x_i)-y_i|, \ \ i = 1,...,n,$$ which means D is the largest error, then we have three candidate loss functions:
$$L_i=\frac{|y_i^{(p)}(x_i)-y_i|}{D} \textit{(linear)}$$
$$L_i=\frac{|y_i^{(p)}(x_i)-y_i|^2}{D^2} \textit{(square law)}$$
$$L_i=1-\exp\left[\frac{-|y_i^{(p)}(x_i)-y_i|}{D}\right] \textit{(exponential)}$$
5. Calculate aveage loss $\overline{L}=\sum_{i=1}^{n}L_ip_i$
6. Form $\beta=\frac{\overline{L}}{1-\overline{L}}$. $\beta$ is a measure of confidence in the predictor. Low $\beta$ means high confidence in the prediction.
7. Update the weights: $w_i \rightarrow w_i\beta^{[1-L_i]}$. The smaller the loss, the more weight is reduced , making the sample less likely to be picked in the next round.
8. For a particular input $x_i$, each of the $T$ machines makes a prediction $h_t, t=1,...,T$. $h_f$ is cumulative prediction using the $T$ predictors:
$$h_f = inf\bigg\{y \in Y: \sum_{t:h_t\leq y}log(\frac{1}{\beta_t})\geq\frac{1}{2}\sum_{t}log(\frac{1}{\beta_t})\bigg\}$$
This equation is essentially relabel $y_i$ such that $y_i^{(1)}<y_i^{(2)}<,...,y_i^{(T)}$. Then sum the $log(1/\beta_t)$ until we reach the smallest t that is equal or grater than $\frac{1}{2}\sum_{t}log(\frac{1}{\beta_t})$. This is the **weighted median**. If the $\beta_t$ were all equal, this would be the median.


In [11]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

In [12]:
%%time
param_grid = {'n_estimators': [250],
              'loss' : ['linear', 'square', 'exponential']}
abr_3 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=3))
abr_5 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=5))
abr_8 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=8))
for k, v in zip(['abr_3', 'abr_5', 'abr_8'], [abr_3, abr_5, abr_8]):
    gs_cv = GridSearchCV(v, param_grid, cv = 5)
    gs_cv.fit(X_train, np.ravel(y_train))
    print(k, gs_cv.best_params_, gs_cv.best_score_)

abr_3 {'loss': 'exponential', 'n_estimators': 250} 0.78948529636
abr_5 {'loss': 'linear', 'n_estimators': 250} 0.812257334853
abr_8 {'loss': 'square', 'n_estimators': 250} 0.827129375149
Wall time: 6min 54s


In [13]:
y_pred = gs_cv.predict(X_test)
print('R-squared:', gs_cv.score(X_test, y_test), 'RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))

R-squared: 0.898281009521 RMSE: 26150.6513963


### Gradient Boosting Machine (GBM) $_{[2]}$

#### AdaBoost vs GBM
As oppose to AdaBoost which takes $\textit{the weighted median}$ of predictions from ensembled regressions, GBM boosting is an $\textit{additive model}$ which adds up predicted values from ensembled regressions. 

#### General Form
The general form of gredient tree-boosting algorithm for regression could be expressed as
$$f(x) = \sum_{m=1}^{M}\beta_mb(x;\gamma_m),$$
where $\beta_m, m = 1,2,...,M$ are the expansion coefficients, and $b(x;\gamma)\in\mathbb{R}$ are usually simple functions of the multivariate argument $x$, characterized by a set of parameters $\gamma$. Specific algorithms are objtained by inserting different loss criteria $L(y, f(x))$

#### Algorithm
1. Initialize $f_0(x) = \operatorname*{argmin}_\gamma\sum_{i=1}^{N} L(y_i,\gamma)$.
2. For $m = 1$ to $M$:
    
    (a) For $i = 1,2,...,N$ compute
    $$r_{im} = -\bigg[\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\bigg]_{f=f_{m-1}}.$$
    
    $r$ is referred as generalized or $\textit{pseudo}$ residuals. If we set the loss function as $\frac{1}{2}[y_i - f(x_i)]^2$, the gradient will be $y_i - f(x_i)$. This step is also where the name of the model is from.
    
    (b) Fit a regression tree to target $r_{im}$ giving terminal regions $R_{jm}, j = 1,2,...,J_m$. 
    
    Here we fit $r$, not $y$. $J$ is the number of leafs. Each iteration $m$ might have different number of leaves $J_m$.
    
    (c) For $j=1,2,...,J_m$ compute
    $$\gamma_{jm} = \operatorname*{argmin}_\gamma\sum_{x_i\in R_{jm}}L(y_i, f_{m-1}X_i + \gamma).$$
    
    (d) Update $f_m(x) = f_{m-1}(x) + \sum_{j=1}^{J_m}\gamma_{jm}I(x\in R{jm})$.

3. Output $\hat{f}(x) = f_M(x)$.

In [14]:
from sklearn.ensemble import GradientBoostingRegressor

In [15]:
%%time
param_grid = {'n_estimators': [500],
              'learning_rate': np.linspace(0.01, 0.15, 4),
              'max_depth': [3, 5, 8]}
gbr = GradientBoostingRegressor()
gs_cv = GridSearchCV(gbr, param_grid, cv=5)
gs_cv.fit(X_train, np.ravel(y_train))
print(gs_cv.best_params_, gs_cv.best_score_)

{'learning_rate': 0.14999999999999999, 'max_depth': 3, 'n_estimators': 500} 0.847728415134
Wall time: 19min 21s


In [16]:
y_pred = gs_cv.predict(X_test)
print(gs_cv.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

0.913572430868 24105.0279711


### XGBoost $_{[3]}$

#### GBM vs XGBoost
Gradient boosting machine, or gradient tree boosting, has been shown to give state-of-the-art results on many standard classification benchmarks, and XGBoost is a scalable machine learning system for tree boosting. The impact of XGBoost has been widely recognized in a number of machine learning and data mining challenges. The regularization term added after the loss function in XGBoost prevents over fitting. This approach improves generalization from training data and increases prediction performance for regression, ranking, and classification problems.

Moreover, XGBoost systerm runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited setting due to several important systems and algorithmic optimizations, including:
- A highly scalable end-to-end tree boosting system.
- A theoretically justified weighted quantile sketch for efficient proposal calculation.
- A novel sparsity-aware algorithm for parallel tree learning.
- An effective cache-aware block structure for out-of-core tree learning.


#### Algorithm
**Note: The process below is simplified for a brief discussion. For more details please refer to the document [3] listed in references.**

The system implements gradient boosting, which performs additive optimization in functional space, and incorporates a regularized model to prevent over fitting.

**Gradient Boosting:**

$$\hat{y}_i = \phi(x_i) = \sum_{k=1}^{K}f_k(x_i), f_k\in\textit{F},$$

where $\textit{F}={f(x)=w_{q(x)}}(q:\mathbb{R}^m \rightarrow T, w \in \mathbb{R}^T)$ is the space of regression trees (also known as CART). Here $q$ represents the structure of each tree that maps an example to the corresponding leaf index. $T$ is the number of leaves in the tree. Each $f_k$ corresponds to an independent tree structure $q$ and leaf weight $w$. We use $w_i$ to represent score on $i$-th leaf.

**XGBoost then minimize the following loss function with $\textit{regularized}$ objective:** 

$$\textit{L}(\phi) = \sum_{i}l(\hat{y}_i, y_i)+\sum_{k}\Omega(f_k),$$
$$where \ \ \Omega(f) = \gamma T + \frac{1}{2}\lambda\|w\|^2$$

In addictive manner, let $\hat{y}_i^t$ be the prediction of the $i$-th instance at the $t$-th iteration, we will need to add $f_t$ to minimize:

$$\textit{L}^{(t)} = \sum_{i=1}^{n}l(y_i, \hat{y}_i^{t-1}+f_t(x_i))+\sum_{k}\Omega(f_k),$$

Apply second-order approximation to quickly optimize the objective and then remove the constant.
$$ L^{(t)} \simeq \sum_{i=1}^{n}[l(y_i, \hat{y}_i^{t-1}) + g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t) $$
$$ = \sum_{i=1}^{n}[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t), $$
where $g_i = \partial_{\hat{y}^{(t-1)}}l(y_i,\hat{y}^{(t-1)})$ and $h_i = \partial_{\hat{y}^{(t-1)}}^2l(y_i,\hat{y}^{(t-1)})$

Define $I_j = \{i|q(x_i)=j\}$ as the instance set of leaf $j$ and expand $\Omega$.
$$ \tilde{L}^{(t)} = \sum_{i=1}^{n}[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]+ \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T}w_j^2$$
$$ = \sum_{j=1}^{T}[(\sum_{i \in I_j}g_i)w_j + \frac{1}{2}(\sum_{i\in I_j}h_i+\lambda)w_j^2]+\gamma T$$

For a fixed structure $q(x)$, we can compute the optimal weight $w_j^*$ of leaf $j$ by
$$ w_j^* = -\frac{\sum_{i\in I_j}g_i}{\sum_{i \in I_j}h_i + \lambda} $$

and calculate the corresponding optimal value by
$$ \tilde{L}^{(t)}(q) = -\frac{1}{2}\sum_{j=1}^{T}\frac{(\sum_{i \in I_j} g_i)^2}{\sum_{i \in I_j} h_i+\lambda}+\gamma T $$

Assume that $I_L$ and $I_R$ are the instance sets of left and right nodes after the split. Letting $I=I_L \cup I_R$, then the loss reduction after the split is given by
$$ L_{split} = \frac{1}{2} \bigg[ \frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i+\lambda} +
                                  \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda} -
                                  \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda} \bigg] - \gamma $$

This formula is usually used in practice for evaluating the split candicates.

In [17]:
# A new data cleaning process without fill NA as XGBoost can handel missing value
class CleanHouseDataForXGBoost(BaseEstimator, TransformerMixin):
    """Apply rules during data exploration to clean house price dataset"""
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, df_house_price, cols_to_drop, target_col=[]):
        # Age of building/remodle from YearBuilt and YearRemodAdd
        df_house_price['AgeBuilding'] = 2012 - df_house_price['YearBuilt']
        df_house_price['AgeRemodel'] = 2012 - df_house_price['YearRemodAdd']
        
        # Create boolean variables for Alley, PoolQC, and Fence
        df_house_price['HasAlley'] = ~df_house_price[['Alley']].isnull()
        df_house_price['HasPool'] = ~df_house_price[['PoolQC']].isnull()
        df_house_price['Fence'] = ~df_house_price[['Fence']].isnull()

        # Remove categories not exist in test.csv
        df_filtered = df_house_price[(df_house_price['HouseStyle'] != '2.5Fin') &
                                     (~df_house_price['RoofMatl'].isin(['Membran', 'Roll', 'ClyTile', 'Metal'])) &
                                     (~df_house_price['Exterior1st'].isin(['Stone', 'ImStucc'])) &
                                     (df_house_price['Exterior2nd'] != 'Other') &
                                     (df_house_price['Electrical'] != 'Mix')]
        
        # Drop columns
        df_dropped = df_filtered.drop(cols_to_drop + target_col + ['HasAlley', 'HasPool', 'Fence'], axis=1)
               
        # Get numeric columns
        df_numeric = df_dropped.select_dtypes(include=['int64', 'float64'])
        
        # Get non-numcric columns
        df_nonNumeric = df_dropped.select_dtypes(include=['object'])
                
        X = pd.concat([df_numeric, pd.get_dummies(df_nonNumeric)], axis=1)
        y = df_filtered[target_col]

        return X, y

In [18]:
X, y = CleanHouseDataForXGBoost().transform(df, drop_cols, target_col)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [19]:
import xgboost as xgb

In [20]:
%%time
# Use the same parameter as GBM
#data_dmatrix = xgb.DMatrix(data = X_train, label = y_train)
param_grid = {'max_depth': [3, 5, 8],
              'learning_rate': np.linspace(0.01, 0.15, 4),
              'n_estimators': [500]}
gs_cv = GridSearchCV(xgb.XGBRegressor(), param_grid, cv=5)
gs_cv.fit(X_train, y_train)
print(gs_cv.best_params_, gs_cv.best_score_)

{'learning_rate': 0.056666666666666664, 'max_depth': 3, 'n_estimators': 500} 0.827793626636
Wall time: 13min 51s


In [21]:
y_pred = gs_cv.predict(X_test)
print(gs_cv.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

0.899172022077 26035.8655096


**With the same settings in grid search of parameters, XGBoost runs about 40% faster than GBM. The predicting results on the test set shows that XGBoost seems to generalize the model better as well.**

#### Randomized Grid Search

In [22]:
from sklearn.model_selection import RandomizedSearchCV

In [23]:
%%time
params = {'max_depth': [3, 5, 8],
          'learning_rate': np.linspace(0.01, 0.15, 4),
          'n_estimators': [500]}

random_search = RandomizedSearchCV(xgb.XGBRegressor(), params, n_iter=5, scoring='neg_mean_squared_error')
random_search.fit(X_train, y_train)
print(random_search.best_params_, random_search.best_score_)

{'learning_rate': 0.14999999999999999, 'max_depth': 3, 'n_estimators': 500} -1051860217.83
Wall time: 3min 38s


In [24]:
y_pred = random_search.predict(X_test)
print(random_search.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

-692245392.488 26310.5566739


Althought it's much faster to use random search, the result is not quite as good as grid search.

#### More Parameter Tuning

In [25]:
%%time
# Grid Search
params = {'max_depth': [3, 5, 7],
          'learning_rate': np.linspace(0.001, 0.020, 11),
          'n_estimators': [250, 500, 800],
          'min_child_weight': [5, 8, 10],
          'gamma': [0.5, 1, 1.5],
          'subsample': [0.6, 0.8, 1.0],
          'colsample_bytree': [0.6, 0.8, 1.0]
          }

random_search = RandomizedSearchCV(xgb.XGBRegressor(), params, n_iter=5, scoring='neg_mean_squared_error')
random_search.fit(X_train, y_train)
print(random_search.best_params_, random_search.best_score_)

{'min_child_weight': 10, 'max_depth': 3, 'n_estimators': 800, 'subsample': 0.6, 'learning_rate': 0.0086, 'gamma': 1, 'colsample_bytree': 0.8} -778456612.689
Wall time: 2min 8s


In [26]:
y_pred = random_search.predict(X_test)
print(random_search.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

-694890265.687 26360.7713409


## Performance on Kaggle

In [27]:
%%time
# GBM: 0.12984
X, y = CleanHouseAttributes().transform(df, drop_cols, target_col)
X_test_kaggle, y_test_kaggle = CleanHouseAttributes().transform(df_test, drop_cols)

param_grid = {'n_estimators': [500],
              'learning_rate': np.linspace(0.01, 0.15, 4),
              'max_depth': [3, 5]}
gbr = GradientBoostingRegressor()
gs_cv = GridSearchCV(gbr, param_grid, cv=5)
gs_cv.fit(X, np.ravel(y))

y_pred_kaggle = gs_cv.predict(X_test_kaggle)
pd.DataFrame({'Id': df_test.Id.values,
              'SalePrice': np.squeeze(y_pred_kaggle)}).to_csv('../project-house-price-prediction/data/pred.csv', index=False)

Wall time: 11min 49s


In [28]:
%%time
# XGBoost: 0.12845
X, y = CleanHouseDataForXGBoost().transform(df, drop_cols, target_col)
X_test_kaggle, y_test_kaggle = CleanHouseDataForXGBoost().transform(df_test, drop_cols)

params_grid = {'max_depth': [3, 5],
               'learning_rate': np.linspace(0.001, 0.015, 4),
               'n_estimators': [500],
               #'min_child_weight': [5, 8, 10],
               #'gamma': [0.5, 1, 1.5],
               'subsample': [0.6, 0.8],
               'colsample_bytree': [0.6, 0.8]
              }

gs_cv = GridSearchCV(xgb.XGBRegressor(), params_grid, cv=5)
gs_cv.fit(X, np.ravel(y))
y_pred_kaggle = gs_cv.predict(X_test_kaggle)
pd.DataFrame({'Id': df_test.Id.values,
              'SalePrice': np.squeeze(y_pred_kaggle)}).to_csv('../project-house-price-prediction/data/pred.csv', index=False)

Wall time: 37min


In [29]:
%%time
# XGBoost: 0.12628 (top 33.6%)
X, y = CleanHouseDataForXGBoost().transform(df, drop_cols, target_col)
X_test_kaggle, y_test_kaggle = CleanHouseDataForXGBoost().transform(df_test, drop_cols)

params = {'max_depth': [3, 5, 7],
          'learning_rate': np.linspace(0.001, 0.020, 11),
          'n_estimators': [250, 500, 800],
          'min_child_weight': [5, 8, 10],
          'gamma': [0.5, 1, 1.5],
          'subsample': [0.6, 0.8, 1.0],
          'colsample_bytree': [0.6, 0.8, 1.0]
          }

random_search = RandomizedSearchCV(xgb.XGBRegressor(), params, n_iter=5, scoring='neg_mean_squared_error')
random_search.fit(X, np.ravel(y))
y_pred_kaggle = random_search.predict(X_test_kaggle)
pd.DataFrame({'Id': df_test.Id.values,
              'SalePrice': np.squeeze(y_pred_kaggle)}).to_csv('../project-house-price-prediction/data/pred.csv', index=False)

Wall time: 5min 20s


In [30]:
print(random_search.best_params_)

{'min_child_weight': 5, 'max_depth': 7, 'n_estimators': 800, 'subsample': 0.8, 'learning_rate': 0.0067000000000000002, 'gamma': 1, 'colsample_bytree': 0.8}


## References
1. H. Drucker, “Improving Regressors using Boosting Techniques”, 1997.
2. T. Hastie, R. Tibshirani and J. Friedman, Elements of Statistical Learning Ed. 2, Springer, 2009.
3. Tianqi Chen and Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System", 2016.