# House Price Prediction


## Boosting

This notebook discusses the algorithm and application of boosting methods for better predicting performance than single models.

## Overview    
- Data
- Models
    - Decision Tree
    - AdaBoosting
        - Bagging vs. AdaBoosting
        - Algorithm (Regression)
    - Gradient Boosting Machine (GBM)
        - AdaBoost vs GBM
        - General Form
        - Algorithm
    - XGBoost
        - GBM vs XGBoost
        - Algorithm
- Performance on Kaggle
    - GradientBoostingRegressor from sklearn
    - XGBoost from xgboost
    - Executive Summary
- References


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
### Read in the data
df = pd.read_csv('../project-house-price-prediction/data/train.csv')
df_test = pd.read_csv('../project-house-price-prediction/data/test.csv')

## Data
***The data preparation below follows the process done in the seperate notebook data-exploration-and-preprocessing under the same repository.***

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [4]:
class HouseDataFeatureEngineering(BaseEstimator, TransformerMixin):
    """Apply rules during data exploration to clean house price dataset"""
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Numerical Features:
            # Age of building/remodle from YearBuilt and YearRemodAdd
        if np.all((X.dtypes == 'int64') | (X.dtypes == 'float64')):
            X = X.assign(AgeBuilding = 2012 - X['YearBuilt'],
                         AgeRemodel = 2012 - X['YearRemodAdd'])

            X.drop(['YearBuilt', 'YearRemodAdd'], axis=1, inplace=True)
        
        # Non-numerical Features:
            # Create boolean variables for Alley, PoolQC, and Fence
        if np.all(X.dtypes == 'object'):
            X = X.assign(HasAlley = [False if x is None else True for x in X.Alley],
                            HasPool = [False if x is None else True for x in X.PoolQC],
                            Fence = [False if x is None else True for x in X.Fence])
            X.drop(['Alley', 'PoolQC', 'Fence'], axis=1, inplace=True)

        return X

In [5]:
class HouseDataDropColumns(BaseEstimator, TransformerMixin):
    """Apply rules during data exploration to clean house price dataset"""
    
    def __init__(self, drop_columns=['Id', 'GarageYrBlt', 'TotalBsmtSF', 'TotRmsAbvGrd', 
                                     'MoSold', 'YrSold', 'GarageQual', 'Street', 'Utilities', 
                                     'LandSlope', 'Condition2', 'Heating', 'Functional', 
                                     'FireplaceQu', 'MiscFeature']):
        self.drop_columns = drop_columns        
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # To apply the transformer on numeric and non-numeric data frames respectively
        drop_it = [x for x in self.drop_columns if x in X.columns]
        X.drop(drop_it, axis=1, inplace=True)

        return X

In [6]:
# Build preprocessing pipeline by numeric and non-numeric columns
numeric_features = df.drop(['SalePrice'], axis=1
                          ).select_dtypes(include=['int64', 'float64']).columns
numeric_transformer = Pipeline(steps=[
    ('drop', HouseDataDropColumns()),
    ('feature', HouseDataFeatureEngineering()),
    ('imputer', SimpleImputer(strategy='median'))])

categorical_features = df.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('drop', HouseDataDropColumns()),
    ('feature', HouseDataFeatureEngineering()),
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [7]:
X = df.drop(['SalePrice'], axis = 1)
y = df[['SalePrice']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 45)

## Models
### Decision Tree
From the notebook models-and-algorithm under the same repository, it's identified that decision tree has the lowest performance among other single regression models. Therefore, decision tree is selected here to observe the performace improvement after boosting methods and to compare with gradient boosting tree models that will be introduced later in this notebook.

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

In [9]:
param_grid = {'dt__max_depth': [3, 5, 8]}
dt = Pipeline(steps=[('preprocessor', preprocessor),
                     ('dt', DecisionTreeRegressor())])
gs_dt = GridSearchCV(dt, param_grid, cv=5)
gs_dt.fit(X_train, y_train)
print(gs_dt.best_params_, gs_dt.best_score_)

y_pred = gs_dt.predict(X_test)
print(gs_dt.score(X_test, y_test))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

{'dt__max_depth': 8} 0.5894056896304389
0.6885879692663159
44150.27139297183


### AdaBoosting $_{[1]}$

#### Bagging vs. AdaBoosting
In bagging, each training example is equally likely to be picked. In boosting, the probability of a particular example being in the training set of a particular machine depends on the performance of the prior machines on that sample. In each round, each training samples is assigned a new weight depending on the prediction performance from the previous round. More weights are assigned to samples with worse prediction so that each subsequent machine would focus on training the "difficult" samples.

#### Algorithm (Regression)
Given $(x_1, y_1),...,(x_n, y_n)$, assign a weight $w_i=1$ for $i = 1,...,n$.

For $t = 1,..., T$
1. The probability that training sample $i$ is in the training set is $p_i = \frac{w_i}{\sum w_i}$ where the summation is over all members of the training set. Pick $n$ samples with replacement to form the training set.
2. Construct a regression machine $t$.
3. Make prediction $y_i^{(p)}(x_i)$ for $i=1,...,n$ with machine $t$. **Note: $y_i^{(p)}(x_i)$ is the prediction of this machine but not the final prediction of the ensemble model.**
4. Calculate a loss for $y_i^{(p)}(x_i)$ and $y_i$. The loss function may be of any functional form as long as $L \in$ [0,1]. If we let $$ D = sup|y_i^{(p)}(x_i)-y_i|, \ \ i = 1,...,n,$$ which means D is the largest error, then we have three candidate loss functions:
$$L_i=\frac{|y_i^{(p)}(x_i)-y_i|}{D} \ \ \ \textit{(linear)}$$
$$L_i=\frac{|y_i^{(p)}(x_i)-y_i|^2}{D^2} \ \ \ \textit{(square law)}$$
$$L_i=1-\exp\left[\frac{-|y_i^{(p)}(x_i)-y_i|}{D}\right] \ \ \ \textit{(exponential)}$$
5. Calculate aveage loss $\overline{L}=\sum_{i=1}^{n}L_ip_i$
6. Form $\beta=\frac{\overline{L}}{1-\overline{L}}$. $\beta$ is a measure of confidence in the predictor. Low $\beta$ means high confidence in the prediction.
7. Update the weights: $w_i \rightarrow w_i\beta^{[1-L_i]}$. The smaller the loss, the more weight is reduced , making the sample less likely to be picked in the next round.
8. For a particular input $x_i$, each of the $T$ machines makes a prediction $h_t, t=1,...,T$. $h_f$ is cumulative prediction using the $T$ predictors:
$$h_f = inf\bigg\{y \in Y: \sum_{t:h_t\leq y}log(\frac{1}{\beta_t})\geq\frac{1}{2}\sum_{t}log(\frac{1}{\beta_t})\bigg\}$$
This equation is essentially relabel $y_i$ such that $y_i^{(1)}<y_i^{(2)}<,...,y_i^{(T)}$. Then sum the $log(1/\beta_t)$ until we reach the smallest t that is equal or grater than $\frac{1}{2}\sum_{t}log(\frac{1}{\beta_t})$. This is the **weighted median**. If the $\beta_t$ were all equal, this would be the median.


In [10]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

In [11]:
%%time
param_grid = {'abr__n_estimators': [250],
              'abr__loss' : ['linear', 'square', 'exponential']}
abr_3 = Pipeline(steps=[('preprocessor', preprocessor),
                        ('abr', AdaBoostRegressor(DecisionTreeRegressor(max_depth=3)))])
abr_5 = Pipeline(steps=[('preprocessor', preprocessor),
                        ('abr', AdaBoostRegressor(DecisionTreeRegressor(max_depth=5)))])
abr_8 = Pipeline(steps=[('preprocessor', preprocessor),
                        ('abr', AdaBoostRegressor(DecisionTreeRegressor(max_depth=8)))])
for k, v in zip(['abr_3', 'abr_5', 'abr_8'], [abr_3, abr_5, abr_8]):
    gs_cv = GridSearchCV(v, param_grid, cv = 5)
    gs_cv.fit(X_train, np.ravel(y_train))
    print(k, gs_cv.best_params_, gs_cv.best_score_)

abr_3 {'abr__loss': 'exponential', 'abr__n_estimators': 250} 0.713203389204278
abr_5 {'abr__loss': 'linear', 'abr__n_estimators': 250} 0.765606456356722
abr_8 {'abr__loss': 'square', 'abr__n_estimators': 250} 0.7441814079195108
Wall time: 10min 34s


In [12]:
y_pred = gs_cv.predict(X_test)
print('R-squared:', gs_cv.score(X_test, y_test), 
      'RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))

R-squared: 0.8906824177427002 RMSE: 26158.37089278115


The prediction on test set has improved quite much after applying AdaBoosting.

### Gradient Boosting Machine (GBM) $_{[2]}$

#### AdaBoost vs GBM
As oppose to AdaBoost which takes $\textit{the weighted median}$ of predictions from ensembled regressions, GBM boosting is an $\textit{additive model}$ which adds up predicted values from ensembled regressions. 

#### General Form
The general form of gredient tree-boosting algorithm for regression could be expressed as
$$f(x) = \sum_{m=1}^{M}\beta_mb(x;\gamma_m),$$
where $\beta_m, m = 1,2,...,M$ are the expansion coefficients, and $b(x;\gamma)\in\mathbb{R}$ are usually simple functions of the multivariate argument $x$, characterized by a set of parameters $\gamma$. Specific algorithms are objtained by inserting different loss criteria $L(y, f(x))$

#### Algorithm
1. Initialize $f_0(x) = \operatorname*{argmin}_\gamma\sum_{i=1}^{N} L(y_i,\gamma)$.
2. For $m = 1$ to $M$:
    
    (a) For $i = 1,2,...,N$ compute
    $$r_{im} = -\bigg[\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\bigg]_{f=f_{m-1}}.$$
    
    $r$ is referred as generalized or $\textit{pseudo}$ residuals. If we set the loss function as $\frac{1}{2}[y_i - f(x_i)]^2$, the gradient will be $y_i - f(x_i)$. This step is also where the name of the model is from.
    
    (b) Fit a regression tree to target $r_{im}$ giving terminal regions $R_{jm}, j = 1,2,...,J_m$. 
    
    Here we fit $r$, not $y$. $J$ is the number of leafs. Each iteration $m$ might have different number of leaves $J_m$.
    
    (c) For $j=1,2,...,J_m$ compute
    $$\gamma_{jm} = \operatorname*{argmin}_\gamma\sum_{x_i\in R_{jm}}L(y_i, f_{m-1}X_i + \gamma).$$
    
    (d) Update $f_m(x) = f_{m-1}(x) + \sum_{j=1}^{J_m}\gamma_{jm}I(x\in R{jm})$.

3. Output $\hat{f}(x) = f_M(x)$.

In [13]:
from sklearn.ensemble import GradientBoostingRegressor

In [14]:
%%time
param_grid = {'gbr__n_estimators': [500],
              'gbr__learning_rate': np.linspace(0.01, 0.15, 4),
              'gbr__max_depth': [3, 5, 8]}
gbr = Pipeline(steps=[('preprocessor', preprocessor),
                      ('gbr', GradientBoostingRegressor())])
gs_cv = GridSearchCV(gbr, param_grid, cv=5)
gs_cv.fit(X_train, np.ravel(y_train))
print(gs_cv.best_params_, gs_cv.best_score_)

{'gbr__max_depth': 3, 'gbr__n_estimators': 500, 'gbr__learning_rate': 0.056666666666666664} 0.7438394689713702
Wall time: 21min 25s


In [15]:
y_pred = gs_cv.predict(X_test)
print(gs_cv.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

0.9005000689587325 24956.12051050492


Gradient boosting regression generates better results than AdaBoosting. However, speed is the main issue to run more parameter tuning.

### XGBoost $_{[3]}$

#### GBM vs XGBoost
Gradient boosting machine, or gradient tree boosting, has been shown to give state-of-the-art results on many standard classification benchmarks, and XGBoost is a scalable machine learning system for tree boosting. The impact of XGBoost has been widely recognized in a number of machine learning and data mining challenges. The regularization term added after the loss function in XGBoost prevents over fitting. This approach improves generalization from training data and increases prediction performance for regression, ranking, and classification problems.

Moreover, XGBoost systerm runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited setting due to several important systems and algorithmic optimizations, including:
- A highly scalable end-to-end tree boosting system.
- A theoretically justified weighted quantile sketch for efficient proposal calculation.
- A novel sparsity-aware algorithm for parallel tree learning.
- An effective cache-aware block structure for out-of-core tree learning.


#### Algorithm
**Note: The process below is simplified for a brief discussion. For more details please refer to the document [3] listed in references.**

The system implements gradient boosting, which performs additive optimization in functional space, and incorporates a regularized model to prevent over fitting.

**Gradient Boosting:**

$$\hat{y}_i = \phi(x_i) = \sum_{k=1}^{K}f_k(x_i), f_k\in\textit{F},$$

where $\textit{F}={f(x)=w_{q(x)}}(q:\mathbb{R}^m \rightarrow T, w \in \mathbb{R}^T)$ is the space of regression trees (also known as CART). Here $q$ represents the structure of each tree that maps an example to the corresponding leaf index. $T$ is the number of leaves in the tree. Each $f_k$ corresponds to an independent tree structure $q$ and leaf weight $w$. We use $w_i$ to represent score on $i$-th leaf.

**XGBoost then minimize the following loss function with $\textit{regularized}$ objective:** 

$$\textit{L}(\phi) = \sum_{i}l(\hat{y}_i, y_i)+\sum_{k}\Omega(f_k),$$
$$where \ \ \Omega(f) = \gamma T + \frac{1}{2}\lambda\|w\|^2$$

In addictive manner, let $\hat{y}_i^t$ be the prediction of the $i$-th instance at the $t$-th iteration, we will need to add $f_t$ to minimize:

$$\textit{L}^{(t)} = \sum_{i=1}^{n}l(y_i, \hat{y}_i^{t-1}+f_t(x_i))+\sum_{k}\Omega(f_k),$$

Apply second-order approximation to quickly optimize the objective and then remove the constant.
$$ L^{(t)} \simeq \sum_{i=1}^{n}[l(y_i, \hat{y}_i^{t-1}) + g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t) $$
$$ = \sum_{i=1}^{n}[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t), $$
where $g_i = \partial_{\hat{y}^{(t-1)}}l(y_i,\hat{y}^{(t-1)})$ and $h_i = \partial_{\hat{y}^{(t-1)}}^2l(y_i,\hat{y}^{(t-1)})$

Define $I_j = \{i|q(x_i)=j\}$ as the instance set of leaf $j$ and expand $\Omega$.
$$ \tilde{L}^{(t)} = \sum_{i=1}^{n}[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]+ \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T}w_j^2$$
$$ = \sum_{j=1}^{T}[(\sum_{i \in I_j}g_i)w_j + \frac{1}{2}(\sum_{i\in I_j}h_i+\lambda)w_j^2]+\gamma T$$

For a fixed structure $q(x)$, we can compute the optimal weight $w_j^*$ of leaf $j$ by
$$ w_j^* = -\frac{\sum_{i\in I_j}g_i}{\sum_{i \in I_j}h_i + \lambda} $$

and calculate the corresponding optimal value by
$$ \tilde{L}^{(t)}(q) = -\frac{1}{2}\sum_{j=1}^{T}\frac{(\sum_{i \in I_j} g_i)^2}{\sum_{i \in I_j} h_i+\lambda}+\gamma T $$

Assume that $I_L$ and $I_R$ are the instance sets of left and right nodes after the split. Letting $I=I_L \cup I_R$, then the loss reduction after the split is given by
$$ L_{split} = \frac{1}{2} \bigg[ \frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i+\lambda} +
                                  \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda} -
                                  \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda} \bigg] - \gamma $$

This formula is usually used in practice for evaluating the split candicates.

In [16]:
import xgboost as xgb

In [17]:
%%time
# Use the same parameter as GBM
#data_dmatrix = xgb.DMatrix(data = X_train, label = y_train)
param_grid = {'xgbr__max_depth': [3, 5, 8],
              'xgbr__learning_rate': np.linspace(0.01, 0.15, 4),
              'xgbr__n_estimators': [500]}
xgbr = Pipeline(steps=[('preprocessor', preprocessor),
                       ('xgbr', xgb.XGBRegressor())])
gs_cv = GridSearchCV(xgbr, param_grid, cv=5)
gs_cv.fit(X_train, y_train)
print(gs_cv.best_params_, gs_cv.best_score_)

{'xgbr__max_depth': 3, 'xgbr__n_estimators': 500, 'xgbr__learning_rate': 0.056666666666666664} 0.7461741945197997
Wall time: 7min 8s


In [18]:
y_pred = gs_cv.predict(X_test)
print(gs_cv.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

0.9008914810201432 24906.986044489637


With the same settings in grid search of parameters, XGBoost runs about **65%** faster than GBM. In addition, both R2 and RMSE are both **better from XGBoost model** even though the parameters are the same. This is possibly due to different algorithms.

#### Randomized Grid Search

In [19]:
from sklearn.model_selection import RandomizedSearchCV

In [20]:
%%time
params = {'xgbr__max_depth': [3, 5, 8],
          'xgbr__learning_rate': np.linspace(0.01, 0.15, 4),
          'xgbr__n_estimators': [500]}
xgbr = Pipeline(steps=[('preprocessor', preprocessor),
                       ('xgbr', xgb.XGBRegressor())])
random_search = RandomizedSearchCV(xgbr, params, n_iter=5, 
                                   scoring='neg_mean_squared_error')
random_search.fit(X_train, y_train)
print(random_search.best_params_, random_search.best_score_)

{'xgbr__max_depth': 3, 'xgbr__n_estimators': 500, 'xgbr__learning_rate': 0.10333333333333332} -1406074824.6038938
Wall time: 3min 10s


In [21]:
y_pred = random_search.predict(X_test)
print(random_search.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

-618244281.5745714 24864.518526900363


Using random search would further save more than half of the tuning time with even better RMSE in final output.

#### More Parameter Tuning

In [22]:
%%time
# Grid Search
params = {'xgbr__max_depth': [3, 5, 7],
          'xgbr__learning_rate': np.linspace(0.001, 0.020, 11),
          'xgbr__n_estimators': [250, 500, 800],
          'xgbr__min_child_weight': [5, 8, 10],
          'xgbr__gamma': [0.5, 1, 1.5],
          'xgbr__subsample': [0.6, 0.8, 1.0],
          'xgbr__colsample_bytree': [0.6, 0.8, 1.0]
          }
xgbr = Pipeline(steps=[('preprocessor', preprocessor),
                       ('xgbr', xgb.XGBRegressor())])
random_search = RandomizedSearchCV(xgbr, params, n_iter=5, 
                                   scoring='neg_mean_squared_error')
random_search.fit(X_train, y_train)
print(random_search.best_params_, random_search.best_score_)

{'xgbr__subsample': 0.6, 'xgbr__max_depth': 3, 'xgbr__n_estimators': 500, 'xgbr__learning_rate': 0.0181, 'xgbr__gamma': 0.5, 'xgbr__colsample_bytree': 0.8, 'xgbr__min_child_weight': 10} -1108259264.8463147
Wall time: 3min 11s


In [23]:
y_pred = random_search.predict(X_test)
print(random_search.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

-634106579.2699922 25181.472936863567


More parameter tuing is an option for XGBoost with random search as speed doesn't seem to be an issue.

## Performance on Kaggle
### GradientBoostingRegressor from `sklearn`

In [24]:
%%time
# GBM: 0.13090 1994/5041
param_grid = {'gbr__n_estimators': [500],
              'gbr__learning_rate': np.linspace(0.01, 0.15, 4),
              'gbr__max_depth': [3, 5]}
gbr = Pipeline(steps=[('preprocessor', preprocessor),
                      ('gbr', GradientBoostingRegressor())])
gs_cv = GridSearchCV(gbr, param_grid, cv=5)
gs_cv.fit(X, np.ravel(y))

y_pred_kaggle = gs_cv.predict(df_test)
pd.DataFrame({'Id': df_test.Id.values,
              'SalePrice': np.squeeze(y_pred_kaggle)}
            ).to_csv('../project-house-price-prediction/data/pred.csv', index=False)

Wall time: 12min 6s


In [25]:
print(gs_cv.best_params_)

{'gbr__max_depth': 3, 'gbr__n_estimators': 500, 'gbr__learning_rate': 0.056666666666666664}


Compared to the submissions in notebook models-and-algorithms, the submission of gradient boosting regression on Kaggle is much better.

### XGBoost from `xgboost`

In [26]:
%%time
# XGBoost: 0.12391 1495/5047
params = {'xgbr__max_depth': [6, 7, 8],
          'xgbr__learning_rate': np.linspace(0.001, 0.020, 11),
          'xgbr__n_estimators': [500, 800, 1200],
          'xgbr__min_child_weight': [10, 12, 14],
          'xgbr__gamma': [0, 0.5, 1],
          'xgbr__subsample': [0.6, 0.7, 0.8],
          'xgbr__colsample_bytree': [0.4, 0.5, 0.6]
          }
xgbr = Pipeline(steps=[('preprocessor', preprocessor),
                       ('xgbr', xgb.XGBRegressor())])
random_search = RandomizedSearchCV(xgbr, params, n_iter=5, 
                                   scoring='neg_mean_squared_error')
random_search.fit(X, np.ravel(y))
y_pred_kaggle = random_search.predict(df_test)
pd.DataFrame({'Id': df_test.Id.values,
              'SalePrice': np.squeeze(y_pred_kaggle)}
            ).to_csv('../project-house-price-prediction/data/pred.csv', index=False)

Wall time: 5min 12s


Each time the random search might result in different best parameters. Within all trials, the best submission truned out to come from model with the following parameters, with score 0.12391 and ranks top **30%** at the time of submission.

- {'xgbr__n_estimators': 1200, 'xgbr__gamma': 0, 'xgbr__learning_rate': 0.0162, 'xgbr__min_child_weight': 12, 'xgbr__subsample': 0.6, 'xgbr__max_depth': 6, 'xgbr__colsample_bytree': 0.5}

In [29]:
# print(random_search.best_params_)

### Executive Summary
- Compared to single models, both GradientBoostingRegressor and XGBoost have better generalization and suffer less over-fitting, resulting in better submission scores on Kaggle.
- The biggest advantage of using XGBoost is the speed, which allows more trails on parameter tuning to improve model. 
- In addition, with the same parameters, XGBoost also seems to generate slightly better R-2 and RMSE than GradientBoostingRegressor, possibly thanks to the enhanced algorithm under the hood.

## References
1. H. Drucker, “Improving Regressors using Boosting Techniques”, 1997.
2. T. Hastie, R. Tibshirani and J. Friedman, Elements of Statistical Learning Ed. 2, Springer, 2009.
3. Tianqi Chen and Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System", 2016.