# [Learning to Use XGBoost](https://www.kaggle.com/dansbecker/learning-to-use-xgboost)

**XGBoost** is the leading model for working with standard tabular data (the type of data you store in pandas DataFrames, as opposed to more exotic types of data like images and videos).  
XGBoost models do well in many Kaggle competitions.  
To reach peak accuracy, XGBoost models require more knowledge and model tuning than techniques like Random Forest. After this tutorial, you'll be able to:  
* Follow the full modeling workflow with XGBoost, and    
* Fine-tune XGBoost models for optimal performance

XGBoost is an implementation of the Gradient Boosted Decision Trees algorithm (scikit-learn has another version of this algorithm, but XGBoost has some technical advantages.)  
What are Gradient Boosted Decision Trees?

![XGBoost](img/xgboost.png)

New models are generated in cycles, and the results of these models are aggregated and used to build into an **ensemble** model.  
We start the cycle by calculating the errors for each observation in the dataset.  
We then build a new model to predict those errors.  
We add predictions from this error-predicting model to the ensemble of models.  
To make a prediction, we include the predictions from all previous models.  
We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.  
There's one piece outside that cycle.  
We need some base prediction to start the cycle.  
In practice, the initial predictions can be pretty naive.  
Even if the predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.  
This process may sound complicated, but the code to use it is straightforward.  
We'll fill in some additional explanatory details in the model tuning section below.  

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

data = pd.read_csv('input/train.csv')
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25)
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
print("First entry of train_X :\n", train_X[:1])
print()
test_X = my_imputer.transform(test_X)
print("First entry of test_X :\n", test_X[:1])

First entry of train_X :
 [[9.4900e+02 6.0000e+01 6.5000e+01 1.4006e+04 7.0000e+00 5.0000e+00
  2.0020e+03 2.0020e+03 1.4400e+02 0.0000e+00 0.0000e+00 9.3600e+02
  9.3600e+02 9.3600e+02 8.4000e+02 0.0000e+00 1.7760e+03 0.0000e+00
  0.0000e+00 2.0000e+00 1.0000e+00 3.0000e+00 1.0000e+00 7.0000e+00
  1.0000e+00 2.0020e+03 2.0000e+00 4.7400e+02 1.4400e+02 9.6000e+01
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 2.0000e+00
  2.0060e+03]]

First entry of test_X :
 [[1.1060e+03 6.0000e+01 9.8000e+01 1.2256e+04 8.0000e+00 5.0000e+00
  1.9940e+03 1.9950e+03 3.6200e+02 1.0320e+03 0.0000e+00 4.3100e+02
  1.4630e+03 1.5000e+03 1.1220e+03 0.0000e+00 2.6220e+03 1.0000e+00
  0.0000e+00 2.0000e+00 1.0000e+00 3.0000e+00 1.0000e+00 9.0000e+00
  2.0000e+00 1.9940e+03 2.0000e+00 7.1200e+02 1.8600e+02 3.2000e+01
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 4.0000e+00
  2.0100e+03]]


Now we can build and fit a model just as we would in `sklearn`:

In [2]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle:
# Don't forget to examine the parameters displayed when the model is built.
# Tuning those parameters properly may improve the model's performance.
my_model.fit(train_X, train_y, verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

And now on to evaluating the model and making predictions, also like in scikit-learn.

In [3]:
predictions = my_model.predict(test_X)
predictions[:5]

array([293203.94 ,  97959.83 , 129877.266, 148022.8  , 167368.7  ],
      dtype=float32)

In [4]:
from sklearn.metrics import mean_absolute_error

print("Mean Absolute Error:\n", str(mean_absolute_error(predictions, test_y)))

Mean Absolute Error:
 17306.776466181505


## Model Tuning

XGBoost has a number of parameters that can dramatically affect your model's accuracy and speed.  
Some significant parameters are:  

**`n_estimators` and `early_stopping_rounds`:**  
`n_estimators` specifies how many times the modeling cycle is repeated.  
In the underfitting vs overfitting graph below, `n_estimators` moves you further to the right.  
Too low a value causes underfitting, which will result in inaccurate predictions on both training data and new data. Too large a value causes overfitting, which means accurate predictions on training data, but inaccurate predictions on new data (which is what we care about).  
You can experiment with your dataset to find the ideal.  
Typical values range from 100-1000, though this depends a lot on the learning rate discussed below.

![Underfitting vs Overfitting](img/underfit_vs_overfit.png)

**`early_stopping_rounds`** offers a way to automatically find the maximum value.  
Early stopping tells the program to stop iterating when the validation score stops improving.  
One effective technique is to set a relatively high value for `n_estimators` and then use `early_stopping_rounds` to figure out when to stop.  
Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping.  
`early_stopping_rounds = 5` is a reasonable value to experiment with.  
Thus we stop after 5 straight rounds of deteriorating validation scores.  
Here is the code to fit with early_stopping:

In [10]:
my_model = XGBRegressor(n_estimators=1000)
my_model.fit(train_X, train_y, early_stopping_rounds=5,
            eval_set=[(test_X, test_y)], verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

When using `early_stopping_rounds`, you need to set aside some of your data for checking the number of rounds to use.  
If you later want to fit a model with all of your data, set `n_estimators` to whatever value you found to be optimal when run with early stopping.  

**learning_rate**  
