# [Learning to Use XGBoost](https://www.kaggle.com/dansbecker/learning-to-use-xgboost)

**XGBoost** is the leading model for working with standard tabular data (the type of data you store in pandas DataFrames, as opposed to more exotic types of data like images and videos).  
XGBoost models do well in many Kaggle competitions.  
To reach peak accuracy, XGBoost models require more knowledge and model tuning than techniques like Random Forest. After this tutorial, you'll be able to:  
* Follow the full modeling workflow with XGBoost, and    
* Fine-tune XGBoost models for optimal performance

XGBoost is an implementation of the Gradient Boosted Decision Trees algorithm (scikit-learn has another version of this algorithm, but XGBoost has some technical advantages.)  
What are Gradient Boosted Decision Trees?

![XGBoost](img/xgboost.png)

New models are generated in cycles, and the results of these models are aggregated and used to build into an **ensemble** model.  
We start the cycle by calculating the errors for each observation in the dataset.  
We then build a new model to predict those errors.  
We add predictions from this error-predicting model to the ensemble of models.  
To make a prediction, we include the predictions from all previous models.  
We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.  
There's one piece outside that cycle.  
We need some base prediction to start the cycle.  
In practice, the initial predictions can be pretty naive.  
Even if the predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.  
This process may sound complicated, but the code to use it is straightforward.  
We'll fill in some additional explanatory details in the model tuning section below.  

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

data = pd.read_csv('input/train.csv')
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25)
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
print("First entry of train_X :\n", train_X[:1])
print()
test_X = my_imputer.transform(test_X)
print("First entry of test_X :\n", test_X[:1])

First entry of train_X :
 [[8.780e+02 6.000e+01 7.400e+01 8.834e+03 9.000e+00 5.000e+00 2.004e+03
  2.005e+03 2.160e+02 1.170e+03 0.000e+00 2.920e+02 1.462e+03 1.462e+03
  7.620e+02 0.000e+00 2.224e+03 1.000e+00 0.000e+00 2.000e+00 1.000e+00
  4.000e+00 1.000e+00 1.000e+01 1.000e+00 2.004e+03 3.000e+00 7.380e+02
  1.840e+02 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  6.000e+00 2.009e+03]]

First entry of test_X :
 [[8.200e+02 1.200e+02 4.400e+01 6.371e+03 7.000e+00 5.000e+00 2.009e+03
  2.010e+03 1.280e+02 7.330e+02 0.000e+00 6.250e+02 1.358e+03 1.358e+03
  0.000e+00 0.000e+00 1.358e+03 1.000e+00 0.000e+00 2.000e+00 0.000e+00
  2.000e+00 1.000e+00 6.000e+00 1.000e+00 2.010e+03 2.000e+00 4.840e+02
  1.920e+02 3.500e+01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  6.000e+00 2.010e+03]]


Now we can build and fit a model just as we would in `sklearn`:

In [14]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle:
# Don't forget to examine the parameters displayed when the model is built.
# Tuning those parameters properly may improve the model's performance.
my_model.fit(train_X, train_y, verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

And now on to evaluating the model and making predictions, also like in scikit-learn.

In [15]:
predictions = my_model.predict(test_X)
predictions[:5]

array([212101.77, 181076.14, 116821.5 , 188081.03, 183677.23],
      dtype=float32)

In [16]:
from sklearn.metrics import mean_absolute_error

print("Mean Absolute Error:\n", str(mean_absolute_error(predictions, test_y)))

Mean Absolute Error:
 16368.89558005137


## Model Tuning