The XGBoost algorithm has gained recently gained momentum and popularity in data-science
competitions such as Kaggle (https://www.kaggle.com/ (https://www.kaggle.com/)) and the KDD-cup
2015. As the authors (Tianqui Chen, Tong He, and Carlos Guestrin) report on papers they wrote on
the algorithm, among 29 challenges held on Kaggle during 2015, 17 winning solutions used XGBoost
as a standalone solution or as part of an ensemble of multiple different models.
In their paper, XGBoost: A Scalable Tree Boosting System (which can be found at
http://learningsys.org/papers/LearningSys_2015_paper_32.pdf
(http://learningsys.org/papers/LearningSys_2015_paper_32.pdf)), the authors report that XGBoost
was also used by every team that ended in the top 10 of the recent KDD-cup 2015.
Apart from the successful performances in both accuracy and computational efficiency, XGBoost is
also a scalable solution under different points of view. XGBoost represents a new generation of GBM
algorithms thanks to important tweaks to the initial tree boost GBM algorithm:

A sparse-aware algorithm; it can leverage sparse matrices, saving both memory (no
need for dense matrices) and computation time (zero values are handled in a special
way).

Approximate tree learning (weighted quantile sketch), which bears similar results but in
much less time than the classical complete explorations of possible branch cuts.

Parallel computing on a single machine (using multi-threading in the phase of the search
for the best split) and similarly distributed computations on multiple ones.

Out-of-core computations on a single machine leveraging a data storage solution called
Column Block. This arranges data on a disk by columns, thus saving time by pulling data
from the disk as the optimization algorithm (which works on column vectors) expects it

XGBoost can also deal with missing data in an effective way. Other tree ensembles
based on standard decision trees require missing data first to be imputed using an offscale
value, such as a negative number, in order to develop an appropriate branching of
the tree to deal with missing values.

An algorithm that accepts sparse data, which can leverage sparse matrices, saving both
memory (no need for dense matrices) and computation time (zero values are handled in a
special way)

An approximate tree learning (weighted quantile sketch), which bears similar results but in
much less time than the classical complete explorations of possible branch cuts

Parallel computing on a single machine (using multithreading in the phase of the search for the
best split) and similarly distributed computations on multiple ones

Out-of-core computations on a single machine leveraging a data storage solution called
Column Block, which arranges data on disk by columns, thus saving time by pulling data from
disk as the optimization algorithm (which works on column vectors) expects it

XGBoost, instead, first fits all the non-missing values. After having created the branching for the
variable, it decides which branch is better for the missing values to take in order to minimize the
prediction error. Such an approach leads to both trees that are more compact and an effective
imputation strategy leading to more predictive power.

XGBoost works much like AdaBoost with one key difference—the means by which the model is improved is
different.
At each iteration, XGBoost is seeking to improve the performance of the existing model set by reducing the
residuals (the differences between targets and label predictions) of that ensemble. Every iteration, the model
added is selected based on whether it is most able to reduce the existing ensemble's residuals. This is
analogous to gradient descent (where a function is iteratively minimized by moving against a loss gradient);
hence, the name Gradient Boosting.
Gradient Boosting has proven to be highly successful in recent Kaggle contests, where it has supported the
winners of the CrowdFlower Competition and Microsoft Malware Classification Challenge, along with many
other structured data competitions in the final half of 2015.

In [None]:
from sklearn.datasets import load_svmlight_file
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import StratifiedKFold
import pickle
covertype_dataset = pickle.load(open("covertype_dataset.pickle", "rb"))
covertype_dataset.target = covertype_dataset.target.astype(int)
covertype_X = covertype_dataset.data[:15000,:]
covertype_y = covertype_dataset.target[:15000] -1
covertype_val_X = covertype_dataset.data[15000:20000,:]
covertype_val_y = covertype_dataset.target[15000:20000] -1
covertype_test_X = covertype_dataset.data[20000:25000,:]
covertype_test_y = covertype_dataset.target[20000:25000] -1

In [None]:
import xgboost as xgb
hypothesis = xgb.XGBClassifier(objective= "multi:softprob", max_depth = 24, gamma=0.1, subsample = 0.90, learning_rate=0.01, n_estimators = 500, nthread=-1)
hypothesis.fit(covertype_X, covertype_y, eval_set=[(covertype_val_X, covertype_val_y)], eval_metric='merror', early_stopping_rounds=25, verbose=False)

In [None]:
xg_train = xgb.DMatrix( train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
print ('test accuracy:', accuracy_score(covertype_test_y, hypothesis.predict(covertype_test_X)))
print (confusion_matrix(covertype_test_y, hypothesis.predict(covertype_test_X)))

In [None]:
import numpy as np
import scipy.sparse
import xgboost as xgb
import os
import pandas as pd
from sklearn.cross_validation import train_test_split
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
pd=fetch_california_housing()

In [None]:
import urllib
from sklearn.datasets import dump_svmlight_file
from sklearn.datasets import load_svmlight_file
trainfile = urllib.URLopener()
trainfile.retrieve("http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/poker.bz2", "pokertrain
.bz2")
X,y = load_svmlight_file('pokertrain.bz2')
dump_svmlight_file(X, y,'pokertrain', zero_based=True,query_id=None, multilabel=False)
testfile = urllib.URLopener()
testfile.retrieve("http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/poker.t.bz2", "pokertest
.bz2")
X,y = load_svmlight_file('pokertest.bz2')
dump_svmlight_file(X, y,'pokertest', zero_based=True,query_id=None, multilabel=False)
del(X,y)
from sklearn.metrics import classification_report
import numpy as np
import xgboost as xgb
dtrain = xgb.DMatrix('/yourpath/pokertrain#dtrain.cache')
dtest = xgb.DMatrix('/yourpath/pokertest#dtestin.cache')