# XGBoost.jl - quickest ways to win data science competitions

Here is an example of how a single non-ensembled model can achieve high ranking scores using XGBoost, which is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.

## Highlights of XGBoost

*  ### Distributed on Cloud
  Supports distributed training on multiple machines, including AWS, GCE, Azure, and Yarn clusters. Can be integrated with Flink, Spark and other cloud dataflow systems.
* ### Battle-tested
  Wins many data science and machine learning challenges. Used in production by multiple companies.
* ### Flexible
  Supports regression, classification, ranking and user defined objectives.

### Distinguishing poisonous vs edible mushrooms

Based on 8142 instances and 22 attributes like, odor, habitat, color, etc, we can easily and accurately classify mushrooms as poisonous or edible and in few cases of unknown edibility and hence not recommended. 

The Agaricus genus contains the most widely consumed and best-known mushroom today, but there are poisonous ones among them as well. The dataset consists of 8142 observations of Agaricus and Lepiota Family, this is a multivariate dataset with 22 characteristic attributes and classified into 2 classes, edible and poisonous.

![Agaricus californicus](ACP.jpg) ![Agaricus campestris](ACE2.jpg)

In [26]:
using XGBoost
include("$(Pkg.dir())/MLDemos/src/xgboost/mushroom.jl")
path = "$(Pkg.dir())/MLDemos/";



#### We use auxiliary function to read LIBSVM format into julia Matrix.

Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values.

Ex. :

1 2:1 9:1 10:1 20:1 29:1 33:1 35:1 39:1 40:1 52:1 57:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 104:1 116:1 123:1

0 2:1 9:1 19:1 20:1 22:1 33:1 35:1 38:1 40:1 52:1 55:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 105:1 115:1 119:1

In [27]:
train_X, train_Y = readlibsvm("$(path)data/mushroom/agaricus.txt.train", (6513, 126));
test_X, test_Y = readlibsvm("$(path)data/mushroom/agaricus.txt.test", (1611, 126));

#### Basic training using XGBoost :
You can directly pass julia's matrix as data,

In [3]:
num_round = 2;

print("training xgboost with dense matrix\n");
@time bst1 = xgboost(train_X, num_round, label = train_Y, eta=1, max_depth=2, objective="binary:logistic");


training xgboost with dense matrix


[1]	train-error:0.046522
[2]	train-error:0.022263


  0.921140 seconds (1.21 M allocations: 62.616 MB, 1.04% gc time)


#### Alternatively, you can sparse matrix as data and also pass parameters in as a map

In [8]:
print("training xgboost with sparse matrix\n");
sptrain = sparse(train_X);
param = ["max_depth"=>2, "eta"=>1, "objective"=>"binary:logistic"]
@time bst = xgboost(sptrain, num_round, label = train_Y, param=param)

training xgboost with sparse matrix
  0.005247 seconds (131 allocations: 1.648 MB)



Use "Dict(a=>b, ...)" instead.
[1]	train-error:0.046522
[2]	train-error:0.022263


XGBoost.Booster(Ptr{Void} @0x00000000045db0c0)

#### You can also put in xgboost's DMatrix object. DMatrix stores label, data and other metadata needed for advanced features

In [34]:
print("training xgboost with DMatrix\n")
dtrain = DMatrix(train_X, label = train_Y)
println(num_round)
@time bst = xgboost(dtrain, num_round, eta = 1, objective = "binary:logistic")

training xgboost with DMatrix
2
  0.023433 seconds (114 allocations: 5.219 KB)


[1]	train-error:0.000614
[2]	train-error:0.000000


XGBoost.Booster(Ptr{Void} @0x000000000878f6e0)

### -----Basic prediction using XGBoost-----

#### You can put in Matrix, SparseMatrix or DMatrix

In [23]:
preds1 = predict(bst1, test_X)
print("test-error=", sum((preds1 .> 0.5) .!= test_Y) / float(size(preds1)[1]), "\n")

test-error=0.021725636250775917


In [22]:
preds = predict(bst, test_X)
print("test-error=", sum((preds .> 0.5) .!= test_Y) / float(size(preds)[1]), "\n")

1611-element Array{Float32,1}:
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 1.0
 0.0
 1.0
 0.0
 1.0
 0.0
 0.0
 ⋮  
 1.0
 0.0
 0.0
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 1.0
 0.0
 1.0

### XGBoost vs other solvers :

Solving the same problem not using linear models instead of trees, and comparing with GLM.jl

In [4]:
using XGBoost
dtrain = DMatrix("../data/mushroom/agaricus.txt.train")
dtest = DMatrix("../data/mushroom/agaricus.txt.test")


6513x126 matrix with 143286 entries is loaded from ../data/mushroom/agaricus.txt.train
1611x126 matrix with 35442 entries is loaded from ../data/mushroom/agaricus.txt.test


XGBoost.DMatrix(Ptr{Void} @0x00000000036d1890,_setinfo)

In [25]:
num_round = 2;
@time bst = xgboost(dtrain1, num_round, eta = 1, objective = "binary:logistic")

LoadError: LoadError: UndefVarError: xgboost not defined
while loading In[25], in expression starting on line 155

In [24]:
preds = predict(bst, test_X)
print("test-error=", sum((preds .> 0.5) .!= test_Y) / float(size(preds)[1]), "\n")

LoadError: LoadError: UndefVarError: predict not defined
while loading In[24], in expression starting on line 1

In [7]:
preds1 = predict(bst1, test_X)
print("test-error=", sum((preds1 .> 0.5) .!= test_Y) / float(size(preds1)[1]), "\n")

test-error=0.021725636250775917


In [23]:
workspace()

In [13]:
dtrain2 = DMatrix(train_X, label = train_Y)

XGBoost.DMatrix(Ptr{Void} @0x000000000326b980,_setinfo)

false