# XGBoost.jl - quickest ways to win data science competitions

Here is an example of how a single non-ensembled model can achieve high ranking scores using XGBoost, which is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.

### Distinguishing poisonous vs edible mushrooms

Based on 8142 instances and 22 attributes like, odor, habitat, color, etc, we can easily and accurately classify mushrooms as poisonous or edible and in few cases of unknown edibility and hence not recommended. 

The Agaricus genus contains the most widely consumed and best-known mushroom today, but there are poisonous ones among them as well. The dataset consists of 8142 observations of Agaricus and Lepiota Family, this is a multivariate dataset with 22 characteristic attributes and classified into 2 classes, edible and poisonous.

![Agaricus californicus](ACP.jpg) ![Agaricus campestris](ACE2.jpg)

In [1]:
using XGBoost
include("$(Pkg.dir())/MLDemos/src/xgboost/mushroom.jl")
path = "$(Pkg.dir())/MLDemos/";

#### We use auxiliary function to read LIBSVM format into julia Matrix.


In [3]:
train_X, train_Y = readlibsvm("$(path)data/mushroom/agaricus.txt.train", (6513, 126));
test_X, test_Y = readlibsvm("$(path)data/mushroom/agaricus.txt.test", (1611, 126));

#### Basic training using XGBoost :
You can directly pass julia's matrix or sparse matrix as data,

In [5]:
num_round = 2;

print("training xgboost with dense matrix\n");
bst = xgboost(train_X, num_round, label = train_Y, eta=1, max_depth=2, objective="binary:logistic");

print("training xgboost with sparse matrix\n");
sptrain = sparse(train_X);


training xgboost with dense matrix
training xgboost with sparse matrix


[1]	train-error:0.046522
[2]	train-error:0.022263


#### Alternatively, you can pass parameters in as a map

In [6]:
param = ["max_depth"=>2, "eta"=>1, "objective"=>"binary:logistic"]
@time bst = xgboost(sptrain, num_round, label = train_Y, param=param)


Use "Dict(a=>b, ...)" instead.
[1]	train-error:0.046522
[2]	train-error:0.022263


  0.675264 seconds (88.31 k allocations: 5.714 MB)


XGBoost.Booster(Ptr{Void} @0x00000000035fef20)

#### You can also put in xgboost's DMatrix object. DMatrix stores label, data and other meta datas needed for advanced features

In [7]:
print("training xgboost with DMatrix\n")
dtrain = DMatrix(train_X, label = train_Y)
bst = xgboost(dtrain, num_round, eta = 1, objective = "binary:logistic")

training xgboost with DMatrix


[1]	train-error:0.000614
[2]	train-error:0.000000


XGBoost.Booster(Ptr{Void} @0x000000000358cc50)

### -----Basic prediction using XGBoost-----

#### You can put in Matrix, SparseMatrix or DMatrix

In [8]:
preds = predict(bst, test_X)
print("test-error=", sum((preds .> 0.5) .!= test_Y) / float(size(preds)[1]), "\n")

test-error=0.0
