# XGBoost.jl - quickest ways to win data science competitions

Here is an example of how a single non-ensembled model can achieve high ranking scores using XGBoost, which is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.

### Distinguishing poisonous vs edible mushrooms

Based on 8142 instances and 22 attributes like, odor, habitat, color, etc, we can easily and accurately classify mushrooms as poisonous or edible and in few cases of unknown edibility and hence not recommended. 

The Agaricus genus contains the most widely consumed and best-known mushroom today, but there are poisonous ones among them as well. The dataset consists of 8142 observations of Agaricus and Lepiota Family, this is a multivariate dataset with 22 characteristic attributes and classified into 2 classes, edible and poisonous.

![Agaricus californicus](../data/mushroom/ACP.jpg) ![Agaricus campestris](../data/mushroom/ACE2.jpg)

In [12]:
using XGBoost
include("$(Pkg.dir())/MLDemos/src/xgboost/mushroom.jl")
path = "$(Pkg.dir())/MLDemos/";

In [11]:
;ls /home/abhijithc/.julia/v0.4/MLDemos/data/mushroom/

ACE2.jpg
ACE.jpg
ACP.jpg
Agaricus_californicus(P).jpg
Agaricus campestris 1 Michael Beug.jpg
Agaricus_campestris(E).jpg
agaricus.txt.test
agaricus.txt.train
Agaricus_xanthoderma(P).jpg


#### We use auxiliary function to read LIBSVM format into julia Matrix.


In [13]:
train_X, train_Y = readlibsvm("$(path)data/mushroom/agaricus.txt.train", (6513, 126))
test_X, test_Y = readlibsvm("$(path)data/mushroom/agaricus.txt.test", (1611, 126))

 in depwarn at deprecated.jl:73
 in int at deprecated.jl:50
 in readlibsvm at /home/abhijithc/.julia/v0.4/MLDemos/src/xgboost/mushroom.jl:16
 in include_string at loading.jl:282
 in execute_request_0x535c5df2 at /home/abhijithc/.julia/v0.4/IJulia/src/execute_request.jl:183
 in eventloop at /home/abhijithc/.julia/v0.4/IJulia/src/IJulia.jl:143
 in anonymous at task.jl:447
while loading In[13], in expression starting on line 1


(
1611x126 Array{Float32,2}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0     0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0     1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  1.0 

#### Basic training using XGBoost :
You can directly pass julia's matrix or sparse matrix as data,

In [None]:
num_round = 2

print("training xgboost with dense matrix\n")
bst = xgboost(train_X, num_round, label = train_Y, eta=1, max_depth=2, objective="binary:logistic")

print("training xgboost with sparse matrix\n")
sptrain = sparse(train_X)


#### Alternatively, you can pass parameters in as a map

In [7]:
param = ["max_depth"=>2, "eta"=>1, "objective"=>"binary:logistic"]
@time bst = xgboost(sptrain, num_round, label = train_Y, param=param)


Use "Dict(a=>b, ...)" instead.


LoadError: LoadError: UndefVarError: train_Y not defined
while loading In[7], in expression starting on line 155

#### You can also put in xgboost's DMatrix object. DMatrix stores label, data and other meta datas needed for advanced features

In [None]:
print("training xgboost with DMatrix\n")
dtrain = DMatrix(train_X, label = train_Y)
bst = xgboost(dtrain, num_round, eta = 1, objective = "binary:logistic")

### -----Basic prediction using XGBoost-----

#### You can put in Matrix, SparseMatrix or DMatrix