<h1>xgBoost</h1>

In [1]:
library(mlr)
source('../utils.r')

set.seed(42)

folder_name = '../../raw_data' 
file_name   = 'german.csv'

Loading required package: ParamHelpers
“replacing previous import ‘BBmisc::isFALSE’ by ‘backports::isFALSE’ when loading ‘mlr’”

<h2>1. Dataprep</h2>

In [2]:
data = read.csv(file=sprintf('%s/%s',folder_name,file_name))

In [3]:
cat(sprintf('NRow: %d\nNCol: %d',nrow(data), ncol(data)))
head(data)

NRow: 1000
NCol: 22

X,V1,V2,V3,V4,V5,V6,V7,V8,V9,⋯,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21
1,A11,6,A34,A43,1169,A65,A75,4,A93,⋯,A121,67,A143,A152,2,A173,good,A192,A201,1
2,A12,48,A32,A43,5951,A61,A73,2,A92,⋯,A121,22,A143,A152,1,A173,good,A191,A201,2
3,A14,12,A34,A46,2096,A61,A74,2,A93,⋯,A121,49,A143,A152,1,A172,bad,A191,A201,1
4,A11,42,A32,A42,7882,A61,A74,2,A93,⋯,A122,45,A143,A153,1,A173,bad,A191,A201,1
5,A11,24,A33,A40,4870,A61,A73,3,A93,⋯,A124,53,A143,A153,2,A173,bad,A191,A201,2
6,A14,36,A32,A46,9055,A65,A73,2,A93,⋯,A124,35,A143,A153,1,A172,bad,A192,A201,1


<p style='color: red'> ATENTION: </p>Specifically in R, in classification problem, target must be set as Factor.

In [4]:
data$V21 = as.factor(data$V21)

MLR works only with features and target, this means that others columns must be dorped.

In [5]:
drops = c('X')
data  = data[ , !(names(data) %in% drops)]
cat(sprintf('NRow: %d\nNCol: %d',nrow(data), ncol(data)))

NRow: 1000
NCol: 21

Function
```R
createDummyFeatures(obj, target = character(0L), method = "1-of-n",  cols = NULL)
```

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/createDummyFeatures

In [6]:
data = createDummyFeatures(obj = data, target = 'V21')
cat(sprintf('NRow: %d\nNCol: %d',nrow(data), ncol(data)))

NRow: 1000
NCol: 63

<h2>2. Modeling</h2>

Function
```R
makeLearner(cl, id = cl, predict.type = "response", predict.threshold = NULL, 
            fix.factors.prediction = FALSE, ..., par.vals = list(), config = list())
```
Param.:

* cl: [character(1)] Class of learner. By convention, all classification learners start with “classif.”. A list of all integrated learners is available on the learners help page < https://mlr-org.github.io/mlr-tutorial/release/html/integrated_learners/ >.
* predict: [character(1)] “response” (= labels) or “prob” (= probabilities and labels by selecting the ones with maximal probability). Default is “response”.
* par.vals: [list] Optional list of named (hyper)parameters. The arguments in ... take precedence over values in this list. We strongly encourage you to use one or the other to pass (hyper)parameters to the learner but not both.

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/makeLearner <br>
Rpart Parameters: https://www.rdocumentation.org/packages/rpart/versions/4.1-11/topics/rpart.control

In [7]:
cl = "classif.xgboost"

List all parameters that can be used in this classifier.
The value must be set in par.vals parameter.

In [8]:
getParamSet(cl)

                          Type len             Def               Constr Req
booster               discrete   -          gbtree gbtree,gblinear,dart   -
silent                 integer   -               0          -Inf to Inf   -
eta                    numeric   -             0.3               0 to 1   -
gamma                  numeric   -               0             0 to Inf   -
max_depth              integer   -               6             1 to Inf   -
min_child_weight       numeric   -               1             0 to Inf   -
subsample              numeric   -               1               0 to 1   -
colsample_bytree       numeric   -               1               0 to 1   -
colsample_bylevel      numeric   -               1               0 to 1   -
num_parallel_tree      integer   -               1             1 to Inf   -
lambda                 numeric   -               0             0 to Inf   -
lambda_bias            numeric   -               0             0 to Inf   -
alpha       

In [9]:
learner = makeLearner(cl = cl
                     , predict.type = "prob"
                     , par.vals = list()
                     )

Function
```R
makeClassifTask(id = deparse(substitute(data)), data, target, weights = NULL, blocking = NULL, 
                positive = NA_character_, fixup.data = "warn", check.data = TRUE)
```
Param.:

* data: [data.frame] A data frame containing the features and target variable(s).
* target: [character(1)] Name of the target variable.
* positive: [character(1)] Positive class for binary classification (otherwise ignored and set to NA). Default is the first factor level of the target attribute.
* fixup.data: [character(1)] Should some basic cleaning up of data be performed? Currently this means removing empty factor levels for the columns. Possible coices are: “no” = Don't do it. “warn” = Do it but warn about it. “quiet” = Do it but keep silent. Default is “warn”.

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/makeLearner

In [10]:
task = makeClassifTask( data = data
                      , target = 'V21'
                      , positive = '2'
                      , fixup.data = 'no'
                      )

Function:
```R
makeResampleDesc(method, predict = "test", ..., stratify = FALSE, stratify.cols = NULL)
```
Param.:

* method: [character(1)] “CV” for cross-validation, “LOO” for leave-one-out, “RepCV” for repeated cross-validation, “Bootstrap” for out-of-bag bootstrap, “Subsample” for subsampling, “Holdout” for holdout.
* predict: What to predict during resampling: “train”, “test” or “both” sets. Default is “test”.
* ... : [any] Further parameters for strategies.
    * iters [integer(1)] Number of iterations, for “CV”, “Subsample” and “Boostrap”.
    * split [numeric(1)] Proportion of training cases for “Holdout” and “Subsample” between 0 and 1. Default is 2/3.
    * reps [integer(1)] Repeats for “RepCV”. Here iters = folds * reps. Default is 10.
    * folds [integer(1)] Folds in the repeated CV for RepCV. Here iters = folds * reps. Default is 10.

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/makeResampleDesc

In [11]:
resample = makeResampleDesc( method = "CV"
                           , iters = 10
                           , predict = 'both'
                           , stratify = FALSE
                           )

List of performance measures:

Doc.: http://mlr-org.github.io/mlr-tutorial/release/html/measures/

In [12]:
measures = list(mmce #MMCE 
               ,acc  #acuracia
               ,f1   #f1
               ,ppv  #precision
               ,tpr  #recall
               ,auc  #AUC
               ,gini #Gini
               #,timetrain #tempo execucao
               )

Function:
```R
resample(learner, task, resampling, measures, weights = NULL, models = FALSE, extract, 
         keep.pred = TRUE, ..., show.info = getMlrOption("show.info"))
```
Param.:

* learner: [Learner] The learner.
* task: [Task] The task.
* resampling: [ResampleInstance] Resampling strategy.
* measures: [Measure | list of Measure] Performance measure(s) to evaluate. Default is mean misclassification error (mmce)
* weights: [numeric] Optional, non-negative case weight vector to be used during fitting. If given, must be of same length as observations in task and in corresponding order. Overwrites weights specified in the task. By default NULL which means no weights are used unless specified in the task.
* models: [logical(1)] Should all fitted models be returned? Default is FALSE.
* keep.pred: [logical(1)] Keep the prediction data in the pred slot of the result object. If you do many experiments (on larger data sets) these objects might unnecessarily increase object size / mem usage, if you do not really need them. In this case you can set this argument to FALSE. Default is TRUE.
* show.info: [logical(1)] Print verbose output on console? Default is set via configureMlr.

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/resample

In [13]:
r = resample(learner = learner
            ,task = task 
            ,resampling = resample 
            ,measures = measures
            #---------------------#
            ,models = TRUE
            ,keep.pred = FALSE
            ,show.info = TRUE
            )

[Resample] cross-validation iter 1: mmce.test.mean=0.29,acc.test.mean=0.71,f1.test.mean=0.408,ppv.test.mean=0.455,tpr.test.mean=0.37,auc.test.mean=0.649,gini.test.mean=0.297
[Resample] cross-validation iter 2: mmce.test.mean=0.28,acc.test.mean=0.72,f1.test.mean=0.391,ppv.test.mean=0.692,tpr.test.mean=0.273,auc.test.mean=0.71,gini.test.mean=0.42
[Resample] cross-validation iter 3: mmce.test.mean=0.31,acc.test.mean=0.69,f1.test.mean=0.492,ppv.test.mean=0.536,tpr.test.mean=0.455,auc.test.mean=0.745,gini.test.mean=0.49
[Resample] cross-validation iter 4: mmce.test.mean=0.27,acc.test.mean=0.73,f1.test.mean= 0.4,ppv.test.mean=0.692,tpr.test.mean=0.281,auc.test.mean=0.694,gini.test.mean=0.389
[Resample] cross-validation iter 5: mmce.test.mean= 0.3,acc.test.mean= 0.7,f1.test.mean=0.516,ppv.test.mean= 0.5,tpr.test.mean=0.533,auc.test.mean=0.752,gini.test.mean=0.504
[Resample] cross-validation iter 6: mmce.test.mean= 0.3,acc.test.mean= 0.7,f1.test.mean=0.444,ppv.test.mean=0.522,tpr.test.mean=0.3

<h2>3. Result Analysis</h2>

Train Measures

In [14]:
r$measures.train

iter,mmce,acc,f1,ppv,tpr,auc,gini
1,0.1977778,0.8022222,0.6352459,0.7209302,0.5677656,0.8564447,0.7128894
2,0.1866667,0.8133333,0.6018957,0.8193548,0.4756554,0.8666951,0.7333901
3,0.19,0.81,0.6430063,0.7264151,0.576779,0.8596511,0.7193023
4,0.18,0.82,0.6462882,0.7789474,0.5522388,0.8610783,0.7221566
5,0.1966667,0.8033333,0.5931034,0.7818182,0.4777778,0.8403616,0.6807231
6,0.1922222,0.8077778,0.6587771,0.7016807,0.6208178,0.8505058,0.7010116
7,0.1811111,0.8188889,0.6337079,0.7921348,0.5280899,0.855323,0.7106461
8,0.1944444,0.8055556,0.6492986,0.7168142,0.5934066,0.8489756,0.6979512
9,0.2066667,0.7933333,0.6108787,0.7192118,0.5309091,0.8453207,0.6906415
10,0.1911111,0.8088889,0.6653696,0.7037037,0.6309963,0.8495709,0.6991417


Test Measures

In [15]:
r$measures.test

iter,mmce,acc,f1,ppv,tpr,auc,gini
1,0.29,0.71,0.4081633,0.4545455,0.3703704,0.6486555,0.297311
2,0.28,0.72,0.3913043,0.6923077,0.2727273,0.7098598,0.4197196
3,0.31,0.69,0.4918033,0.5357143,0.4545455,0.7451379,0.4902759
4,0.27,0.73,0.4,0.6923077,0.28125,0.6943934,0.3887868
5,0.3,0.7,0.516129,0.5,0.5333333,0.7521429,0.5042857
6,0.3,0.7,0.4444444,0.5217391,0.3870968,0.7800374,0.5600748
7,0.32,0.68,0.4482759,0.52,0.3939394,0.7406151,0.4812302
8,0.28,0.72,0.5333333,0.4848485,0.5925926,0.746068,0.492136
9,0.24,0.76,0.4545455,0.5263158,0.4,0.7706667,0.5413333
10,0.28,0.72,0.5483871,0.5151515,0.5862069,0.7544925,0.5089849


Train Aggregated Result

In [16]:
apply(r$measures.train,2,mean)

Test Aggregated Result

In [17]:
apply(r$measures.test,2,mean)

Run Time in seconds

In [18]:
r$runtime

<h2>4. Prediction for new data</h2>

Read the data to predict

In [19]:
new.data = read.csv(file=sprintf('%s/%s', folder_name, file_name))
new.data = createDummyFeatures(obj = new.data, target = 'V21')

Search for the best model in crossvalidation and use it to score the incoming data

In [20]:
best.model = which.max(r$measures.test$acc)

In [21]:
pred = predict(r$models[[best.model]], newdata = new.data)

Prediction Result

In [22]:
pred

Prediction: 1000 observations
predict.type: prob
threshold: 1=0.50,2=0.50
time: 0.00
  truth    prob.1    prob.2 response
1     1 0.5299640 0.4700360        1
2     2 0.5299640 0.4700360        1
3     1 0.6248065 0.3751935        1
4     1 0.5299640 0.4700360        1
5     2 0.5299640 0.4700360        1
6     1 0.6248065 0.3751935        1
... (1000 rows, 4 cols)


Cast result to data.frame to access the prediction

In [23]:
head(as.data.frame(pred))

truth,prob.1,prob.2,response
1,0.529964,0.470036,1
2,0.529964,0.470036,1
1,0.6248065,0.3751935,1
1,0.529964,0.470036,1
2,0.529964,0.470036,1
1,0.6248065,0.3751935,1
