<h1>Ranger: A Fast Implementation of Random Forests</h1>

In [47]:
library(mlr)
library(parallel)
source('../utils.r')

set.seed(42)

data_folder_name  = '../../raw_data'
data_file_name    = 'airfoil_self_noise.dat'
model_folder_name = '../model'
model_file_name   = 'model_regr_ranger.RData'
ml_algorithm      = "regr.ranger"
gs_max_iteration  = 10
gs_max_time       = 5#60#*60*1

<h2>Load Data

In [2]:
data = read_data(data_folder_name %+/% data_file_name)

In [3]:
cat(sprintf('NRow: %d\nNCol: %d',nrow(data), ncol(data)))
head(data)

NRow: 1503
NCol: 6

V1,V2,V3,V4,V5,V6
800,0,0.3048,71.3,0.00266337,126.201
1000,0,0.3048,71.3,0.00266337,125.201
1250,0,0.3048,71.3,0.00266337,125.951
1600,0,0.3048,71.3,0.00266337,127.591
2000,0,0.3048,71.3,0.00266337,127.461
2500,0,0.3048,71.3,0.00266337,125.571


Assunming that target is the last column of the data. 
If it's not true, one most declare the name of the column that represents the target.

In [4]:
target = names(data)[ncol(data)]

In [5]:
drops = c()
data  = data[,c(!(names(data) %in% drops)), with=FALSE]
cat(sprintf('NRow: %d\nNCol: %d',nrow(data), ncol(data)))

NRow: 1503
NCol: 6

<h2>Make Task</h2><br>

The task encapsulates the data and specifies - through its subclasses - the type of the task. It also contains a description object detailing further aspects of the data. Useful operators are: getTaskFormula, getTaskFeatureNames, getTaskData, getTaskTargets, and subsetTask.

Function
```R
makeRegrTask(id = deparse(substitute(data)), data, target, weights = NULL, blocking = NULL, 
             fixup.data = "warn", check.data = TRUE)
```
Param.:

* data: [data.frame] A data frame containing the features and target variable(s).
* target: [character(1)] Name of the target variable.
* weights: [numeric] Optional, non-negative case weight vector to be used during fitting. Cannot be set for cost-sensitive learning. Default is NULL which means no (= equal) weights.
* fixup.data: [character(1)] Should some basic cleaning up of data be performed? Currently this means removing empty factor levels for the columns. Possible coices are: “no” = Don't do it. “warn” = Do it but warn about it. “quiet” = Do it but keep silent. Default is “warn”.

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/makeRegrTask

In [6]:
task = makeRegrTask(data=data, target = target, fixup.data = 'no')

“Provided data is not a pure data.frame but from class data.table, hence it will be converted.”

<h2>Make Learner</h2><br>
Function
```R
makeLearner(cl, id = cl, predict.type = "response", predict.threshold = NULL, 
            fix.factors.prediction = FALSE, ..., par.vals = list(), config = list())
```
Param.:

* cl: [character(1)] Class of learner. By convention, all regression learners with “regr.”. A list of all integrated learners is available on the learners help page < https://mlr-org.github.io/mlr-tutorial/release/html/integrated_learners/ >.
* predict: [character(1)] “response” (= mean response) or “se” (= standard errors and mean response). Default is “response”.
* par.vals: [list] Optional list of named (hyper)parameters. The arguments in ... take precedence over values in this list. We strongly encourage you to use one or the other to pass (hyper)parameters to the learner but not both.

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/makeLearner

In [33]:
learner = makeLearner(cl = ml_algorithm, par.vals=list(verbose=FALSE,num.threads=1,seed=42, importance='impurity'))

<h2>Specifying the search space</h2><br>

In order to define a search space, we create a ParamSet object, which describes the parameter space we wish to search. This is done via the function makeParamSet. For each parameter type a special constructor function is available, see: https://www.rdocumentation.org/packages/ParamHelpers/versions/1.10/topics/Param

List all parameters:

In [8]:
getParamSet(ml_algorithm)

                                      Type  len      Def
num.trees                          integer    -      500
mtry                               integer    -        -
min.node.size                      integer    -        5
replace                            logical    -     TRUE
sample.fraction                    numeric    -        -
split.select.weights         numericvector <NA>        -
always.split.variables             untyped    -        -
respect.unordered.factors          logical    -    FALSE
importance                        discrete    -     none
write.forest                       logical    -     TRUE
scale.permutation.importance       logical    -    FALSE
num.threads                        integer    -        -
save.memory                        logical    -    FALSE
verbose                            logical    -     TRUE
seed                               integer    -        -
splitrule                         discrete    - variance
alpha                          

In [9]:
param_set = makeParamSet(
  makeIntegerParam("num.trees", lower = 100, upper = 3000),
  makeIntegerParam("mtry",      lower = 2,   upper = ceiling(sqrt(length(getTaskFeatureNames(task))))),  
  makeIntegerParam('min.node.size', lower = 5,   upper = 50)
)

<h2>Specifying the optimization algorithm</h2><br>

Once we have specified the search space, we need to choose an optimization algorithm for our parameters. Optimization algorithms are considered TuneControl objects in mlr. The following tuners are available: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/TuneControl

In [48]:
optimization_algorithm = makeTuneControlGenSA(max.call=gs_max_iteration, max.time=gs_max_time)

<h2>Resampling strategy</h2><br>

Function:
```R
makeResampleDesc(method, predict = "test", ..., stratify = FALSE, stratify.cols = NULL)
```
Param.:

* method: [character(1)] “CV” for cross-validation, “LOO” for leave-one-out, “RepCV” for repeated cross-validation, “Bootstrap” for out-of-bag bootstrap, “Subsample” for subsampling, “Holdout” for holdout.
* predict: What to predict during resampling: “train”, “test” or “both” sets. Default is “test”.
* ... : [any] Further parameters for strategies.
    * iters [integer(1)] Number of iterations, for “CV”, “Subsample” and “Boostrap”.
    * split [numeric(1)] Proportion of training cases for “Holdout” and “Subsample” between 0 and 1. Default is 2/3.
    * reps [integer(1)] Repeats for “RepCV”. Here iters = folds * reps. Default is 10.
    * folds [integer(1)] Folds in the repeated CV for RepCV. Here iters = folds * reps. Default is 10.

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/makeResampleDesc

In [16]:
resample = makeResampleDesc(method = "CV", iters = 3, predict = 'both')

<h2>Measures</h2><br>

List of performance measures:

Doc.: http://mlr-org.github.io/mlr-tutorial/release/html/measures/

In [17]:
measures = list(mae, rmse, expvar, timetrain)

<h2>Performing the tuning</h2><br>

Optimizes the hyperparameters of a learner. Allows for different optimization methods, such as grid search, evolutionary strategies, iterated F-race, etc. You can select such an algorithm (and its settings) by passing a corresponding control object. For a complete list of implemented algorithms look at TuneControl. Multi-criteria tuning can be done with tuneParamsMultiCrit.

Function:
```R
tuneParams(learner, task, resampling, measures, par.set, control, show.info = getMlrOption("show.info"))
```
Param.:

* learner: [Learner | character(1)] The learner. If you pass a string the learner will be created via makeLearner
* task: [Task] The task.
* resampling: [ResampleInstance | ResampleDesc] Resampling strategy to evaluate points in hyperparameter space.
* measures: [list of Measure | Measure] Performance measures to evaluate. The first measure, aggregated by the first aggregation function is optimized, others are simply evaluated. 
* par.set: [ParamSet] Collection of parameters and their constraints for optimization. Dependent parameters with a requires field must use quote and not expression to define it.
* control: [TuneControl] Control object for search method. Also selects the optimization algorithm for tuning.
* show.info: [logical(1)] Print verbose output on console? Default is set via configureMlr.

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/tuneParams

In [None]:
cpus = detectCores(all.tests = FALSE, logical = TRUE) - 1
parallelMap::parallelStartMulticore(cpus, show.info=FALSE)

tune_result = tuneParams(learner, task, resample, measures, param_set, optimization_algorithm, show.info=FALSE)

parallelMap::parallelStop()

In [50]:
tune_result$y

In [51]:
unlist(tune_result$x)

<h2>Train A Learning Algorithm with best Hyperparameters</h2><br>

Given a Task, creates a model for the learning machine which can be used for predictions on new data.

Function:
```R
train(learner, task, subset, weights = NULL)
```
Param.:

* learner: [Learner | character(1)] The learner. If you pass a string the learner will be created via makeLearner.
* task: [Task] The task.
* subset: [integer | logical] Selected cases. Either a logical or an index vector. By default all observations are used.
* weights: [numeric] Optional, non-negative case weight vector to be used during fitting. If given, must be of same length as subset and in corresponding order. By default NULL which means no weights are used unless specified in the task (Task). Weights from the task will be overwritten.

Doc.: https://www.rdocumentation.org/packages/mlr/versions/2.10/topics/train

In [52]:
optimal_learner = setHyperPars(tune_result$learner, par.vals = tune_result$x)

Train with all data to make final model

In [53]:
model = train(optimal_learner, task)

<h4>Calculates Feature Importance Values For Trained Models

In [54]:
getFeatureImportance(model)

FeatureImportance:
Task: data

Learner: regr.ranger
Measure: NA
Contrast: NA
Aggregation: function (x)  x
Replace: NA
Number of Monte-Carlo iterations: NA
Local: FALSE
       V1       V2       V3      V4       V5
1 29081.2 5862.249 8224.299 2885.39 21566.24

In [55]:
object.size(model)

59630992 bytes

In [56]:
print(model)

Model for learner.id=regr.ranger; learner.class=regr.ranger
Trained on: task.id = data; obs = 1503; features = 5
Hyperparameters: num.threads=1,verbose=FALSE,respect.unordered.factors=TRUE,seed=42,importance=impurity,num.trees=2811,mtry=3,min.node.size=7


<h2>Save final model</h2><br>

In [57]:
save(model, file = model_folder_name %+/% model_file_name)