In [None]:
5.5 Random forest

Random forest models are accurate and non-linear models and robust to over-fitting and hence quite popular. They however require hyperparameters to be tuned manually, like the value k in the example above.

Building random forest starts by generating a high number of individual decision trees. A single decision tree isn’t very accurate, but many different trees built using different inputs (with bootstrapped inputs, features and observations) enable to explore a broad search space and, once combined, produce accurate models, a technique called bootstrap aggregation or bagging.



In [None]:
5.5.1 Decision trees

A great advantage of decision trees is that they make a complex decision simpler by breaking it down into smaller, simpler decisions using divide-and-conquer strategy. They basically identify a set of if-else conditions that split data according to the value if the features.

library("rpart") ## recursive partitioning
m <- rpart(Class ~ ., data = Sonar,
           method = "class")
library("rpart.plot")
rpart.plot(m)
    
    
    p <- predict(m, Sonar, type = "class")
table(p, Sonar$Class)

In [None]:
Decision trees choose splits based on most homogeneous partitions, and lead to smaller and more homogeneous partitions over their iterations.

An issue with single decision trees is that they can grow, and become large and complex with many branches, with corresponds to over-fitting. Over-fitting models noise, rather than general patterns in the data, focusing on subtle patterns (outliers) that won’t generalise.

To avoid over-fitting, individual decision trees are pruned. Pruning can happen as a pre-condition when growing the tree, or afterwards, by pruning a large tree.

Pre-pruning: stop growing process, i.e stops divide-and-conquer after a certain number of iterations (grows tree at certain predefined level), or requires a minimum number of observations in each mode to allow splitting.

Post-pruning: grow a large and complex tree, and reduce its size; nodes and branches that have a negligible effect on the classification accuracy are removed.

In [1]:
5.5.2 Training a random forest

Let’s return to random forests and train a model using the train infrastructure from caret:

set.seed(12)
model <- train(Class ~ .,
               data = Sonar,
               method = "ranger") 
print(model)


plot(model)


ERROR: Error in parse(text = x, srcfile = src): <text>:1:4: unexpected numeric constant
1: 5.5.2
       ^


In [None]:
The main hyperparameters is mtry, i.e. the number of randomly selected variables used at each split. 2 variables produce random models, while 100s of variables tend to be less random, but risk over-fitting. caret automate the tuning of the hyperparameter using a grid search, which can be parametrised by setting tuneLength (that sets the number of hyperparameter values to test) or directly defining the tuneGrid (the hyperparameter values), which requires knowledge of the model.

model <- train(Class ~ .,
               data = Sonar,
               method = "ranger",
               tuneLength = 5)
set.seed(42)
myGrid <- expand.grid(mtry = c(5, 10, 20, 40, 60),
                     splitrule = c("gini", "extratrees"))
model <- train(Class ~ .,
               data = Sonar,
               method = "ranger", 
               tuneGrid = myGrid,
               trControl = trainControl(method = "cv",
                                       number = 5,
                                       verboseIter = FALSE))
print(model)
    plot(model)