In [1]:
from IPython.display import HTML
css_file = './custom.css'
HTML(open(css_file, "r").read())

# Random Forests

© 2018 Daniel Voigt Godoy

In [2]:
from intuitiveml.supervised.classification.RandomForest import *
from intuitiveml.utils import gen_button

## 1. Definition

From the Scikit-Learn [website](https://scikit-learn.org/stable/modules/ensemble.html):

    The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.
    
    Two families of ensemble methods are usually distinguished:
    
    In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.
    Examples: Bagging methods, Forests of randomized trees, …

    By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.
    Examples: AdaBoost, Gradient Tree Boosting, …
    
Let's stick with the ***first*** family for now, which ***Random Forests*** are a member of.

The whole idea behind ***ensemble methods*** is to profit from the ***wisdom of the crowds***. In simple terms, it tries to average out the error by using multiple estimators (models). 

So, it is easy as training a whole ***bunch of Decision Trees*** and ***averaging their predictions***! There is a ***catch***, though...

If you train multiple trees on the same dataset, they will be, er... ***all the same***! How to solve that?

### 1.1 Bagging

#### ***BAGGing*** stands for ***B***ootstrapping and ***AGG***regat***ing***.

First, we ***bootstrap***, which is just fancy for ***sampling with replacement***. 

Resampling a dataset, keeping its original size, will yield a dataset similar to the original one, but not quite. 

***NOTE***: Some of items in the original dataset will ***not make it*** to the resampled dataset - these are ***out-of-bag (OOB)*** samples - they can be handy for evaluating the model later!

So, if we train a Decision Tree on this dataset, it will yield a ***slightly different model***. It is ***injecting randomness*** in the process!

If we repeat this several times, we'll have as many slightly different resampled datasets and, therefore, models.

Next, we ***aggregate*** them, which is just ***averaging their predictions***.

This is ***BAGGING*** in a nutshell! 

P.S.: There is actually more to it, as it is also possible to ***randomize the features used at each split*** (remember the Decision Tree notebook), for instance. Check [Scikit-Learn](https://scikit-learn.org/stable/modules/ensemble.html#random-forests) for more information on Random Forests.

![](https://imgs.xkcd.com/comics/ensemble_model.png)
<center>Source: <a href="https://xkcd.com/1885/">XKCD</a></center>

## 2. Experiment

Time to try it yourself!

You have 10 data points of two colors: red and green! These are your ***labels***.

Each point is has a single numerical coordinate. This is your ***feature***.

The sliders below allow you to train one (shown as zero in the slider) or ***multiple Decision Trees*** and choose the ***maximum depth*** they are allowed to have.

Each tree outputs a probability (refer to ***Understanding its Nodes*** section of the ***Decision Tree*** notebook for more details) associated with its predicted class.

The average of these probabilities is then shown in the plot.

Use the sliders to play with different configurations and answer the questions below.

In [3]:
x, y = data()
mydt = plotDecision(x=x, y=y, idx_mid=0)

In [4]:
vb = VBox(build_figure_ensemble(mydt), layout={'align_items': 'center'})
vb

VBox(children=(FigureWidget({
    'data': [{'marker': {'color': 'black'},
              'mode': 'lines',
     …

#### Questions

1. What happens if you train 5 trees (keeping depth 3)?
2. How many trees (keeping depth 3) do you need to train to get ***all points correctly classified***? Why?
3. How many trees (with depth 1) do you need to train to get ***all points correctly classified***? Why?
4. Considering an individual tree, what's the main ***difference*** between a shallow (depth = 1) and a deep (depth = 3) tree?
5. Which one is best to use in Random Forests, ***shallow*** or ***deep*** trees? Why?

## 3. Scikit-Learn

[Ensemble methods](https://scikit-learn.org/stable/modules/ensemble.html)

[Probability calibration](https://scikit-learn.org/stable/modules/calibration.html)

Please check Aurelién Geron's "Hand-On Machine Learning with Scikit-Learn and Tensorflow" notebook on Ensemble Methods [here](http://nbviewer.jupyter.org/github/ageron/handson-ml/blob/master/07_ensemble_learning_and_random_forests.ipynb).

## 4. More Resources

[InfoGraphic](https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Info-graphs/Day%2033.jpg)

#### This material is copyright Daniel Voigt Godoy and made available under the Creative Commons Attribution (CC-BY) license ([link](https://creativecommons.org/licenses/by/4.0/)). 

#### Code is also made available under the MIT License ([link](https://opensource.org/licenses/MIT)).

In [5]:
from IPython.display import HTML
HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')