# Oil analysis using random forest classification 
The essentials of a random forest are best illustrated by introducing its decision trees first. A decision tree shows how well a partitioning of the data predicts a label that has been assigned to the data. The picture shows the measurements ($x$,$y$) that have been labelled "ocker" or "green". The outlined circle represents two identical measurements ($x$,$y$) that differ in label. Therefore, the data cannot be partitioned into sets of identical labels. 

![image](figures/Oilanalysis_rf01.png).

The CART algorithm (Classification And Regression Tree) uses the Gini impurity to quantify the diversity of the labels within a set by:

$I=\sum_{i=1}^{n} p_i \times{(1-p_i)}$

where $i$ is the number of labels in a set, i.e. (ocher, green) and $p_i$ is the proportion of these labels in the set. The Gini impurity of the set of all measurements ($x$,$y$) is:

$I_{u}= p_{ocher} \times{(1-p_{ocher})}+p_{green} \times{(1-p_{green})}=3/7 \times{4/7} +4/7 \times{3/7}=24/49 \approx{0.49} $

The Gini impurity of the sets resulting from the partitioning by $x_1$ in the picture is given by:

$I_{\lt{x_1}}= p_{ocher} \times{(1-p_{ocher})}+p_{green} \times{(1-p_{green})}=2/2 \times{1-2/2}+0/2 \times{1-0/2}=0$

$I_{\ge{x_1}}= p_{ocher} \times{(1-p_{ocher})}+p_{green} \times{(1-p_{green})}=1/5 \times{1-1/5}+4/5 \times{1-4/5}=8/25 \approx{0.32}$

The Table shows the Gini impurity $I$ of the sets generated by all two-partitionings of the measurements ($x$,$y$) in the picture:

|       | Gini impurity $I_\lt$   | Gini impurity $I_\ge$   |  Gini gain $G$  |
|:-----:|------------------------:|------------------------:|----------------:|
|$x_1$  |                0.00     |                0.32     |           0.26  |
|$x_2$  |                0.44     |                0.38     |           0.08  |
|$x_3$  |                0.48     |                0.00     |           0.15  |
|$x_4$  |                0.50     |                0.00     |           0.06  |
|$y_:$  |                0.48     |                0.50     |           0.01  |

The Gini gain $G$ equals the impurity of all measurements $I_{u}$ minus the $p_{j}$ weighted sum of the $j$ partitioned sets $I_{j}$:

$G=I_{u}-\sum_{j=1}^{m} I_{j} \times{p_{j}}$

The Gini gain $G$ resulting from the partitioning by $x_1$ is given by:

$G=24/49-(2/7 \times{0.00}+ 5/7 \times{0.32})=24/49-8/35 \approx{0.23}$

The Table shows the Gini gain $G$ of the sets generated by all two-partitionings of the measurements ($x$,$y$). The partitioning by $x_1$ shows the largest Gini gain $G$. This means that the decision tree that partitions by $x_1$ best separates the labels ("green", "ocher") in different sets. As a decision tree should better predict the label  from the measurements ($x$,$y$),


Similarly, it can be shown that the Gini gain of the decision tree at the lefthand side equals $12/25$ as the decision tree yields data sets with impurity $0$. The Gini gain of the decision tree at the righthand side exceeds the Gini gain at the lefthand side which provides a criterion to compare the decision trees in a random forest.
The picture shows a random forest of two decision trees that predict the label "green" by partitioning a training set of measurements ($x$,$y$). The picture illustrates that various decision trees may be selected and that the choice of the decision tree determines the prediction of the label.


### ... is insensitive to many scaling and transformations
An ...


Here, 

### ... is insensitive to irrelevant data
Dependencies
 

### ... is sensitive to the composition of the training set
..

# [Click here to see the random forest script](https://nbviewer.jupyter.org/github/chrisrijsdijk/RAMS/blob/master/notebook/Oilanalysis_randomforest.ipynb?flush_cache=true)