# Oil analysis using random forest classification 
The essentials of a random forest are best illustrated by introducing its decision trees first. A decision tree should better predict a label by splitting the data. The picture shows the measurements ($x$,$y$) that have been labelled "ocker" or "green". The outlined circle represents two identical measurements ($x$,$y$) that only differ in label. Therefore, the measurements ($x$,$y$) cannot be split into pure sets. 

![image](figures/Oilanalysis_rf01.png).

The Classification And Regression Tree (CART) algorithm uses the Gini impurity to quantify the diversity of the labels in a set by:

$I=\sum_{i=1}^{n} p_i \times{(1-p_i)}$

where $n$ is the number of labels in a set, i.e. (ocher, green) and $p_i$ is the proportion of each of these labels in the set. The proportion $p_i$ of a label may range from zero to one. The picture shows that a label $i$ does not contribute to the Gini impurity $I$ as $p_i=0$ or $p_i=1$ and that it maximally contributes to the Gini impurity $I$ as $p_i=1/2$.

![image](figures/Oilanalysis_rf06.png).

The Gini impurity of the set of all measurements ($x$,$y$) is:

$I_{u}= p_{ocher} \times{(1-p_{ocher})}+p_{green} \times{(1-p_{green})}=3/7 \times{4/7} +4/7 \times{3/7}=24/49 \approx{0.49} $

The Gini impurity of the sets resulting from the partitioning by $x_1$ in the picture is given by:

$I_{\lt{x_1}}= p_{ocher} \times{(1-p_{ocher})}+p_{green} \times{(1-p_{green})}=2/2 \times{(1-2/2)}+0/2 \times{(1-0/2)}=0$

$I_{\ge{x_1}}= p_{ocher} \times{(1-p_{ocher})}+p_{green} \times{(1-p_{green})}=1/5 \times{(1-1/5)}+4/5 \times{(1-4/5)}=8/25 \approx{0.32}$

The Table shows the Gini impurity $I$ of the sets generated by all splits in the measurements ($x$,$y$) in the picture:

|       | Gini impurity $I_\lt$   | Gini impurity $I_\ge$   |  Gini gain $G$  |
|:-----:|------------------------:|------------------------:|----------------:|
|$x_0$  |                0.00     |                0.44     |           0.11  |
|$x_1$  |                0.00     |                0.32     |           0.26  |
|$x_2$  |                0.44     |                0.38     |           0.08  |
|$x_3$  |                0.48     |                0.00     |           0.15  |
|$x_4$  |                0.50     |                0.00     |           0.06  |
|$y_\:\:$  |                0.48     |                0.50     |           0.01  |

To best split is the split that yields subsets that are large and pure. The Gini gain $G$ is just one of the optimisation criteria that may be used. The Gini gain $G$ equals the difference in the impurity of all measurements $I_{u}$ and the $p_{j}$ weighted sum of the $j$ partitioned sets $I_{j}$:

$G=I_{u}-\sum_{j=1}^{m} I_{j} \times{p_{j}}$

The Gini gain $G$ resulting from the split by $x_1$ is given by:

$G=24/49-(2/7 \times{0.00}+ 5/7 \times{0.32})=24/49-8/35 \approx{0.26}$

The Table shows the Gini gain $G$ of the splits in the measurements ($x$,$y$). The split by $x_1$ shows the largest Gini gain $G$. This means that the decision tree that splits by $x_1$ best separates the labels ("green", "ocher") in terms of Gini gain $G$. This means that the split by $x_1$ yields subsets that are relatively large and pure. The picture below shows the decision tree with the largest Gini gain $G$:

![image](figures/Oilanalysis_rf02.png).

As the set {$x \lt{x_1}$} is pure, i.e. $I_\lt=0$, further splitting is superfluous. However, partitioning of the set {$x \ge{x_1}$} may further reduce impurity. The picture shows the optional splits:

![image](figures/Oilanalysis_rf03.png).

The Table shows the Gini impurity $I$ and the Gini gain $G$:

|       | Gini impurity $I_{\ge{x_1};\lt}$   | Gini impurity $I_{\ge{x_1};\ge}$   |  Gini gain $G$  |
|:-----:|------------------------:|------------------------:|----------------:|
|$x_3$  |                0.44     |                0.00     |           0.06  |
|$x_4$  |                0.38     |                0.00     |           0.02  |
|$y_\:\:$  |                0.38     |                0.00     |           0.06  |

A split over $x_3$ yields the largest Gini gain $G$. Therefore, extending the decision tree by a split over $x_3$ will best separate the labels ("green", "ocher").

![image](figures/Oilanalysis_rf04.png).

Ultimately, the decision tree cannot be extended beyond here

![image](figures/Oilanalysis_rf05.png).

Decision trees are known to be:
- insensitive to many scaling and transformations
- insensitive to irrelevant data
- very sensitive to the composition of the training set

A random forest is a means to reduce the sensitivity for the composition of the sample of measurements while preserving the nice insensitivities of a decision tree to a large extent. A random forest classification aggregates the classifications of many independent decision trees that have been built on bootstrapped samples of measurements. Aggregating bootstrapped samples is known as *bagging*.

Step 1
Create bootstrapped sample of the measurements ($x$,$y$). *Bootstrapping* is random sampling with replacement. Note that the bootstrapped sample typically does not entail all measurements ($x$,$y$). The unselected measurements are said to be *out of bag* measurements.

![image](figures/Oilanalysis_rf07.png).

Step 2
Create a decision tree on the bootstrapped sample of the measurements ($x$,$y$), but only consider a randomly selected subset of the measurements at each node of the tree. For example, only the measurement $x$ will be considered to assess the best split for the root node of the tree using the Gini gain $G$ criterion. For this specific bootstrapped sample, the root node of the decision tree reduces the Gini impurity $I$ of the subsets {$x\lt{x_1}$} and {$x\ge{x_1}$} to zero. Otherwise, another node may be added to the decision tree that again uses a randomly selected subset of the measurements ($x$,$y$). 

![image](figures/Oilanalysis_rf08.png).

By repeating step 1 and step 2 many times a random forest of different decision trees will be created. Now, a new measurement ($x_8$,$y_8$) with an unknown label will be evaluated by each of the decision trees from the random forest. Each decision tree votes whether the measurement ($x_8$,$y_8$) is likely to be labelled "ocher" or "green". The *random forest classifier* then expresses the support of a particular label among the decision trees in the random forest.

To validate the random forest, the demo script partitions the sample into a training set and a test set. The oil measurements in the training set will be used to construct the random forest. The oil measurements in the test set will be evaluated by the random forest. As the labels of the measurements in the test set are in fact known, it becomes clear whether the random forest classifier correctly labelled these oil measurements.

![image](figures/Oilanalysis_rf09.png).

The picture shows that the random forest in the demo script correctly predicts most of the label "Age > 1 year" of the oil measurements. However, the 17th oil measurement in the test set has been labelled incorrectly by the random forest. So, the validity of the random forest becomes clear by comparing predictions of the random forest with the actual label.

|      |         |
|:----:|:-------:|
|CU | 0.809958 |
|VIS99 | 0.024242 |
|VIS40 | 0.021000 |
|WATER | 0.013833 |

The demo script also shows which nodes of the trees in the random forest yield pure subsets. The Table reveals that splits by the copper (Cu) measurement yield subsets that are quite pure whereas the other measurements did not contribute much. So, the demo script also shows which measurements seem good predictors of the label "Age > 1 year". 

# [Click here to see the random forest script](https://nbviewer.jupyter.org/github/chrisrijsdijk/RAMS/blob/master/notebook/Oilanalysis_randomforest.ipynb?flush_cache=true)