# Random Forest


<img src="./images/forest.jpg" style="width:100%"/>

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. 

One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification. It usually performs better for classification problems.

### How it works?

The random forest algorithm creates and trains multiple decision trees which are combined to a single response taking their winner vote for classification or the average value  for regression problems.


### What is a Decision Tree? 

From a high level we can define Decision Trees as algorithms that can be used both classification and regression tasks providing a model to make predictions based on a universe of labeled training data.

A decision tree consists of several layers of IF-THEN-ELSE "forks" that are generated automatically during the training phase to fit the provided data.


## Advantages
- A decision tree is easy to understand and interpret.
- Expert opinion and preferences can be included, as well as hard data.
- Can be used with other decision techniques.
- New scenarios can easily be added.

## Disadvantages
- If a decision tree is used for categorical variables with multiple levels, those variables with more levels will have more information gain.

- Calculations can quickly become very complex, although this is usually only a problem if the tree is being created by hand.


## When should I consider using RandomForest

#### Used only for tabular data

Random forests can only work with structured (tabular) data, meaning data that can be represented
as csv or any other similar technique. Unstructured data like images, speech or text cannot be
processed from random forest and most likely they will require some form of a Neural Network like CNN
or RNN for example.

#### Data requirements

Although "by book" Random Forests should be able to handle both missing and nominal
data the library that is widely used in the python world (SkLearn) does not comply
with it and in order to use it we will need to:

- Clean missing values (either removing them or adjusting them)
- Convert nominal data to categorical
- Unlike to NN in random forests there in no need to normalize or scale the data


### Example

The iris data set that we have already solve using a neural network is ideal
for a simple example of random forest as well. The main reason for this is that
all the features are numeric and the output consists of class (in other words
it is a classifier) so we can just use the dataset as is without the need to
preprosessing it.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [2]:
# Load the data and randomize them.
df = pd.read_csv("./data/iris-dataset.csv")
df = df.sample(frac=1)
display(df)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
35,36,5.0,3.2,1.2,0.2,setosa
37,38,4.9,3.1,1.5,0.1,setosa
39,40,5.1,3.4,1.5,0.2,setosa
148,149,6.2,3.4,5.4,2.3,virginica
127,128,6.1,3.0,4.9,1.8,virginica
...,...,...,...,...,...,...
28,29,5.2,3.4,1.4,0.2,setosa
111,112,6.4,2.7,5.3,1.9,virginica
64,65,5.6,2.9,3.6,1.3,versicolor
40,41,5.0,3.5,1.3,0.3,setosa


## Convert targets to categorical

Since the target is a nominal value we need to convert it to categorical 
to make it compatible with the Sklearn implementation of Random Forest 
that we will use.

In [3]:
column_name = "Species"
temp = df.copy()
dummies = pd.get_dummies(temp[column_name], prefix=column_name)
new_frame = pd.concat( [temp, dummies], axis=1, join="inner")
df = new_frame.drop(columns=[column_name, "Id"])
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_setosa,Species_versicolor,Species_virginica
35,5.0,3.2,1.2,0.2,1,0,0
37,4.9,3.1,1.5,0.1,1,0,0
39,5.1,3.4,1.5,0.2,1,0,0
148,6.2,3.4,5.4,2.3,0,0,1
127,6.1,3.0,4.9,1.8,0,0,1
...,...,...,...,...,...,...,...
28,5.2,3.4,1.4,0.2,1,0,0
111,6.4,2.7,5.3,1.9,0,0,1
64,5.6,2.9,3.6,1.3,0,1,0
40,5.0,3.5,1.3,0.3,1,0,0


The following picture can help us understand the objective of our model now that we have expanded the nominal targets to numberics:

<img src="./images/iris-input-output.png" style="width:60%"/>


## Separate the inputs and the outputs

The inputs (also known as features) must be contained in their own
data frame and the same applies to the outputs (labels) as well.

In [4]:
output_cols = ["Species_setosa", "Species_versicolor", "Species_virginica"]
X = df.copy()
Y = X[output_cols].copy()
X = X.drop(output_cols, axis=1)

display(X)
display(Y)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
35,5.0,3.2,1.2,0.2
37,4.9,3.1,1.5,0.1
39,5.1,3.4,1.5,0.2
148,6.2,3.4,5.4,2.3
127,6.1,3.0,4.9,1.8
...,...,...,...,...
28,5.2,3.4,1.4,0.2
111,6.4,2.7,5.3,1.9
64,5.6,2.9,3.6,1.3
40,5.0,3.5,1.3,0.3


Unnamed: 0,Species_setosa,Species_versicolor,Species_virginica
35,1,0,0
37,1,0,0
39,1,0,0
148,0,0,1
127,0,0,1
...,...,...,...
28,1,0,0
111,0,0,1
64,0,1,0
40,1,0,0


## Split the data to two train and test

The training data (consisting both of X and Y values) are used to train the algoritm 
while the testing data are neven seen from the training algorithm and we use them
to see how well our model performs.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
print("Number of rows")
print("Train", X_train.shape[0])
print("Test", X_test.shape[0])

Number of rows
Train 120
Test 30


## Create and train the model

In [6]:
clf = RandomForestClassifier(n_estimators=100, max_features='sqrt')
clf.fit(X_train, y_train.values)

RandomForestClassifier(max_features='sqrt')

## Find the accuracy of the model

In [7]:
accuracy_rf = clf.score(X_test, y_test)
print(f'accuracy: {accuracy_rf}')

accuracy: 0.9666666666666667


# A visual representation of our model

<img src="./images/iris-decission-tree.png" style="width:60%"/>