## Overview

A random forest is made up of decision trees. A decision involves segementing the predictor space into simple regions. In order to make a prediction for a given observation, we take the mean (for regression) or the mode (for classification) of the training observations in the region to which it belongs. 

The set of rules used to define each region can be summarised as a tree, hence the name. A random forest grows many such trees and takes the mean or mode of predictions across the trees to achieve improve predictive performance compared to a sinlge decision tree.

If this all makes sense, you can safely jump to the implementation section. If not, don't worry about it, we will step through each part of the random forest in the following sections.

## Introducing some terminology

We've learned a random forest is made up of decision trees.  How many trees are in the forest? The number of decision trees is determined by the user. Selecting the number of decision trees is important and often comes down to a simple cost-benefit equation: the cost of calculating more trees vs the possible benefit of  increased performance. 

To create multiple decision trees from the same training data, a technique referred to as bagging is used. Bagging is a common process in machine learning and is not limited to tree based methods.  Essentially bagging is building multiple models, each based on a sample of the  training data. 

To generate these samples, bootstrapping is used. To understand bootstrapping, imagine you have a tiny data set:



In [5]:
import pandas as pd
import numpy as np

np.random.seed(0)
toy_data = pd.DataFrame({'a' : np.random.choice(57, 5), 'c' : np.random.choice(11, 5),
                         'y' : np.random.choice(78, 5)})

display(toy_data)

Unnamed: 0,a,c,y
0,44,3,36
1,47,7,70
2,53,9,12
3,0,3,58
4,3,5,65


To boostrap this tiny data, you first need to randomly select one row. The row selected is our first observation sampled from `toy_data`. Next you sample another row, noting that the row you sampled first remains in the pool of possible rows to be selected. Repeat this process until you have the number of observations you would like. Now you have a bootstrapped sample. When growing a random forest the number of rows selected through bootstrapping will generally be eqaul to the number of rows in the training data. The number of bootstrapped samples you need is equal to the number of decision trees you need to grow.

As you can see, bootstrapping is simply sampling with replacement from the training data, with the number of rows in each sample set to the number of rows in the training data and the number of samples set to the number of trees required for your forest.

So we have a number of decision trees, grown based on bootstrapped samples of our training data. Do we have a random forest yet? Not quite. We need to address the random part of the random forest. In each stage of growing a decision trees in the random forest, all features are considered in order to determine which is best to determine the next step in the tree. In a random forest tree, at each of these stages a random sample of possible features is taken. This limits which features can be chosen for each stage. 

Why is this important? Imagine three trees grown with this modified process, compared to three trees grown with the standard process. The three random forest trees will most likely be less similar to each other, because they have each been forced to consider a ramdonly selected set of features. Each tree is likely considering features other trees have ignored. Compare this with the standard process - these tree will most likely be quite similar to each other. They have all considered the same set of features at each stage, the only difference is the bootstrapped sample they recieve as training input. 

The result of the random forest tree process is reduced correletion between trees.  A prediction in a random forest is simply a summary of the predictions from the the decision trees in the forest. The goal of summarising over many trees is to reduce variance. 



If we are predicting a contimuous outcome, the random forest predcition is the mean of the tree predictions. If we are predicting s categorical outcome, the random forest prediction is the category most frequently selected by the trees. 



In [4]:
import pandas as pd
from sklearn import datasets
from src import i_kit_learn

In [5]:
# training data
y = datasets.load_boston()['target']
X = datasets.load_boston()['data']
columns = datasets.load_boston()['feature_names']
data = pd.DataFrame(X)
data.columns = columns
data.loc[:, 'y'] = y

train = data.sample(frac =0.7)
test = data[~data.isin(train)].dropna()

In [6]:
boston_forest = i_kit_learn.grow_random_forest(train,max_depth = 5, min_size = 5, no_of_trees = 10)

In [10]:
boston_predictions = i_kit_learn.predict(test, boston_forest)
display(i_kit_learn.mse(test.iloc[:, -1], boston_predictions))
display(i_kit_learn.r_squared(test.iloc[:, -1], boston_predictions))

1948.9641726721798

0.8398139414492283