# Random Forest

Now let's grow a forest out of our trees. 

In this notebook we'll cover the following topics.
- The problem of overfitting
- Pruning
- From a tree to a forest
- A voting forest

### Pruning

We've seen the problem of overfitting in the previous notebook. There is a way to reverse overfitting (**regularization**) called **pruning**. Let's go through the steps of pruning a tree. (Pruning a tree is actually a horticultural expression used for cutting off branches to make the plant grow better in the future.)

You need to have separated your data set into three parts for this. A **training set**, a **validation set**, and a **test set**. Once you have learned a tree on the training set you take the validation set and start cutting away at the tree: for example removing some of the leaf nodes. You then use the validation set to make sure that the tree does *not* lose any predictive power when cutting away these branches. If it does decrease an accurate prediction you're cutting away a crucial branch, so better put it back!

Finally you can apply the tree to the test set for the final check that you have a model that is not overfitted.

### Forest

A forest is also a method of **regularization**. Instead of cutting away branches that were overfitted you keep the overfitted tree, but simply generate lots of trees. Each tree is trained on a randomly sampled subset of the training data, so each tree is slightly different.

And what do you call a set of many trees.... a **random forest**!

When a prediction has to be made each tree is asked to go through its nodes and determine the outcome. This outcome is then cast as a vote by each tree. The forest looks at all the votes and sticks with the majority. The idea being that each tree will be overfitted differently than all the others, but the general trend is learned in all trees! Some trees will cast some ridiculous vote, but as long as they are in the minority the forest as a whole predicts correctly.

### Training a Forest

We'll start by importing the required modules and doing the same preprocessing of the data as in the previous notebook.

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

In [2]:
df = sns.load_dataset('titanic')
df.sort_values(by=list(df.columns), axis=0, inplace=True)
df.fillna(method='bfill', inplace=True)
df.fillna(df.mean(), inplace=True)
df = pd.get_dummies(df)
df_train, df_test = train_test_split(df, test_size=0.5, random_state=42)
X_train = df_train.drop(columns=['survived', 'alive_no', 'alive_yes'])
y_train = df_train['survived']
X_test = df_test.drop(columns=['survived', 'alive_no', 'alive_yes'])
y_test = df_test['survived']

The forest classifier is imported from the ensemble submodule of Scikit. Each tree is a separate model and we're combining their output through voting. This is called an **ensemble** of models.

In [3]:
from sklearn.ensemble import RandomForestClassifier

Here you can see the power of Scikit. The syntax when training a different model is almost identical!

Let this inspire you to try other models available in Scikit: the interface to use them is very easy.

In [11]:
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

We'll use our trusty error to determine how this model did.

In [5]:
def compute_error(truth, prediction):
    diff = 0
    
    for truth_i, prediction_i in zip(truth, prediction):
        diff += (truth_i - prediction_i)**2
        
    return diff

In [12]:
pred_test = forest.predict(X_test)
pred_train = forest.predict(X_train)
err_test = compute_error(y_test, pred_test)
err_train = compute_error(y_train, pred_train)
err_test, err_train

(63, 1)

Hmm... it doesn't improve much on this small data set, one tree is actually enough.

But I challenge you to try it on the much larger Fraude Detection set! I bet you will see a change!

---
#### Exercise
Did you try rerunning training the forest? Each time you train and test it you can get a slightly different number of mistakes. Can you explain why this is?

---

So far we have made precise predictions with both the decision tree and the random forest. Some models actually allow for giving a probability of being in each class. Let's try this for the random forest.

In [17]:
forest.predict_proba(X_test)

array([[0.08      , 0.92      ],
       [1.        , 0.        ],
       [0.18      , 0.82      ],
       [0.        , 1.        ],
       [0.6       , 0.4       ],
       [1.        , 0.        ],
       [0.98      , 0.02      ],
       [1.        , 0.        ],
       [0.9       , 0.1       ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.04      , 0.96      ],
       [1.        , 0.        ],
       [0.94      , 0.06      ],
       [1.        , 0.        ],
       [0.22      , 0.78      ],
       [0.04      , 0.96      ],
       [0.9       , 0.1       ],
       [0.06      , 0.94      ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.96      , 0.04      ],
       [0.5       , 0.5       ],
       [0.44      , 0.56      ],
       [0.94      , 0.06      ],
       [0.5       , 0.5       ],
       [0.82      , 0.18      ],
       [0.52      , 0.48      ],
       [0.82      , 0.18      ],
       [0.86      , 0.14      ],
       [0.

Now we get two columns as prediction. The first column is the estimated probability that this passenger deceased, while the second is the probability that the passenger survived. You can also check that both probabilities sum to 1 for each passenger, since a passenger has either survived or not, there is no other outcome.

When measuring the performance of a model giving probabilities can work in your advantage. It works well with the ROC curve measurement for example, hint hint... ;)