<a href="https://colab.research.google.com/github/WhiteHum/Application-security/blob/main/3_07_Random_Forests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest

## Overview

We saw how easy it was to create a Decision Tree, but we recognize that the accuracy may suffer when looking at new data.  Random Forest is an approach that seeks to improve this by building many trees using Bagging and a measure of randomization.

## Goals

In this lab, you will:

* Build and train a Random Forest classifier
* Compare different parameter options when building the classifier
* Evaluate the accuracy in comparison to a Decision Tree

 
## Estimated Time: 30 minutes

We will once again be using the same BackBlaze data in this lab.  Hopefully, this has given us an opportunity to compare apples to apples across these four labs.

## Note Regarding This Lab

Our decision tree performs so well, that we aren't going to see any real benefit from switching to a Random Forest.  We have incldued a completely different problem at the end of this lab to demonstrate how a random forest can perform better than a simple decision tree.

# <img src="../images/task.png" width=20 height=20> Task 7.1

Use the following cell to read in the first twenty files from the BackBlaze dataset, split the labels from the data, convert everything to Numpy arrays, and create a test dataset from the final 25% of the overall dataset.

In [None]:
import os
import pandas as pd
import numpy as np

def get_file_list(starting_directory="../data/data_Q4_2020/"):
    final_list = list()
    files = os.listdir(starting_directory)
    for file in files:
        file_name = os.path.join(starting_directory, file)
        if os.path.isdir(file_name):
            final_list = final_list + get_file_list(file_name)
        else:
            final_list.append(file_name)
    return final_list

all_files = get_file_list()
columns = [
    'failure', 
    'capacity_bytes',
    'smart_1_normalized',
    'smart_2_normalized',
    'smart_3_normalized',
    'smart_4_normalized',
    'smart_5_normalized',
    'smart_7_normalized',
    'smart_8_normalized',
    'smart_9_normalized',
    'smart_10_normalized',
    'smart_11_normalized',
    'smart_12_normalized',
    'smart_13_normalized',
    'smart_15_normalized',
    'smart_16_normalized',
    'smart_17_normalized',
    'smart_18_normalized',
    'smart_22_normalized',
    'smart_23_normalized',
    'smart_24_normalized',
    'smart_168_normalized',
    'smart_170_normalized',
    'smart_173_normalized',
    'smart_174_normalized',
    'smart_175_normalized',
    'smart_177_normalized',
    'smart_179_normalized',
    'smart_180_normalized',
    'smart_181_normalized',
    'smart_182_normalized',
    'smart_183_normalized',
    'smart_184_normalized',
    'smart_187_normalized',
    'smart_188_normalized',
    'smart_189_normalized',
    'smart_190_normalized',
    'smart_191_normalized',
    'smart_192_normalized',
    'smart_193_normalized',
    'smart_194_normalized',
    'smart_195_normalized',
    'smart_196_normalized',
    'smart_197_normalized',
    'smart_198_normalized',
    'smart_199_normalized',
    'smart_200_normalized',
    'smart_201_normalized',
    'smart_202_normalized',
    'smart_206_normalized',
    'smart_210_normalized',
    'smart_218_normalized',
    'smart_220_normalized',
    'smart_222_normalized',
    'smart_223_normalized',
    'smart_224_normalized',
    'smart_225_normalized',
    'smart_226_normalized',
    'smart_231_normalized',
    'smart_232_normalized',
    'smart_233_normalized',
    'smart_234_normalized',
    'smart_235_normalized',
    'smart_240_normalized',
    'smart_241_normalized',
    'smart_242_normalized',
    'smart_245_normalized',
    'smart_247_normalized',
    'smart_248_normalized',
    'smart_250_normalized',
    'smart_251_normalized',
    'smart_252_normalized',
    'smart_254_normalized',
    'smart_255_normalized'
]
started = False
for f in all_files[:10]:
    if started:
        new_df = pd.read_csv(f, usecols=columns)
        df = df.append(new_df, ignore_index=True)
    else:
        df = pd.read_csv(f, usecols=columns)
        started = True

# Grab the labels
labels = df['failure'].to_numpy()
# Drop the failure column
df.drop('failure', axis=1, inplace=True)

# Convert to an array and replace NaN values
x = df.to_numpy()
for i in np.argwhere(np.isnan(x)):
    x[i[0],i[1]] = 0

testing_split = int(len(labels)*0.25)

x_train = x[:testing_split]
y_train = labels[:testing_split]
x_test = x[testing_split:]
y_test = labels[testing_split:]



The Random Forest approach has been implemented in Scikit Learn within the `sklearn.ensemble` package.  Let's get started by importing it.

# <img src="../images/task.png" width=20 height=20> Task 7.2

Import the `RandomForestClassifier` from Scikit Learn.

In [None]:
from sklearn.ensemble import RandomForestClassifier

To build a Random Forest, there are several parameters that we can tune to speed up the process and possibly increase our accuracy:

* The number of *estimators* or trees that will be in the forest.  This is selected with the `n_estimators` parameter.
* The maximum depth of each tree, which is controlled by the `max_depth` parameter.  If you do not set this parameter, every tree may grow to any size based on how well the data splits at each node.
* We can specify the number of jobs, or `n_jobs`, which instructs the library to use some number of threads to build trees in parallel.  Setting this to `-1` it will use all available processors.  The default sets this to `1`.
* The maximum number of samples to use when training each tree.  This parameter is `max_samples`.
* The maximum number of features to consider when splitting the data.  The default is $\sqrt{n}$ where $n$ is the number of features.  We can tune this value using `max_features`.

# <img src="../images/task.png" width=20 height=20> Task 7.3

Create a Random Forest with 100 trees and a maximum depth of 20.  What is the accuracy against the test dataset?  When training this and all future forests in this lab, set `max_samples` to 30% of the total number of samples in the training dataset.

In [None]:
def accuracy(y_hat, y):
    accuracy = 0
    correct = [y_hat[i] == y[i] for i in range(len(y_hat))]
    correct = np.array([1 if i else 0 for i in correct])
    accuracy = correct.sum()/len(y_hat)
    print(f'Overall accuracy: {accuracy * 100.0}%')
    return accuracy

training_samples = (len(x_train)*0.3)//1
tree = RandomForestClassifier(n_estimators=100, max_depth=20)
tree.fit(x_train, y_train)
predictions = tree.predict(x_test)
accuracy(y_test, predictions)

Overall accuracy: 99.9973175571079%


0.9999731755710791

One of the things mentioned in the discussion is that we have to balance overall accuracy against time.  One of the parameters that affects this most severely is the number of trees.  Let's train this again, this time with 1,000 trees.

# <img src="../images/task.png" width=20 height=20> Task 7.4

Does a forest with 1,000 trees with a maximum depth of 20 have a significantly greater accuracy?

In [None]:

tree = RandomForestClassifier(n_estimators=1000, max_depth=20)
tree.fit(x_train, y_train)
predictions = tree.predict(x_test)
accuracy(y_test, predictions)

Overall accuracy: 99.9973175571079%


0.9999731755710791

Something that will tend to speed up training is reducing the number of features considered at each node.  This will simultaneously tend to decrease the accuracy of each tree, which can be smoothed out by having more trees.

# <img src="../images/task.png" width=20 height=20> Task 7.5

Construct and test two forests.  The first has 100 trees and the second 1,000.  For each of these, set the `max_features` parameter to 4.

In [None]:

tree = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=4)
tree.fit(x_train, y_train)
predictions = tree.predict(x_test)
accuracy(y_test, predictions)

tree = RandomForestClassifier(n_estimators=1000, max_depth=20, max_features=4)
tree.fit(x_train, y_train)
predictions = tree.predict(x_test)
accuracy(y_test, predictions)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    2.9s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 100 out of 100 | elapsed:    1.9s finished


Overall accuracy: 99.99740940472438%


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   11.9s
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:   22.3s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   29.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.1s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    8.2s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   15.4s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:   20.4s finished


Overall accuracy: 99.99740940472438%


0.9999740940472438