# Introduction to Random Forest Classifier

This notebook will use the Iris flower dataset from sklearn to introduce classification with Random Forest.

Feature importance and partial dependency plots will be created once the model is trained.

## Prerequisites

-Python imports

-Train-test split

-Classification metrics

-Decision Trees

-Measures of node impurity (Shannon Entropy and Gini Index)

# Learning Objectives

1. Apply a random forest classifier to a dataset
1. Visualize feature importances from a trained random forest model

# Set up data set 

In [None]:
import pandas as pd
from sklearn import datasets

data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df['label'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
df.head()

In [None]:
df.shape

In [None]:
df.label.value_counts()

## Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[data.feature_names], 
    df['label'], 
    random_state=42,
    stratify=df['label']
)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# Decision Tree Review

We will first build a decision tree model to review their structure before moving on to random forest classification

In [None]:
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree

In [None]:
dt = DecisionTreeClassifier(random_state=42)
mdl = dt.fit(X_train, y_train)

In [None]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(dt, 
                   feature_names=data.feature_names,  
                   class_names=data.target_names,
                   filled=True)

In [None]:
from sklearn.metrics import plot_confusion_matrix

predicted_labels = dt.predict(X_test)
plot_confusion_matrix(dt, X_test, y_test, cmap=plt.cm.Blues)

In [None]:
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    f1_score
)

def print_metrics(y_test, y_pred):
    scores = [accuracy_score, recall_score, precision_score, f1_score]
    s_labels = ['Accuracy', 'Recall', 'Precision', 'F1']
    for score, s_label in zip(scores, s_labels):
        if s_label == 'Accuracy':
            print(s_label + ': ' + str(score(y_test, y_pred)))
        else:
            print(s_label + ': ' + str(score(y_test, y_pred, average='weighted')))

In [None]:
print_metrics(y_test, predicted_labels)

# Decision Tree Knowledge
1. Are decision trees deterministic?
    - Yes they are deterministic. The best split will be found at each iterative step and will be used. 
1. How are decision trees split determined?
    - Information gain or entropy reduction
1. Are decision trees parametric? 
    - No, they are not parametric. Splits may differ in direction based on values.
1. Decision trees often have high variance, why might that be?
    - May split on wrong features or overfit to data.

# Random Forest: Expanding on Decision Trees
1. How might decision trees be leveraged to reduce variance?
    1. Create multiple classifiers and average the results. Multiple weak learners can ofter produce a strong learner.
1. Would multiple deterministic decision trees be useful?
    1. Only if they were NOT deterministic.
1. How could they not be deteministic?
    1. Bootstrapping data and limiting features at each step

# Random Forest
-An ensemble method that combines many decision trees which have been given different subsets of the data and features to create a strong learner. 

-Reduces variance and creates a non-deterministic model

-Generally use a large number of bushy trees

-Can get excellent performance with minimum tuning


# Random Forest Example

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=10, random_state=1)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

In [None]:
plot_confusion_matrix(rf, X_test, y_test, cmap=plt.cm.Blues)

In [None]:
print_metrics(y_test, rf_preds)

# Visualizing trees in the forest

In [None]:
def plot_tree(est_num=0):
    fn=data.feature_names
    cn=data.target_names
    fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
    tree.plot_tree(rf.estimators_[est_num],
                   feature_names = fn, 
                   class_names=cn,
                   filled = True)

In [None]:
plot_tree(est_num=0)

In [None]:
plot_tree(est_num=1)

Focusing only on the first split in the two trees above, we can see differences in the splits used to build the forest.

The first one split on sepal lenth <= 5.35 and the second on petal length <=2.45. The depth of the trees also varies. This is due to the randomness induced in the trees with bootstrapping and feature selection. 


# Random Forest Feature Importances

In [None]:
ft_import = pd.DataFrame()
ft_import['Features'] = data.feature_names
ft_import['Importance'] = rf.feature_importances_
ft_import

In [None]:
# All importances sum to 1
ft_import.Importance.sum()

In [None]:
ft_import.plot.bar(x='Features', y='Importance', rot=30)

# Partial Dependency Plots

Shows the marginal effect of a feature on predictions.

Shows effect of predictions when all observations have a feature set to a particular value.

Relationship may be linear or complex. 

In [None]:
from sklearn.inspection import plot_partial_dependence

for target in data.target_names:
    print('Target is: ', target)
    plot_partial_dependence(rf, X_train, data.feature_names, target=target)

# Advantages of Random Forests

1. Ensemble model (Wisdom of the Crowd)
1. Good out-of-box performance
1. Multiple trees can be trained at once 

# Cons of Random Forests
1. Expensive to train
2. Can produce very large model files

# Model Comparison and evaluation

1. Which model performed better?
2. Which feature had the most influence on the random forest model?


# Review Objectives 
1. Apply a random forest classifier to a dataset
1. Visualize feature importances from a trained random forest model

# Knowledge Check
1. A random forest is the same as combining many decision trees?
1. Name two ways in which random forests are made non-deterministic. 
1. Random forest classifiers are parametric?

# Next Steps
1. Tune the random forest classifier
    1. Number of estimators
    1. Criterion: default is gini, can also try entropy
    1. Max depth, min samples
    1. Number of features
1. Create a random forest regressor and test on sklearn Boston housing data
    1. Compare to decision tree model
    1. Create partial dependency plots

# Full Day Activities
1. Code random forest from scratch
1. Code partial dependecy plot function from scratch