#**Module 9: Random Forest Classification**

In this notebook, we are going to set up a Random Forest model in Python. At the end of this module, you will be able to:

* Explain what Random Forest does
* Build and evaluate a Random Forest

**Be sure to expand all the hidden cells, run all the code, and do all the exercises--you will need the techniques for the lesson lab!**




#**What is Random Forest?**
For the previous module on simple tree construction, there's a chance that each of you have slightly different numeric results. Why? Because the sampling to produce the training and test sets is random. So, while performing the train_test_split, each student built the model based on a somewhat different training set. This means that each student's model is also slightly different from that of her peers.

So, whose results are the actual, real results, then?

The answer is: We really can't tell. Each tree that each student built has some validity, and we can have some confidence in its final predictions.

But wouldn't it be great if we could have more confidence and come to a better overall result for the entire class? That's what the popular Random Forest algorithm does.

Random Forest doesn't build just one tree--it builds an entire classroom full of trees, each one of which is based on a slightly different training set (which is, in fact, a small randomized subset of the big overall training set). To save processing power, Random Forest then picks just a random few of the attributes to consider when building each tree, so that no two trees are based on the same attributes. Finally, Random Forest evaluates all the trees it has constructed and, for a given prediction, outputs the class assignment that is the mode of the classes (classification) or, if you run it as a regression tree, the mean prediction (regression) of the individual trees.

<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/randomforest2.png" width="600">
</center>

Here is an example:
<center>
<img src = "https://static.javatpoint.com/tutorial/machine-learning/images/random-forest-algorithm2.png">
</center>

So, we have:
* A number of trees
* Using a random subset of features in the dataset to make their split decisions
* Built on a number of slightly different training subsets, selected as random samples with replacement (= bootstrap aggregating or bagging) from the overall training set
* A voting function that selects the mode of the classes (classification or the mean prediction (regression)

In other words, we introduce dual randomness into our classification in order to pick the best model from the places where all the individual trees overlap. That leaves us with much greater accuracy for our model.

We are working with the [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from the scikit-learn package.


#**0. Preparation and Setup**
As before, we are following the basic classification steps:

1. **Exploratory Data Analysis** to see how the data is distributed and to determine what the class attribute in the dataset should be. This will be the attribute you'll predict later on
2. **Preprocess the data** (remove n/a, transform data types as needed, deal with missing data) and ensure that the dependent attribute is CATEGORICAL
3. Split the data into a **training set and a test set**
4. **Build the model** based on the training set
5. **Test the model** on the test set
6. Determine the quality of the model with the help of a **Confusion Matrix** and a **Classification Report**.

As with our previous problems, we will use the insurance dataset again.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import spatial
import statsmodels.api as sm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

from IPython.display import HTML # This is just for me so I can embed videos
from IPython.display import Image # This is just for me so I can embed images

#Reading in the data as insurance dataframe
insurance = pd.read_csv("https://raw.githubusercontent.com/shstreuber/Data-Mining/master/data/insurance_with_categories.csv")

#Verifying that we can see the data
insurance.head()

#**1. Exploratory Data Analysis (EDA)**
As before, we have the option to either do this in a code cell, or to import the HTML-based ydata_profile package.

Test your EDA skills below:

In [None]:
insurance.describe(include = 'all'), print("***DATA OVERVIEW***")# Build a data summary for ALL data in the set (not just numeric!)

In [None]:
insurance.dtypes

In [None]:
insurance.corr(numeric_only = True), print("***DATA CORRELATIONS")

In [None]:
# Build a histogram for the numeric values
insurance.hist()
insurance.plot()

In [None]:
insurance.groupby('children').size().plot(kind='pie', autopct='%.2f')

#**2. Preprocessing: Building the Analysis Set**
You have done this before. Build an insurance_rf dataset consisting of age, bmi, children, charges, and--again--region as the class attribute.

In [None]:
insurance_rf = pd.DataFrame(insurance, columns = ['age', 'bmi', 'children','charges','region'])
insurance_rf.head()

# **3. Building the Training and Test Datasets**
As before, we cannot do classification without training and test data. You did this previously. Do it again--we want 20% of the data set as test and 80% as training set.

In [None]:
insurance_rf.children.unique()

In [None]:
from sklearn.model_selection import train_test_split
x=insurance_rf.iloc[:,:4] # all parameters
y=insurance_rf['region'] # class labels 'southwest', 'southeast', 'northwest', 'northeast'
X_train, X_test, y_train, y_test =  train_test_split(x,y, random_state=0)                            # COMPLETE THIS LINE!
print("X_train shape: {}".format(X_train.shape))
print("y_test shape: {}".format(y_test.shape))

# **4. Building and Training the Classifier**
We are going to use the [RandomForestClassifier from sklearn.ensemble](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). The RandomForestClassifier has a number of really interesting parameters that we can control in order to optimize our model to run quickly and efficiently, especially the sub-sample size, which is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

##**4.1 Building the Classifier**

The most important parameters are:
* n_estimators int, default=100 --
The number of trees in the forest.
* criterion{“gini”, “entropy”}, default=”gini” --
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific. You've seen this in the previous workbook.
* max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto” --
The number of features to consider when looking for the best split: If int, then consider max_features features at each split. If “auto”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
* max_depthint, default=None --
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
* min_samples_split int or float, default=2
The minimum number of samples required to split an internal node
* bootstrap bool, default=True -- Whether bootstrap samples are used when building trees (which is 50% of the whole idea behind Random Forest). If False, the whole dataset is used to build each tree.

Let's get started!

In [None]:
# Configuring the classifier and using get_params to double-check all the parameters with which it is configured

rf = RandomForestClassifier()
rf.get_params

In [None]:
from sklearn.model_selection import train_test_split
x=insurance_rf.iloc[:,:4] # all parameters
y=insurance_rf['region'] # class labels 'southwest', 'southeast', 'northwest', 'northeast'
X_train, X_test, y_train, y_test =  train_test_split(x,y, random_state=0)                            # COMPLETE THIS LINE!
print("X_train shape: {}".format(X_train.shape))
print("y_test shape: {}".format(y_test.shape))

In [12]:
rf =RandomForestClassifier(criterion='entropy')

##**4.2 Training the Classifier**
Just like before, we are using .fit() to train our classifier! Remember that we named it rf. You'll want your training data inside the parentheses.

Give it a shot below!

In [None]:
rf.fit(X_train,y_train)

Just incase you're lost: The solutions are posted at the end of this workbook.

# **5. Use the Classifier to test and predict**
There is nothing different about the steps below than what you have already done. Uncomment the second line starting with "print" if you would like to see the output of your predictions.

In [14]:
y_pred = rf.predict(X_test)
# print(y_pred) # If you want to see the big long list, uncomment this line!

# **6. Evaluate the Quality of the Model**
Again, we will look at the following:
1. Accuracy score
2. Confusion matrix
3. Classification Report

The interesting part will be to see if any of the predictions have improved from the simple tree model in the previous module.

In [None]:
# First, the accuracy score
accuracy_score(y_test, y_pred)

What was the accuracy score for the simple tree? **Did using Random Forest improve it?** Record your observations below.

In [None]:
# Next, the Confusion Matrix

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred, labels=rf.classes_)
cm_display = ConfusionMatrixDisplay(cm, display_labels=rf.classes_).plot()

What did the Confusion Matrix look like for the simple tree? What differences do you notice? Record your observations in the field below.

In [None]:
# Finally, the Classification Report
import sklearn.metrics as metrics
from sklearn.metrics import classification_report

print(metrics.classification_report(y_test, y_pred, labels=['southwest', 'southeast', 'northwest', 'northeast']))

Again, compare this output with the output for the simple tree. What are the differences? Overall, would you say that Random Forest works better? Or, given that we're doing a whole lot more processing, is any improvement worth it? Record your answer below.

# **What If ...**
So far, we have used only the default settings on the Random Forest algorithm. What if we play with different configuration settings, such as the number of trees? Or the depth of the trees? Or the minimum samples required to split?
<center>
<img src = "https://uploads-ssl.webflow.com/646218c67da47160c64a84d5/64634977cc057db29263d4a6_81.png" width = 300>

First, let's set up the parameters as variables so that we can easily change them:

In [19]:
# We are setting up the n_estimators and other configuration parameters so that we can easily change them
# Feel free to comment any of these out or change the values and re-run the cells below to see how this changes the result
n_estimators = 10000 # This is the number of different trees to build; default was 100; we are increasing this number tenfold.
min_samples_split = 5 # Previously, we ran this with the default split of 2
criterion='entropy' # This is for Information Gain; previously, we ran this with the Gini Index

Then, let's build the classifier again, now with the different settings.

In [None]:
rf2 = RandomForestClassifier(verbose=1, n_estimators=n_estimators, min_samples_split=min_samples_split, criterion=criterion)
rf2.fit(X_train, y_train)

Time to predict and evaluate with **accuracy score, Confusion Matrix, and Classification Report!**

In [None]:
# Testing and predicting
y_pred = rf2.predict(X_test)

print(accuracy_score(y_test, y_pred))

cm2 = confusion_matrix(y_test, y_pred, labels=rf2.classes_)
cm2_display = ConfusionMatrixDisplay(cm, display_labels=rf2.classes_).plot()

print(metrics.classification_report(y_test, y_pred, labels=['southwest', 'southeast', 'northwest', 'northeast']))

##Your Turn
**Try this with a number of different settings. Does using 1,000 trees improve the accuracy by a little--or a lot? What about 10,000 trees?**

Play around with the settings, then record below what you have done and what your results were. Interpret what you're seeing: Is more processing worth it? Or is there a point where we accept the results as "good enough"?

#**7. Towards Optimization**
<center>
<img src = "https://miro.medium.com/v2/resize:fit:640/format:webp/1*lIaapnR-0Cdf3kHsZtGcxw.jpeg">
</center>

You just played with the tree setting manually. What if we could cycle the algorithm through a list of "number of trees" settings and see what happens then?

All we need is a quick "for" loop with a range. This range setting is configured like this:


```
range(starting_point, termination_point, increment_size)
```
In other words range(20,200,20) means that we start with 20 trees and go up to 200 in steps of 20. So, we will look at the behavior for 20, 40, 60, 80, 100, 120, 140, 160, 180, and 200 trees. The code is below.




In [None]:
# We can even cycle through a number of trees in the Random Forest
for n_estimators in range(20,200,20):
    print('Accuracy score using n_estimators =', n_estimators, end = ': ')

    rf3 = RandomForestClassifier(n_estimators = n_estimators, verbose=1)
    rf3.fit(X_train, y_train)
    y_pred = rf3.predict(X_test)
    print(accuracy_score(y_test, y_pred))

How about setting this to 1,000 or even 10,000 trees and seeing the accuracy score change? **Go ahead and play with the range setting!** Then record below which range setting you think works best for you.

#Solutions (to help you if you get stuck)

In [None]:
# This is the solution for task 2 above.
insurance_rf = pd.DataFrame(insurance, columns = ['age', 'bmi', 'children','charges','region'])
insurance_rf.head()

In [None]:
# This is the solution for task 3 above:
from sklearn.model_selection import train_test_split
x=insurance_rf.iloc[:,:4] # all parameters
y=insurance_rf['region'] # class labels 'southwest', 'southeast', 'northwest', 'northeast'
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))

In [None]:
# Solution for task 4.2
rf.fit(X_train, y_train)