# Random Forest

Our next notebook will look at random forests.

The easiest way to think of random forests is to imagine creating a whole series of decision trees, each of which might get slightly different data or some limitation in the features it can use.  Gather the outcomes of all of those trees and use them to come to some consensus.  Let's compare our decision tree classifier from the prior notebook to the random forest classifier here and see if we can get a better result than before.


In [None]:
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
clf = ensemble.RandomForestClassifier(max_depth=4)

## Load Data

For this demo, we will load a dataset of individuals and whether they have a high chance of long-term heart disease (output = 1) after a heart attack.

In [None]:
heart_attack_data = "../data/HeartAttackData.csv"
df = pd.read_csv(heart_attack_data, header=0)

# Review the data
df

These measures aren't very self-explanatory, so let's explain them here.  These are the same explanations that we saw in the CART notebook, but they're included again here for clarity.

- `age` = Age of patient
- `sex` = Sex of the patient (0 = female, 1 = male)
- `cp` = Type of chest pain.
  - 1 = Typical angina
  - 2 = Atypical angina
  - 3 = Non-anginal pain
  - 4 = Asymptomatic
- `trtbps` = Resting blood pressure (mm/Hg)
- `chol` = Cholesterol level
- `fbs` = Fasting blood sugar above 120 mg/dl
- `restecg` = Resting ECG result
  - 0 = Normal
  - 1 = ST-T wave abnormality
  - 2 = Probable or definite left ventricular hypertrophy
- `thalachh` = Maximum heart rate achieved
- `exng` = Exercise-induced angina (1 = yes, 0 = no)
- `oldpeak` = Previous peak
- `slp` = Slope
- `caa` = Number of major vessels (0-3)
- `thall` = Thalium Stress Test result (ranges from 0-3)
- `output` = Diagnosis of heart disease (0 = < 50% diameter narrowing, 1 = > 50% diameter narrowing)


## Split Labels from Features

Let's now create two variables:  `y`, which is the thing we want to predict (output: `{ 0, 1 }`); and `X`, which is everything we can use to predict the specific value of `y`.

With Python, splitting data out like this will not shuffle the results (something we might have to worry about if we split the data up in SQL).

In [None]:
y = df['output']
X = df.drop('output', axis=1)

## Split into Training & Test Datasets

The sklearn library has a method called `train_test_split` which breaks our data out into training and test datasets.  This allows us to train a model on one set of data and then see how it would perform on a completely different set of data.  This gives us a better idea of how our model might perform than simply using accuracy from the test dataset, as models tend to **overfit**:  they latch on the peculiarities of the training dataset.  If those peculiarities do not also exist in the broader population, then the trained model may come up with the wrong answer.  Having a separate test dataset that the trained model knows nothing about gives us a better idea of realistic behavior.  It also allows us to come up with a measure of how much overfitting the trained model does, as we can compare the training accuracy to the test accuracy; if there is a substantial difference between the two, our model is overfitting quite a bit.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1740)

## Perform Classification

We'll train the model on our training data, ignoring the test data for now.  With sklearn, this is easy:  use the `fit()` method.

In [None]:
clf = clf.fit(X_train, y_train)

## How'd we do?

Let's use the `accuracy_score` method in sklearn to see just how well we did.

In [None]:
predicted = clf.predict(X_test)
accuracy_score(y_test, predicted)

Our accuracy score is now 83.5%, whereas for CART, it was 75.8%.  It would appear that the random forest classifier is better than the decision tree classifier when it comes to this dataset.

This is fairly normal behavior--we typically see random forests beat individual decision trees.

## Randomized Search

We stuck with most of the defaults, save for the max depth of each tree. But there are several parameters we can work with along the way. The question is, how do we decide what settings to use? We could perform ad hoc random search, tweaking parameters until we get an outcome that looks okay. That's not the only (or even the best) way to do things, however. Let's try using a technique known as randomized search.

The definition of `RandomForestClassifier()` is as follows:

```
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None, monotonic_cst=None)[source]
```

Things that look interesting include:

* n_estimators - The number of estimators (trees) we create in the forest.
* criterion - One of { gini, entropy, log_loss }
* max_depth - The maximum number of levels beneath the root node
* min_samples_split - The minimum number of samples needed to split an internal node
* min_samples_leaf - The minimum number of samples needed in a leaf node
* max_features - The maximum number of features to consider per split. One of { sqrt, log2 } or specify a number/percent
* max_leaf_nodes - Grow up to this many leaf nodes in a best-first fashion

Let's come up with a set of parameters that we could try.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

parameters = {
    'n_estimators': [40, 50, 75, 100, 125, 150],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [2, 3, 4, 5, 6, 7],
    'max_features': ['sqrt', 'log2'],
    'min_samples_leaf': [1, 2, 3, 4, 5]
}

If you do the math on this, there are 6 * 3 * 6 * 2 * 5 = 1080 possible parameter combinations. This could take a while if we wanted to use a thorough grid search.

That's why we're going to use randomized search and pick 40 parameter combinations at random.

In [None]:
rf = ensemble.RandomForestClassifier()
clf = RandomizedSearchCV(rf, parameters, n_iter=40)

Now let's run the `fit()` method, which will generate results for each set of parameters. This will take a little longer because we're training 40 separate times.

In [None]:
clf.fit(X_train, y_train)

Once it finishes, we can determine what the best set of parameters is given our particular search criteria.

In [None]:
clf.best_params_

Let's see what it looks like when we fill in these values. We'll create one more classifier.

In [None]:
clf = ensemble.RandomForestClassifier(max_depth=3, n_estimators=125, min_samples_leaf=3, max_features='sqrt', criterion='gini')

Now let's fit the new model. Arguably, we don't **need** to do this because we've already done the dirty work, but one more iteration doesn't take much time.

In [None]:
clf = clf.fit(X_train, y_train)

Then, we can generate predictions and check the accuracy.

In [None]:
predicted = clf.predict(X_test)
accuracy_score(y_test, predicted)

It turns out that there might be a small benefit in this particular dataset by performing randomized or grid search. We're probably not going to break 85-86%, but will get closer to the peak more frequently.

## Viewing a sample Tree

As mentioned above, the random forest classifier is really just a series of decision trees.  Let's use the `graphviz` library to visualize the first of those trees.

In order to run this section, you will need the `graphviz` library.  You can get it from pip or conda:

`conda install python-graphviz`

`pip install graphviz`

In [None]:
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(clf.estimators_[20], out_file=None, feature_names=X.columns.values.tolist(), class_names=['No heart attack', 'Heart attack'], filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph

Leaf nodes in **blue** were cases where there was long-term heart disease, and leaf nodes in **orange** had no long-term heart disease.  Non-leaf nodes are colored based on how likely they are to indicate heart disease.

As a reminder, the single decision tree from the prior notebook started with `thall <= 2.5`, so we can tell that this tree is definitely starting off differently.
