# Classification with Random Forests

We will again look at predicting the quality of wine (either good or bad) based on various predictors. A random forest model creates many decision trees, where each decision tree is constructed using a subset of all features. Then to make a prediction, a random forest model asks each tree what its prediction is. And whatever prediction is made by the most trees, is the final prediction made by the random forest model.

In [372]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk

## Load Data

In [373]:
# Load Data
wine = pd.read_csv('heart_disease_data.csv',header=0)

In [374]:
wine.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,num
0,62,1,4,160,254,1,1,108,1,3.0,1
1,46,1,4,140,311,0,0,120,1,1.8,1
2,39,0,3,138,220,0,0,152,0,0.0,0
3,56,1,1,120,193,0,2,162,0,1.9,0
4,43,0,2,120,201,0,0,165,0,0.0,0


In [375]:
wine.shape

(561, 11)

In [376]:
wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561 entries, 0 to 560
Data columns (total 11 columns):
age         561 non-null int64
sex         561 non-null int64
cp          561 non-null int64
trestbps    561 non-null int64
chol        561 non-null int64
fbs         561 non-null int64
restecg     561 non-null int64
thalach     561 non-null int64
exang       561 non-null int64
oldpeak     561 non-null float64
num         561 non-null int64
dtypes: float64(1), int64(10)
memory usage: 48.3 KB


In [377]:
# Create the numpy arrays
x = np.array(wine.iloc[:,0:10])
y = np.array(wine["num"])

In [378]:
# split data into training set and test set. I'm putting 33% into test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

## Create and Fit Random Forest Model

In [379]:
# Import the random forest classifier and fit it
# to the training data
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(x_train,y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## Evaluate Model on Test Set

In [380]:
# Ask the random forest model to make predictions of wine quality
# on the test set
y_test_pred = model.predict(x_test)

In [381]:
# And then evaluate our model performance on that test set
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_test_pred)

array([[77, 21],
       [25, 63]], dtype=int64)

In [382]:
# Accuracy => the proportion of wines that were correctly classified
# i.e. it is the true positives + true negatives divided by all wines
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_test_pred)

0.7526881720430108

In [383]:
# Precision => out of all those wines that were predicted GOOD, what proportion truely were?
from sklearn.metrics import precision_score
precision_score(y_test,y_test_pred,pos_label=1)

0.75

In [384]:
# Recall => Out of all those wines that are GOOD, what proportion were also predicted as GOOD
from sklearn.metrics import recall_score
recall_score(y_test,y_test_pred,pos_label=1)

0.7159090909090909

## Variable Importance

An interesting and useful feature of random forests is that based on the generated trees, we can calculate the importance of the different variables! See the sklearn documentation for the details. But to find the variable importances, use this:

In [385]:
print(model.feature_importances_)

[0.13989814 0.04397072 0.12994805 0.10957627 0.12171914 0.02443111
 0.02773919 0.15292884 0.07361018 0.17617836]


In [386]:
# And these correlate to the predictors:
wine.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'num'],
      dtype='object')

## My model is bad! How do I make it better???? (Model Parameters)

Lots of models allow you to set parameters that change the behaviour of those models. Trying out different parameters could lead to a better model for your data. Use the sklearn documentation to figure out what parameters a model takes. Here I'll use random forests as an example!

In [387]:
# Go to the random forest documentation at:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
# On that page, you'll find various possible parameters for the random forest model.
# Let's try a random forest model with 20 decision trees, and a max depth of 5 and no more
# than 7 features per tree:

In [388]:
model2 = RandomForestClassifier(n_estimators = 50, max_depth = 5, max_features = 10)
model2.fit(x_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features=10, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [389]:
# let's evaluate that model:
y_test_pred2 = model2.predict(x_test)
print(confusion_matrix(y_test,y_test_pred))
print("Accuracy:  {}",accuracy_score(y_test,y_test_pred2))
print("Precision: {}",precision_score(y_test,y_test_pred2,pos_label=1))
print("Recall:    {}",recall_score(y_test,y_test_pred2,pos_label=1))

[[77 21]
 [25 63]]
Accuracy:  {} 0.7849462365591398
Precision: {} 0.7727272727272727
Recall:    {} 0.7727272727272727


## Cross Validation Grid Search Through Parameters

You can use Grid Search to try lots of different parameter combinations. Here is the code to do that, with random forests as an example.

In [390]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Create a dictionary mapping parameter names to lists of values to try!
param_grid = {'n_estimators':[50,100,150,200,250,300],
              'max_depth': [3,5,50,100,150,200],
              'max_features': ["auto",10]}

# The above means that we try 3*3*2 = 18 random forest models! The first one has
# 50 trees with max_depth of 3 and automatically determined number of features
# per tree.

# Then to fit the model, we do this.
# 1. Create random forest model
rf = RandomForestClassifier()

# 2. Create a GridSearchCV object and tell it about your
#    random forest model and paramater grid. cv = 3 means
#    you want to use 3-fold cross validation.
rf_cv= GridSearchCV(rf,param_grid,cv=3)

# 3. Now fit the model to the training data
rf_cv.fit(x_train,y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid

In [391]:
# Now to find the best found score and parameters do:
print(rf_cv.best_score_)
print(rf_cv.best_params_)

0.8
{'max_depth': 5, 'max_features': 'auto', 'n_estimators': 200}


In [392]:
# so it seems from our 18 models, the best random forest model had
# 200 trees, automatically determined number of features and a
# max_depth of 1000. We can now just create that model:

In [393]:
model_final = RandomForestClassifier(n_estimators = 200, max_depth = 1000, max_features="auto")

In [394]:
y_test_pred_final = rf_cv.predict(x_test)

In [395]:
confusion_matrix(y_test,y_test_pred_final)

array([[78, 20],
       [18, 70]], dtype=int64)

In [396]:
accuracy_score(y_test,y_test_pred_final)

0.7956989247311828

In [397]:
precision_score(y_test,y_test_pred_final,pos_label=1)

0.7777777777777778

In [398]:
recall_score(y_test,y_test_pred_final,pos_label=1)

0.7954545454545454