# Module 19

### 19.5.4 Random Forest Vs. Deep Learning Model

Random forest classifiers are one of Bek’s favorites. (Not only is it a powerful model type, she can’t help thinking about actual forests when she sees the name!) She’s curious how it stacks up against a deep learning model.


Random forest classifiers are a type of ensemble learning model that combines multiple smaller models into a more robust and accurate model. 

Random forest models use a number of weak learner algorithms (decision trees) and combine their output to make a final classification (or regression) decision. 

Structurally speaking, random forest models are very similar to their neural network counterparts. 

Random forest models have been a staple in machine learning algorithms for many years due to their robustness and scalability. 

Both output and feature selection of random forest models are easy to interpret, and they can easily handle outliers and nonlinear data.

![image.png](attachment:image.png)

An example of a random forest with three decision trees making a prediction based on a series of true and false questions.



**SKILL DRILL**
Take a moment to consider a few different reasons to the following question:

- If random forest models are fairly robust and clear, why would you want to replace them with a neural network?

The answer depends on the type and complexity of the entire dataset. 

- First and foremost, random forest models will only handle tabular data, so data such as images or natural language data cannot be used in a random forest without heavy modifications to the data. 
- Neural networks can handle all sorts of data types and structures in raw format or with general transformations (such as converting categorical data).

In addition, each model handles input data differently. 

- Random forest models are dependent on each weak learner being trained on a subset of the input data. Once each weak learner is trained, the random forest model predicts the classification based on a consensus of the weak learners. 
- In contrast, deep learning models evaluate input data within a single neuron, as well as across multiple neurons and layers.
- As a result, the deep learning model might be able to identify variability in a dataset that a random forest model could miss. 
- However, a random forest model with a sufficient number of estimators and tree depth should be able to perform at a similar capacity to most deep learning models.

To compare the implementation and performance of a random forest model versus a deep learning model, we’ll train and evaluate both models on the same data. 

This time, we’ll use a dataset that has been adapted from bank loan data (Links to an external site.) with more than 36,000 rows and 16 feature columns. 

From this dataset, we want to build a classifier that can predict whether or not a loan will or will not be paid provided their current loan status and metrics.

First, we’ll download the bank loan status dataset (loan_status.csv) Preview the documentand place it in a folder with a new Jupyter Notebook. 

Next, we’ll make a new Jupyter Notebook and name it “RandomForest_DeepLearning” (or something similar). 

This will help us easily locate the comparison example at another time. 

Once we have created our notebook and placed the dataset into the corresponding folder, we’ll start by importing our libraries and reading in the dataset. 

Copy and run the following code into the notebook:


In [1]:

# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import tensorflow as tf

# Import our input dataset
loans_df = pd.read_csv('loan_status.csv')
loans_df.head()


Unnamed: 0,Loan_Status,Current_Loan_Amount,Term,Credit_Score,Annual_Income,Years_in_current_job,Home_Ownership,Purpose,Monthly_Debt,Years_of_Credit_History,Months_since_last_delinquent,Number_of_Open_Accounts,Number_of_Credit_Problems,Current_Credit_Balance,Maximum_Open_Credit,Bankruptcies,Tax_Liens
0,Fully_Paid,99999999,Short_Term,741.0,2231892.0,8_years,Own_Home,Debt_Consolidation,29200.53,14.9,29.0,18,1,297996,750090.0,0.0,0.0
1,Fully_Paid,217646,Short_Term,730.0,1184194.0,<_1_year,Home_Mortgage,Debt_Consolidation,10855.08,19.6,10.0,13,1,122170,272052.0,1.0,0.0
2,Fully_Paid,548746,Short_Term,678.0,2559110.0,2_years,Rent,Debt_Consolidation,18660.28,22.6,33.0,4,0,437171,555038.0,0.0,0.0
3,Fully_Paid,99999999,Short_Term,728.0,714628.0,3_years,Rent,Debt_Consolidation,11851.06,16.0,76.0,16,0,203965,289784.0,0.0,0.0
4,Fully_Paid,99999999,Short_Term,740.0,776188.0,<_1_year,Own_Home,Debt_Consolidation,11578.22,8.5,25.0,6,0,134083,220220.0,0.0,0.0


The DataFrame shows seven columns of bank loan data: Loan_Status, Current_Loan_Amount, Term, Credit_Score, Annual_Income, Years_In-current_job, and Home_Ownership.

Because both Scikit-Learn’s RandomForestClassifier class and TensorFlow’s Sequential class require preprocessing, we can perform our preprocessing steps on all of the data—no need to keep track of separate scaled and unscaled data. 

For our first preprocessing workflow, let’s encode our categorical variables using Scikit-Learn’s OneHotEncoder class.

First, we must make sure that none of our categorical variables require bucketing. 
To check this, let’s get the column names of categorical variables and check their number of unique values. 

Add and run the following code to the notebook:




In [2]:
# Generate our categorical variable list
loans_cat = loans_df.dtypes[loans_df.dtypes == "object"].index.tolist()

# Check the number of unique values in each column
loans_df[loans_cat].nunique()


Loan_Status              2
Term                     2
Years_in_current_job    11
Home_Ownership           4
Purpose                  7
dtype: int64

Check the number of unique values in the bank loan data.

Looking at the number of unique values in our categorical variable, the “Years_in_current_job” column does have 11 unique values. Therefore, we should check the number of data points for each unique value to find out if any categorical variables can be bucketed together. Again, add and run the following code to the notebook:



In [3]:
# Check the unique value counts to see if binning is required
loans_df.Years_in_current_job.value_counts()



10+_years    13149
2_years       3225
3_years       2997
<_1_year      2699
5_years       2487
4_years       2286
1_year        2247
6_years       2109
7_years       2082
8_years       1675
9_years       1467
Name: Years_in_current_job, dtype: int64

Check if any categorical variables can be bucketed together.

Looking at the number of data points for each unique value, all of the categorical values have a substantial number of data points. In this case, we have reason to leave the “Years_in_current_job” column alone because we don’t want to bucket common values together and cause confusion in the model.

Since all of the categorical variables are ready for encoding, we can add and run the following code to the notebook:


In [4]:

# Create a OneHotEncoder instance
enc = OneHotEncoder(sparse=False)

# Fit and transform the OneHotEncoder using the categorical variable list
encode_df = pd.DataFrame(enc.fit_transform(loans_df[loans_cat]))

# Add the encoded variable names to the DataFrame
encode_df.columns = enc.get_feature_names(loans_cat)
encode_df.head()


Unnamed: 0,Loan_Status_Fully_Paid,Loan_Status_Not_Paid,Term_Long_Term,Term_Short_Term,Years_in_current_job_10+_years,Years_in_current_job_1_year,Years_in_current_job_2_years,Years_in_current_job_3_years,Years_in_current_job_4_years,Years_in_current_job_5_years,...,Home_Ownership_Home_Mortgage,Home_Ownership_Own_Home,Home_Ownership_Rent,Purpose_Business_Loan,Purpose_Buy_House,Purpose_Buy_a_Car,Purpose_Debt_Consolidation,Purpose_Home_Improvements,Purpose_Medical_Bills,Purpose_Other
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


The DataFrame shows multiple columns of encoded loan data such as: Loan_Status_Fully_Paid, Loan_Status_Not_Paid, Term_Long_Term, and Term_Short_Term.

Now that our categorical variables have been encoded, we need to merge them back into our original data frame and remove the unencoded columns. To do this, add and run the following code in the notebook:


In [5]:

# Merge one-hot encoded features and drop the originals
loans_df = loans_df.merge(encode_df,left_index=True, right_index=True)
loans_df = loans_df.drop(loans_cat,1)
loans_df.head()


Unnamed: 0,Current_Loan_Amount,Credit_Score,Annual_Income,Monthly_Debt,Years_of_Credit_History,Months_since_last_delinquent,Number_of_Open_Accounts,Number_of_Credit_Problems,Current_Credit_Balance,Maximum_Open_Credit,...,Home_Ownership_Home_Mortgage,Home_Ownership_Own_Home,Home_Ownership_Rent,Purpose_Business_Loan,Purpose_Buy_House,Purpose_Buy_a_Car,Purpose_Debt_Consolidation,Purpose_Home_Improvements,Purpose_Medical_Bills,Purpose_Other
0,99999999,741.0,2231892.0,29200.53,14.9,29.0,18,1,297996,750090.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,217646,730.0,1184194.0,10855.08,19.6,10.0,13,1,122170,272052.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,548746,678.0,2559110.0,18660.28,22.6,33.0,4,0,437171,555038.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,99999999,728.0,714628.0,11851.06,16.0,76.0,16,0,203965,289784.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,99999999,740.0,776188.0,11578.22,8.5,25.0,6,0,134083,220220.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


Use merge and drop to replace unencoded categorical variables in the DataFrame.

Next, we need to standardize our numerical variables using Scikit-Learn’s StandardScaler class. 

Again, we must split our data into the training and testing sets prior to standardization to not incorporate the testing values into the scale. 

To perform the training/test split and standardize our numerical variables, add and run the following code in the notebook:


In [6]:

# Remove loan status target from features data
y = loans_df.Loan_Status_Fully_Paid
X = loans_df.drop(columns=["Loan_Status_Fully_Paid","Loan_Status_Not_Paid"])

# Split training/test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


After standardizing variables in both the training and testing data, our dataset is ready for both models. First, we’ll train and evaluate our random forest classifier.

**REWIND**
Random forest models can be built using Scikit-learn’s RandomForestClassifier class in the ensemble module.


For our purposes, we’ll use a random forest classifier with the n_estimators parameter set to 128. Typically, 128 estimators is the largest number of estimators we would want to use in a model. 

To create our random forest classifier model and test the performance, add and run the following code:


In [7]:

# Create a random forest classifier.
rf_model = RandomForestClassifier(n_estimators=128, random_state=78)

# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

# Evaluate the model
y_pred = rf_model.predict(X_test_scaled)
print(f" Random forest predictive accuracy: {accuracy_score(y_test,y_pred):.3f}")




 Random forest predictive accuracy: 0.849



Next, we need to build, compile, and evaluate our deep learning model. Again, we’ll use our typical binary classifier parameters:

- Our first hidden layer will have an input_dim equal to 38, 24 neuron units, and will use the relu activation function.
- Our second hidden layer will have 12 neuron units and also will use the relu activation function.
- The loss function should be binary_crossentropy, using the adam optimizer.
- Our model should provide the additional accuracy scoring metric and train over a maximum of 50 epochs.

To build and evaluate our deep learning model, we must add and run the following code to the notebook:


In [8]:
import numpy
import pandas
# Define the model - deep neural net
number_input_features = len(X_train_scaled[0])
hidden_nodes_layer1 =  24
hidden_nodes_layer2 = 12

nn = tf.keras.models.Sequential()

# First hidden layer
nn.add(
    tf.keras.layers.Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation="relu")
)

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer2, activation="relu"))


# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Compile the Sequential model together and customize metrics
nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the model
fit_model = nn.fit(X_train_scaled, numpy.array(y_train), epochs=50)

# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled, numpy.array(y_test),verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")


Train on 27317 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
9106/1 - 0s - loss: 0.4284 - accuracy: 0.8461
Loss: 0.39050867314420373, Accuracy: 0.8461453914642334


Evaluate the deep learning model’s loss and accuracy metrics.

Again, if we compare both model’s predictive accuracy, their output is very similar. 

Both the random forest and deep learning models were able to predict correctly whether or not a loan will be repaid over 80% of the time. 

Although their predictive performance was comparable, their implementation and training times were not—the random forest classifier was able to train on the large dataset and predict values in seconds, while the deep learning model required a couple minutes to train on the tens of thousands of data points. 

In other words, the random forest model is able to achieve comparable predictive accuracy on large tabular data with less code and faster performance. 

The ultimate decision of whether to use a random forest versus a neural network comes down to preference. 

However, if your dataset is tabular, random forest is a great place to start.
