## 17.8.2
### Predict Loan Applications
Jill now asks you to run a random forest model to make classifications. As you have done before, the first step is to prepare the data for the random forest classifier model.

When we imported our dependencies to create the decision tree in the previous example, we use the "tree" module from the sklearn library, from sklearn import tree.

For the random forest model, we'll use the "ensemble" module from the sklearn library. All the remaining dependencies will be the same. In the dependencies, replace from sklearn import tree with from sklearn.ensemble import RandomForestClassifier so that our dependencies look like the following.

    '# Initial imports.
    import pandas as pd
    from path import Path
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

Next, read in your loans_data_encoded.csv file from previous exercises.

    '# Loading data
    file_path = Path("../Resources/loans_data_encoded.csv")
    df_loans = pd.read_csv(file_path)
    df_loans.head()

After the data has been loaded, we're going to preprocess data just like we did for the decision tree model.

Preprocess the Data

Now, we're going to walk through the preprocessing steps for the loan applications' encoded data so that we can fit our training and testing sets with the random forest model.

If you do not quite remember the steps for preprocessing, add the blocks of code in your Jupyter Notebook as follows.

    First, we define the features set.

        # Define the features set.
        X = df_loans.copy()
        X = X.drop("bad", axis=1)
        X.head()

    Next, we define the target set. Here, we're using the ravel() method, which performs the same procedure on our target set data as the values attribute.

        # Define the target set.
        y = df_loans["bad"].ravel()
        y[:5]

    Now, we split into the training and testing sets.

        # Splitting into Train and Test sets.
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

    Lastly, we can create the StandardScaler instance, fit the scaler with the training set, and scale the data.

        # Creating a StandardScaler instance.
        scaler = StandardScaler()
        # Fitting the Standard Scaler with the training data.
        X_scaler = scaler.fit(X_train)

        # Scaling the data.
        X_train_scaled = X_scaler.transform(X_train)
        X_test_scaled = X_scaler.transform(X_test)

If you were able to do these steps without having to follow along, congratulations!

## 17.8.3
### Fit the Model, Make Predictions, and Evaluate Results
Now that you have prepared the data, you will put the random forest classifier model to practice, then evaluate the results.

Now that we have preprocessed the data into training and testing data for both features and target sets, we can fit the random forest model, make predictions, and evaluate the model.

### Fit the Random Forest Model

Before we fit the random forest model to our X_train_scaledand y_train training data, we'll create a random forest instance using the random forest classifier, RandomForestClassifier().

    '# Create a random forest classifier.
    rf_model = RandomForestClassifier(n_estimators=128, random_state=78) 

The RandomForestClassifier takes a variety of parameters, but for our purposes we only need the n_estimators and the random_state.

**note**
Consult the sklearn documentation for additional information about the RandomForestClassifier and the parameters it takes.

The n_estimators will allow us to set the number of trees that will be created by the algorithm. Generally, the higher number makes the predictions stronger and more stable, but can slow down the output because of the higher training time allocated. The best practice is to use between 64 and 128 random forests, though higher numbers are quite common despite the higher training time. For our purposes, we'll create 128 random forests.

After we create the random forest instance, we need to fit the model with our training sets.

    '# Fitting the model
    rf_model = rf_model.fit(X_train_scaled, y_train)

### Make Predictions Using the Testing Data

After fitting the model, we can run the following code to make predictions using the scaled testing data:

    # Making predictions using the testing data.
    predictions = rf_model.predict(X_test_scaled)

The output will be similar as when the predictions were determined for the decision tree.

The predictions generated by the model.

### Evaluate the Model

After making predictions on the scaled testing data, we analyze how well our random forest model classifies loan applications by using the confusion_matrix.

    '# Calculating the confusion matrix.
    cm = confusion_matrix(y_test, predictions)

    '# Create a DataFrame from the confusion matrix.
    cm_df = pd.DataFrame(
        cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"])

    cm_df

The confusion matrix in the DataFrame format after running 128 random forest models.

These results are relatively the same as the decision tree model. To improve our predictions, let's increase the n_estimators to 500. After running all the code again, after changing the n_estimators to 500, our confusion matrix DataFrame is about the same as before.

data-17-8-3-3-Confusion-Matrix-500-Random-Forest-Model.png

Using the equation (TP + TN) / Total, we can determine our accuracy (determine how often the classifier predicts correctly) by running the following code. For this model, our accuracy score is 0.520:

    '# Calculating the accuracy score.
    acc_score = accuracy_score(y_test, predictions)

Lastly, we can print out the above results along with the classification report for the two classes:

    '# Displaying results
    print("Confusion Matrix")
    display(cm_df)
    print(f"Accuracy Score : {acc_score}")
    print("Classification Report")
    print(classification_report(y_test, predictions))

The print out of the confusion matrix, the accuracy score, and the classification report.

From the confusion matrix results, the precision for the bad loan applications is low, indicating a large number of false positives, which indicates an unreliable positive classification. The recall is also low for the bad loan applications, which is indicative of a large number of false negatives. The F1 score is also low (33).

In summary, this random forest model is not good at classifying fraudulent loan applications because the model's accuracy, 0.520, and F1 score are low.

### Rank the Importance of Features

One nice byproduct of the random forest algorithm is to rank the features by their importance, which allows us to see which features have the most impact on the decision.

To calculate the feature importance, we can use thefeature_importances_attribute with the following code:

    '# Calculate feature importance in the Random Forest model.
    importances = rf_model.feature_importances_
    importances

The output from this code returns an array of scores for the features in the X_test set, whose sum equals 1.0:

The importance of each feature, as numerical values.

To sort the features by their importance with the column in the X_test set, we can modify our code above as follows:

    '# We can sort the features by their importance.
    sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

In the code, the sorted function will sort the zipped list of features with their column name (X.columns) in reverse order—more important features first—with reverse=True.

Running this code will return the following output:

Each feature is associated with its importance.

Now we can clearly see which features, or columns, of the loan application are more relevant. The age and month_num of the loan application are the more relevant features.

To improve this model, we can drop some of the lower ranked features.

### Actual Code

In [2]:
# Initial imports.
import pandas as pd
from path import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Loading data
file_path = Path("../17-6-1-label_encode/Resources/loans_data_encoded.csv")
df_loans = pd.read_csv(file_path)
df_loans.head()

# Preprocess the Data

# Define the features set.
X = df_loans.copy()
X = X.drop("bad", axis=1)
X.head()

# Define the target set.
y = df_loans["bad"].ravel()
y[:5]

#Splitting into Train and Test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

# Creating a StandardScaler instance.
scaler = StandardScaler()
# Fitting the Standard Scaler with the training data.
X_scaler = scaler.fit(X_train)

# Scaling the data.
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [3]:
# Create a random forest classifier.
rf_model = RandomForestClassifier(n_estimators=128, random_state=78) 

# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

# Making predictions using the testing data.
predictions = rf_model.predict(X_test_scaled)

### Evaluate the Model

# Calculating the confusion matrix.
cm = confusion_matrix(y_test, predictions)

# Create a DataFrame from the confusion matrix.
cm_df = pd.DataFrame(cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"])

cm_df

# Calculating the accuracy score.
acc_score = accuracy_score(y_test, predictions)

# Displaying results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, predictions))

### Rank the Importance of Features

# Calculate feature importance in the Random Forest model.
importances = rf_model.feature_importances_
importances

# We can sort the features by their importance.
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

Confusion Matrix


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,51,33
Actual 1,23,18


Accuracy Score : 0.552
Classification Report
              precision    recall  f1-score   support

           0       0.69      0.61      0.65        84
           1       0.35      0.44      0.39        41

    accuracy                           0.55       125
   macro avg       0.52      0.52      0.52       125
weighted avg       0.58      0.55      0.56       125



[(0.43280447750315343, 'age'),
 (0.32973986443922343, 'month_num'),
 (0.07997292251445517, 'term'),
 (0.05454782107242418, 'amount'),
 (0.021510631303272416, 'education_college'),
 (0.021102188881175144, 'education_High School or Below'),
 (0.01985561654170213, 'gender_male'),
 (0.018878176828577283, 'gender_female'),
 (0.018871722006693077, 'education_Bachelor'),
 (0.002716578909323729, 'education_Master or Above')]