In [138]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [139]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
credit_df = pd.read_csv(Path('Resources/lending_data.csv'))

# Review the DataFrame
credit_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [140]:
# show number of rows for each category in loan_status
credit_df['loan_status'].value_counts()

loan_status
0    75036
1     2500
Name: count, dtype: int64

### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [141]:
# Separate the data into labels and features
# Separate the y variable, the labels
y = credit_df['loan_status']

# Separate the X variable, the features
X = credit_df.drop(columns='loan_status')

In [142]:
# Review the y variable Series
print(f"Labels: {y[:10]}")

Labels: 0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: loan_status, dtype: int64


In [143]:
# Review the X variable DataFrame
print(f"Data: {X[:10]}")

Data:    loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   
5    10100.0          7.438            50600        0.407115                4   
6    10300.0          7.490            51100        0.412916                4   
7     8800.0          6.857            45100        0.334812                3   
8     9300.0          7.096            47400        0.367089                3   
9     9700.0          7.248            48800        0.385246                4   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [144]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=1, 
                                                    stratify=y)
X_train.shape

(58152, 7)

In [145]:
X_test.shape

(19384, 7)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [146]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
classifier = LogisticRegression(solver='lbfgs', random_state=1)


# Fit the model using training data
classifier.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [147]:
# Make a prediction using the testing data
predictions = classifier.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test})

Unnamed: 0,Prediction,Actual
36831,0,0
75818,0,1
36563,0,0
13237,0,0
43292,0,0
...,...,...
38069,0,0
36892,0,0
5035,0,0
40821,0,0


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [148]:
# Generate a confusion matrix for the model
confusion_matrix(y_test, predictions)

array([[18679,    80],
       [   67,   558]], dtype=int64)

In [149]:
# Calculating the confusion matrix
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(
    cm, index=["Actual Healthy Loan", "Actual High Risk Loan"], columns=["Predicted Healthy Loan", "Predicted High Risk Loan"]
)
display(cm_df)
# Print the classification report for the model
target_names = ["Healthy loan", "High risk loan"]
print(classification_report(y_test, predictions, target_names=target_names))
reg_acc_score = accuracy_score(y_test, predictions)
print(f"Accuracy Score: {reg_acc_score}")

Unnamed: 0,Predicted Healthy Loan,Predicted High Risk Loan
Actual Healthy Loan,18679,80
Actual High Risk Loan,67,558


                precision    recall  f1-score   support

  Healthy loan       1.00      1.00      1.00     18759
High risk loan       0.87      0.89      0.88       625

      accuracy                           0.99     19384
     macro avg       0.94      0.94      0.94     19384
  weighted avg       0.99      0.99      0.99     19384

Accuracy Score: 0.9924164259182832


### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** Overall, the model is performing very well, especially in identifying the healthy loans. It's also doing a good job with the high-risk loans, given the imbalance in the dataset. 

Precision for healthy loans shows that the model was 100% accurate in labeling healthy loans as such; precision for high-risk loan indicates that the model accurately labeled 87% of high risk-loans as such i.e. mislabeled 13% of healthy loans as high-risk [false positive]. In terms of correct identification for the model, the recall for high-risk loans is 0.89, meaning the model correctly identified 89% of all actual high-risk loans, and missed 11% [false negative].

---

## Create a the Random Forest Model with the Original Data

Code used largery sourced from the random forest loan default activity.

In [150]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

In [151]:
# Define features set
X = credit_df.copy()
X.drop('loan_status', axis=1, inplace=True)
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


Create the target vector by assigning the values of the `loan_status` column from the `credit_df` DataFrame.

In [152]:
# Define target vector
y = credit_df["loan_status"].values.reshape(-1, 1)
y[:5]

array([[0],
       [0],
       [0],
       [0],
       [0]], dtype=int64)

Split the data into training and testing sets.

In [153]:
# Splitting into Train and Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

Use the `StandardScaler` to scale the features data

In [154]:
# Create the StandardScaler instance
scaler = StandardScaler()

In [155]:
# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

In [156]:
# Scale the training data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Fitting the Random Forest Model

In [157]:
# Create the random forest classifier instance
rf_model = RandomForestClassifier(n_estimators=500, random_state=78)

In [158]:
# Fit the model and use .ravel()on the "y_train" data. 
rf_model = rf_model.fit(X_train_scaled, y_train.ravel())

## Making Predictions Using the Random Forest Model

Validate the trained model by predicting loan defaults using the testing data (`X_test_scaled`).

In [159]:
# Making predictions using the testing data
predictions = rf_model.predict(X_test_scaled)

## Model Evaluation

Evaluate model's results, by using `sklearn` to calculate the confusion matrix, the accuracy score and to generate the classification report.

In [160]:
# Calculating the confusion matrix
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(
    cm, index=["Actual Healthy Loan", "Actual High Risk Loan"], columns=["Predicted Healthy Loan", "Predicted High Risk Loan"]
)

# Calculating the accuracy score
acc_score = accuracy_score(y_test, predictions)

In [161]:
# Displaying results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, predictions, target_names=target_names))

Confusion Matrix


Unnamed: 0,Predicted Healthy Loan,Predicted High Risk Loan
Actual Healthy Loan,18691,93
Actual High Risk Loan,70,530


Accuracy Score : 0.9915910028889806
Classification Report
                precision    recall  f1-score   support

  Healthy loan       1.00      1.00      1.00     18784
High risk loan       0.85      0.88      0.87       600

      accuracy                           0.99     19384
     macro avg       0.92      0.94      0.93     19384
  weighted avg       0.99      0.99      0.99     19384



## Feature Importance

In [162]:
# Get the feature importance array
importances = rf_model.feature_importances_
# List the top 10 most important features
importances_sorted = sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)
importances_sorted[:10]

[(0.29703998492312944, 'interest_rate'),
 (0.186273936116309, 'borrower_income'),
 (0.17038411491427138, 'debt_to_income'),
 (0.16924829066855973, 'total_debt'),
 (0.11646576782842037, 'loan_size'),
 (0.06046908207975322, 'num_of_accounts'),
 (0.00011882346955686954, 'derogatory_marks')]

## Model Comparison

The random forest forest model exhibits strong performance when distinguishing between health and high-risk loans showing the overall accuracy score of 99.16%. The accuracy score achieved by the random forest model is only very slightly lower than that achieved with the logitisc regression model (99.24%). This means that both models are are highly accurate in predicting the loan status.


PRECISION: 
For "High-risk loans", the Random Forest model has a precision of 85%, slightly lower than the Logistic Regression model (87%). This means the Random Forest model has a slightly higher rate of falsely labeling healthy loans as high-risk.

RECALL:

The recall for high-risk loans in the Random Forest model is 88%, which is very similar to the recall of 89% in the Logistic Regression model. This suggests that both models are almost equally good at identifying the actual high-risk loans.


F1-SCORE:

The F1-score for high-risk loans in the Random Forest model is 87%, slightly lower than the 88% in the Logistic Regression model. The F1-score gives a balanced measure of precision and recall, and in this case, the difference is marginal.

FALSE POSITIVES/FALSE NEGATIVES:

The Random Forest model resulted in 93 False Positives and 70 False Negatives, compared to the Logistic Regression model of 80 and 67 respectively. For a more risk-averse bank with enough applicants not to fear losing customers, logistic regression model might be a better choice since its confusion matrix indicates a slightly lower Type 2 error. 

Scaling and Complexity:

The Random Forest model required additional steps like scaling the features, and it’s generally more complex and computationally intensive compared to Logistic Regression. Depending on the available computational resources and the need for interpretability, Logistic Regression might be preferred.

Conclusion:

Both models exhibit strong and comparable performance in classifying loan status. The feature importance analysis for the Random Forest model reveals that 'interest_rate', 'borrower_income', 'debt_to_income', 'total_debt', and 'loan_size' are the most strongest predictors of loan status. This helps to understand which features are driving the model's predictions and can inform risk management strategies and policy development for the bank.

The choice between Logistic Regression and Random Forest may depend on specific business requirements, such as the importance of minimizing false positives/negatives, computational efficiency, model interpretability, and gaining insights into feature importance. Given the minor differences in performance metrics, if computational efficiency, model simplicity, and interpretability are priorities, Logistic Regression might be a more suitable choice.

However, if the model's ability to generalize well, handle non-linear relationships, and provide insights into the relative importance of different features is more critical, the Random Forest model may have the edge. The detailed feature importance provided by the Random Forest model can be particularly useful for refining and optimizing the loan approval process and for identifying areas where additional data collection or feature engineering may improve model performance.

In conclusion, the decision to use either model should be aligned with the specific goals and constraints of the financial institution, with consideration given to the trade-offs between interpretability, complexity, accuracy, and the value of understanding feature influence on predictions.