In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
# Define a path object for the 'Resources' directory
resources_path = Path('Resources')

# Define a path object for 'lending_data.csv' within the 'Resources' directory
data_file_path = resources_path / 'lending_data.csv'

# Check if the file exists
if data_file_path.exists():
    print(f"The file {data_file_path} exists.\n")
else:
    print(f"The file {data_file_path} does not exist.\n")

# Assuming 'lending_data.csv' is in the 'Resources' folder relative to your script
data = pd.read_csv(data_file_path)
    
# Review the DataFrame
# Display the first few rows to confirm it's loaded correctly
data.head()

The file Resources\lending_data.csv exists.



Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
# Create the labels set (y)
y = data['loan_status']

# Separate the X variable, the features
# Create the features DataFrame (X) by dropping the 'loan_status' column
X = data.drop('loan_status', axis=1)

In [4]:
# Review the y variable Series
print(y.head())

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64


In [5]:
# Review the X variable DataFrame
print(X.head())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [6]:
# Check the balance of our target values
# Use the value_counts method on the labels (y) to check balance
label_counts = y.value_counts()

# Display the counts for each label
print(label_counts)

loan_status
0    75036
1     2500
Name: count, dtype: int64


### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
# Split the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Print the shape of the resulting datasets to confirm the split
print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)

Training features shape: (62028, 7)
Testing features shape: (15508, 7)
Training labels shape: (62028,)
Testing labels shape: (15508,)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# Initialize the Logistic Regression model with a random_state of 1
logreg = LogisticRegression(max_iter=10000, random_state=1)

# Fit the model using training data
logreg.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Make a prediction using the testing data
y_pred = logreg.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [10]:
# Print the balanced_accuracy score of the model
balanced_acc_score = balanced_accuracy_score(y_test, y_pred)

print("Balanced Accuracy Score:", balanced_acc_score)

Balanced Accuracy Score: 0.9668615123225841


In [11]:
# Generate a confusion matrix for the model
conf_matrix = confusion_matrix(y_test, y_pred)

# Specific labels (e.g., 0 for 'Healthy Loan', 1 for 'High-risk Loan'), you can specify them
labels = ['Healthy Loan', 'High-risk Loan']

# Create a DataFrame from the confusion matrix for better readability, using the labels for both the index and columns if specified
conf_matrix_df = pd.DataFrame(conf_matrix, 
                              index=['Actual ' + label for label in labels], 
                              columns=['Predicted ' + label for label in labels])

print("Confusion Matrix:\n")
print(conf_matrix_df)

Confusion Matrix:

                       Predicted Healthy Loan  Predicted High-risk Loan
Actual Healthy Loan                     14924                        77
Actual High-risk Loan                      31                       476


In [12]:
# Print the classification report for the model
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=['Healthy Loan', 'High-risk Loan']))

Classification Report:

                precision    recall  f1-score   support

  Healthy Loan       1.00      0.99      1.00     15001
High-risk Loan       0.86      0.94      0.90       507

      accuracy                           0.99     15508
     macro avg       0.93      0.97      0.95     15508
  weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The model does a good to excellent job of classifying healthy vs. high risk loans. Particularly for the healthy loan category, the model performs superbly. The overall balance accuracy score is 0.97, which, on average, indicates that the model correctly classifies individual loans 97% of the time, while also taking into account class imbalances in our training/test data (which we definitely have here, as the high-risk loan class is much smaller than the healthy loan class). We get some clarification of where our classification failures are occuring by examining the confusion matrix and classification report. Here, we appear to be more likely to classify a healthy loan as high-risk than we are high-risk loan as healthy, which in the case of inevitable misclassifications may or may not be a preferential outcome (i.e., categorizing marginally good lendees as high-risk is probably preferential to categorizing marginally high-risk lendees as good, but it does depend on whether the goal of the model is to maximize lending or minimize risk). Our classification report indicates our model is extremely good at classifying healthy loans into the correct class, with scores of 0.99+ in precision, recall, and f1 for this class. This model, however is slightly to moderately less effective at classifying high-risk loans, as our precision, recall, and f1-scores are all below 0.95 and all lower than our healthly loan class. In particular, our high-risk class precision score indicates the model tends to classify false positives at into the high-risk class at a higher rate than the healthy loan class (our precision score for the high-risk class is 0.86; i.e., the model appears less effective at classifying healthly loans into the correct class at the near-high-risk margins). The very acceptable recall score suggests the model is good at classifying the true positive high-risk loans, while our also fairly high f1-score indicates a general balance between recall and precision for high-risk classification. The problem of class imbalance is underscored by our macro and weighted averages. The lower macro averages across the board demonstrates that our relatively small high-risk class size is causing the model to struggle somewhat when classifying high-risk loans relative to healthy loans, where we have a much larger class size. There is likely some room for model improvement, at least for the effectiveness of classifying high-risk loans, but again, this depends on whether the goal is the maximize lending or minimize risk. If the goal is to minimize risk, this model does well (although there is some room for marginal improvement). However, if the goal is to maximize lending, it would probably be beneficial to increase the precision of the high-risk classification.

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [13]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
ros = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)

In [14]:
# Count the distinct values of the resampled labels data
resamp_label_counts = y_train_resampled.value_counts()

# Display the counts for each label
print(resamp_label_counts)

loan_status
0    60035
1    60035
Name: count, dtype: int64


### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [15]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
resamp_logreg = LogisticRegression(max_iter=10000, random_state=1)

# Fit the model using the resampled training data
resamp_logreg.fit(X_train_resampled, y_train_resampled)

# Make a prediction using the testing data
resamp_y_pred = resamp_logreg.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [16]:
# Print the balanced_accuracy score of the model 
resamp_balanced_acc_score = balanced_accuracy_score(y_test, resamp_y_pred)

print("Resampled Balanced Accuracy Score:", resamp_balanced_acc_score)

Resampled Balanced Accuracy Score: 0.9941416134387885


In [17]:
# Generate a confusion matrix for the model
resamp_conf_matrix = confusion_matrix(y_test, resamp_y_pred)

# Specific labels (e.g., 0 for 'Healthy Loan', 1 for 'High-risk Loan'), you can specify them
labels = ['Healthy Loan', 'High-risk Loan']

# Create a DataFrame from the confusion matrix for better readability, using the labels for both the index and columns if specified
resamp_conf_matrix_df = pd.DataFrame(resamp_conf_matrix, 
                              index=['Actual ' + label for label in labels], 
                              columns=['Predicted ' + label for label in labels])

print("Resampled Confusion Matrix:\n")
print(resamp_conf_matrix_df)

Resampled Confusion Matrix:

                       Predicted Healthy Loan  Predicted High-risk Loan
Actual Healthy Loan                     14914                        87
Actual High-risk Loan                       3                       504


In [18]:
# Print the classification report for the model
print("Resampled Classification Report:\n")
print(classification_report(y_test, resamp_y_pred, target_names=['Healthy Loan', 'High-risk Loan']))

Resampled Classification Report:

                precision    recall  f1-score   support

  Healthy Loan       1.00      0.99      1.00     15001
High-risk Loan       0.85      0.99      0.92       507

      accuracy                           0.99     15508
     macro avg       0.93      0.99      0.96     15508
  weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The resampled model does perform marginally better, but only in the context of the specific goals of the model as stated above. Again, if our goal is to minimize risk, then the balancing of the class sizes with the random oversampler did improve the recall and f1-score of the high risk class. The improvment was not large, but it was enough to say that the oversampled model performs better than the original model with non-oversampled data. Specifically, the resampled model was slightly better at classifying true positives into the high-risk class (recall) while maintaining an acceptable balance between the precision and recall of the model (the slightly better f1-score of the high-risk class demonstrates the model didn't have an imbalanced change in precision or recall and that its improvment corresponds the the improvment in recall for the high-risk class). Our improved macro averages also show that the model estimation benefitted slightly from balancing out the class sizes with the random oversampler. However, our high-risk class precision effectively remained unchanged (technically was slightly lower, but only by 0.01), so the model still struggles slightly in that it classifies a non-trivial amount of heathly loans into the high-risk class (15% false positive rate for high-risk class). The slightly higher balanced accuracy score (0.99 vs 0.97) for the resampled model appears to derive entirely from the improved recall of the resampled model in the high-risk class, as almost every other metric remained unchanged or virtually unchanged. The resampled model performed identically to the original non-resampled model for the healthy loan class and still performs extremely well. 

Thus, we come to the crux of the problem: what is the goal of the model? If our goal is to maximize lending (while keeping risk controlled) through getting loans into the hands of as many acceptable lendees as possible, this model doesn't perform much better than the first. If our goal is to minimize risk (while potentially rejecting a non-trivial number of applicants that would be good lendees) then this model does perform marginally better than the first. Now we have to answer questions such as: what is the ultimate goal of the model (risk-reduction or lending-maximizing), how much time and labor are we willing to give to reviewing loans at the margins to achieve those goals, and what amount of misclassification towards the stated goals are we willing to tolerate.