In [1]:
# Importing the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report, accuracy_score

---

## Splitting the Data into Training and Testing Sets

### Step 1: Reading the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Reading the lending_data.csv file from the Resources folder into a Pandas DataFrame
lending_data = Path('Resources/lending_data.csv')
lending_data_df = pd.read_csv(lending_data)

# Review the DataFrame
lending_data_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
# Review tail of DataFrame
lending_data_df.tail()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
77531,19100.0,11.261,86600,0.65358,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1
77535,15600.0,9.742,72300,0.585062,9,2,42300,1


### Step 2: Creating the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [4]:
# Separating the data into labels and features

# Separating the y variable, the labels
y = lending_data_df["loan_status"]

# Separating the X variable, the features
X = lending_data_df.drop(columns="loan_status")

In [5]:
# Reviewing the y variable Series
y.head()
y.tail()

77531    1
77532    1
77533    1
77534    1
77535    1
Name: loan_status, dtype: int64

In [6]:
# Reviewing the X variable DataFrame
X

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.430740,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000
...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600
77532,17700.0,10.662,80900,0.629172,11,2,50900
77533,17600.0,10.595,80300,0.626401,11,2,50300
77534,16300.0,10.068,75300,0.601594,10,2,45300


### Step 3: Checking the balance of the labels variable (`y`) by using the `value_counts` function.

In [7]:
# Checking the balance of our target values
y.value_counts()

loan_status
0    75036
1     2500
Name: count, dtype: int64

### Step 4: Splitting the data into training and testing datasets by using `train_test_split`.

In [8]:
# Importing the train_test_learn module
from sklearn.model_selection import train_test_split

# Splitting the data using train_test_split
# Assigning a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


---

## Creating a Logistic Regression Model with the Original Data

###  Step 1: Fitting a logistic regression model by using the training data (`X_train` and `y_train`).

In [9]:
# Importing the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiating the Logistic Regression model
# Assigning a random_state parameter of 1 to the model
Log_reg_model_classifier = LogisticRegression(random_state=1)

# Fitting the model using training data
l_r_model = Log_reg_model_classifier.fit(X_train, y_train)

### Step 2: Saving the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [10]:
# Making a prediction using the testing data
predictions = l_r_model.predict(X_test)
predictions

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

### Step 3: Evaluating the model’s performance by doing the following:

* Calculating the accuracy score of the model.

* Generating a confusion matrix.

* Printing the classification report.

In [11]:
# Printing the accuracy_score of the model
l_r_accuracy_score = accuracy_score(y_test, predictions)
print(f"The Accuracy Score is : {l_r_accuracy_score}")

# Printing the balanced_accuracy_score of the model
l_r_accuracy_score = balanced_accuracy_score(y_test, predictions)
print(f"The Balanced Accuracy Score is : {l_r_accuracy_score}")

The Accuracy Score is : 0.9918489475856377
The Balanced Accuracy Score is : 0.9520479254722232


In [12]:
# Generating a confusion matrix for the model
l_r_confusion_matrix = confusion_matrix(y_test, predictions)
print(f"The Confusion Matrix : {l_r_confusion_matrix}")

The Confusion Matrix : [[18663   102]
 [   56   563]]


In [13]:
# Printing the classification report for the model
l_r_classification_report = classification_report(y_test, predictions)
print(f"The Classification Report : {l_r_classification_report}")

The Classification Report :               precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



Measuring the performance of the model, the logistic regression model has %95 balanced accuracy at predicting both the `0` (healthy loan) and `1` (high-risk loan) labels. 

On the meantime, based on the confusion matrix, with 102 FP-false positives and 18863 TN-true negatives and 563 TP-true positives and 56 FN-false negative results, which means 18663 instances correctly predicted as the negative class, 102 instances incorrectly predicted as positive class, 56 instances incorrectly predicted as negative class and 563 instances correctly predicted as positive class. In other words, 18663 out of 18765 healthy loans were accepted and 563 out of 619 high-risk loans were rejected.

The precision score is %100 and recall score is %99 for healthy loans.

The f1-score is %88 for high-risk loans.

---

## Predicting a Logistic Regression Model with Resampled Training Data

### Step 1: Using the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [14]:
# Importing the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiating the random oversampler model
# # Assigning a random_state parameter of 1 to the model
random_over_sampler_model = RandomOverSampler(random_state=1)

# Fitting the original training data to the random_oversampler model
X_resampled, y_resampled = random_over_sampler_model.fit_resample(X_train, y_train)

In [15]:
# Counting the distinct values of the resampled labels data
y_resampled.value_counts()

loan_status
0    56271
1    56271
Name: count, dtype: int64

In [16]:
# Counting the distinct values of the original labels data
y_train.value_counts()

loan_status
0    56271
1     1881
Name: count, dtype: int64

### Step 2: Using the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [17]:
# Instantiating the Logistic Regression model
# Assigning a random_state parameter of 1 to the model
l_r_model_classifier = LogisticRegression(solver='lbfgs', random_state=1)

# Fitting the model using the resampled training data
l_r_model_classifier.fit(X_resampled, y_resampled)

# Making a prediction using the testing data
predictions_random_over_sampler_model = l_r_model_classifier.predict(X_test)

### Step 3: Evaluating the model’s performance by doing the following:

* Calculating the accuracy score of the model.

* Generating a confusion matrix.

* Printing the classification report.

In [18]:
# Printing the accuracy_score of the model
random_over_sampler_model_accuracy_score = accuracy_score(y_test, predictions_random_over_sampler_model) 
print(f"The Accuracy Score : {random_over_sampler_model_accuracy_score}")

# Printing the balanced_accuracy_score of the model 
random_over_sampler_model_balanced_accuracy_score = balanced_accuracy_score(y_test, predictions_random_over_sampler_model)
print(f"The Balanced Accuracy Score : {random_over_sampler_model_balanced_accuracy_score}")

The Accuracy Score : 0.9938093272802311
The Balanced Accuracy Score : 0.9936781215845847


In [19]:
# Generating a confusion matrix for the model
random_over_sampler_model_confusion_matrix = confusion_matrix(y_test, predictions_random_over_sampler_model)
print(f"The Confusion Matrix : {random_over_sampler_model_confusion_matrix}")

The Confusion Matrix : [[18649   116]
 [    4   615]]


In [21]:
# Printing the classification report for the model
random_over_sampler_model_classification_report = classification_report(y_test, predictions_random_over_sampler_model)
print(f"The Classification Report : {random_over_sampler_model_classification_report}")

The Classification Report :               precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



Measuring the performance of the model, the logistic regression model, fit with oversampled data, has %99 balanced accuracy and %99 accuracy scores at predicting both the `0` (healthy loan) and `1` (high-risk loan) labels. 

On the meantime, based on the confusion matrix, with 116 FP-false positives and 18649 TN-true negatives and 615 TP-true positives and 4 FN-false negative results, which means 18649 instances correctly predicted as the negative class, 116 instances incorrectly predicted as positive class, 4 instances incorrectly predicted as negative class and 615 instances correctly predicted as positive class. In other words, 18649 out of 19384 healthy loans were accepted and 615 out of 619 high-risk loans were rejected.

The precision score is %100 and recall score is %99 for healthy loans.

The f1-score is %91 for high-risk loans.

#### SUMMARY = The logistic regression model, when trained on the original dataset, demonstrated solid performance with a balanced accuracy of 95%. However, with oversampled data, the model's balanced accuracy significantly improved to 99%, indicating enhanced predictive capabilities.

#### For healthy loans (label 0), both models achieved perfect precision (100%) and high recall (99%), implying a low rate of false positives and false negatives. This suggests that the models are effective in correctly identifying and accepting healthy loans.

#### In terms of high-risk loans (label 1), the oversampled model outperformed the original model, achieving a higher F1-score (91% compared to 88%). This indicates better balance between precision and recall for identifying high-risk loans.