In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
df_credit_data = pd.read_csv("../Resources/lending_data.csv")

# Review the DataFrame
df_credit_data

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = df_credit_data['loan_status']

# Separate the X variable, the features
X = df_credit_data.copy()
X.drop("loan_status", axis=1, inplace=True)


In [4]:
# Review the y variable Series
y[:5]

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [5]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [6]:
# Check the balance of our target values
y.value_counts()

0    75036
1     2500
Name: loan_status, dtype: int64

### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Split the data using train_test_split
# Assign a random_state of 1 to the function
#adding 'stratify' to ensure that the random sample matches the demos of the total sample

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify = y)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
classifier = LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using training data
classifier.fit(X_train, y_train)

LogisticRegression(random_state=1)

In [9]:
#Validate the model w/ training data
classifier.score(X_train, y_train)

0.9914878250103177

In [10]:
#Validate the model w/ testing data
classifier.score(X_test, y_test)

0.9924164259182832

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [11]:
# Make a prediction using the testing data
test_predictions = classifier.predict(X_test)

In [12]:
#put predictions into a dataframe
test_predictions_df = pd.DataFrame({'Predictions': test_predictions, 'Actual': y_test})
test_predictions_df.head(10)

Unnamed: 0,Predictions,Actual
36831,0,0
75818,0,1
36563,0,0
13237,0,0
43292,0,0
68423,0,0
37714,0,0
64870,0,0
47959,0,0
49,0,0


### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [13]:
# Print the balanced_accuracy score of the model
acc_score = balanced_accuracy_score(y_test, test_predictions)
acc_score

0.9442676901753825

In [14]:
# Generate a confusion matrix for the model
confuse_matrix = confusion_matrix(y_test, test_predictions)

confuse_matrix_df = pd.DataFrame(confuse_matrix, index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1'])
confuse_matrix_df

Unnamed: 0,predicted 0,predicted 1
actual 0,18679,80
actual 1,67,558


In [15]:
# Print the classification report for the model
print(f"Confusion Matrix:\n{confuse_matrix_df}\n\n\nAccuracy Score:\n{acc_score}\n\n\nClassification Report:\n{classification_report(y_test, test_predictions)}")


Confusion Matrix:
          predicted 0  predicted 1
actual 0        18679           80
actual 1           67          558


Accuracy Score:
0.9442676901753825


Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18759
           1       0.87      0.89      0.88       625

    accuracy                           0.99     19384
   macro avg       0.94      0.94      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 
### Overview
This analysis uses supervised machine learning to build a model that can identify the creditworthiness of loan applicants. The model classifies borrowers into low-risk and high-risk categories. The model is built on a dataset of historical lending activitiy from a peer-to-peer lending services company. 

The limitations of this dataset is that '0 - healthy loan' is overrepresented, so the model is expected to perform poorly when identifying high-risk loan applicants compared to low-risk applicants. 

### Classification Summary
- Accuracy: How often the mode is correct
    - The model accurately predicts a loan applicant's credit risk classification 94% of the time. 
- Precision: High precision relates to a low false positive rate.
    - This model is very good at predicting who is not a credit risk and belongs in the "0 - healthy loan" category. 
    - However, it has a tendency to predict that **more applicants belong in the "1 - high-risk" category** than we would expect to find in the actual data. 
    - In other words, this model results in more false positives when trying to predict which applicants are high risk.
    - This may result in loans not being approved for applicants who ought to be approved, resulting in less revenue for the business. 
- Recall Score: High recall correlates to a low false negative rate. 
    - This model is very good at predicting who is not a credit risk and belongs in the "0 - healthy loan" category. 
    - This model also has a tendency to predict that **more applicants are not  in the "1 - high-risk" category** than we would expect to find in the actual data. 
    - In other worse, this model results in more false negatives when trying to predict which applicants are high risk. However, the recall score is higher than the precision score, meaning that the model will have less false negatives than false positives.
    - This may result in loans being approved for applicants who ought not to be approvied, resulting in loss of revenue, as well as more bad debt provisioning and charge-offs than would otherwise be expected. 

### Model Recommendation and Justification
This model does a fantastic job at predicting which loan applicants are low-risk. However, the model is not able to classify high-risk applicants 100% of the time, so there is real business risk in adopting this model. The business must decide how much risk is acceptable. My recomendations take into consideration the possible scenarios: 
1. If the business is highly risk-adverse, I recommend continuing to experiment with classification models until we have a model that can correctly classify high-risk applicants 95% (or more) of the time. 
2.  If risk is acceptable, this model may work for now until we are able to develop a more accurate model. The model is more likely to predict false negatives than false positives when it comes to high-risk applicants, meaning that the buiness would approve less loans than it otherwise would. This makes the model a bit on the conservative side, so the business may decide that use of this model is, in other words, good enough for now. 

Because the model is conservative when it comes to classifying high-risk applicants, my official recommendation is that it can be used for now until more data is collected and the data team can develop a more accurate model. The financial teams will, however, want to take precision and recall rates into account when producing financial forecasting reports. I do not, however, recommend that this model be put into permanently. We must strive to create and use a better model that can correctly classify high-risk applicants 100% of the time. 

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [16]:
# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
ros = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)


In [17]:
# Count the distinct values of the resampled labels data
y_resampled.value_counts()

0    56277
1    56277
Name: loan_status, dtype: int64

### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [18]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
re_classifier = LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using the resampled training data
re_classifier.fit(X_resampled, y_resampled)

# Make a prediction using the testing data
resampled_predictions = re_classifier.predict(X_test)

#put predictions into a dataframe
resampled_predictions_df = pd.DataFrame({'Resampled Predictions': resampled_predictions, 'Actual': y_test})
resampled_predictions_df.head(10)

Unnamed: 0,Resampled Predictions,Actual
36831,0,0
75818,1,1
36563,0,0
13237,0,0
43292,0,0
68423,0,0
37714,0,0
64870,0,0
47959,0,0
49,0,0


### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [19]:
# Print the balanced_accuracy score of the model 
resampled_acc_score = balanced_accuracy_score(y_test, resampled_predictions)
resampled_acc_score

0.9959744975744975

In [20]:
# Generate a confusion matrix for the model
resampled_confuse_matrix = confusion_matrix(y_test, resampled_predictions)

resampled_confuse_matrix_df = pd.DataFrame(resampled_confuse_matrix, index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1'])
resampled_confuse_matrix_df

Unnamed: 0,predicted 0,predicted 1
actual 0,18668,91
actual 1,2,623


In [21]:
# Print the classification report for the model
print(f"Confusion Matrix:\n{resampled_confuse_matrix_df}\n\n\nAccuracy Score:\n{resampled_acc_score}\n\n\nClassification Report:\n{classification_report(y_test, resampled_predictions)}")


Confusion Matrix:
          predicted 0  predicted 1
actual 0        18668           91
actual 1            2          623


Accuracy Score:
0.9959744975744975


Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18759
           1       0.87      1.00      0.93       625

    accuracy                           1.00     19384
   macro avg       0.94      1.00      0.96     19384
weighted avg       1.00      1.00      1.00     19384



### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

### Overview
This analysis uses the same data as above to classifies borrowers into low-risk and high-risk categories. However, oversampling is used to balance out the number of low risk and high risk applicants. The goal is to see if oversampling creates a better classification model. 

### Classification Summary
- Accuracy: How often the mode is correct
    - The model accurately predicts a loan applicant's credit risk classification **99.5%** of the time. 
- Precision: High precision relates to a low false positive rate.
    - This model is very good at predicting who is not a credit risk and belongs in the "0 - healthy loan" category. 
    - Unfortunately, oversampling did not improve the precision score. This model also has a tendency to predict that **more applicants belong in the "1 - high-risk" category** than we would expect to find in the actual data. 
    - In other words, this model results in more false positives when trying to predict which applicants are high risk.
    - This may result in loans not being approved for applicants who ought to be approved, resulting in less revenue for the business. 
- Recall Score: High recall correlates to a low false negative rate. 
    - This model is very good at predicting who is not a credit risk and belongs in the "0 - healthy loan" category. 
    - Oversampling did improve the recall score significantly. It does not classify more applicants as not in the "1 - high-risk" category than we would expect to find in the actual data. 

### Model Recommendation and Justification
This model does a fantastic job at predicting which loan applicants are low-risk and a much better job at identifying high-risk applicants. However, the use of oversampling did not enable the model to be able to classify high-risk applicants 100% of the time. 

While there was significant improvement, there's still risk associated with using this model as it incorrectly classifies some applicants as high-risk when they are low-risk. This may result in the business approving less loans than it otherwise would. If the business is ok with this risk, then I strongly recommend the use of this model over the one described above. 

If the business is not ok with this risk, then I recommend more experimentation until we can create a model with a perfect precision score. 