# Credit Risk Classification

Credit risk poses a classification problem that’s inherently imbalanced. This is because healthy loans easily outnumber risky loans. In this Challenge, you’ll use various techniques to train and evaluate models with imbalanced classes. You’ll use a dataset of historical lending activity from a peer-to-peer lending services company to build a model that can identify the creditworthiness of borrowers.

## Instructions:

This challenge consists of the following subsections:

* Split the Data into Training and Testing Sets

* Create a Logistic Regression Model with the Original Data

* Predict a Logistic Regression Model with Resampled Training Data 

### Split the Data into Training and Testing Sets

Open the starter code notebook and then use it to complete the following steps.

1. Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

2. Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

    > **Note** A value of `0` in the “loan_status” column means that the loan is healthy. A value of `1` means that the loan has a high risk of defaulting.  

3. Check the balance of the labels variable (`y`) by using the `value_counts` function.

4. Split the data into training and testing datasets by using `train_test_split`.

### Create a Logistic Regression Model with the Original Data

Employ your knowledge of logistic regression to complete the following steps:

1. Fit a logistic regression model by using the training data (`X_train` and `y_train`).

2. Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

3. Evaluate the model’s performance by doing the following:

    * Calculate the accuracy score of the model.

    * Generate a confusion matrix.

    * Print the classification report.

4. Answer the following question: How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

### Predict a Logistic Regression Model with Resampled Training Data

Did you notice the small number of high-risk loan labels? Perhaps, a model that uses resampled data will perform better. You’ll thus resample the training data and then reevaluate the model. Specifically, you’ll use `RandomOverSampler`.

To do so, complete the following steps:

1. Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

2. Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

3. Evaluate the model’s performance by doing the following:

    * Calculate the accuracy score of the model.

    * Generate a confusion matrix.

    * Print the classification report.
    
4. Answer the following question: How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

### Write a Credit Risk Analysis Report

For this section, you’ll write a brief report that includes a summary and an analysis of the performance of both machine learning models that you used in this challenge. You should write this report as the `README.md` file included in your GitHub repository.

Structure your report by using the report template that `Starter_Code.zip` includes, and make sure that it contains the following:

1. An overview of the analysis: Explain the purpose of this analysis.


2. The results: Using bulleted lists, describe the balanced accuracy scores and the precision and recall scores of both machine learning models.

3. A summary: Summarize the results from the machine learning models. Compare the two versions of the dataset predictions. Include your recommendation for the model to use, if any, on the original vs. the resampled data. If you don’t recommend either model, justify your reasoning.

In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from imblearn.metrics import classification_report_imbalanced

# import warnings
# warnings.filterwarnings('ignore')

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
# Standardize the file path
file_nm_path = Path('Resources/lending_data.csv')
# Read the CSV file into a DataFrame
lending_data_df = pd.read_csv(file_nm_path)

# Review the DataFrame
display(lending_data_df.info(), lending_data_df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 non-null  float64
 1   interest_rate     77536 non-null  float64
 2   borrower_income   77536 non-null  int64  
 3   debt_to_income    77536 non-null  float64
 4   num_of_accounts   77536 non-null  int64  
 5   derogatory_marks  77536 non-null  int64  
 6   total_debt        77536 non-null  int64  
 7   loan_status       77536 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 4.7 MB


None

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# The label is the dependent discrete binary variable, y, representing loan status, where y == 0 for a healthy loan, and y == 1 for an impaired loan
# Separate the y variable, the labels
y = lending_data_df['loan_status']
#display(y.info(), y) # Pandas Series

# The features are the independent variables, collectively X, or the explanatory variables, or predictors
# Separate the X variable, the features
X = lending_data_df.drop(columns=['loan_status'])
#display(X.info(), X) # Pandas DataFrame

In [4]:
# Review the y variable Series
display(y.info(), y) # Pandas Series

<class 'pandas.core.series.Series'>
RangeIndex: 77536 entries, 0 to 77535
Series name: loan_status
Non-Null Count  Dtype
--------------  -----
77536 non-null  int64
dtypes: int64(1)
memory usage: 605.9 KB


None

0        0
1        0
2        0
3        0
4        0
        ..
77531    1
77532    1
77533    1
77534    1
77535    1
Name: loan_status, Length: 77536, dtype: int64

In [5]:
# Review the X variable DataFrame
display(X.info(), X) # Pandas DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 non-null  float64
 1   interest_rate     77536 non-null  float64
 2   borrower_income   77536 non-null  int64  
 3   debt_to_income    77536 non-null  float64
 4   num_of_accounts   77536 non-null  int64  
 5   derogatory_marks  77536 non-null  int64  
 6   total_debt        77536 non-null  int64  
dtypes: float64(3), int64(4)
memory usage: 4.1 MB


None

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.430740,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000
...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600
77532,17700.0,10.662,80900,0.629172,11,2,50900
77533,17600.0,10.595,80300,0.626401,11,2,50300
77534,16300.0,10.068,75300,0.601594,10,2,45300


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [6]:
# Check the balance of our target values
y_labels_count = y.value_counts()
y_labels_distribution = y.value_counts(normalize=True)
print(f"y labels count:\n{y_labels_count}\n\ny labels weight distribution:\n{y_labels_distribution}")
# The target class majority label to minority label skew is about 30-to-1

y labels count:
0    75036
1     2500
Name: loan_status, dtype: int64

y labels weight distribution:
0    0.967757
1    0.032243
Name: loan_status, dtype: float64


### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function.  # Assigning a random state is recommended for purposes of reproducibility.
# 'stratify=y' ensures that train and test datasets each preserve the label proportion from the original dataset. The default, otherwise, is 'None'.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# Selecting 'liblinear' as solver given simple classification: single binomial class.  Liblinear supports both L1 and L2 regularization.
lending_logistic_regression_model = LogisticRegression(solver='liblinear', random_state=1)

# Fit the model using training data
lending_logistic_regression_model.fit(X=X_train, y=y_train)

LogisticRegression(random_state=1, solver='liblinear')

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Make a prediction using the testing data
lending_test_predictions = lending_logistic_regression_model.predict(X_test)
display(len(lending_test_predictions), lending_test_predictions) # 'test' array size verified as 25% of original dataset, generated through train_test_split's default split

19384

array([0, 0, 0, ..., 0, 0, 0])

### Step 3: Evaluate the model’s performance, or classification accuracy, by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [1]:
# Create a DataFrame for the test results and inspect prediction of loan health versus actual health (High_Risk_Loan==1)
lending_test_results_df = pd.DataFrame({'lending test predictions': lending_test_predictions, 'actuals': y_test}).reset_index(drop=True)
display(lending_test_results_df)

# Print the 'balanced_accuracy_score' for the model's test data
lending_test_balanced_accuracy_score = balanced_accuracy_score(y_test, lending_test_predictions)
print(f"Balanced accuracy score for model's test data: {lending_test_balanced_accuracy_score}")

# Print the traditional 'accuracy_score' for the model's test data for comparison to the 'balanced_accuracy_score version'
lending_test_accuracy_score = accuracy_score(y_test, lending_test_predictions)
print(f"Traditional accuracy score for model's test data: {lending_test_accuracy_score}")

print("\nVery interesting: if the dataset was balanced, the traditional 'accuracy_score' by definition would equal the 'balanced_accuracy_score'.  \n\
In this case, the 'balanced_accuracy_score' is materially less than the traditional 'accuracy_score', consistent with our observation of an imbalanced dataset.  \n\
The 'balanced_accuracy_score' accounts for imbalanced datasets by weighting each sample by the inverse prevalence of its true class,  \n\
c.f. https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score")

NameError: name 'pd' is not defined

In [31]:
# Generate a confusion matrix for the model
lending_test_confusion_matrix = confusion_matrix(y_test, lending_test_predictions, labels = [0, 1]) # Method input order actuals, predictions.  Healthy_Loan==0, High_Risk_Loan==1.
tn, fp, fn, tp = lending_test_confusion_matrix.ravel() # Ravel() creates one-dimensional array faster than reshape() method
print(lending_test_confusion_matrix)
print(f"\nconfusion matrix components:\ntrue negatives: {tn}, false positives: {fp}, false negatives: {fn}, true positives: {tp}")

[[18678    81]
 [   62   563]]

confusion matrix components:
true negatives: 18678, false positives: 81, false negatives: 62, true positives: 563


In [12]:
# Define labels for use in classification reports
target_labels = ['Healthy Loan', 'High-Risk Loan']

# Print the classification report for the model
lending_test_imbalanced_classification_report = classification_report_imbalanced(y_test, lending_test_predictions, target_names=target_labels) 
print(f"Classification Report Taking Into Account Imbalanced Loan Status Class:\n{lending_test_imbalanced_classification_report}\n")

# Print the traditional 'classification_report' for the model's test data for comparison to the 'classification_report_imbalanced' version
lending_test_classification_report = classification_report(y_test, lending_test_predictions, target_names=target_labels) 
print(f"Traditional Classification Report Without Taking Into Account Loan Status Class Imbalance:\n{lending_test_classification_report}")

Classification Report Taking Into Account Imbalanced Loan Status Class:
                      pre       rec       spe        f1       geo       iba       sup

  Healthy Loan       1.00      1.00      0.90      1.00      0.95      0.91     18759
High-Risk Loan       0.87      0.90      1.00      0.89      0.95      0.89       625

   avg / total       0.99      0.99      0.90      0.99      0.95      0.90     19384


Traditional Classification Report Without Taking Into Account Loan Status Class Imbalance:
                precision    recall  f1-score   support

  Healthy Loan       1.00      1.00      1.00     18759
High-Risk Loan       0.87      0.90      0.89       625

      accuracy                           0.99     19384
     macro avg       0.94      0.95      0.94     19384
  weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** `The test dataset loan status class is imbalanced, with just 625 loans labeled 'high-risk', while 18759 loans are labeled 'healthy'.  Therefore, traditional accuracy measure, defined as the ratio of total CORRECT predictions to total predictions [(TP+TN)/(P+N)] could indicate a high level of model accuracy for a model that only predicts the majority class label, in this case 'healthy loan', obscuring a bad model that does not address its objective of predicting the minority class label 'high-risk loan'.  In other words, a model that by traditional measures appears highly accurate yet has no ability to predict high-risk loans, or differentiate between healthy and high-risk loans.`

`According to Brownlee (2021), for imbalanced datasets, two metric pairs mitigate bias when evaluating imbalanced dataset models: sensitivity-specificity and recall-precision:`
- `Sensitivity-Specificity pairing:`
- `"For imbalanced classification, the sensitivity [or recall] might be more interesting than the specificity."`
- `"Sensitivity and Specificity can be combined into a single score that balances both concerns, called the geometric mean or G-Mean: G-Mean = sqrt(Sensitivity * Specificity)"`
- - `Recall-Precision pairing:`
- `"Precision and recall can be combined into a single score that seeks to balance both concerns, called the F-score or the F-measure: F-Measure = (2 * Precision * Recall) / (Precision + Recall)"`
- `"The F-Measure is a popular metric for imbalanced classification."`
    - `c.f. Brownlee, Jason, 2021, "Tour of Evaluation Metrics for Imbalanced Classification",` https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification
- `Furthermore, Jayaswal (2020), notes that the F-score "is the harmonic mean of precision and recall. It takes both false positive[s] and false negatives into account. Therefore, it performs well on an imbalanced dataset" and "gives the same weightage to recall and precision", although there are also F-score versions where the weightage varies by a beta parameter used to say how many more times recall is important than precision, for example a beta of 2 is used to indicate recall is twice as important as precision when evaluating or comparing models.`
    - `c.f. Jayaswal, Vaibhav, 2020, "Performance Metrics: Confusion matrix, Precision, Recall, and F1 Score: Unraveling the Confusion Behind the Confusion Matrix",` https://towardsdatascience.com/performance-metrics-confusion-matrix-precision-recall-and-f1-score-a8fe076a2262

`According to scikit-learn.org, the balanced accuracy score function computes the balanced accuracy, which avoids inflated performance estimates on imbalanced datasets. It is the macro-average of recall scores per class or,` *`equivalently, raw accuracy where each sample is weighted according to the inverse prevalence of its true class.`* (italics emphasis added)
- c.f. https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score

`And according to statology.org, balanced accuracy is the macro average (aka arithmetic mean) of sensitivity (or true positive rate, or recall on the positive class label) and specificity (true negative rate, or recall on the negative class label):`
- `balanced accuracy = (sensitivity + specificity) / 2`
    - c.f. https://www.statology.org/balanced-accuracy

`A similar measure to balanced accuracy is the geometric mean, which is the geometric mean, as opposed to the arithmetic mean, of recall (or sensitivity) and specificity:`
- `geometric mean accuracy measure, used in the imbalanced classification report = sqrt(sensitivity * specificity)`
- `Note the geometric mean mathematically converges to the arithmetic mean as sensitivity and specificity become closer to one another, and the two means are identical by definition when sensitivity equals specificity`
    - `c.f. Anand, Aman, 2021, "Performance measures for Imbalanced Classes",` https://dev.to/amananandrai/performance-measures-for-imbalanced-classes-2ojj

`The imbalanced classification report also includes a newer metric for measuring accuracy of imbalanced dataset models called the Index of Balanced Accuracy (IBA) reported in a 2009 academic paper:`
- `index balanced accuracy (iba) = (1 + α*Dominance)(GMean²), where Dominance is the true positive rate less the true negative rate, or recall - specificity, and GMean² is the geometric mean-squared, or the product of recall and specificity.  The weight assigned to Dominance is α, where the default for the imbalanced classification report is 0.1.  "The closer the Dominance is to 0, the more balanced both individual rates are.  In practice, Dominance can be interpreted as an indicator of how balanced the TPrate and the TNrate are."  (Garcia, 2009).`
    - `c.f. Garcia et al., 2009, "Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions",` https://core.ac.uk/download/pdf/61392839.pdf
    - `c.f. Anand, Aman, 2021, "Performance measures for Imbalanced Classes",` https://dev.to/amananandrai/performance-measures-for-imbalanced-classes-2ojj

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [18]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler algo object.  The strategy here is therefore to randomly oversample, or duplicate, the minority class label (high-risk loans)
# Assign a random_state parameter of 1 to the model.  random_state is the seed used by the random number generator, for replication and validation purposes
rand_oversampler = RandomOverSampler(random_state=1)

# Fit the original training data to the random oversampler algo object, creating the oversampled, or resampled, training data model (algo + data = model)
# The change to the class label distribution is only applied to the original training data, not the test data, which is used ultimately to evaluate the performance of the model
X_train_resampled, y_train_resampled = rand_oversampler.fit_resample(X_train, y_train)

In [20]:
# Count the distinct values of the resampled labels data
y_train_labels_resampled_count = y_train_resampled.value_counts()
y_train_labels_resampled_distribution = y_train_resampled.value_counts(normalize=True)
print(f"y_train labels resampled count:\n{y_train_labels_resampled_count}\n\ny_train labels resampled weight distribution:\n{y_train_labels_resampled_distribution}")
# The resampled train target class majority label to minority label skew is now about 1-to-1, down from around 30-to-1! before random oversampling of the original train minority class label
# The downside to random oversampling of the original minority class label is the introduction of overfitting bias in the train data model used in subsequent prediction.
# The minority train class label has now been inflated roughly 30 times through random duplication of values from the original train minority label, while not preserving the original minority sample variance, \n
# inferring greater certainty in the minority class than originally observed, leading to overfitting bias in subsequent prediction

y_train labels resampled count:
0    56277
1    56277
Name: loan_status, dtype: int64

y_train labels resampled weight distribution:
0    0.5
1    0.5
Name: loan_status, dtype: float64


### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [21]:
# Instantiate another Logistic Regression model object to be used with the resampled data
# Assign a random_state parameter of 1 to the model
# Selecting 'liblinear' as solver given simple classification: single binomial class.  Liblinear supports both L1 and L2 regularization.
lending_resampled_logistic_regression_model = LogisticRegression(solver='liblinear', random_state=1)

# Fit the model using the resampled training data
lending_resampled_logistic_regression_model.fit(X=X_train_resampled, y=y_train_resampled)

# Make a new prediction using the testing data following model fit to resampled train data 
lending_resampled_model_test_predictions = lending_resampled_logistic_regression_model.predict(X_test)
display(len(lending_resampled_model_test_predictions), lending_resampled_model_test_predictions) # 'test' array size verified as unchanged: 25% of original dataset, generated through train_test_split's default split

19384

array([0, 1, 0, ..., 0, 0, 0])

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [22]:
# Print the 'balanced_accuracy_score' for the resampled model's test data
lending_resampled_test_balanced_accuracy_score = balanced_accuracy_score(y_test, lending_resampled_model_test_predictions)
print(f"Balanced accuracy score for model's test data: {lending_resampled_test_balanced_accuracy_score}")

# Print the traditional 'accuracy_score' for the resampled model's test data for comparison to the 'balanced_accuracy_score version'
lending_resampled_test_accuracy_score = accuracy_score(y_test, lending_resampled_model_test_predictions)
print(f"Traditional accuracy score for model's test data: {lending_resampled_test_accuracy_score}")

Balanced accuracy score for model's test data: 0.9959744975744975
Traditional accuracy score for model's test data: 0.9952022286421791


In [29]:
# Generate a confusion matrix for the resampled model
lending_resampled_test_confusion_matrix = confusion_matrix(y_test, lending_resampled_model_test_predictions, labels = [0, 1]) # Method input order actuals, predictions.  Healthy_Loan==0, High_Risk_Loan==1.
tn_resampled, fp_resampled, fn_resampled, tp_resampled = lending_resampled_test_confusion_matrix.ravel() # Ravel() creates one-dimensional array faster than reshape() method
print(f"resampled model confusion matrix:\n{lending_resampled_test_confusion_matrix}")
print(f"\nresampled model confusion matrix components:\ntrue negatives: {tn_resampled}, false positives: {fp_resampled}, false negatives: {fn_resampled}, true positives: {tp_resampled}")

resampled model confusion matrix:
[[18668    91]
 [    2   623]]

resampled model confusion matrix components:
true negatives: 18668, false positives: 91, false negatives: 2, true positives: 623


In [32]:
# Define labels for use in resampled model classification report
target_labels = ['Healthy Loan', 'High-Risk Loan']

# Print the traditional 'classification_report' for the resampled model's test predictions
lending_resampled_test_classification_report = classification_report(y_test, lending_resampled_model_test_predictions, target_names=target_labels) 
print(f"Traditional Classification Report for Resampled Model:\n{lending_resampled_test_classification_report}")

Traditional Classification Report for Resampled Model:
                precision    recall  f1-score   support

  Healthy Loan       1.00      1.00      1.00     18759
High-Risk Loan       0.87      1.00      0.93       625

      accuracy                           1.00     19384
     macro avg       0.94      1.00      0.96     19384
  weighted avg       1.00      1.00      1.00     19384



### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** YOUR ANSWER HERE!