In [1]:
# Import the modules

import numpy  as np
import pandas as pd

from pathlib         import Path
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
data = pd.read_csv(r'Resources/lending_data.csv')


# Review the DataFrame
data.sample(10, random_state = 123)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
43735,7900.0,6.485,41600,0.278846,2,0,11600,0
7789,10500.0,7.607,52200,0.425287,4,1,22200,0
10760,10100.0,7.43,50500,0.405941,4,1,20500,0
73493,9500.0,7.169,48100,0.376299,4,0,18100,0
49005,8600.0,6.776,44400,0.324324,3,0,14400,0
39030,10100.0,7.414,50400,0.404762,4,1,20400,0
9743,10700.0,7.677,52800,0.431818,5,1,22800,0
56241,10200.0,7.48,51000,0.411765,4,1,21000,0
30577,8600.0,6.799,44600,0.327354,3,0,14600,0
60206,11400.0,7.956,55500,0.459459,5,1,25500,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = data.loan_status

# Separate the X variable, the features
X = data.iloc[:, :-1]

In [4]:
# Review the y variable Series

y[:10]

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: loan_status, dtype: int64

In [5]:
# Review the X variable DataFrame

X.sample(10, random_state = 123)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
43735,7900.0,6.485,41600,0.278846,2,0,11600
7789,10500.0,7.607,52200,0.425287,4,1,22200
10760,10100.0,7.43,50500,0.405941,4,1,20500
73493,9500.0,7.169,48100,0.376299,4,0,18100
49005,8600.0,6.776,44400,0.324324,3,0,14400
39030,10100.0,7.414,50400,0.404762,4,1,20400
9743,10700.0,7.677,52800,0.431818,5,1,22800
56241,10200.0,7.48,51000,0.411765,4,1,21000
30577,8600.0,6.799,44600,0.327354,3,0,14600
60206,11400.0,7.956,55500,0.459459,5,1,25500


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [6]:
# Check the balance of our target values

y.value_counts()

0    75036
1     2500
Name: loan_status, dtype: int64

### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
log_reg_model = LogisticRegression(random_state = 1)

# Fit the model using training data
log_reg_model.fit(x_train, y_train)

LogisticRegression(random_state=1)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Make a prediction using the testing data

prediction = log_reg_model.predict(x_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [10]:
# Print the balanced_accuracy score of the model

print(round(balanced_accuracy_score(y_test, prediction), 3))

0.952


In [11]:
# Print the accuracy score of the model

print(round(accuracy_score(y_test, prediction), 3))

0.992


In [12]:
# Generate a confusion matrix for the model

conf_matrix = confusion_matrix(y_test, prediction)

print('Confusion Matrix: \n',conf_matrix)

Confusion Matrix: 
 [[18663   102]
 [   56   563]]


In [13]:
# Print the classification report for the model
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** In comparison to the original dataset, similarly the number of healthy loans is greater than the number of unhealthy loans. The model has a good accuracy model of `99%`, the precision score for `0` (healthy loans) is `100%` and the precision for `1` labels is not bad at `85%`. The recall score is also quite high at `99%` for prediction of 0 labels and `91%` for high-risk loans with the label `1`.

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [14]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
randover_model = RandomOverSampler(random_state = 1)

# Fit the original training data to the random_oversampler model
x_resampled, y_resampled = randover_model.fit_resample(x_train, y_train)

In [15]:
# Count the distinct values of the resampled labels data

y_resampled.value_counts()

0    56271
1    56271
Name: loan_status, dtype: int64

### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [16]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
resampled_log_reg_model = LogisticRegression(random_state = 1)

# Fit the model using the resampled training data
resampled_log_reg_model.fit(x_resampled, y_resampled)

# Make a prediction using the testing data
resampled_prediction = resampled_log_reg_model.predict(x_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [17]:
# Print the balanced_accuracy score of the model 

print(round(balanced_accuracy_score(y_test, resampled_prediction), 3))

0.994


In [18]:
# Print the accuracy score of the model 

print(round(accuracy_score(y_test, resampled_prediction), 3))

0.994


In [19]:
# Generate a confusion matrix for the model

print('Confusion Matrix: \n', confusion_matrix(y_test, resampled_prediction))

Confusion Matrix: 
 [[18649   116]
 [    4   615]]


In [20]:
# Print the classification report for the model

print('Classification Report: \n', classification_report(y_test, resampled_prediction))

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The accuracy score for both prediction models is quite high at `99%`. Looking at the confusion matrix, the oversampled data model did significantly better at predicting false negatives, meaning only `4` loans of `0` type were identified as false negative. Similar to the previous model, the precision score for `0` loans was 100% and `84%` to loan `type 1`. The recall score improved for unhealthy loans compared to the previous model. Overall,the original model was quite accurate but this model with oversampled data seems to be slightly better at predicting more accurately.

---

# Module 20 Report


## Overview of the Analysis

In this section, describe the analysis you completed for the machine learning models used in this Challenge. This might include:

* Explain the purpose of the analysis.

    * **The purpose of this analysis is to train and evaluate a model based on loan risk in order to identify the reliability of borrowers or mortgagors.**


* Explain what financial information the data was on, and what you needed to predict.
    * **The financial data described loan details like loan size, interest rate, borrower income, loan status, etc. In order to create a prediction, the data was split into training and testing datasets, a logistic regression model was created, the model was fit using training data, and then `.predict()` was used on the test data.**


* Provide basic information about the variables you were trying to predict (e.g., `value_counts`).
    * **The `y` values, or loan status of individuals in the dataset are what we want to predict. In this case, `0` indicates a healthy loan while `1` indicates a high risk loan. the `.value_counts()` tells us that there are `75036` healthy loans and `2500` high risk loans in this dataset.**


* Describe the stages of the machine learning process you went through as part of this analysis.
    * **To start, the data was split into training and testing sets, next, `X` and `y` variables were created and the data was split using `train_test_split()`, then a logistic regression model was created with the original data and fit using the training data. Predictions were made next and then the models performance was tested by calculating the `accuracy score`, generating a `confusion matrix`, and printing a `classification report`. After this, a logistic regression model was predicted with resampled training data using the `RandomOverSampler()` module and repeating the previous process, only this time using that resampled data.**


* Briefly touch on any methods you used (e.g., `LogisticRegression`, or any resampling method).
    * **The logistic regression module was used to aid predictions throughout the analysis, and the `RandomOverSampler()` was used to create resampled data. This allowed the original training data to be fit to the newly created model where further predictions and performance could be evaluated.**

## Results

Using bulleted lists, describe the balanced accuracy scores and the precision and recall scores of all machine learning models.

* Machine Learning ***`Model 1`***:

    * Description of ***`Model 1`*** `Accuracy`, `Precision`, and `Recall` scores.
    
    * Balanced Accuracy score: `95% or 0.952` which is good!
    
    * Accuracy Score: `99% or 0.992` which is really good!
    
    * Healthy loans see ideal `precision` and `f1-scores` at `100%`, while recall is at `99%`.
    
    * High-risk loans' precision is `15%` lower than healthy loans
    
    * High-risk loans are also at lower `recall` and `f1-score` values than the healthy loans with `8%` and `2%` variations (respectively)


* Machine Learning ***`Model 2`***:
    * Description of ***`Model 2`*** `Accuracy`, `Precision`, and `Recall` scores.
    
    * Balanced Accuracy score: `99% or 0.994` which is really good!
    
    * Accuracy score: `99% or 0.994` which is also really good!
    
    * `Precison`, `recall`, and `f1-score` have remained the same for healthy loans.
    
    * High-risk loans experienced general increase in `recall` and `f1-score`- `8%` and `3%` (respectively)
    
    * High-risk loans experienced a decrease in precision by `1%`.

## Summary

Summarize the results of the machine learning models, and include a recommendation on the model to use, if any. For example:

* Which one seems to perform best? How do you know it performs best?
    * **It seems that Machine Learning *`Model 2`* performs the best because we see a general increase in recall and f1 scores, and even the macro average. Accuracy is at `99%` for both models, but if one must choose, then the inceases do indicate *`Model 2`* as the better performing model.**

* Does performance depend on the problem we are trying to solve? (For example, is it more important to predict the `1`'s, or predict the `0`'s? )
    * **It is important to predict both the ones and zeros in order to determine a the creditworthiness of borrowers. Being able to compare the two allows for well-rounded predictions and conclusions to be made.**

---