In [1]:
# Importing the modules and dependencies that are needed to test the data.
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Reading the CSV file from the folder into a Pandas DataFrame.
df = pd.read_csv("lending_data.csv")

# Reviewing the DataFrame so we can verify our data and create labels and features from there.
print(df.head())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  loan_status  
0                 1       22800            0  
1                 0       13600            0  
2                 0       16100            0  
3                 1       22700            0  
4                 1       23000            0  


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separating the data into labels and features.
# We separate 'loan_status' from the DataFrame above and make it into our 'Y' variable which will be the lable.
y = df["loan_status"] 

# We separate the rest of the columns that are not 'loan_status' by droping the column 'loan_status'. 
#The rest will be our 'X' variable, the features.
X = df.drop(columns=["loan_status"])

In [4]:
# Reviewing the y variable Series
print(y.value_counts())

loan_status
0    75036
1     2500
Name: count, dtype: int64


In [5]:
# Reviewing the X variable DataFrame
print(X.head())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# This imports the train_test_learn module which will be needed to test the module.
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function; this is so we can get the same result every time we test it.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [7]:
# Import the LogisticRegression module from SKLearn, this will be used to predict the module.
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model; again this is so we can get the saem results every time we test it.
model = LogisticRegression(random_state=1)

# Fit the model using training data
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [8]:
# Make a prediction using the testing data. This generates an array of predicted loan risk classifications (0 = low risk, 1 = high risk),
# which will later be compared against the actual values (y_test) to evaluate model performance.
y_pred = model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [12]:
# Generate a confusion matrix for the model. We import the confusion matrix first and then assigned 'confusion' so we can print the result later on.
# The confusion matrix is a table that evaluates the performance of a classification model, showing the count of correct and incorrect predictions.
from sklearn.metrics import confusion_matrix

confusion = confusion_matrix(y_test, y_pred)
print(confusion)

[[18655   110]
 [   36   583]]


In [13]:
# Print the classification report for the model. We import the classification report and assigned 'report' so we can print its values.
# The classification report is a textual summary that evaluates a classification model's performance.
from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.94      0.89       619

    accuracy                           0.99     19384
   macro avg       0.92      0.97      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** Pretty well. The precision, recall, and f1-score in the 0 had a 100% prediction rate in the healthy loans; and for the 1 (high-risk loans) the precision was in .84 which is not bad, but the recall
was .94 which is way better than the prediction. Checking with the accrurcy we can reassure that the model is very good at predicting the risk on both loans.

---