# Credit Risk Classification
---

## Overview of the Analysis
### Purpose
* Explain the purpose of the analysis.

### About the data
* Explain what financial information the data was on, and what you needed to predict.
* Provide basic information about the variables you were trying to predict (e.g., `value_counts`).


### Methods used
The following models form the `sklearn` library are evaluated:
* `LogisticRegression` with original data (Model 1) and scaled data (Model 2)
* `SVC` with scaled data (Model 3)
* `tree` with scaled data (Model 4)
* `RandomForest` with scaled data (Model 5)
* `KNeighborsClassifier` with scaled data (Model 6)

Note that PCA is not used as the number of features (dimensions) is reasonable and we do not expect any significant improvement by using Principal Components.

### Stages
We prepare the data for all the models in the next sections by performing the following steps:
* Import all necessary modules (there are no imports within the other code blocks)
* Load the data from the CSV file into a pandas DataFrame
* Split the data between the training and test sets using `train_test_split` from `sklearn`

For each of the models, we then perform the following steps
* Scaling (optional)
* Fitting (i.e. train the model with the training set)
* Predictions (i.e. use the test set )
* Describe the stages of the machine learning process you went through as part of this analysis.

## Preparation

### Dependencies

In [1]:
# Ignore all warnings
from warnings import simplefilter
simplefilter(action='ignore')

# Import dependencies
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Import the models from SKLearn (Model 1 through Model 6)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Local module
from ml_classification import model_performance

### Load data

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
csv = Path('Resources/lending_data.csv')
df = pd.read_csv(csv)

# Review the DataFrame
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
# Review the DataFrame information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 non-null  float64
 1   interest_rate     77536 non-null  float64
 2   borrower_income   77536 non-null  int64  
 3   debt_to_income    77536 non-null  float64
 4   num_of_accounts   77536 non-null  int64  
 5   derogatory_marks  77536 non-null  int64  
 6   total_debt        77536 non-null  int64  
 7   loan_status       77536 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 4.7 MB


In [4]:
# Review the "loan_status" information
loan_status_df = df['loan_status'].value_counts()
loan_status_df.index = ['Healthy (0)', 'High Risk (1)']
display(loan_status_df)

Healthy (0)      75036
High Risk (1)     2500
Name: loan_status, dtype: int64

### Split data between Training and Test sets
We create the labels set (`y`) from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns. We then split the data into training and testing datasets by using `train_test_split`.

In [5]:
# Separate the data into labels and features
# Separate the y variable, the labels
y = df['loan_status']

# Separate the X variable, the features
X = df.drop(columns='loan_status')

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1,stratify=y)

---
## Model 1: Logistic Regression Model with the Original Data

### Fitting
Logistic regression model, using the training data (`X_train` and `y_train`).

In [6]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
model_1 = LogisticRegression(solver='lbfgs', random_state=1, max_iter=100)

# Fit the model using training data
model_1 = model_1.fit(X_train, y_train)

# Score the model
print(f"Training Data Score: {model_1.score(X_train, y_train)}")
print(f"Testing Data Score: {model_1.score(X_test, y_test)}")

Training Data Score: 0.9914878250103177
Testing Data Score: 0.9924164259182832


### Predictions

In [7]:
# Make a prediction using the testing data
print('---')
print("Predictions vs Actual classification:")
predictions_1 = model_1.predict(X_test)
classification_df = pd.DataFrame({'Prediction': predictions_1, "Actual": y_test})
classification_df.head()

---
Predictions vs Actual classification:


Unnamed: 0,Prediction,Actual
36831,0,0
75818,0,1
36563,0,0
13237,0,0
43292,0,0


### Performance

In [8]:
model_performance(y_test, predictions_1)

Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0        18679           80
Actual 1           67          558
---
Accuracy Score : 0.9924164259182832
---
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18759
           1       0.87      0.89      0.88       625

    accuracy                           0.99     19384
   macro avg       0.94      0.94      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Conclusions for Model 1
#### Observations
- The precision for Class 0 is excellent (only 67 mistakes over more than 18k data points)
- In other word, the ratio of false negatives over true positive is very small
- The precision and recall for Class 1 is not as good (around 88%)
- In other word, this model has a (very) small tendency to produce false positives (Type I error)

#### Recommendations
**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

A value of 0 in the “loan_status” column means that the loan is healthy. A value of 1 means that the loan has a high risk of defaulting. A Type I error therefore means that a loan can be predicted as "healthy" when it is in fact "risky."

This is not in the favour of the creditor and a more conservative model (perhaps with a higher tendency for Type II errors but that creates fewer Type I errors) may be preferred. The model still shows very good performance and can be recommeded for the application but getting the Class 1 precision and recall closer to 95% could be preferrable for creditors who want to avoid exposing themselves to more risk than expected.

---
## Model 2: Logistic Regression Model with the Scaled Data

### Data scaling (`X` only)
The scaled data are saved in two variables: `X_train_scaled` and `X_test_scaled`

In [9]:
# Creating StandardScaler instance
scaler = StandardScaler()

# Fitting Standard Scaller
X_scaler = scaler.fit(X_train)

# Scaling data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

### Fitting, Predictions, and Performance

In [10]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
model_2 = LogisticRegression(solver='lbfgs', random_state=1, max_iter=100)

# Fit the model using training data
model_2 = model_2.fit(X_train_scaled, y_train)

# Make a prediction using the testing data
predictions_2 = model_2.predict(X_test_scaled)

# Model performance
model_performance(y_test, predictions_2)


Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0        18669           90
Actual 1           12          613
---
Accuracy Score : 0.9947379281881964
---
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18759
           1       0.87      0.98      0.92       625

    accuracy                           0.99     19384
   macro avg       0.94      0.99      0.96     19384
weighted avg       1.00      0.99      0.99     19384



---
## Model 3: SVM with the Scaled Data

In [15]:
# Create a support vector machine linear classifer, and fit it to the training data
model_3 = SVC(kernel='linear')
model_3.fit(X_train_scaled, y_train)

# Print the model score by using the test data
print(f"Test accuracy: {model_3.score(X_test_scaled, y_test):.3f}")

# Calculate the classification report
predictions_3 = model_3.predict(X_test_scaled)

# Model performance
model_performance(y_test, predictions_3)

Test accuracy: 0.995
Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0        18669           90
Actual 1           12          613
---
Accuracy Score : 0.9947379281881964
---
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18759
           1       0.87      0.98      0.92       625

    accuracy                           0.99     19384
   macro avg       0.94      0.99      0.96     19384
weighted avg       1.00      0.99      0.99     19384



---
## Model 4: Decision Tree with the Scaled Data

In [26]:
# Creating the decision tree classifier instance
model_4 = DecisionTreeClassifier(max_depth=5)

# Fitting the model
model_4 = model_4.fit(X_train_scaled, y_train)

# Making predictions using the testing data
predictions_4 = model_4.predict(X_test_scaled)

# Model performance
model_performance(y_test, predictions_4)

Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0        18668           91
Actual 1            6          619
---
Accuracy Score : 0.9949958728848535
---
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18759
           1       0.87      0.99      0.93       625

    accuracy                           0.99     19384
   macro avg       0.94      0.99      0.96     19384
weighted avg       1.00      0.99      1.00     19384



---
## Model 5: Random Forest with the Scaled Data

In [20]:
# Create a random forest classifier
model_5 = RandomForestClassifier(n_estimators=500, random_state=1)

# Fitting the model
model_5 = model_5.fit(X_train_scaled, y_train)

# Making predictions
predictions_5 = model_5.predict(X_test_scaled)

# Model performance
model_performance(y_test, predictions_5)

Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0        18680           79
Actual 1           72          553
---
Accuracy Score : 0.9922100701609575
---
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18759
           1       0.88      0.88      0.88       625

    accuracy                           0.99     19384
   macro avg       0.94      0.94      0.94     19384
weighted avg       0.99      0.99      0.99     19384



---
## Model 6: K-nearest neighbours with the Scaled Data

In [16]:
# Instantiate the model
model_6 = KNeighborsClassifier(n_neighbors=4)

# Train the model
model_6.fit(X_train_scaled,y_train)

# Create predictions
predictions_6 = model_6.predict(X_test_scaled)

# Model performance
model_performance(y_test, predictions_6)

Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0        18681           78
Actual 1           73          552
---
Accuracy Score : 0.9922100701609575
---
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18759
           1       0.88      0.88      0.88       625

    accuracy                           0.99     19384
   macro avg       0.94      0.94      0.94     19384
weighted avg       0.99      0.99      0.99     19384

