# Module 19 Challenge - Supervised Machine Learning

## Credit Risk Evaluator  - Predicting Credit Risk

### Logistic Regression vs Random Forests Classifier

**Prediction:** Random Forest Classifier will perform better than Logistic Regression classification, since the model fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [1]:
#Import dependencies

import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

Retrieving the data from the lending_data.csv file

In [2]:
# Load dataset
file_path = Path("Resources/lending_data.csv")
credit_df = pd.read_csv(file_path)
credit_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


Setting the X and y variables from the data in the table

In [3]:
# Dependent variable (y) will be the loan_status and all other data will be Independent (X)

X = credit_df.drop('loan_status', axis=1)
y = credit_df['loan_status']

In [4]:
# showing the values for the variables - X
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [5]:
print("Shape: ", X.shape, y.shape)

Shape:  (77536, 7) (77536,)


Split the data into training and testing data

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Scaling the data to similar numeric scales so that the magnitude of one feature doesn’t bias the model during training.

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Logistic Regression Model

Create a logistic regression model

In [8]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

LogisticRegression()

Fit (train) our model by using the training data

In [9]:
classifier.fit(X_train_scaled, y_train)

LogisticRegression()

Validate the Logistic Regression model by using the test data

In [10]:
print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

Training Data Score: 0.9941188609162196
Testing Data Score: 0.9941704498555509


Make Predictions

In [11]:
y_pred = classifier.predict(X_test_scaled)

print(f'Actual:\t\t{list(y_test[:10])}')
print(f'Predicted:\t{list(y_pred[:10])}')

Actual:		[0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
Predicted:	[0, 0, 0, 0, 0, 0, 0, 0, 1, 0]


Classification Report

In [12]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18792
           1       0.85      0.98      0.91       592

    accuracy                           0.99     19384
   macro avg       0.93      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



## Random Forests Classifier

Create a Random Forest Classifier model

In [13]:
from sklearn.ensemble import RandomForestClassifier

Validate the Random Forest Classifier model by using the test data

In [14]:
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9971970009629936
Testing Score: 0.991900536524969


## Conclusion

There is not a big difference in the performance between both models for the lending data provided. However, Random Forest Classifier had the higher Train Score of 0.9971 while Logistic Regression Model had a score of 0.9941, a difference of only 0.004.

Random Forest Classifier performed better than Logistic Regression classification because a random selection of features is selected each time which provides a greater ensemble to aggregate over producing a more accurate predictor. [Source](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)