# Module 19 Challenge - Supervised Machine Learning

## Credit Risk Evaluator  - Predicting Credit Risk

### Logistic Regression vs Random Forests Classifier
**Prediction:** Random Forest Classifier will perform better than Logistic Regression classification for the following reasons:

1. Logistic regression is a classification algorithm used to predict a discrete set of classes or categories while Random Forest Cl
2. TBD

    

In [4]:
#Import dependencies

import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

Retrieving the data from the lending_data.csv file

In [5]:
# Load dataset
file_path = Path("Resources/lending_data.csv")
credit_df = pd.read_csv(file_path)
credit_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


Setting the X and y variables from the data in the table

In [6]:
# Dependent variable (y) will be the loan_status and all other data will be Independent (X)

X = credit_df.drop('loan_status', axis=1)
y = credit_df['loan_status']

In [14]:
# showing the values for the variables - X
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [19]:
print("Shape: ", X.shape, y.shape)

Shape:  (77536, 7) (77536,)


Split the data into training and testing data

In [69]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Logistic Regression Model

Create a logistic regression model

In [91]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

LogisticRegression()

Fit (train) our model by using the training data

In [92]:
classifier.fit(X_train, y_train)

LogisticRegression()

Validate the model by using the test data

In [93]:
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9676537350392076
Testing Data Score: 0.9680664465538589


## Random Forests Classifier

Create a Random Forest Classifier model

In [73]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

In [74]:
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [78]:
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9971970009629936
Testing Score: 0.9918489475856377


## Testing Logistic Regression with selected data

In [79]:
from sklearn.feature_selection import SelectFromModel

sel = SelectFromModel(clf)
sel.fit(X_train_scaled, y_train)
sel.get_support()

array([False,  True,  True,  True, False, False,  True])

In [80]:
X_selected_train, X_selected_test, y_train, y_test = train_test_split(sel.transform(X), y, random_state=1)
scaler = StandardScaler().fit(X_selected_train)
X_selected_train_scaled = scaler.transform(X_selected_train)
X_selected_test_scaled = scaler.transform(X_selected_test)

This is with the full dataset

In [81]:
clf = LogisticRegression().fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9676537350392076
Testing Score: 0.9680664465538589


This is with a select few of the dataset

In [82]:
clf = LogisticRegression()
clf.fit(X_selected_train_scaled, y_train)
print(f'Training Score: {clf.score(X_selected_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_selected_test_scaled, y_test)}')

Training Score: 0.9942908240473243
Testing Score: 0.9936545604622369


## Using function that will run for each test

In [86]:
%matplotlib inline
from matplotlib import pyplot as plt

In [83]:
def test_model(model, data):
    X_train_scaled, X_test_scaled, y_train, y_test = data
    reg = model.fit(X_train_scaled, y_train)
    print(f'Model: {type(reg).__name__}')
    print(f'Train score: {reg.score(X_train_scaled, y_train)}')
    print(f'Test Score: {reg.score(X_test_scaled, y_test)}\n')
    plt.show()   

In [84]:
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

data = [X_train_scaled, X_test_scaled, y_train, y_test]

In [88]:
test_model(LogisticRegression(), data)
test_model(RandomForestClassifier(), data)

Model: LogisticRegression
Train score: 0.9676537350392076
Test Score: 0.9680664465538589

Model: RandomForestClassifier
Train score: 0.968289998624295
Test Score: 0.9670862567065621



In [89]:
original_data= [X_train, X_test, y_train, y_test]

In [90]:
test_model(LogisticRegression(), original_data)
test_model(RandomForestClassifier(), original_data)

Model: LogisticRegression
Train score: 0.9676537350392076
Test Score: 0.9680664465538589

Model: RandomForestClassifier
Train score: 0.968289998624295
Test Score: 0.9669830788278992



## Conclusion

There is not a big difference in the performance between both models for the lending data provided. Random Forest Classifier had the higher Train Score of 0.968288 while Logistic Regression Model had a score of 0.967653.

# Solution from the provided starter code

In [111]:
# Split the data into X_train, X_test, y_train, y_test

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
32815,8200.0,6.598,42700,0.297424,2,0,12700
59572,10500.0,7.591,52000,0.423077,4,1,22000
42325,7600.0,6.351,40400,0.257426,2,0,10400
39070,9700.0,7.254,48900,0.386503,4,0,18900
42524,11800.0,8.155,57300,0.47644,6,1,27300


In [109]:
# Train a Logistic Regression model print the model score

0.9908171687990095

In [110]:
# Train a Random Forest Classifier model and print the model score

0.9910751134956666