# Predicting Credit Risk - Supervised Machine Learning

### Prediction before creating, fitting and scoring the models:
#### For the given data the RandomForestClassifier will perform better than LogisticRegression because the classification does well in predicting discrete class label such as label output. It also uses decision trees that try to select the best feature at every split. The regression does well in predicting quantity. 

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

In [2]:
# Import the data
file_path = Path("Resources/lending_data.csv")
lending_data = pd.read_csv(file_path)
lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
# Assign the data to X and y
# Note: Sklearn requires a two-dimensional array of values
# so we use reshape() to create this

X = lending_data[[ 'loan_size'
                 , 'interest_rate'
                 , 'borrower_income'
                 , 'debt_to_income'
                 , 'num_of_accounts'
                 , 'derogatory_marks'
                 , 'total_debt']]

y = lending_data['loan_status']

print("Shape: ", X.shape, y.shape)

Shape:  (77536, 7) (77536,)


In [4]:
# Split the data into X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

# Train a Logistic Regression model print the model score

In [5]:
# Create a logistic regression model
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=1)
classifier

LogisticRegression(random_state=1)

In [6]:
#Fit (train) our model by using the training data
classifier.fit(X_train, y_train)

LogisticRegression(random_state=1)

In [7]:
# Validate the model by using the test data
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score : {classifier.score(X_test, y_test)}")

Training Data Score: 0.992192873847847
Testing Data Score : 0.9913330581923235


In [8]:
# make predictions
print(f'Actual:\t\t{list(y_test[:15])}')
print(f'Predicted:\t{list(classifier.predict(X_test[:15]))}')

Actual:		[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Predicted:	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


# Train a Random Forest Classifier model and print the model score

In [9]:
from sklearn.ensemble import RandomForestClassifier

# Create a RandomForestClassifier model
classifier2 = RandomForestClassifier(random_state=1, n_estimators=500)
classifier2

RandomForestClassifier(n_estimators=500, random_state=1)

In [10]:
#Using the same X and y data used in the Logistic Regression model,
#train the data using Random Forrest Classifier

#Fit (train) our model by using the training data
classifier2.fit(X_train, y_train)
classifier2

RandomForestClassifier(n_estimators=500, random_state=1)

In [11]:
# Validate(score) the model by using the test data
print(f'Training Score: {classifier2.score(X_train, y_train)}')
print(f'Testing Score : {classifier2.score(X_test, y_test)}')

Training Score: 0.997214197276104
Testing Score : 0.9920553033429633


In [12]:
# make predictions

print(f'Actual:\t\t{list(y_test[:15])}')
print(f'Predicted:\t{list(classifier2.predict(X_test[:15]))}')

Actual:		[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Predicted:	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


# Using Scaled data

In [13]:
from sklearn.preprocessing import StandardScaler

In [14]:
# Using the same X and y data used in the Logistic Regression model
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [15]:
#LogisticRegression using scaled data
#Fit (train) our model by using the training data

scaled_classifier = LogisticRegression().fit(X_train_scaled, y_train)
print(f'Training Score: {scaled_classifier.score(X_train_scaled, y_train)}')
print(f'Testing Score : {scaled_classifier.score(X_test_scaled, y_test)}')

# make predictions
print(f'Actual:\t\t{list(y_test[:15])}')
print(f'Predicted:\t{list(scaled_classifier.predict(X_test_scaled[:15]))}')

Training Score: 0.9941016646031091
Testing Score : 0.9942220387948824
Actual:		[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Predicted:	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


In [16]:
# RandomForestClassifier using scaled data
#Fit (train) our model by using the training data
scaled_classifier2 = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Training Score: {scaled_classifier2.score(X_train_scaled, y_train)}')
print(f'Testing Score : {scaled_classifier2.score(X_test_scaled, y_test)}')

# make predictions
print(f'Actual:\t\t{list(y_test[:15])}')
print(f'Predicted:\t{list(scaled_classifier2.predict(X_test_scaled[:15]))}')

Training Score: 0.997214197276104
Testing Score : 0.9917973586463062
Actual:		[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Predicted:	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


## Conclusion: Which model performed better? How does that compare to your prediction?
#### RandomForestClassifier performed better than the  LogisticRegression with the training score. For the testing data the result is showing very slight difference. Looks like it matches my initial prediction prior to using the models that the classifier will perform better than the regression.
#### For the prediction, the actual results matched the predicted results for both models. 