Prediction: 
    -Random Forests Classifier Model Will Perform Better

Why?:
    -Because the overall outcome is determined by the majority outcomes of the smaller trees, and with each smaller tree encompassing the whole of the data, there is less guessing/predicting, and more analyzing and actualizing of the trends/dataset.

In [1]:
# Import Dependencies

import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression

In [2]:
# Import the Data

file_path = Path("Resources/lending_data.csv")
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
# Split Data Into X_train, X_test, y_train, y_test

X = df.drop("derogatory_marks", axis=1)
y = df["derogatory_marks"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Scale Data

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [4]:
# Train a Logistic Regression Model and Print the Model Testing/Training Scores

clf = LogisticRegression().fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9986586875773834
Testing Score: 0.9984523318200578


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [5]:
# Get Parameters Already Used by Default

rf = RandomForestRegressor(random_state = 42)
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}


In [6]:
# Import RandomizedSearchCV (Cross-Validated*) to Search For Hyper-Parameters
# *Cross Validation is the attempt to estimate the efficiency of a model when it comes to making predictions by using 
# a test set of data without overfitting the model itself
# Examples of Cross-Validation: 
# 1) K-Fold: The dataset is split into equally sized groups(k), and each group is individually trained/tested against 
#    the rest and the results are averaged out
# 2) Hold-Out: (Not Suggested/ Extremely Unstable/ Single-Validation) A single run done on the dataset where it is split
#    into the test (smaller)/ training set  
# 3) Leave-One-Out: Taking all but one observation in the dataset and using that singular data point as the testing, build
#    the model on the training set, and then use the model to predict the single value and calculate MSE. Rinse and repeat
#    with each data point andd then get the average of all of the test MSEs.
# 4) Bootstrap: (Uses Random Sampling/ Can Contain Multiple Instances of the Same Information) Uses samples and summary 
#    statistics from the testing data to estimate the distribution of the sample.

# Code and Comments From Towards Data Science:

from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


In [7]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42, verbose=2)

In [8]:
# If Actually Ran, It Will Take F*O*R*E*V*E*R
# Next Time: Attempt a Randomized Hyper-Parameter Search to Reduce Time for Similar Results

rf_random.best_params_

{'n_estimators': 600,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 110,
 'bootstrap': False}

In [9]:
# Train a Random Forest Classifier Model, Input Results From Best_Params_ and Print Model Training/Testing Scores

clf = RandomForestClassifier(n_estimators=600, min_samples_leaf=2, max_depth=110, bootstrap=False).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9996904663640116
Testing Score: 0.9992261659100289


Prediction was correct. 

In this instance, if looking for increased accuracy, the RFC with the RFR+best_params_ was the way to go.