#**Capstone 2: Lending Club - Modeling**

Previously, in the "Capstone Two - Preprocessing and Training" notebook, we utilized Auto-Gluon to generate a quick sample of optimized models to see which would perform the best. Without knowing what exact hyperparameters were used, we handpicked several models that performed well on Auto-Gluon as a direction for which models to start building. 

5 Models from Auto-gluon
1. Dummy Regressor 
2. Linear Regression
3. XGBoost 
4. RandomForests 
5. Decision Tree

The purpose of this notebook is to explore various models and determine the one that performs the best in predicting our dependent variable using statistical metrics.

This is a regression problem; we are utilizing loan data to predict a non-binary numerical feature's value:"interest_rate". From my data analysis, the best model for predicting our dependent variable "interest_rate" is a XGBoost Regressor model. Below you'll find the works that lead to this conclusion.

#**Importing the necessary packages**


In [20]:
#import of necessary packages 
#need the LinearRegression Model
#import the 5 models
#import gridsearchCV show 

import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

#statistical measures for model performance
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

#Hyperparameter tunings
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve

#Regression models
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
import lightgbm as lgb

from sklearn.utils import all_estimators
from sklearn.base import RegressorMixin

from sb_utils import save_file

#**Importing our training and testing data sets**

Grabbing our training and testing data files from our previous notebook "Capstone Two - Pre-processing and Training".

In [2]:
y_train = pd.read_csv('y_train.csv')
y_test = pd.read_csv('y_test.csv')
X_train_scaled = pd.read_csv('X_train_scaled.csv')
X_test_scaled = pd.read_csv('X_test_scaled.csv')

#**Generating a base Dataframe to track our model performance**

In [3]:
results_df = pd.DataFrame(columns=['Model', 'R_Squared_train', 'R_Squared_test', 'R_Squared_Adj_train','R_Squared_Adj_test', 'RMSE', 'MAE'])

#**Dummy Regression Model (median)**

Based off our plots of the data in EDA, we saw that alot of the graphs of individual independent variables were not evenly distributed/ had displayed characteristics of skewedness. Thus approaching forward generating a dummy model using the strategy "median" is more appropriate than "mean", due to potential negative influences from outliers.



In [4]:
#How Good is the Median?
#Generating a model that just predicts the median interest rate from the training set
train_median = y_train.median()
train_median

interest_rate    0.1189
dtype: float64

In [5]:
#Using dummregressor based on the median
#due to the data possibly having outliers, we used the median
dumb_reg_median = DummyRegressor(strategy='median')
dumb_reg_median.fit(X_train_scaled, y_train)

In [6]:
#predicting on X_train data on dummy model based on guessing only training data's mean
y_tr_predict_median = dumb_reg_median.predict(X_train_scaled)

In [7]:
#predicting testing data on dummy model based on guessing only training data's mean
y_te_predict_median = dumb_reg_median.predict(X_test_scaled)

In [8]:
y_te_predict_median[:5]

array([0.1189, 0.1189, 0.1189, 0.1189, 0.1189])

In [9]:
#Using the Coefficient of determination to determine how well this model performed
# R^2 = 0 should be what we see
# R^2 < 0 means model did worse than predicting the median
# R^2 = 1 means that the model predicted all the test values without error
#Calculating Coefficient of determinination on training model using median as the guess
dumb_reg_median.score(X_train_scaled, y_train)

-6.127241931608296e-07

In [10]:
dumb_reg_median.score(X_test_scaled, y_test)

-0.00047817956077511603

In [11]:
model = "Dummy Regressor (Median)"

mae = round(mean_absolute_error(y_test,y_te_predict_median),4)
rmse = round(np.sqrt(mean_squared_error(y_test,y_te_predict_median)),4)

r2_train = r2_score(y_train, y_tr_predict_median)
r2_test = r2_score(y_test, y_te_predict_median)
r2_adj_train = (1 - (1 - r2_train) * (X_train_scaled.shape[0] - 1) / (X_train_scaled.shape[0] - X_train_scaled.shape[1] - 1))
r2_adj_test = (1 - (1 - r2_test) * (X_test_scaled.shape[0] - 1) / (X_test_scaled.shape[0] - X_test_scaled.shape[1] - 1))

results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test, 
                                'RMSE': rmse, 'MAE': mae}, ignore_index=True)

results_df

  results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test,


Unnamed: 0,Model,R_Squared_train,R_Squared_test,R_Squared_Adj_train,R_Squared_Adj_test,RMSE,MAE
0,Dummy Regressor (Median),-6.127242e-07,-0.000478,-0.003534,-0.008768,0.0244,0.0194


Based on the results from the dummy regressor model using the median strategy, we see that the model has approximately an expected result of 0. When you predict the coefficient of determination on the training data using a regression model that predicts the training data's median for outputs with regards to each input, you expect 0. It's normal that the coefficient of determination for the testing data would perform slightly worse.

#**Linear Regression**

In [12]:
#Training the model
lm = LinearRegression().fit(X_train_scaled, y_train)

In [13]:
#generating predictions on the scaled training data!
y_tr_pred = lm.predict(X_train_scaled)
y_te_pred = lm.predict(X_test_scaled)

In [14]:
#First Calculating "CoD" on training data
#We need to make sure we're using the scale version of our independent variables
lm.score(X_train_scaled, y_train)

0.6909249231774082

In [15]:
y_train.shape, y_tr_pred.shape

((5397, 1), (5397, 1))

In [16]:
#Calculating "CoD" on testing Data
lm.score(X_test_scaled, y_test)

0.6741315546902478

In [17]:
y_test.shape, y_te_pred.shape

((2313, 1), (2313, 1))

In [18]:
model = "Linear Regression"

mae = round(mean_absolute_error(y_test,y_te_pred),4)
rmse = round(np.sqrt(mean_squared_error(y_test,y_te_pred)),4)

r2_train = r2_score(y_train, y_tr_pred)
r2_test = r2_score(y_test,y_te_pred)
r2_adj_train = (1 - (1 - r2_train) * (X_train_scaled.shape[0] - 1) / (X_train_scaled.shape[0] - X_train_scaled.shape[1] - 1))
r2_adj_test = (1 - (1 - r2_test) * (X_test_scaled.shape[0] - 1) / (X_test_scaled.shape[0] - X_test_scaled.shape[1] - 1))

results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test, 
                                'RMSE': rmse, 'MAE': mae}, ignore_index=True)

results_df

  results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test,


Unnamed: 0,Model,R_Squared_train,R_Squared_test,R_Squared_Adj_train,R_Squared_Adj_test,RMSE,MAE
0,Dummy Regressor (Median),-6.127242e-07,-0.000478,-0.003534,-0.008768,0.0244,0.0194
1,Linear Regression,0.6909249,0.674132,0.689833,0.671431,0.0139,0.0104


#**XGBoost Regression Model**

Based on gluon results, I have high expectations that the XGBoost Regression model if properly tuned should perform very well.

In [21]:
#building XGBRegressor Model

#tuning the hyperparameters for the XGBRegressor model
#per mentor's recommendation minimize focus of gridSearchCV parameters to focus on to reduce number of model's needed to be built
#xgb_reg = XGBRegressor(random_state = 7) 

#parameters = {#n_estimators determines the number of models to be built
#    'n_estimators' : [50,100,150,200],
    #max_depth determines the depth of the tree(s) built
#    'max_depth' : [5,6,7,8],
    #learning rate determines step size updates to prevent overfitting
#    'learning_rate' : [0.01, 0.1, 1],
    #subsample is the ratio of training data that is used to build a tree
#    'subsample' : [0.5, 0.75, 1],
    #colsample_bytree is the subsampling of the columns used to build a tree
#    'colsample_bytree' : [0.5, 0.75, 1]
              } 

#GridSearchCV params
#cv = integer, dictates the number of folds for cross validation, anyone 
#verbose = 1, this determines the messages we see based on the calculator, higher integer means, mmore messages 
#scoring = there are a set of acceptable string values...
# strategy to evaluate the performance of cross-validated model
#n_jobs = None or -1, number of jobs to run in parallel

#grid_xgb = GridSearchCV(xgb_reg, parameters, cv=10, verbose = 1, 
                         scoring='neg_mean_squared_error', n_jobs=-1)
 
#grid_xgb.fit(X_train_scaled, y_train)

#print(f"Best paramters: {grid_xgb.best_params_})")
#print("MSE: ", -grid_xgb.best_score_)

Fitting 10 folds for each of 432 candidates, totalling 4320 fits
Best paramters: {'colsample_bytree': 0.75, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'subsample': 1})
MSE:  0.00014637739063926301


It took roughly ~55 minutes to run this.

Best paramters: {'colsample_bytree': 0.75, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'subsample': 1}) MSE: 0.00014637739063926301

In [22]:
#Using best parameters to build model for testing data set
#fill in values below ??, with parameters values from grid_xgb
xgb_tuned = XGBRegressor(colsample_bytree=0.75, learning_rate=0.1,
                         max_depth=7, n_estimators=100, subsample=1, random_state = 7)

#fitting to the training
xgb_tuned.fit(X_train_scaled, y_train)

#predicting on X_train
y_tr_pred_xgb = xgb_tuned.predict(X_train_scaled)

#predicting for X_test
y_te_pred_xgb = xgb_tuned.predict(X_test_scaled)

In [23]:
model = "XGBoost"

mae = round(mean_absolute_error(y_test,y_te_pred_xgb),4)
rmse = round(np.sqrt(mean_squared_error(y_test,y_te_pred_xgb)),4)

r2_train = r2_score(y_train, y_tr_pred_xgb)
r2_test = r2_score(y_test,y_te_pred_xgb)
r2_adj_train = (1 - (1 - r2_train) * (X_train_scaled.shape[0] - 1) / (X_train_scaled.shape[0] - X_train_scaled.shape[1] - 1))
r2_adj_test = (1 - (1 - r2_test) * (X_test_scaled.shape[0] - 1) / (X_test_scaled.shape[0] - X_test_scaled.shape[1] - 1))

results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test, 
                                'RMSE': rmse, 'MAE': mae}, ignore_index=True)

results_df

  results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test,


Unnamed: 0,Model,R_Squared_train,R_Squared_test,R_Squared_Adj_train,R_Squared_Adj_test,RMSE,MAE
0,Dummy Regressor (Median),-6.127242e-07,-0.000478,-0.003534,-0.008768,0.0244,0.0194
1,Linear Regression,0.6909249,0.674132,0.689833,0.671431,0.0139,0.0104
2,XGBoost,0.9220981,0.764047,0.921823,0.762092,0.0119,0.0083


#**Random Forest Regression Model**

In [24]:
#param_grid = {'n_estimators': [50, 100, 200, 300],
#               'max_depth': [None, 10, 20, 30],
#               'min_samples_split': [2, 5, 10],
#               'min_samples_leaf': [1, 2, 4]}


#reg_rf = RandomForestRegressor(random_state=7)

#grid_rf = GridSearchCV(reg_rf, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1, cv=5)
#grid_rf.fit(X_train_scaled, y_train)
#best_params = grid_rf.best_params_

#print(f"Best parameters: {best_params})")
#print("MSE: ", -grid_rf.best_score_)

Fitting 5 folds for each of 144 candidates, totalling 720 fits


  self.best_estimator_.fit(X, y, **fit_params)


Best parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 300})
MSE:  0.0001543863360799219


It took roughly ~40 minutes to run this.

Best paramters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 300})
MSE:   0.0001543863360799219

In [25]:
rf_tuned = RandomForestRegressor(max_depth=20, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state = 7) 
rf_tuned.fit(X_train_scaled, y_train)
y_te_pred_rf = rf_tuned.predict(X_test_scaled)
y_tr_pred_rf = rf_tuned.predict(X_train_scaled)

  rf_tuned.fit(X_train_scaled, y_train)


In [26]:
model = "Random Forest"

mae = round(mean_absolute_error(y_test,y_te_pred_rf),4)
rmse = round(np.sqrt(mean_squared_error(y_test,y_te_pred_rf)),4)

r2_train = r2_score(y_train, y_tr_pred_rf)
r2_test = r2_score(y_test,y_te_pred_rf)
r2_adj_train = (1 - (1 - r2_train) * (X_train_scaled.shape[0] - 1) / (X_train_scaled.shape[0] - X_train_scaled.shape[1] - 1))
r2_adj_test = (1 - (1 - r2_test) * (X_test_scaled.shape[0] - 1) / (X_test_scaled.shape[0] - X_test_scaled.shape[1] - 1))

results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test, 
                                'RMSE': rmse, 'MAE': mae}, ignore_index=True)

results_df

  results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test,


Unnamed: 0,Model,R_Squared_train,R_Squared_test,R_Squared_Adj_train,R_Squared_Adj_test,RMSE,MAE
0,Dummy Regressor (Median),-6.127242e-07,-0.000478,-0.003534,-0.008768,0.0244,0.0194
1,Linear Regression,0.6909249,0.674132,0.689833,0.671431,0.0139,0.0104
2,XGBoost,0.9220981,0.764047,0.921823,0.762092,0.0119,0.0083
3,Random Forest,0.9551438,0.757463,0.954985,0.755453,0.012,0.0082


#**Decision Tree Regressor Model**

In [27]:
#param_grid = { 
#    "max_depth":(list(range(1, 5))), 
#    "min_samples_split":[2, 3, 4], 
#    "min_samples_leaf":list(range(1, 5))
#}

#reg_dt = DecisionTreeRegressor(random_state = 7) 

#grid_dt = GridSearchCV(reg_dt, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1, cv=5)
#grid_dt.fit(X_train_scaled, y_train)
#best_params = grid_dt.best_params_

#print(f"Best parameters: {best_params})")
#print("MSE: ", -grid_dt.best_score_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'max_depth': 4, 'min_samples_leaf': 3, 'min_samples_split': 2})
MSE:  0.00022651371873831498


Took 5 seconds to run this.

Best parameters: {'max_depth': 4, 'min_samples_leaf': 3, 'min_samples_split': 2}

MSE:  0.00022651371873831498

In [28]:
reg_dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=3, min_samples_split=2, splitter='best', max_features='auto', random_state = 7) 
reg_dt.fit(X_train_scaled, y_train)
y_te_pred_dt = reg_dt.predict(X_test_scaled)
y_tr_pred_dt = reg_dt.predict(X_train_scaled)



In [29]:
model = "Decision Tree"

mae = round(mean_absolute_error(y_test,y_te_pred_dt),4)
rmse = round(np.sqrt(mean_squared_error(y_test,y_te_pred_dt)),4)

r2_train = r2_score(y_train, y_tr_pred_dt)
r2_test = r2_score(y_test,y_te_pred_dt)
r2_adj_train = (1 - (1 - r2_train) * (X_train_scaled.shape[0] - 1) / (X_train_scaled.shape[0] - X_train_scaled.shape[1] - 1))
r2_adj_test = (1 - (1 - r2_test) * (X_test_scaled.shape[0] - 1) / (X_test_scaled.shape[0] - X_test_scaled.shape[1] - 1))

results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test, 
                                'RMSE': rmse, 'MAE': mae}, ignore_index=True)

results_df

  results_df = results_df.append({'Model': model, 'R_Squared_train': r2_train, 'R_Squared_test': r2_test,'R_Squared_Adj_train': r2_adj_train, 'R_Squared_Adj_test': r2_adj_test,


Unnamed: 0,Model,R_Squared_train,R_Squared_test,R_Squared_Adj_train,R_Squared_Adj_test,RMSE,MAE
0,Dummy Regressor (Median),-6.127242e-07,-0.000478,-0.003534,-0.008768,0.0244,0.0194
1,Linear Regression,0.6909249,0.674132,0.689833,0.671431,0.0139,0.0104
2,XGBoost,0.9220981,0.764047,0.921823,0.762092,0.0119,0.0083
3,Random Forest,0.9551438,0.757463,0.954985,0.755453,0.012,0.0082
4,Decision Tree,0.6836561,0.650789,0.682538,0.647895,0.0144,0.0106


#**Conclusion** 

Based off our data metrics of the various models, the best performing models are the XGBoost and Random Forest regressor models. They have the lowest mean absolute error and their R-Squared score are the closest to the value of 1 for their testing data. Comparing the two, XGBoost performs better in the mean absolute error and the R-squared statistic, making it the most optimal model of the several tested to use.