
# Content of this Notebook (Logistic Regression and Random Forest Learning Algorithms and Validation Applied to Predicting Loan Defaults)
In this notebook, the dataset is transformed into a matrix containing the features and the vector containing the labels. The data was then split between training and testing sets. This was done to prepare the data for the machine learning algorithms, logistic regression and random forest. 

The logistic regression was first built without tuning the hyperparameters. The cross validation accuracy score on the training set and the score on the training set as a whole were compared to evaluate for over/under fitting. The scores for both are 0.965, which suggests no overfitting. The score on the test set is 0.964. The precision is 0.98 and the recall is 0.79. The difference in the f1 scores between fully repaid loans and defaults is 0.11, which is not significant in regards to imbalance in the data, 15% being defaults. The logistic regression was then tuned and compared to the untuned model.

The logistic regression was tuned using cross validation and gridsearch techniques. The optimal penalty is the l1, or lasso, and the optimal C value is 0.000085. Using the optimal model, the score on the training set using CV is 0.9706 and the score on the training set as a whole is 0.9703. The score on the testing set is 0.969. These scores are slight improvements over the raw model. The precision and recall also improved to 0.99 and 0.81, respectively. The difference in the f1 score between fully repaid loans and defaults also decreased to 0.09. In every way, tuning the hyper parameters improved the model.

The random forest classifier algorithm was also used and compared with the tuned logistic regression model. The random forest was first done with no tuning and compared to the tuned version to evaluate improvement. The score of the CV training set is 0.931 and the score on the entire training set is 0.993. The considerable difference suggests overfitting, given that the score on the full training set is larger by more than 0.06. The score on the testing set is also 0.931, same as the CV score. The recall is 0.56 and the precision is 0.99. Unlike the untuned logistic regression model, the untuned random forest model suffers from overfitting. Imbalancing in the data is not much of a concern with the random forest because the random forest algorithm is more robust in nature towards that potential issue. But the tuned random forest model shows great improvement compared to the raw version. 

The hyperparameters chosen to tune the random forest are the max depth, max features, and the number of estimators. Tuning was done again with cross validation and grid search techniques. The max depth of None was most optimanl. The max feature of None was also most optiminal, even more so than the default 'auto' setting, which uses the square root of the number of features. This change in max features makes each tree less complex, which addresses the overfitting issue. The number of trees which was found to be the most optimal is 1000, compared to the default setting of 10. More trees make up for the weakness of the individual trees. With these hyperparameters, the score of the CV using the training set is 0.974 and the score on the entire training set is 1.0, which is a considerable decrease compared to the untuned random forest model. There might still be a slight overfitting issue that can be addressed with further feature selection, but this model suffices for this project. The score on the testing set is 0.975, which is slightly better than that of the CV on the training set. The precision is 0.99 and the recall is 0.84, which is a great improvement compared to the untuned random forest. The recall of the tuned random forest model is also higher than that of the tuned logistic regression model. Both have the same precision. 

Overall, both models improved with tuning, espeacially the random forest. The imbalancing in the dataset does not affect the logistic regression model significantly, given the low difference in f1 scores between the two classes. However, the tuned random forest model performs better than the tuned logistic regression in terms of recall, 0.84 and 0.81 respectively. Both had equal precision of 0.99. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Loan_data_ML.csv', index_col='member_id')

In [9]:
df.index

Int64Index([    1,     2,     3,     4,     5,     6,     7,     8,     9,
               10,
            ...
            42526, 42527, 42528, 42529, 42530, 42531, 42532, 42533, 42534,
            42535],
           dtype='int64', name='member_id', length=42535)

In [10]:
X = df.drop('loan_status_Charged Off', axis=1).values
y = df['loan_status_Charged Off'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=32)

In [11]:
import warnings
warnings.filterwarnings("ignore")

In [12]:
# Logistic regression without tuning
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression()
reg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [15]:
# Logistic regression without tuning
from sklearn.model_selection import cross_val_score
x_val = cross_val_score(reg, X_train, y_train, cv=5)
print('CV Score on training data:', np.mean(x_val))
print('Score on training data:', reg.score(X_train, y_train))
print('Score on test set:', reg.score(X_test, y_test))

CV Score on training data: 0.9655403428983511
Score on training data: 0.9659098542352388
Score on test set: 0.9645012146383513


In [27]:
# Logistic regression without tuning
pred_y = reg.predict(X_test)
print(classification_report(y_test, pred_y))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     10818
           1       0.98      0.79      0.87      1943

   micro avg       0.96      0.96      0.96     12761
   macro avg       0.97      0.89      0.93     12761
weighted avg       0.96      0.96      0.96     12761



In [32]:
# Optimal logisitc regression
from sklearn.model_selection import GridSearchCV
param_grid = {'penalty':['l1','l2'], 'C':np.logspace(-5,8,15)}
logregcv = GridSearchCV(reg, param_grid, cv=5)
logregcv.fit(X_train, y_train)
print('Best parameters:', logregcv.best_params_)
print('Score from the best parameters:', logregcv.best_score_)

Best parameters: {'C': 8.483428982440725e-05, 'penalty': 'l1'}
Score from the best parameters: 0.9706119433062403


In [33]:
best_logistic = logregcv.best_estimator_

In [34]:
# Optimal logisitc regression
from sklearn.metrics import classification_report
y_pred = best_logistic.predict(X_test)
print('CV accuracy score of optimized model on the test data:', np.mean(cross_val_score(best_logistic, X_train, y_train, cv=5)))
print('Accuracy score of optimized model on the training data:', best_logistic.score(X_train, y_train))
print('Accuracy score of optimized model on the test data:', best_logistic.score(X_test, y_test))

CV accuracy score of optimized model on the test data: 0.9705783857989358
Accuracy score of optimized model on the training data: 0.9703432525021831
Accuracy score of optimized model on the test data: 0.9696732231016378


In [35]:
# Optimal logisitc regression
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98     10818
           1       0.99      0.81      0.89      1943

   micro avg       0.97      0.97      0.97     12761
   macro avg       0.98      0.90      0.94     12761
weighted avg       0.97      0.97      0.97     12761



In [28]:
# Random forest without tuning
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [31]:
# Random forest without tuning
x_vali = cross_val_score(rf, X_train, y_train, cv=5)
print('Untuned Cross Validation score on training set:',x_vali.mean())
print('Untuned Score on training set:',rf.score(X_train, y_train))
print('Untuned Score on the test set:',rf.score(X_test, y_test))

Untuned Cross Validation score on training set: 0.9306447490626546
Untuned Score on training set: 0.9930476254450191
Untuned Score on the test set: 0.931666797272941


In [30]:
# Random forest without tuning
print(classification_report(y_test, rf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.93      1.00      0.96     10818
           1       0.98      0.56      0.71      1943

   micro avg       0.93      0.93      0.93     12761
   macro avg       0.96      0.78      0.84     12761
weighted avg       0.94      0.93      0.92     12761



In [135]:
# Optimal random forest classifier, tuning the RF
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
param_grid = {'n_estimators':[900, 1000, 1100, 1200], 'max_features':['auto','log2',None], 'max_depth':[7,10,None]}
rfcv = GridSearchCV(rf, param_grid, cv=4)

In [136]:
# Optimal random forest classifier
rfcv.fit(X_train, y_train)
print('Best parameters for random forest:', rfcv.best_params_)
print('Score of random forest with best params:', rfcv.best_score_)

Best parameters for random forest: {'max_depth': None, 'max_features': None, 'n_estimators': 1000}
Score of random forest with best params: 0.9739369920064486


In [137]:
yrf_predict = rfcv.predict(X_test)

In [138]:
# Optimal random forest classifier
print(classification_report(y_test, yrf_predict))

              precision    recall  f1-score   support

           0       0.97      1.00      0.99     10818
           1       0.99      0.84      0.91      1943

   micro avg       0.97      0.97      0.97     12761
   macro avg       0.98      0.92      0.95     12761
weighted avg       0.98      0.97      0.97     12761



In [139]:
# Optimal random forest classifier
best_model = rfcv.best_estimator_

from sklearn.model_selection import cross_val_score
x_val = cross_val_score(best_model, X_train, y_train, cv=5)

In [140]:
# Optimal random forest classifier
print('Optinal Cross Validation score on training set:',x_val.mean())
print('Optimal Score on training set:',rfcv.score(X_train, y_train))
print('Optiman Score on the test set:',rfcv.score(X_test, y_test))

Cross Validation score on training set: 0.974138549167629
Score on training set: 1.0
Score on the test set: 0.9747668678003292
