# Predicting loan default

Our goal is to compare a variety of models for predicting loan default from a consumer's loan history, aiming to maximize accuracy on a chosen test set with little regard for other factors such as model interpretability, training time, etc.

### Data

Data comes from LendingClub and has been prepared and saved in 'data_final.csv'.
We use data from before 2014 for training, and use 2014 data as the test set.

In [1]:
import math
import pprint
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.utils import shuffle
from scipy.optimize import minimize, LinearConstraint

data = pd.read_csv('data_final.csv', parse_dates=['issue_d'], index_col=0)

train = data.loc[data.issue_d < 'January 2014']
test = data.loc[data.issue_d > 'December 2013']

X_cols_full = [c for c in train.columns if not c in ('Default', 'issue_d')]
X_cols_minusCat = [c for c in X_cols_full if c != 'home_ownership']
X_cols = X_cols_full

  from numpy.core.umath_tests import inner1d


### K-nearest neighbors model

The parameters we want to tune here are k (the number of neighbors to check for each point), the distance metric, and the weighting scheme. In addition, there is a categorical variable ('home_ownership') which has been coded as an integer (ranking the categories by decreasing probability of default) and we want to test whether we should include this variable.
In principle each of these parameters may have an independent effect on model accuracy, so we can test every combination (within some reasonable range of parameters). 

As an experiment, we use a small sample test set to test every possible model where the parameters lie in the following ranges:
1. $k \in \{4, 8, 12\}$
2. Four distance metrics: Minkowski metrics for $p \in \{1, 2, 4\}$ and Chebyshev metric
3. Including/excluding the categorical variable home_ownership
4. Equal weights or inverse-distance weights

In total, this gives us $3 \cdot 4 \cdot 2 \cdot 2 = 48$ models to check. We'll compute the accuracy of each model, store the parameters and accuracy in a data frame, and compare them.

In [8]:
# Sample to test models
train_sample = train.sample(n=5000, replace=False)
test_sample = test.sample(n=5000, replace=False)

X_train_full = train_sample[X_cols_full]
X_test_full = test_sample[X_cols_full]
scaler_full = MinMaxScaler()
X_train_full = scaler_full.fit_transform(X_train_full)
X_test_full = scaler_full.transform(X_test_full)

X_train_minusCat = train_sample[X_cols_minusCat]
X_test_minusCat = test_sample[X_cols_minusCat]
scaler_minusCat = MinMaxScaler()
X_train_minusCat = scaler_minusCat.fit_transform(X_train_minusCat)
X_test_minusCat = scaler_minusCat.transform(X_test_minusCat)

y_train, y_test = train_sample.Default, test_sample.Default

# Wrapper for interfacing with sklearn
class KnnModel:
    def __init__(self, metric, metric_label, k, equalWeights, includeCategorical, p=None):
        """Compute knn model, score, and dict for row in dataframe"""
        if includeCategorical:
            X_train, X_test = X_train_full, X_test_full
        else:
            X_train, X_test = X_train_minusCat, X_test_minusCat
        if equalWeights:
            weights = 'uniform'
        else:
            weights = 'distance'
            
        if p:
            self.model = KNeighborsClassifier(
                n_neighbors=k,
                metric=metric,
                p=p,
                weights=weights
            ).fit(X_train, y_train)
        else:
            self.model = KNeighborsClassifier(
                n_neighbors=k,
                metric=metric,
                weights=weights
            ).fit(X_train, y_train)
        
        self.score = self.model.score(X_test, y_test)
        self.rowDict = {
            'metric': metric_label, 
            'categorical': includeCategorical, 
            'equal': equalWeights,
            'k': k, 
            'score': self.score
        }

In [9]:
# Computing model scores
metrics = ['minkowski', 'minkowski', 'minkowski', 'chebyshev']
ps = [1, 2, 4, None]
metric_labels = ['manhattan', 'euclidean', 'minkowski (p=4)', 'chebyshev']
krange = [4, 8, 12]
            
rows = []
for metric, p, metric_label in zip(metrics, ps, metric_labels):
    for includeCategorical in (True, False):
        for equalWeights in (True, False):
            for k in krange:
                print('Computing parameters: ' + \
                      ', '.join([metric_label, str(includeCategorical), str(equalWeights), str(k)]))
                km = KnnModel(metric, metric_label, k, equalWeights, includeCategorical, p)
                print('Score = ' + str(km.score))
                rows.append(km.rowDict)

dfModels  = pd.DataFrame(columns=['metric', 'categorical', 'equal', 'k', 'score'])
for i, row in enumerate(rows):
    dfModels.loc[i] = row
dfModels.to_csv('knn_test_model_scores.csv')
print('Done')

Computing parameters: manhattan, True, True, 4
Score = 0.8578
Computing parameters: manhattan, True, True, 8
Score = 0.8604
Computing parameters: manhattan, True, True, 12
Score = 0.8616
Computing parameters: manhattan, True, False, 4
Score = 0.838
Computing parameters: manhattan, True, False, 8
Score = 0.859
Computing parameters: manhattan, True, False, 12
Score = 0.8612
Computing parameters: manhattan, False, True, 4
Score = 0.858
Computing parameters: manhattan, False, True, 8
Score = 0.8596
Computing parameters: manhattan, False, True, 12
Score = 0.8612
Computing parameters: manhattan, False, False, 4
Score = 0.8348
Computing parameters: manhattan, False, False, 8
Score = 0.8558
Computing parameters: manhattan, False, False, 12
Score = 0.8604
Computing parameters: euclidean, True, True, 4
Score = 0.8574
Computing parameters: euclidean, True, True, 8
Score = 0.86
Computing parameters: euclidean, True, True, 12
Score = 0.8612
Computing parameters: euclidean, True, False, 4
Score = 0.

Now that we have a ranking for all parameters, let's take the top 5 models, run them for a larger portion of the test set, and see which has the highest accuracy. To check that our ranking is reasonable, we'll also test the worst 5 models.

In [11]:
dfModels = pd.read_csv('knn_test_model_scores.csv', index_col=0)
dfModels.sort_values(by=['score'], inplace=True)

# Get larger sample
train_sample = train.sample(n=5000, replace=False)
test_sample = test.sample(n=15000, replace=False)

X_train_full = train_sample[X_cols_full]
X_test_full = test_sample[X_cols_full]
scaler_full = MinMaxScaler()
X_train_full = scaler_full.fit_transform(X_train_full)
X_test_full = scaler_full.transform(X_test_full)

X_train_minusCat = train_sample[X_cols_minusCat]
X_test_minusCat = test_sample[X_cols_minusCat]
scaler_minusCat = MinMaxScaler()
X_train_minusCat = scaler_minusCat.fit_transform(X_train_minusCat)
X_test_minusCat = scaler_minusCat.transform(X_test_minusCat)

y_train, y_test = train_sample.Default, test_sample.Default

# Compute models on larger sample
metric_map = {
    'minkowski (p=4)': ('minkowski', 4),
    'manhattan': ('minkowski', 1),
    'euclidean': ('minkowski', 2),
    'chebyshev': ('chebyshev', None)
}
def modelFromDfRow(row):
    params = {
        'metric': metric_map[row.metric][0],
        'p': metric_map[row.metric][1],
        'metric_label': row.metric,
        'categorical': row.categorical,
        'equal': row.equal,
        'k': row.k
    }
    print(params)
    km = KnnModel(
        metric=params['metric'], 
        metric_label=params['metric_label'],
        k=params['k'],
        equalWeights=params['equal'],
        includeCategorical=params['categorical'],
        p=params['p']
    )
    return km
    
rows = []
M = 5
for model_i in range(M):
    # Worst M models
    row = dfModels.iloc[:M].iloc[model_i]
    km = modelFromDfRow(row)
    print('Score = ' + str(km.score))
    rowDict = km.rowDict
    rowDict['rank'] = -M + model_i
    rows.append(rowDict)
for model_i in range(M):
    # Best M models
    row = dfModels.iloc[-M:].iloc[model_i]
    km = modelFromDfRow(row)
    print('Score = ' + str(km.score))
    rowDict = km.rowDict
    rowDict['rank'] = 1 + model_i
    rows.append(rowDict)
    
dfBestWorst  = pd.DataFrame(columns=['metric', 'categorical', 'equal', 'k', 'score', 'rank'])
for i, row in enumerate(rows):
    dfBestWorst.loc[i] = row
dfBestWorst.to_csv('knn_best_worst_scores.csv')

print(dfBestWorst)

{'metric': 'minkowski', 'p': 4, 'metric_label': 'minkowski (p=4)', 'categorical': False, 'equal': False, 'k': 4}
Score = 0.8342666666666667
{'metric': 'chebyshev', 'p': None, 'metric_label': 'chebyshev', 'categorical': True, 'equal': False, 'k': 4}
Score = 0.8366666666666667
{'metric': 'chebyshev', 'p': None, 'metric_label': 'chebyshev', 'categorical': False, 'equal': False, 'k': 4}
Score = 0.8359333333333333
{'metric': 'minkowski', 'p': 4, 'metric_label': 'minkowski (p=4)', 'categorical': True, 'equal': False, 'k': 4}
Score = 0.8358
{'metric': 'minkowski', 'p': 1, 'metric_label': 'manhattan', 'categorical': False, 'equal': False, 'k': 4}
Score = 0.8310666666666666
{'metric': 'chebyshev', 'p': None, 'metric_label': 'chebyshev', 'categorical': False, 'equal': True, 'k': 8}
Score = 0.8623333333333333
{'metric': 'chebyshev', 'p': None, 'metric_label': 'chebyshev', 'categorical': False, 'equal': True, 'k': 12}
Score = 0.8633333333333333
{'metric': 'minkowski', 'p': 4, 'metric_label': 'mink

Based on these results, it looks like the best model uses the Minkowski metric of order 4 with equal weighting and includes the categorical 'home_ownership' variable. Now we just need to find the best value of k.

In [12]:
for k in range(10, 20):
    km = KnnModel(
        metric='minkowski',
        p=4,
        metric_label='minkowski (p=4)',
        k=k,
        equalWeights=True,
        includeCategorical=True
    )
    print('k: ' + str(k) + ', score: ' + str(km.score))

k: 10, score: 0.8638
k: 11, score: 0.8632
k: 12, score: 0.8639333333333333
k: 13, score: 0.8638666666666667
k: 14, score: 0.8637333333333334
k: 15, score: 0.8638
k: 16, score: 0.8636666666666667
k: 17, score: 0.8637333333333334
k: 18, score: 0.8636
k: 19, score: 0.8636


Model performance peaks at k=12.

The final k-nearest neighbors model uses the Minkowski order-4 metric with equal weighting and k=12, and its accuracy on the full test data is:

In [13]:
X_train_full = train[X_cols_full]
X_test_full = test[X_cols_full]
scaler_full = MinMaxScaler()
X_train_full = scaler_full.fit_transform(X_train_full)
X_test_full = scaler_full.transform(X_test_full)

X_train_minusCat = train[X_cols_minusCat]
X_test_minusCat = test[X_cols_minusCat]
scaler_minusCat = MinMaxScaler()
X_train_minusCat = scaler_minusCat.fit_transform(X_train_minusCat)
X_test_minusCat = scaler_minusCat.transform(X_test_minusCat)

y_train, y_test = train.Default, test.Default

km = KnnModel(
    metric='minkowski', 
    metric_label='minkowski (p=4)',
    k=12,
    equalWeights=True,
    includeCategorical=True
)
print('Accuracy of final KNN model: ' + str(km.score))

Accuracy of final KNN model: 0.8646614168257926


### Logistic regression model

For our logistic regression model we consider two parameters: L1 versus L2 regularization, and the strength of regularization. We use the same approach of testing different values within the parameter range and rank them by their score on the sample test set.

In [14]:
X_cols = [c for c in train.columns if not c in ('Default', 'issue_d')]
X_train = train[X_cols]
X_test = test[X_cols]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train.Default, test.Default

regs = ['l1', 'l2']
Cs = [0.005, 0.01, 0.015, 0.02, 0.04, 0.07, 0.1, 0.2, 0.4, 0.7, 1]

scores = []
for reg in regs:
    for C in Cs:
        print('Computing: ' + reg + ', '+ str(C))
        model = LogisticRegression(penalty=reg, C=C).fit(X_train, y_train)
        scores.append([reg, C, model.score(X_test, y_test)])
scores = sorted(scores, key=lambda x: x[2])
print(pprint.pformat(scores))

Computing: l1, 0.005
Computing: l1, 0.01
Computing: l1, 0.015
Computing: l1, 0.02
Computing: l1, 0.04
Computing: l1, 0.07
Computing: l1, 0.1
Computing: l1, 0.2
Computing: l1, 0.4
Computing: l1, 0.7
Computing: l1, 1
Computing: l2, 0.005
Computing: l2, 0.01
Computing: l2, 0.015
Computing: l2, 0.02
Computing: l2, 0.04
Computing: l2, 0.07
Computing: l2, 0.1
Computing: l2, 0.2
Computing: l2, 0.4
Computing: l2, 0.7
Computing: l2, 1
[['l1', 0.7, 0.8650322723183992],
 ['l2', 0.7, 0.8650322723183992],
 ['l1', 1, 0.8650401628607951],
 ['l2', 1, 0.8650480534031909],
 ['l2', 0.4, 0.8650559439455868],
 ['l1', 0.4, 0.8650638344879827],
 ['l2', 0.2, 0.8650796155727745],
 ['l1', 0.2, 0.8650875061151704],
 ['l1', 0.1, 0.8650953966575663],
 ['l1', 0.07, 0.8651032871999621],
 ['l2', 0.1, 0.8651032871999621],
 ['l1', 0.04, 0.865111177742358],
 ['l1', 0.005, 0.8651190682847539],
 ['l1', 0.01, 0.8651190682847539],
 ['l2', 0.005, 0.8651190682847539],
 ['l2', 0.01, 0.8651190682847539],
 ['l2', 0.04, 0.8651190

Among the best models are two L1 models with C parameters of .015 and .02 (note: smaller C means more regularization). L1 regularization causes many of the coefficients to be set to 0, performing de facto feature selection. Let's see which features these models selected.

In [15]:
print('L1 model, C=0.015')
model = LogisticRegression(penalty='l1', C=0.015).fit(X_train, y_train)
selected_features = list(zip(X_cols, model.coef_.ravel()))
selected_features = [f for f in selected_features if f[1] != 0]
print(pprint.pformat(selected_features))

print('----')

print('L1 model, C=0.02')
model = LogisticRegression(penalty='l1', C=0.02).fit(X_train, y_train)
selected_features = list(zip(X_cols, model.coef_.ravel()))
selected_features = [f for f in selected_features if f[1] != 0]
print(pprint.pformat(selected_features))

L1 model, C=0.015
[('home_ownership', -0.5952603981947261),
 ('acc_open_past_24mths', 1.5062144613709292),
 ('dti', 0.7050401115210349),
 ('inq_last_6mths', 0.25749025235936035),
 ('mo_sin_old_il_acct', -0.06004876133166225),
 ('mo_sin_old_rev_tl_op', -0.6473424652380454),
 ('mo_sin_rcnt_rev_tl_op', -0.012564525137396348),
 ('mths_since_recent_inq', -0.4666571562074985),
 ('num_il_tl', -0.518020756977413),
 ('num_tl_op_past_12m', 0.2654490310199626),
 ('pct_tl_nvr_dlq', -0.3661138397111568),
 ('percent_bc_gt_75', 0.45736388854062354),
 ('total_acc', -0.2717059913656659)]
----
L1 model, C=0.02
[('home_ownership', -0.5866106404340918),
 ('acc_open_past_24mths', 1.6050420403617753),
 ('dti', 0.7498210116131567),
 ('inq_last_6mths', 0.35396742895733146),
 ('mo_sin_old_il_acct', -0.07408169743568527),
 ('mo_sin_old_rev_tl_op', -0.6979957453146806),
 ('mo_sin_rcnt_rev_tl_op', -0.19318363040245998),
 ('mths_since_recent_inq', -0.42197501842372487),
 ('num_il_tl', -0.5548406870316974),
 ('num_

The features are almost the same, except for the variable 'num_rev_tl_bal_gt_0' (number of revolving trades with balance greater than zero). Both models have the same accuracy on the test set so for sake of parsimony we use the C=0.015 model with one fewer variable.

In [17]:
logModel = LogisticRegression(penalty='l1', C=0.015).fit(X_train, y_train)
print('Accuracy of final Logistic Regression model: ' + str(logModel.score(X_test, y_test)))

Accuracy of final Logistic Regression model: 0.8651269588271497


It's worth noting that the logistic regression model offers slightly better performance than the k-nearest neighbors model (86.51% compared to 86.49%) while being much faster to use, performing feature selection, and yielding insight into the importance of each variable in the prediction.

### Gradient boosting model

Now we'll try a gradient boosting model using a sequence of small decision trees. The parameters to be adjusted are the loss function, the learning rate (which weights contributions from each tree), the maximum tree depth, and the number of trees.

In [4]:
train_sample = train.sample(n=10000, replace=False)
test_sample = test.sample(n=10000, replace=False)

X_train, X_test = train_sample[X_cols], test_sample[X_cols]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train_sample.Default, test_sample.Default

class GBModel:
    def __init__(self, lossOption, learningRate, treeDepth, treeNum):
        """Compute gradient boosting model, score, and dict for row in dataframe"""
        self.model = GradientBoostingClassifier(
            loss=lossOption,
            learning_rate=learningRate,
            n_estimators=treeNum,
            subsample=0.5,
            max_depth=treeDepth
        ).fit(X_train, y_train)
        self.score = self.model.score(X_test, y_test)
        self.rowDict = {
            'loss': lossOption,
            'learn_rate': learningRate,
            'tree_depth': treeDepth,
            'tree_num': treeNum,
            'score': self.score
        }

In [20]:
lossOptions = ['deviance', 'exponential']
learningRates = [1, 0.5, 0.3, 0.2, 0.1, 0.05]
treeDepths = [2, 3, 4, 5]
treeNums = [80, 90, 100, 110, 120]

rows = []
for lossOption in lossOptions:
    for learningRate in learningRates:
        for treeDepth in treeDepths:
            for treeNum in treeNums:
                print(
                    'Computing parameters: ' + \
                    ', '.join( [lossOption, str(learningRate), str(treeDepth), str(treeNum)] )
                )
                gbm = GBModel(lossOption, learningRate, treeDepth, treeNum)
                print('Score = ' + str(gbm.score))
                rows.append(gbm.rowDict)

dfModels  = pd.DataFrame(columns=['loss', 'learn_rate', 'tree_depth', 'tree_num', 'score'])
for i, row in enumerate(rows):
    dfModels.loc[i] = row
dfModels.to_csv('gbm_test_model_scores.csv')
print('Done')

Computing parameters: deviance, 1, 2, 80
Score = 0.8342
Computing parameters: deviance, 1, 2, 90
Score = 0.8196666666666667
Computing parameters: deviance, 1, 2, 100
Score = 0.8235333333333333
Computing parameters: deviance, 1, 2, 110
Score = 0.8384666666666667
Computing parameters: deviance, 1, 2, 120
Score = 0.8076
Computing parameters: deviance, 1, 3, 80
Score = 0.7777333333333334
Computing parameters: deviance, 1, 3, 90
Score = 0.7885333333333333
Computing parameters: deviance, 1, 3, 100
Score = 0.7811333333333333
Computing parameters: deviance, 1, 3, 110
Score = 0.7972666666666667
Computing parameters: deviance, 1, 3, 120
Score = 0.7794666666666666
Computing parameters: deviance, 1, 4, 80
Score = 0.7592
Computing parameters: deviance, 1, 4, 90
Score = 0.7744
Computing parameters: deviance, 1, 4, 100
Score = 0.7644
Computing parameters: deviance, 1, 4, 110
Score = 0.7842
Computing parameters: deviance, 1, 4, 120
Score = 0.7703333333333333
Computing parameters: deviance, 1, 5, 80
Sc

Score = 0.8162666666666667
Computing parameters: exponential, 1, 2, 120
Score = 0.806
Computing parameters: exponential, 1, 3, 80
Score = 0.8101333333333334
Computing parameters: exponential, 1, 3, 90
Score = 0.8070666666666667
Computing parameters: exponential, 1, 3, 100
Score = 0.8040666666666667
Computing parameters: exponential, 1, 3, 110
Score = 0.7984
Computing parameters: exponential, 1, 3, 120
Score = 0.7938666666666667
Computing parameters: exponential, 1, 4, 80
Score = 0.8078
Computing parameters: exponential, 1, 4, 90
Score = 0.8
Computing parameters: exponential, 1, 4, 100
Score = 0.8015333333333333
Computing parameters: exponential, 1, 4, 110
Score = 0.7987333333333333
Computing parameters: exponential, 1, 4, 120
Score = 0.8024
Computing parameters: exponential, 1, 5, 80
Score = 0.8096
Computing parameters: exponential, 1, 5, 90
Score = 0.8006
Computing parameters: exponential, 1, 5, 100
Score = 0.8101333333333334
Computing parameters: exponential, 1, 5, 110
Score = 0.8037

In [21]:
dfModels = pd.read_csv('gbm_test_model_scores.csv', index_col=0)
dfModels.sort_values(by=['score'], inplace=True)

print(dfModels)
print('----')
print(dfModels['score'].describe())

            loss  learn_rate  tree_depth  tree_num     score
16      deviance        1.00           5        90  0.741133
18      deviance        1.00           5       110  0.758333
10      deviance        1.00           4        80  0.759200
12      deviance        1.00           4       100  0.764400
15      deviance        1.00           5        80  0.765267
19      deviance        1.00           5       120  0.766400
14      deviance        1.00           4       120  0.770333
11      deviance        1.00           4        90  0.774400
5       deviance        1.00           3        80  0.777733
17      deviance        1.00           5       100  0.777733
9       deviance        1.00           3       120  0.779467
7       deviance        1.00           3       100  0.781133
39      deviance        0.50           5       120  0.781267
37      deviance        0.50           5       100  0.784133
13      deviance        1.00           4       110  0.784200
6       deviance        

Some trends are immediately obvious: models with lower learning rates and exponential loss functions perform better. It seems that shallower trees also produce better performance, but the effect of tree number is not as clear. 

Now we train the worst and best 10 models on the full data set and compare.

In [22]:
train_sample = train.sample(n=20000, replace=False)
test_sample = test.sample(n=20000, replace=False)

X_train, X_test = train_sample[X_cols], test_sample[X_cols]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train_sample.Default, test_sample.Default

dfModels = pd.read_csv('gbm_test_model_scores.csv', index_col=0)
dfModels.sort_values(by=['score'], inplace=True)

dfWorst = dfModels.iloc[:10]
dfBest = dfModels.iloc[-10:]

print(dfWorst)
print(dfBest)

def scoreFromRow(row):
    print('Computing row:')
    print(row)
    gbm = GBModel(row.loss, row.learn_rate, row.tree_depth, row.tree_num)
    return gbm.score

dfBoth = pd.concat([dfWorst, dfBest])
dfBoth['full_score'] = dfBoth.apply(scoreFromRow, axis=1)

print('----')
print('Scores on larger data:')
print(dfBoth)

        loss  learn_rate  tree_depth  tree_num     score
16  deviance         1.0           5        90  0.741133
18  deviance         1.0           5       110  0.758333
10  deviance         1.0           4        80  0.759200
12  deviance         1.0           4       100  0.764400
15  deviance         1.0           5        80  0.765267
19  deviance         1.0           5       120  0.766400
14  deviance         1.0           4       120  0.770333
11  deviance         1.0           4        90  0.774400
5   deviance         1.0           3        80  0.777733
17  deviance         1.0           5       100  0.777733
            loss  learn_rate  tree_depth  tree_num     score
225  exponential        0.05           3        80  0.863667
222  exponential        0.05           2       100  0.863667
230  exponential        0.05           4        80  0.863667
232  exponential        0.05           4       100  0.863733
233  exponential        0.05           4       110  0.863733
223  ex

In this case there was a tie for the top-scoring models on the larger test set, so we train the top 3 models on the full data set and use the one with the best performance.

In [5]:
X_train, X_test = train[X_cols], test[X_cols]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train.Default, test.Default

gbms = [
    GBModel('exponential', 0.05, 2, 100),
    GBModel('exponential', 0.10, 2, 110),
    GBModel('exponential', 0.10, 3, 90)
]
gbms = sorted(gbms, key=lambda x: x.score)
best_gbm = gbms[-1]
print('Parameters of final Gradient Boosting model:')
print(best_gbm.rowDict)
print('Accuracy of final Gradient Boosting model: ' + str(best_gbm.score))

Parameters of final Gradient Boosting model:
{'loss': 'exponential', 'learn_rate': 0.05, 'tree_depth': 2, 'tree_num': 100, 'score': 0.8651190682847539}
Accuracy of final Gradient Boosting model: 0.8651190682847539


### Stacked model

This model uses the prediction of the Logistic Regression model as a feature in a new Gradient Boosting classifier. The simplest way to implement this is to simply add the predictions from the Logistic Regression model to the data sets and repeat the above steps to choose a new Gradient Boosting classifier. For simplicity, we'll just check the top 10 Gradient Boosting models.

In [26]:
N = len(dfBest.index)
stackedScores = []
for i in range(N):
    row = dfBest.iloc[i]
    
    # Get data to feed into log regression
    X_train, X_test = train_sample[X_cols], test_sample[X_cols]
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    y_train, y_test = train_sample.Default, test_sample.Default

    # Fit log regression
    logModel = LogisticRegression(penalty='l1', C=0.015).fit(X_train, y_train)

    # Append predictions to train/test, scale
    X_train, X_test = train_sample[X_cols], test_sample[X_cols]
    X_train['log_prediction'] = logModel.predict(X_train)
    X_test['log_prediction'] = logModel.predict(X_test)
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Fit GB model using new feature
    gbModel = GradientBoostingClassifier(
        loss=row.loss,
        learning_rate=row.learn_rate,
        n_estimators=row.tree_num,
        subsample=0.5,
        max_depth=row.tree_depth
    ).fit(X_train, y_train)
    
    score = gbModel.score(X_test, y_test)
    print('Score: ' + str(score))
    stackedScores.append(score)
    
dfBest['stacked_score'] = stackedScores
print(dfBest)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Score: 0.8653
Score: 0.8653
Score: 0.8653
Score: 0.8654
Score: 0.8653
Score: 0.8653
Score: 0.8653
Score: 0.86535
Score: 0.8652
Score: 0.86535
            loss  learn_rate  tree_depth  tree_num     score  stacked_score
225  exponential        0.05           3        80  0.863667        0.86530
222  exponential        0.05           2       100  0.863667        0.86530
230  exponential        0.05           4        80  0.863667        0.86530
232  exponential        0.05           4       100  0.863733        0.86540
233  exponential        0.05           4       110  0.863733        0.86530
223  exponential        0.05           2       110  0.863733        0.86530
100     deviance        0.05           2        80  0.863733        0.86530
227  exponential        0.05           3       100  0.863800        0.86535
203  exponential        0.10           2       110  0.863800        0.86520
201  exponential        0.10           2        90  0.864000        0.86535


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [27]:
bestGbms = [
    {'loss': 'deviance', 'learn': 0.05, 'depth': 2, 'num': 90},
    {'loss': 'exponential', 'learn': 0.10, 'depth': 3, 'num': 90},
    {'loss': 'exponential', 'learn': 0.05, 'depth': 2, 'num': 100},
]
for gbm in bestGbms:
    
    # Get data to feed into log regression
    X_train, X_test = train[X_cols], test[X_cols]
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    y_train, y_test = train.Default, test.Default

    # Fit log regression
    logModel = LogisticRegression(penalty='l1', C=0.015).fit(X_train, y_train)

    # Append predictions to train/test, scale
    X_train, X_test = train[X_cols], test[X_cols]
    X_train['log_prediction'] = logModel.predict(X_train)
    X_test['log_prediction'] = logModel.predict(X_test)
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Fit GB model using new feature
    gbModel = GradientBoostingClassifier(
        loss=gbm['loss'],
        learning_rate=gbm['learn'],
        n_estimators=gbm['num'],
        subsample=0.5,
        max_depth=gbm['depth']
    ).fit(X_train, y_train)
    
    score = gbModel.score(X_test, y_test)
    print('Score: ' + str(score))
    gbm['score'] = score

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Score: 0.8651190682847539
Score: 0.8650638344879827
Score: 0.8651190682847539


In [24]:
print(bestGbms)
bestGbms = sorted(bestGbms, key=lambda x: x['score'])
print('Accuracy of best stacked model (first attempt): ' + str(bestGbms[-1]['score']))

[{'loss': 'deviance', 'learn': 0.05, 'depth': 2, 'num': 90, 'score': 0.8651190682847539}, {'loss': 'exponential', 'learn': 0.1, 'depth': 3, 'num': 90, 'score': 0.8651032871999621}, {'loss': 'exponential', 'learn': 0.05, 'depth': 2, 'num': 100, 'score': 0.8651190682847539}]
Accuracy of best stacked model (first attempt): 0.8651190682847539


Our stacked model failed to produce an increase in accuracy over the best Gradient Boosting model. One problem could be that the linear model and the Gradient Boosting model both train with the same data. Instead, we can try randomly splitting the data in half and having each model train on a different half.

In [28]:
# Shuffle then split training data
train_sample = shuffle(train)
S = len(train_sample.index)
train_sample_1 = train_sample.iloc[:math.floor(S/2)]
train_sample_2 = train_sample.iloc[math.floor(S/2):]

N = len(dfBest.index)
stackedScores2 = []
for i in range(N):
    row = dfBest.iloc[i]
    
    # Get data to feed into log regression (from train_sample_1)
    X_train, X_test = train_sample_1[X_cols], test_sample[X_cols]
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    y_train, y_test = train_sample_1.Default, test_sample.Default

    # Fit log regression
    logModel = LogisticRegression(penalty='l1', C=0.015).fit(X_train, y_train)

    # Append predictions to train/test for GB model (train_sample_2)
    X_train, X_test = train_sample_2[X_cols], test_sample[X_cols]
    X_train['log_prediction'] = logModel.predict(X_train)
    X_test['log_prediction'] = logModel.predict(X_test)
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    y_train, y_test = train_sample_2.Default, test_sample.Default

    # Fit GB model using new feature
    gbModel = GradientBoostingClassifier(
        loss=row.loss,
        learning_rate=row.learn_rate,
        n_estimators=row.tree_num,
        subsample=0.5,
        max_depth=row.tree_depth
    ).fit(X_train, y_train)
    
    
    score = gbModel.score(X_test, y_test)
    print('Score: ' + str(score))
    stackedScores2.append(score)
    
dfBest['stacked_score_2'] = stackedScores2
print(dfBest)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Score: 0.8653
Score: 0.8653
Score: 0.8653
Score: 0.8653
Score: 0.86525
Score: 0.8653
Score: 0.8653
Score: 0.8653
Score: 0.86525
Score: 0.8653
            loss  learn_rate  tree_depth  tree_num     score  stacked_score  \
225  exponential        0.05           3        80  0.863667        0.86530   
222  exponential        0.05           2       100  0.863667        0.86530   
230  exponential        0.05           4        80  0.863667        0.86530   
232  exponential        0.05           4       100  0.863733        0.86540   
233  exponential        0.05           4       110  0.863733        0.86530   
223  exponential        0.05           2       110  0.863733        0.86530   
100     deviance        0.05           2        80  0.863733        0.86530   
227  exponential        0.05           3       100  0.863800        0.86535   
203  exponential        0.10           2       110  0.863800        0.86520   
201  exponential        0.10           2        90  0.864000        

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [13]:
print('Accuracy of best stacked model (2nd attempt): ' + str(max(dfBest.stacked_score_2)))

Accuracy of best stacked model (2nd attempt): 0.86455


Unfortunately this change resulted in slightly worse models. Therefore we will preserve the first attempt as our best stacked model.

In [8]:
# Data for log regression
X_train, X_test = train[X_cols], test[X_cols]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train.Default, test.Default

# Fit log regression
logModel = LogisticRegression(penalty='l1', C=0.015).fit(X_train, y_train)

# Append predictions to train/test for GB model (train_sample_2)
X_train, X_test = train[X_cols], test[X_cols]
X_train['log_prediction'] = logModel.predict(X_train)
X_test['log_prediction'] = logModel.predict(X_test)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train.Default, test.Default

gbm = GradientBoostingClassifier(
    loss='exponential', learning_rate=0.05, n_estimators=100, subsample=0.5, max_depth=2
).fit(X_train, y_train)

# Save stacked model predictions for later
stackedPredictions = gbm.predict(X_test)

print('Accuracy of selected stacked model: ' + str(gbm.score(X_test, y_test)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Accuracy of selected stacked model: 0.8651190682847539


### Blended model

Another way to combine models is by taking a weighted average of votes from a collection of models. We use strong and weak versions of the logistic regression, k-nearest neighbors, and gradient boosting models as the base model.

In [30]:
# Data for base model predictions
X_train, X_test = train[X_cols], test[X_cols]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train.Default, test.Default

# Selected base models
baseModels = [
    LogisticRegression(penalty='l2', C=0.7),
    LogisticRegression(penalty='l1', C=0.015),
    GradientBoostingClassifier(
        loss='deviance', learning_rate=1, n_estimators=120, subsample=0.5, max_depth=5
    ),
    GradientBoostingClassifier(
        loss='exponential', learning_rate=0.05, n_estimators=100, subsample=0.5, max_depth=2
    ),
    KNeighborsClassifier(n_neighbors=12, metric='minkowski', p=2, weights='uniform'),
    KNeighborsClassifier(n_neighbors=19, metric='chebyshev', weights='distance')
]
N = len(baseModels)

In [None]:
# Train models, get predictions
print('Training base models')
baseModels_trained = []
for m in enumerate(baseModels):
    baseModels_trained.append(m.fit(X_train, y_train))
baseModels = 
print('Getting base model predictions')
baseTrainPredictions = [m.predict(X_train) for m in baseModels]
baseTestPredictions = [m.predict(X_test) for m in baseModels]

# Data for exterior model
print('Saving')
train_ext = pd.DataFrame()
for i, col in enumerate(baseTrainPredictions):
    train_ext['model_'+str(i)] = col
train_ext['Default'] = train['Default'].astype('int').values
    
test_ext = pd.DataFrame()
for i, col in enumerate(baseTestPredictions):
    test_ext['model_'+str(i)] = col
test_ext['Default'] = test['Default'].astype('int').values

train_ext.to_csv('blended_data_train.csv')
test_ext.to_csv('blended_data_test.csv')

In [10]:
train_ext = pd.read_csv('blended_data_train.csv', index_col=0)
test_ext = pd.read_csv('blended_data_test.csv', index_col=0)

X_cols_ext = [c for c in train_ext.columns if not c=='Default']
X_train, X_test = train_ext[X_cols_ext], test_ext[X_cols_ext]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train_ext.Default, test_ext.Default

# Optimize weights
def neg_score(weights):
    # To predict: for each row, multiply weights by row values and check if > 0.5
    ext_predictions = np.apply_along_axis(lambda x: sum(weights*x) > 0.5, 1, X_test)
    return -sum(np.equal(ext_predictions, y_test))/len(y_test)

# Minimize function neg_score over all possible weights
N = len(baseModels)
optResult = minimize(
    fun = neg_score,
    x0 = [1/N]*N,  #initial guess: uniform weights
    method = 'SLSQP',
    bounds = [(-10, 10)]*N,  #bounds on weights (no reason for magnitude > 10)
    constraints = [
        {'type': 'eq', 'fun': lambda x: sum(x)-1}  #weights sum to 1
    ]
)

def test_score(weights):
    ext_predictions = np.apply_along_axis(lambda x: sum(weights*x) > 0.5, 1, X_test)
    return sum(np.equal(ext_predictions, y_test))/len(y_test)

print(optResult.message)
if optResult.success:
    print('Optimal weights:')
    print(optResult.x)
    print('Score:')
    print(test_score(optResult.x))

Optimization terminated successfully.
Optimal weights:
[0.16674805 0.17626954 0.16674805 0.16674805 0.15673827 0.16674805]
Score:
0.8651269588271497


This blended model is not an improvement over the best individual model. In fact, we can try many different initial values and optimization will never improve performance over the best base model. This is because the input is really categorical: each model predits either 0 or 1 and there are 6 models, hence there are $2^6 = 64$ inputs. The weighted average is simply a way to map each of these 64 inputs to a prediction (0 or 1).

It's possible that we could fix this by increasing the number of models, creating more possible functions for the optimization procedure to move through so that it could approach a better function by small steps. Instead, we build our own prediction system which computes the probability of default given a particular input string and returns 1 if that probability is greater than 0.5.

In [13]:
def allBinaryStrings(n):
    """Return all binary strings of length n (as list of lists of bits)"""
    if n > 1:
        return [[0]+s for s in allBinaryStrings(n-1)] + [[1]+s for s in allBinaryStrings(n-1)]
    else:
        return [[0], [1]]

    
class BinaryStringClassifier:
    def __init__(self):
        pass
    
    def fit(self, X, y):
        # Set dimensions
        self.inputLength = X.shape[1]
        if X.shape[0] != y.shape[0]:
            raise Exception('Shape mismatch: ' + str(X.shape) + ' ' + str(y.shape))
        # Create counting dict (initialize with all binary strings)
        bStrings = allBinaryStrings(self.inputLength)
        bStrings = [''.join([str(int(t)) for t in bStr]) for bStr in bStrings]
        self.counts = {
            bStr: {'true_count': 0, 'total': 0, 'prob': 0} 
            for bStr in bStrings
        }
        # Fill dict, store count == 1 and total count
        X_bStrings = np.apply_along_axis(lambda x: ''.join([str(int(t)) for t in x]), 1, X)
        for bStr, yy in zip(X_bStrings, y):
            self.counts[bStr]['total'] += 1
            if yy == 1:
                self.counts[bStr]['true_count'] += 1
        # Compute probabilities, predictMap
        self.predictMap = {bStr: 0 for bStr in bStrings}
        for bStr, store in self.counts.items():
            if store['total'] > 0:
                store['prob'] = store['true_count']/store['total']
            if store['prob'] > 0.5:
                self.predictMap[bStr] = 1        
    
    def predict(self, X):
        """Return predicted values for rows of X"""
        if X.shape[1] != self.inputLength:
            raise Exception('Bad input length: ' + str(X.shape) + ', need ' + str(self.inputLength))
        return np.apply_along_axis(
            lambda x: self.predictMap[ ''.join([str(int(t)) for t in x]) ], 1, X
        )
    
    def score(self, X, y):
        """Return accuracy of predictions for X compared to y"""
        if X.shape[1] != self.inputLength:
            raise Exception('Bad input length: ' + str(X.shape) + ', need ' + str(self.inputLength))
        if X.shape[0] != y.shape[0]:
            raise Exception('Shape mismatch: ' + str(X.shape) + ' ' + str(y.shape))
        return sum(np.equal(self.predict(X), y))/len(y)  

In [17]:
train_ext = pd.read_csv('blended_data_train.csv', index_col=0)
test_ext = pd.read_csv('blended_data_test.csv', index_col=0)

X_cols_ext = [c for c in train_ext.columns if not c=='Default']
X_train, X_test = train_ext[X_cols_ext], test_ext[X_cols_ext]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train_ext.Default, test_ext.Default

bsc = BinaryStringClassifier()
bsc.fit(X_train, y_train)
bsc.score(X_test, y_test)

0.8649218047248568

This classifier performs slightly worse than the best base model. I think we can attribute this to overfitting: the model relies solely on the probability of default given an input string based solely on the probability meaured from the training data. A modification to this approach would be to have the model continuously update the counts and probability for each new test input it makes, and use this data in further inputs. Thus, the model trains itself after each prediction.

In [14]:
class BinaryStringUpdateClassifier(BinaryStringClassifier):
    def __init__(self):
        super().__init__()
        
    def predictAndUpdate(self, X_row, y_val):
        """Make prediction for X_row, update model, and return prediction"""
        bStr = ''.join([str(int(t)) for t in X_row])
        prediction = self.predictMap[bStr]
        store = self.counts[bStr]
        store['total'] += 1
        if y_val == 1:
            store['true_count'] += 1
        store['prob'] = store['true_count']/store['total']
        self.counts[bStr] = store
        if store['prob'] > 0.5:
            self.predictMap[bStr] = 1
        else:
            self.predictMap[bStr] = 0
        return prediction
        
    def score(self, X, y):
        """Return predicted values for rows of X; update model after each prediction"""
        if X.shape[1] != self.inputLength:
            raise Exception('Bad input length: ' + str(X.shape) + ', need ' + str(self.inputLength))
        predictions = [self.predictAndUpdate(X[i], y[i]) for i in range(X.shape[0])]
        return {'score': sum(np.equal(predictions, y))/len(y), 'predictions': predictions} 

In [46]:
bsc = BinaryStringUpdateClassifier()
bsc.fit(X_train, y_train)
print(bsc.score(X_test, y_test)['score'])

0.8649218047248568


Although the update mechanism was tested and works as expected, it did not change the predictions. This is probably because the effect is too weak, i.e. the probability is too heavily weighted by past values. So instead, we will give new values more weight when updating the model by having them directly "pull" the probability toward 1 or 0.

In [15]:
class BinaryStringUpdateClassifier(BinaryStringClassifier):
    def __init__(self, inverseUpdateWeight):
        super().__init__()
        if inverseUpdateWeight < 1:
            raise Exception('inverseUpdateWeight must be >= 1 (was ' + str(inverseUpdateWeight) + ')')
        self.inverseUpdateWeight = inverseUpdateWeight
        
    def predictAndUpdate(self, X_row, y_val):
        """Make prediction for X_row, update model, and return prediction"""
        bStr = ''.join([str(int(t)) for t in X_row])
        prediction = self.predictMap[bStr]
        store = self.counts[bStr]
        w = self.inverseUpdateWeight
        if y_val == 1:
            store['prob'] = store['prob'] + (1 - store['prob'])/w
        else:
            store['prob'] = store['prob'] + (0 - store['prob'])/w
        self.counts[bStr] = store
        if store['prob'] > 0.5:
            self.predictMap[bStr] = 1
        else:
            self.predictMap[bStr] = 0
        return prediction
        
    def score(self, X, y):
        """Return predicted values for rows of X; update model after each prediction"""
        if X.shape[1] != self.inputLength:
            raise Exception('Bad input length: ' + str(X.shape) + ', need ' + str(self.inputLength))
        predictions = [self.predictAndUpdate(X[i], y[i]) for i in range(X.shape[0])]
        return {'score': sum(np.equal(predictions, y))/len(y), 'predictions': predictions} 

In [21]:
optimalWeight = (1, 0.0)
for weight in range(1, 21):
    print('Inverse update weight: ' + str(weight))
    bsc = BinaryStringUpdateClassifier(weight)
    bsc.fit(X_train, y_train)
    score = bsc.score(X_test, y_test)['score']
    if score > optimalWeight[1]:
        optimalWeight = (weight, score)

print('Optimal weight/score:')
print(optimalWeight)

# Save predictions for later
bsc = BinaryStringUpdateClassifier(optimalWeight[0])
bsc.fit(X_train, y_train)
blendedPredictions = bsc.score(X_test, y_test)['predictions']

Inverse update weight: 1
Inverse update weight: 2
Inverse update weight: 3
Inverse update weight: 4
Inverse update weight: 5
Inverse update weight: 6
Inverse update weight: 7
Inverse update weight: 8
Inverse update weight: 9
Inverse update weight: 10
Inverse update weight: 11
Inverse update weight: 12
Inverse update weight: 13
Inverse update weight: 14
Inverse update weight: 15
Inverse update weight: 16
Inverse update weight: 17
Inverse update weight: 18
Inverse update weight: 19
Inverse update weight: 20
Optimal weight/score:
(13, 0.8650322723183992)


The new update mechanism squeezed a little bit more accuracy out of the model, but unfortunately does not quite match the performance of the best base model (about .8651).

### Comparison

In [50]:
# Collect all models and their predictions, save to csv
X_train, X_test = train[X_cols], test[X_cols]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train, y_test = train.Default, test.Default

print('Fitting models')
knnModel = KNeighborsClassifier(
    n_neighbors=12,
    metric='minkowski',
    p=4,
    weights='uniform'
).fit(X_train, y_train)
logModel = LogisticRegression(penalty='l1', C=0.015).fit(X_train, y_train)
gbmModel = GradientBoostingClassifier(
    loss='exponential',
    learning_rate=0.05,
    n_estimators=100,
    subsample=0.5,
    max_depth=2
).fit(X_train, y_train)
# Already have stackedPredictions, blendedPredictions

print('Predictions')
dfPredictions = pd.DataFrame(columns=['knn', 'log', 'gbm', 'stacked', 'blended', 'y'])
dfPredictions['stacked'] = stackedPredictions
dfPredictions['blended'] = blendedPredictions
dfPredictions['y'] = y_test.values
dfPredictions['log'] = logModel.predict(X_test)
print('Knn predictions')
dfPredictions['knn'] = knnModel.predict(X_test)
print('Gbm predictions')
dfPredictions['gbm'] = gbmModel.predict(X_test)
dfPredictions.to_csv('summary_predictions.csv')
print('Done')

Fitting models
Predictions
Knn predictions
Gbm predictions
Done


In [52]:
dfPredictions = pd.read_csv('summary_predictions.csv', index_col=0)

def modelMetrics(modelCode):
    """Return accuracy, true positive rate, false positive rate"""
    y = dfPredictions['y'].values
    mp = dfPredictions[modelCode].values
    metrics = {}
    metrics['accuracy'] = sum(np.equal(mp, y))/len(y)
    mp_posCheck = mp[np.where(y == 1)]
    mp_negCheck = mp[np.where(y == 0)]
    metrics['tpr'] = sum(mp_posCheck == 1)/len(mp_posCheck) # mp==1 when y==1
    metrics['fpr'] = sum(mp_negCheck == 1)/len(mp_negCheck) # mp==1 when y==0
    return metrics

# Print table of model info
dfComp = pd.DataFrame(columns=['model', 'accuracy', 'tpr', 'fpr'])
models = ['K-nearest neighbors', 'Logistic regression', 'Gradient boosting', 'Stacked', 'Blended']
codes = ['knn', 'log', 'gbm', 'stacked', 'blended']
metrics = [modelMetrics(modelCode=c) for c in codes]
for i, (model, metric) in enumerate(zip(models, metrics)):
    dfComp.loc[i] = {
        'model': model,
        'accuracy': metric['accuracy'],
        'tpr': metric['tpr'],
        'fpr': metric['fpr']
    }
print(dfComp)

                 model  accuracy       tpr       fpr
0  K-nearest neighbors  0.864646  0.001814  0.000830
1  Logistic regression  0.865127  0.000059  0.000000
2    Gradient boosting  0.865119  0.000000  0.000000
3              Stacked  0.865119  0.000000  0.000000
4              Blended  0.865032  0.000644  0.000201


The best (most accurate) model is the Logistic Regression model, with accuracy of 86.5127%. But the other models are extremely close: the K-Nearest Neighbors model has the lowest accuracy at 86.4646%.

A final note: although we selected models to maximize accuracy, in some cases we may be interested in different performance metrics. For example, if we want to make sure the model detects most of the defaults, then we would prefer the K-nearest neighbors model, which has the highest true positive rate. If, furthermore, there is a significant cost associated to false positives, then we might prefer the Blended model, which has the highest ratio of true positives to false positives.