In this notebook, we will build models for classifications and test them on the credit card dataset. We begin by reading the data and splitting it into a training set and a test set:

In [1]:
import os
os.sys.path.append(os.path.dirname(os.path.abspath('.')))

from src.models.logistic import *
from src.preprocessing.preprocessing import *

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

path = '../data/raw/'

X, y = get_design_matrix(path=path), get_target_values(path=path)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=6)

## The simplest model
Most clients do not default on their debt. In order to have a performance benchmark, we simply predict that no one defaults:

In [2]:
naive_predictions = np.zeros(y_test.shape)
accuracy_score(y_test, naive_predictions)

0.7746666666666666

## Logistic regression
We will begin by trying out logistic regression from sklearn.

In [3]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='lbfgs', max_iter=500)
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
print(accuracy_score(y_test, predictions))

0.8157777777777778


Now, we will try our own implementation of logistic regression:

In [4]:
lr = LogReg()
lr.fit(X_train,
       y_train,
      iterations = 1000000,
      lr = 0.01,
      stochastic=True,
      batch_size=128,
      validation=True,
      validation_size=0.2,
      seed=12,
      stopping_accuracy=0.82)
predictions = lr.predict(X_test)
print('Test accuracy:', accuracy_score(y_test, predictions))

Epoch 1000
Accuracy: 0.7635714285714286
Epoch 2000
Accuracy: 0.7830952380952381
Epoch 3000
Accuracy: 0.7978571428571428
Epoch 4000
Accuracy: 0.794047619047619
Epoch 5000
Accuracy: 0.7954761904761904
Epoch 6000
Accuracy: 0.7952380952380952
Epoch 7000
Accuracy: 0.7957142857142857
Epoch 8000
Accuracy: 0.794047619047619
Epoch 9000
Accuracy: 0.795952380952381
Epoch 10000
Accuracy: 0.7952380952380952
Epoch 11000
Accuracy: 0.795
Epoch 12000
Accuracy: 0.7992857142857143
Epoch 13000
Accuracy: 0.7947619047619048
Epoch 14000
Accuracy: 0.7952380952380952
Epoch 15000
Accuracy: 0.7954761904761904
Epoch 16000
Accuracy: 0.7961904761904762
Epoch 17000
Accuracy: 0.794047619047619
Epoch 18000
Accuracy: 0.7973809523809524
Epoch 19000
Accuracy: 0.7942857142857143
Epoch 20000
Accuracy: 0.7976190476190477
Epoch 21000
Accuracy: 0.7964285714285714
Epoch 22000
Accuracy: 0.7961904761904762
Epoch 23000
Accuracy: 0.8002380952380952
Epoch 24000
Accuracy: 0.7988095238095239
Epoch 25000
Accuracy: 0.8004761904761905
E

Epoch 203000
Accuracy: 0.8173809523809524
Epoch 204000
Accuracy: 0.815952380952381
Epoch 205000
Accuracy: 0.815952380952381
Epoch 206000
Accuracy: 0.8169047619047619
Epoch 207000
Accuracy: 0.8166666666666667
Epoch 208000
Accuracy: 0.8171428571428572
Epoch 209000
Accuracy: 0.815952380952381
Epoch 210000
Accuracy: 0.8147619047619048
Epoch 211000
Accuracy: 0.815952380952381
Epoch 212000
Accuracy: 0.8161904761904762
Epoch 213000
Accuracy: 0.8173809523809524
Epoch 214000
Accuracy: 0.8176190476190476
Epoch 215000
Accuracy: 0.815952380952381
Epoch 216000
Accuracy: 0.8169047619047619
Epoch 217000
Accuracy: 0.8173809523809524
Epoch 218000
Accuracy: 0.815
Epoch 219000
Accuracy: 0.8173809523809524
Epoch 220000
Accuracy: 0.8178571428571428
Epoch 221000
Accuracy: 0.8161904761904762
Epoch 222000
Accuracy: 0.8183333333333334
Epoch 223000
Accuracy: 0.8176190476190476
Test accuracy: 0.8132222222222222


As we see, we obtain a test accuracy of 0.8132.

## Neural networks
Before trying our own implementation, we will use Keras.

In [5]:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import LabelBinarizer

model = Sequential()
model.add(Dense(32, activation='relu', input_dim=X_train.shape[1]))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])


# Train the model, iterating on the data in batches of 64 samples
model.fit(X_train, y_train, epochs=30, batch_size=64)

probabilities = model.predict(X_test)
predictions = np.where(probabilities < 0.5, 0, 1).ravel()
print('Test accuracy:', accuracy_score(y_test, predictions))
print(predictions)
print(y_test)

Using TensorFlow backend.


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Test accuracy: 0.8118888888888889
[0 0 0 ... 0 0 0]
[1 0 0 ... 1 1 0]


In [18]:
from src.models.neural import *
nn = NeuralNetwork(
    layers = [
        {
            'neurons': 20,
            'activation': 'tanh'
        },
        {
            'neurons': 20,
            'activation': 'tanh'
        },
        {
            'neurons': 2,
            'activation': 'softmax'
        }
    ]
)
nn.fit(X_train,
       y_train,
       iterations=500000,
       batch_size=128,
       learning_rate=0.1,
       regularization=0.0091,
       validation=True,
       validation_size=0.05,
       stopping_accuracy=0.83)
predictions = nn.predict(X_test)
print('Test accuracy:', accuracy_score(y_test, predictions))

Epoch 1000
Accuracy: 0.2361904761904762
Epoch 2000
Accuracy: 0.23047619047619047
Stopped iterating, validation accuracy now is 0.8333333333333334
Test accuracy: 0.814
