In this notebook I will train and cross-validate an algorithm to predict credit card fraud, with highest possible "accuracy". Due to the highly imbalanced nature of the data we need to be careful how to measure error. I will distinguish between two types of error: (a) percentage of frauds detected among non-fraudulent transactions (1st kind), and (b) percentage of frauds that are not recognized (2nd kind).

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("data/creditcard.csv")
df_orig = df.copy(deep = True)        # keep a copy for cross-validation

I have tried several algorithms, but Logistic Regression performed best (I cannot tell about SVC - I have tried it, but I had to abort it after many hours). Note the weight-option - this compensates for the imbalance of classes (i.e. fraud vs. non-fraud):

In [2]:
lrn = LogisticRegression(penalty = 'l2', C= 1, class_weight='balanced' )

Now I will shuffle the data, split it into a training set (80%) and a test set (20%), train the algorithm on the training set, and then test it. I repeat this N times, in order to get a good measure of the quality of the classifier. For sake of clarity I define functions that I will loop over afterwards.

Create training and test sets:

In [3]:
def create_sets():
    df = df_orig.copy(deep = True)
    df = df.sample(frac=1).reset_index(drop=True)        #shuffle
    
    y = df.Class.tolist()
    df = df.drop('Class', 1)
    X = df.as_matrix()
    
    # create test and training set
    p = 0.2                      #fraction of test sample
    X_test = X[:int(p*len(y))]
    y_test = y[:int(p*len(y))]
    X_train = X[int(p*len(y)):]
    y_train = y[int(p*len(y)):]
    return X_test, y_test, X_train, y_train

Training of the algorithm and then testing it. For testing, we count the errors of first and second kind (see first paragraph):

In [8]:
def train_test():
    X_test, y_test, X_train, y_train = create_sets()
    
    lrn.fit(X_train, y_train)
    
    y_predict = lrn.predict(X_test)

    # count the errors:
    c_0 = 0
    c_1 = 0
    for i in range(len(y_test)):
        if (y_test[i] == 0) and (y_predict[i] == 1):
            c_0 += 1
        if (y_test[i] == 1) and (y_predict[i] == 0):
            c_1 += 1

    n_fraud = np.sum(y_test)
    return (100*c_0)/(len(y_test)-n_fraud), (100*c_1)/(n_fraud)


Now we will run through train_test() many times in order to get good estimates for the errors (I have used this crossvalidation for different algorithms, as well in order to obtain the optimal C in the logistic regression):

In [10]:
N = 10        #number of iterations
f_1 = 0      #counts the errors of the first kind (already in percent)
f_2 = 0      #counts the errors of the second kind
for n in range(N):
    a, b = train_test()
    f_1 += a
    f_2 += b

print("Error of first kind  = {}%".format(((10*f_1)//N)/10))
print("Error of second kind = {}%".format(((10*f_2)//N)/10))


Error of first kind  = 2.2%
Error of second kind = 10.2%


This is not too bad - however, there is still room for improvement. One aspect might be to reduce the error of first kind - 2% false alarms for non-fraudulent transactions can be pretty annoying in practice.