In [1]:
import pandas as pd
import numpy as np
import scipy
import random
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
import statsmodels.formula.api as smf
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

from sklearn.utils import resample

from sklearn.naive_bayes import BernoulliNB

from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

# Predicting Credit Card Fraud

I want to make a model that can predcit if a credit card purchase is fraudulent. 

I will use anonymized credit card purchase data to make model the prediction.

In [2]:
fraud = pd.read_csv('C:\Code\Data\creditcard.csv')
print(fraud.shape)
display(fraud.head())

(284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [3]:
display(fraud.Class.value_counts())

0    284315
1       492
Name: Class, dtype: int64

The features have gone nameless to predict the identity protection measures this company takes. As such, it will be difficult to perform additional feature analysis. Instead, I will try many different models until I find the one best suited for the data.

I can also see that the classes are extremely inbalanced. There are more than 280,000 reputable cases and only 492 fradulent ones.

### Resampling Classes and Creating Test Data

I will resample my data so that the classes are balanced. I will upsample fraud and downsample real purchases until their are 1000 of each, 2000 data points in total. 

In [4]:
real = fraud[fraud.Class==0]
fake = fraud[fraud.Class==1]
 
# Downsample real purchases class
real_downsampled = resample(real, replace=True, n_samples=1000)
#Upsample fake purchases class
fake_upsampled = resample(fake, replace=True, n_samples=1000)

sampled_fraud = pd.concat([real_downsampled, fake_upsampled])
 
# Display new class counts
display(sampled_fraud.Class.value_counts())

1    1000
0    1000
Name: Class, dtype: int64

In [5]:
# create data and outcome values for training, or smapled, set
Xs = sampled_fraud.drop(['Class'], 1)
Ys = sampled_fraud.Class

In [6]:
# set up data and outcome for test model
Xt = fraud.drop(['Class'], 1)
Yt = fraud.Class

# Running Models

Now that the classes are balanced I can begin running my various models. It should be noted that with this data fraudulent purchases are the positive class and reputable purchases are the negative. As such, false negatives, or Type II errors, are much worse and than false positives, or Type I errors. 

If a false positive happens, the customer will be called and alerted of fraud where none took place, so the customer will simply clear things up. However, if a false negative happens, fraud took place but wasn't detected at all. So it can continue to go on, damaging both the customer and the company.

So, I will be sure to make a model thats overall accurate, but more importantly one that avoids type II errors.

I will also perform a cross validation score using 3 folds to get a guage on how overfit each the models may get.

### Naive Bayes

The first model type I will use is Naive Bayes. It has the benefit of being very simple and inherently avoids Type II errors. But, as it doesn't learn at all, it can be hard to get it to be very accurate.

In [7]:
# create dataframe to hold the testing scores in 
scores = pd.DataFrame()

In [8]:
bnb = BernoulliNB()

bnb.fit(Xs, Ys)

Y_predbnb = bnb.predict(Xt)

cvscores = cross_val_score(bnb, Xs, Ys)
cvscoret = cross_val_score(bnb, Xt, Yt)
print("Naive Bayes Training Set:")
print("\nAccuracy Score:")
print(bnb.score(Xs,Ys))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscores.mean(),2),round(cvscores.std()*2,2)))
print("\nNaive Bayes Testing Set:")
print('\nTesting Accuracy Score:')
print(bnb.score(Xt,Yt))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscoret.mean(),2),round(cvscoret.std()*2,2)))
tn, fp, fn, tp = confusion_matrix(Yt, Y_predbnb).ravel()
type2 = fn/(fn+tp)
print('\nType II Error Percentage:')
print(round(type2*100,2),"%")
print('\nConfusion Matrix:')
print(confusion_matrix(Yt, Y_predbnb))

scores['Naive Bayes'] = [bnb.score(Xt,Yt),type2]

Naive Bayes Training Set:

Accuracy Score:
0.901

Cross Validation Score:
0.9% +/- 0.02%

Naive Bayes Testing Set:

Testing Accuracy Score:
0.9905374516778028

Cross Validation Score:
1.0% +/- 0.0%

Type II Error Percentage:
17.48 %

Confusion Matrix:
[[281706   2609]
 [    86    406]]


This model is inaccurately, but not overly so. Still it is unuseabl it is current state and impossible to tune without knowing more about the data. I will discard my Naive Bayes model from here.

### Lasso Logistic Regression

Logistics Regression uses the natural log to make categorical outcomes to function like continous one, allowing for an ordinary least square regression like function to be performed. Additionally, I will use a Lasso error function, which will shrink useless terms down to zero, handy since I don't know anything about the features. 

In [9]:
#find the best value of C to use when fitting the model
grid = [.01,.1, 1,10,100,200,300,400,500,600,700,900,1000,5000,10000] 
out = [] 
for c in grid: 
    lrl = linear_model.LogisticRegression(penalty='l1',C=c) 
    lrl.fit(Xs, Ys) 
    score = cross_val_score(lrl, Xs, Ys, cv=3) 
    out.append(score.mean()) 
    bestc = grid[out.index(max(out))] 

lrl = linear_model.LogisticRegression(penalty='l1',C=bestc) 
lrl.fit(Xs,Ys)
print("\nThe model was fit using C = ",bestc)

Y_predlrl = lrl.predict(Xt)


The model was fit using C =  400


In [26]:
cvscores = cross_val_score(lrl, Xs, Ys)
cvscoret = cross_val_score(lrl, Xt, Yt)
print("Lasso Logistic Training Set:")
print("\nTraing Accuracy Score:")
print(lrl.score(Xs,Ys))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscores.mean(),2),round(cvscores.std()*2,2)))
print("\nLasso Logistic Testing Set:")
print('\nTesting Accuracy Score:')
print(lrl.score(Xt,Yt))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscoret.mean(),2),round(cvscoret.std()*2,2)))
tn, fp, fn, tp = confusion_matrix(Yt, Y_predlrl).ravel()
type2 = fn/(fn+tp)
print('\nType II Error Percentage:')
print(round(type2*100,2),"%")
print('\nConfusion Matrix:')
print(confusion_matrix(Yt, Y_predlrl))

scores['Lasso Logistic'] = [lrl.score(Xt,Yt),type2]

Lasso Logistic Training Set:

Traing Accuracy Score:
0.946

Cross Validation Score:
0.94% +/- 0.01%

Lasso Logistic Testing Set:

Testing Accuracy Score:
0.9693476635054616

Cross Validation Score:
1.0% +/- 0.0%

Type II Error Percentage:
7.72 %

Confusion Matrix:
[[275623   8692]
 [    38    454]]


This model is on the edge of useability, but not quite there. While the general model error score is barely past the acceptable 95%, it's type II error percentage is not so low. It is just over the line of unacceptability, at 6.1%. Since type II errors are much more impactful for this data set, I will discard the Lasso Logistic model as well.

### Random Forest

Random Forests function as a group of simultaneous decision trees, all modeling a slightly different portion of the same data. They are extremely robust and do most feature engineering for you, with the tradeoff of being very prone to ovefitting.

In [14]:
rfc = ensemble.RandomForestClassifier()

rfc.fit(Xs,Ys)

Y_predrfc = rfc.predict(Xt)

In [15]:
#cvscores = cross_val_score(rfc, Xs, Ys)
#cvscoret = cross_val_score(rfc, Xt, Yt)
print("Random Forest Training Set:")
print("\nTraing Accuracy Score:")
print(rfc.score(Xs,Ys))
#print('\nCross Validation Score:')
#print('{}% +/- {}%'.format(round(cvscores.mean(),2),round(cvscores.std()*2,2)))
print("\nRandom Forest Testing Set:")
print('\nTesting Accuracy Score:')
print(rfc.score(Xt,Yt))
#print('\nCross Validation Score:')
#print('{}% +/- {}%'.format(round(cvscoret.mean(),2),round(cvscoret.std()*2,2)))
tn, fp, fn, tp = confusion_matrix(Yt, Y_predrfc).ravel()
type2 = fn/(fn+tp)
print('\nType II Error Percentage:')
print(round(type2*100,2),"%")
print('\nConfusion Matrix:')
print(confusion_matrix(Yt, Y_predrfc))

scores['Random Forest'] = [rfc.score(Xt,Yt),type2]

Random Forest Training Set:

Traing Accuracy Score:
0.9995

Random Forest Testing Set:

Testing Accuracy Score:
0.9903689164943278

Type II Error Percentage:
2.44 %

Confusion Matrix:
[[281584   2731]
 [    12    480]]


The error score and type II error percentage are superb. Not to mention, the cross validatio nscore shows the model has managed to prevent overfitting.  This model is definitely useable and I will try it again.

### Support Vector Machine Classification

SVC works by making a boundary between the groups of data in n-dimensional space, where n is equal to the number of features. Here, the groups will simply be my binary outcomes. SVC is very powerful and accurate, but comes at the cost of being computationally intensive and prone to overfitting. 

In [16]:
svm = SVC() # SVC defaults: kernel=rbf, degree of the poly is 3 svm_cv =cross_val_score(svm, train_data_bow, y_train, cv=10)
svm.fit(Xs,Ys)

Y_predsvm = svm.predict(Xt)

In [17]:
#cvscores = cross_val_score(svm, Xs, Ys)
#cvscoret = cross_val_score(svm, Xt, Yt)
print("Support Vector Classification Training Set:")
print("\nTraing Accuracy Score:")
print(svm.score(Xs,Ys))
#print('\nCross Validation Score:')
#print('{}% +/- {}%'.format(round(cvscores.mean(),2),round(cvscores.std()*2,2)))
print("\nSupport Vector Classification Testing Set:")
print('\nTesting Accuracy Score:')
print(svm.score(Xt,Yt))
#print('\nCross Validation Score:')
#print('{}% +/- {}%'.format(round(cvscoret.mean(),2),round(cvscoret.std()*2,2)))
tn, fp, fn, tp = confusion_matrix(Yt, Y_predsvm).ravel()
type2 = fn/(fn+tp)
print('\nType II Error Percentage:')
print(round(type2*100,2),"%")
print('\nConfusion Matrix:')
print(confusion_matrix(Yt, Y_predsvm))

scores['Support Vector Machines'] = [svm.score(Xt,Yt),type2]

Support Vector Classification Training Set:

Traing Accuracy Score:
1.0

Support Vector Classification Testing Set:

Testing Accuracy Score:
0.9996559073337383

Type II Error Percentage:
15.65 %

Confusion Matrix:
[[284294     21]
 [    77    415]]


# Conclusion

Now that I have attempted to predict credit card fraud with all the model types, I can say what type of model worked best.

A quick recap of each model type's ability to predict fraud:

In [27]:
scores['Stat'] = ['Testing Accuracy','Type II Error']
scores = scores.set_index('Stat')
display(scores)

Unnamed: 0_level_0,Naive Bayes,Lasso Logistic,Random Forest,Support Vector Machines
Stat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Testing Accuracy,0.990537,0.969348,0.990369,0.999656
Type II Error,0.174797,0.077236,0.02439,0.156504


Despite the Support Vector Classification model seeming the best during training, the Random Forest Classification model heavily outperformed it during testing. The SVC model's good error scores during traing were only possible because of how overfit the model truly was. This shows just how important training and testing on different data sets is and well as just how important is it to know your data and recognize what type of error hurts it the most. 

https://www.kaggle.com/mlg-ulb/creditcardfraud