# **Fraud detection in credit card transaction data with PyCaret**

We use the high level API [PyCaret](https://github.com/pycaret/pycaret) and the outlier detection package [PyOD](https://github.com/yzhao062/pyod) to detect outliers (i.e. anomalies or possible fraudulent transactions) in a [credit card transaction dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). We also compare our results to those in the benchmark paper [ADBench](https://arxiv.org/abs/2206.09426).

In [None]:
#%%capture   # Uncomment to supress output
!pip install pycaret[full]



In [None]:
import pandas as pd

from google.colab import drive
drive.mount('gdrive')

Mounted at gdrive


In [None]:
dataset = pd.read_csv('gdrive/MyDrive/Colab Notebooks/AnomalyDetection/creditcard.csv')

In [None]:
dataset.shape

(284807, 31)

In [None]:
dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [None]:
dataset["Class"].mean()

0.001727485630620034

We see that only 0.17% of the transactions are fraudulent. Our dataset is heavily imbalanced and the fraudulent transactions are considered outliers. We will now consider which models will perform well under these conditions.



---


# **Model Selection**

The column 'Class' corresponds to weather the transaction is valid (Class = 0) or fraudulent (Class = 1). Given these labels, we consider this to be a supervised classification problem.

In the next section, we will consider the effect of dropping the 'Class' column. In the real world, financial institutions do not have a labeled dataset of valid/fraudulent transactions to train machine learning algorithms on.



## **First, consider the supervised classification algorithms.**


In [None]:
# import pycaret classification and init setup

from pycaret.classification import *

supervised = setup(dataset, target = 'Class', session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Class
2,Target type,Binary
3,Original data shape,"(284807, 31)"
4,Transformed data shape,"(284807, 31)"
5,Transformed train set shape,"(199364, 31)"
6,Transformed test set shape,"(85443, 31)"
7,Numeric features,30
8,Preprocess,True
9,Imputation type,simple


In [None]:
# compare baseline models

best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.9996,0.9566,0.7969,0.9528,0.8668,0.8666,0.8706,121.638
et,Extra Trees Classifier,0.9996,0.9595,0.7911,0.9456,0.8601,0.8599,0.864,13.688
xgboost,Extreme Gradient Boosting,0.9996,0.9827,0.8024,0.9371,0.8641,0.8638,0.8667,68.196
catboost,CatBoost Classifier,0.9996,0.9799,0.8082,0.9491,0.8726,0.8724,0.8754,46.791
lda,Linear Discriminant Analysis,0.9994,0.9085,0.7795,0.8771,0.824,0.8237,0.8258,1.092
lr,Logistic Regression,0.9992,0.9551,0.6395,0.8517,0.7284,0.728,0.7366,4.218
ada,Ada Boost Classifier,0.9992,0.9804,0.7035,0.8268,0.7582,0.7579,0.7613,38.413
dt,Decision Tree Classifier,0.9991,0.8822,0.7648,0.7555,0.7564,0.7559,0.7578,11.018
ridge,Ridge Classifier,0.9989,0.0,0.4247,0.8508,0.5625,0.562,0.5982,0.151
gbc,Gradient Boosting Classifier,0.9989,0.6422,0.5081,0.7811,0.5929,0.5924,0.6165,200.928


Processing:   0%|          | 0/69 [00:00<?, ?it/s]

In [None]:
print(best)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=100, n_jobs=-1, oob_score=False,
                       random_state=123, verbose=0, warm_start=False)


We use the AUC ROC and F1 score as our metrics we wish to maximize (see [this article](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9) for a more detailed explanation on precision/recall tradeoff for imbalanced classes). We will now give a brief explanation for our decision to prioritize these measures.

Notice that the dataset is very imbalanced. Only 0.17% of transactions are fraudulent ('Class' = 1), so a model that predicts every transaction is valid would immediately have an accuracy of 99.83% . Therefore, we care more about precision and recall than we do accuracy. A model with a large AUC ROC will score well in precision and recall for our choice of threshold hyperparameter.

Indeed, we interpret precision as the percentage of transactions we detect as fraudulent that are actually fraudulent, as determined by the formula:
$$ \text{Precision} = \dfrac{ \text{True Pos} }{ \text{True Pos + False Pos} } \, . $$
In other words, if precision is low then the model will flag more transactions as fraudulent that, in reality, are valid. This will result in upset customers falsely accused of fraud.

On the other hand, we interpret recall as the percentage of all the truly fraudulent cases that the model is able to detect, as seen in the formula: $$ \text{Recall} = \dfrac{ \text{True Pos} }{ \text{True Pos + False Neg} } \, . $$
In other words, if recall is low, then there are fraudulent transactions that our model is just not able to detect.

This is the **precision/recall tradeoff**. By lowering the threshold hyperparameter, we will have lower precision but higher recall. The model will be more sensitive to outliers. It will detect more cases of fraud, but result in more upset customers. On the other hand, increasing the threshold hyperparameter will make the model less sensitive. It will ignore the transactions that are somewhat suspicious, resulting in fewer upset customers, but missing some of the borderline fraud cases. A model with high AUC ROC and F1 score will capture the optimal solution to this tradeoff problem, resulting in models that have both a high precision and a high recall. We then will then consider the choice of threshold hyperparameter to find a precision/recall combination that we are happy with.



From this investigation, we have our canditates for the best supervised models:


*   Random Forest Classifier
 *  AUC ROC 0.9566 (6th best)
 *   F1 0.8668 (best)

* xgboost
 * AUC ROC 0.9827 (best)
 * F1 0.8641 (second best)





## **Now, consider the unsupervised algorithms.**


In [None]:
#!pip3 install pycaret[analysis]

In [None]:
# !pip3 install shap

# !pip install shap

In [None]:
# interpret summary model (use SHAP)
# interpret_model(best, plot = 'summary')

In [None]:
# reason plot for test set observation 1
# interpret_model(best, plot = 'reason', observation = 1)

In [None]:
# plot confusion matrix
# plot_model(best, plot = 'confusion_matrix')

In [None]:
# plot AUC
# plot_model(best, plot = 'auc')

In [None]:
# plot feature importance
# plot_model(best, plot = 'feature')

In [None]:
# help(plot_model)

In [None]:
# dashboard function
# dashboard(best, display_format ='inline')

In [None]:
# eda function
# eda()

In [None]:
# create gradio app
# create_app(best)

In [None]:
#dataset = pd.read_csv('gdrive/MyDrive/Colab Notebooks/AnomalyDetection/creditcard.csv')

#dataset.head()


In [None]:
# Drop the 'Class' column and consider the problem as an unsupervised outlier detection problem

dataset_unsup = dataset.drop('Class', axis=1)
dataset_unsup.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99


In [None]:
# Make the dataset smaller to see if the algos run now

#dataset_unsup = dataset_unsup.sample(n=20000, random_state=1)

dataset_unsup.shape

(284807, 30)

In [None]:
# Separate a training set and test set:

data = dataset_unsup.sample(frac=0.95, random_state=42)
data_unseen = dataset_unsup.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Modeling: (270567, 30)
Unseen Data For Predictions: (14240, 30)


In [None]:
y_labels =

In [None]:
from pycaret.anomaly import *

unsup = setup(data, normalize = True, session_id = 123)



Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(270567, 30)"
2,Transformed data shape,"(270567, 30)"
3,Numeric features,30
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,Normalize,True
9,Normalize method,zscore


In [None]:
# We have the following models available from PyCaret and PyOD
models()

Unnamed: 0_level_0,Name,Reference
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
abod,Angle-base Outlier Detection,pyod.models.abod.ABOD
cluster,Clustering-Based Local Outlier,pycaret.internal.patches.pyod.CBLOFForceToDouble
cof,Connectivity-Based Local Outlier,pyod.models.cof.COF
iforest,Isolation Forest,pyod.models.iforest.IForest
histogram,Histogram-based Outlier Detection,pyod.models.hbos.HBOS
knn,K-Nearest Neighbors Detector,pyod.models.knn.KNN
lof,Local Outlier Factor,pyod.models.lof.LOF
svm,One-class SVM detector,pyod.models.ocsvm.OCSVM
pca,Principal Component Analysis,pyod.models.pca.PCA
mcd,Minimum Covariance Determinant,pyod.models.mcd.MCD


We refer to [ADBench](https://arxiv.org/abs/2206.09426) to help us select a few models that typically work well for fraud detection. ADBench has the results:


* LOF
 * AUC ROC 0.9492 (second best)
 * AUC PR 0.4740 (third best)

* KNN
 * AUC ROC 0.9356 (third best)
 * AUC PR 0.4730 (4th best)

Both models are able to be implemented through PyCaret. We will now do so.

In [None]:
import time

tic = time.time()

model_lof = create_model('lof')
print(model_lof)

toc = time.time()
print('Time elapsed, ', toc-tic)




Processing:   0%|          | 0/3 [00:00<?, ?it/s]

LOF(algorithm='auto', contamination=0.05, leaf_size=30, metric='minkowski',
  metric_params=None, n_jobs=-1, n_neighbors=20, novelty=True, p=2)
Time elapsed,  500.44642066955566


In [None]:
lof_anomalies = assign_model(model_lof)
lof_anomalies.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V22,V23,V24,V25,V26,V27,V28,Amount,Anomaly,Anomaly_Score
0,41505.0,-16.526506,8.584971,-18.649853,9.505593,-13.793818,-2.832404,-16.701694,7.517344,-8.507059,...,-1.12767,-2.358579,0.673461,-1.4137,-0.462762,-2.018575,-1.042804,364.190002,0,1.051213
1,44261.0,0.339812,-2.743745,-0.13407,-1.385729,-1.451413,1.015887,-0.524379,0.22406,0.899746,...,-0.942525,-0.526819,-1.156992,0.311211,-0.746647,0.040996,0.102038,520.119995,0,1.140476
2,35484.0,1.39959,-0.590701,0.168619,-1.02995,-0.539806,0.040444,-0.712567,0.002299,-0.971747,...,0.168269,-0.166639,-0.81025,0.505083,-0.23234,0.011409,0.004634,31.0,0,1.095092
3,167123.0,-0.432071,1.647895,-1.669361,-0.349504,0.785785,-0.630647,0.27699,0.586025,-0.484715,...,0.873663,-0.178642,-0.017171,-0.207392,-0.157756,-0.237386,0.001934,1.5,0,1.053872
4,168473.0,2.01416,-0.137394,-1.015839,0.327269,-0.182179,-0.956571,0.043241,-0.160746,0.363241,...,-0.6164,0.347045,0.061561,-0.360196,0.17473,-0.078043,-0.070571,0.89,0,0.990909


In [None]:
unseen_predictions_lof = predict_model(model_lof, data=data_unseen)
unseen_predictions_lof.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V22,V23,V24,V25,V26,V27,V28,Amount,Anomaly,Anomaly_Score
0,-1.996279,-0.216529,0.582063,0.75125,-0.11947,0.304855,-0.02299,0.383188,0.216951,-0.517491,...,-0.770894,-0.042165,-0.612769,-0.447749,0.219264,0.628509,0.243318,-0.340685,0,1.334791
1,-1.996069,-0.382963,0.209372,1.354282,-1.037677,-0.836659,-0.059066,-0.489298,0.002959,-0.396978,...,1.865882,-0.410824,-0.106971,-0.075979,-0.180838,-0.448402,0.388701,-0.291187,1,1.676183
2,-1.995753,-0.124801,0.286803,1.116295,0.184621,-0.007272,-0.458624,0.638741,-0.206153,0.12603,...,-0.461267,-0.125101,0.649006,-0.060685,0.410239,-0.434409,-0.613374,-0.23293,0,1.048846
3,-1.995753,-0.739725,1.069622,0.402784,0.830289,-0.321755,0.184331,-0.206978,0.910631,-0.552828,...,0.449476,-0.110571,0.035098,-0.086621,-0.50497,0.369303,0.362109,-0.348198,0,1.352627
4,-1.995606,0.580985,0.034534,0.42763,0.615802,-0.338007,-0.308211,-0.010996,-0.06043,0.278744,...,-0.340372,0.09526,0.753855,0.692112,0.568573,-0.006339,0.050819,-0.269692,0,1.242074


In [None]:
import time

tic = time.time()

model_knn = create_model('knn')
print(model_knn)

toc = time.time()
print('Time elapsed, ', toc-tic)




Processing:   0%|          | 0/3 [00:00<?, ?it/s]

KNN(algorithm='auto', contamination=0.05, leaf_size=30, method='largest',
  metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
  radius=1.0)
Time elapsed,  507.71187233924866


In [None]:
knn_anomalies = assign_model(model_knn)
knn_anomalies.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V22,V23,V24,V25,V26,V27,V28,Amount,Anomaly,Anomaly_Score
0,41505.0,-16.526506,8.584971,-18.649853,9.505593,-13.793818,-2.832404,-16.701694,7.517344,-8.507059,...,-1.12767,-2.358579,0.673461,-1.4137,-0.462762,-2.018575,-1.042804,364.190002,1,8.589607
1,44261.0,0.339812,-2.743745,-0.13407,-1.385729,-1.451413,1.015887,-0.524379,0.22406,0.899746,...,-0.942525,-0.526819,-1.156992,0.311211,-0.746647,0.040996,0.102038,520.119995,0,2.691225
2,35484.0,1.39959,-0.590701,0.168619,-1.02995,-0.539806,0.040444,-0.712567,0.002299,-0.971747,...,0.168269,-0.166639,-0.81025,0.505083,-0.23234,0.011409,0.004634,31.0,0,0.954768
3,167123.0,-0.432071,1.647895,-1.669361,-0.349504,0.785785,-0.630647,0.27699,0.586025,-0.484715,...,0.873663,-0.178642,-0.017171,-0.207392,-0.157756,-0.237386,0.001934,1.5,0,1.084385
4,168473.0,2.01416,-0.137394,-1.015839,0.327269,-0.182179,-0.956571,0.043241,-0.160746,0.363241,...,-0.6164,0.347045,0.061561,-0.360196,0.17473,-0.078043,-0.070571,0.89,0,0.11173


# Now that we have chosen our 4 models (Random Forest, XG boost, LOF, and KNN), we will compare the results of these algorithms on a test set.

# We do so in another notebook, "compare_models_PyCaret.ipynb". See you over there. :)