# Transaction Fraud Anomaly Detection

In this project, I will build a simple anomaly detection algorithm and then apply it to a credit card transaction dataset containing 284,807 transactions, of which 492 (0.172%) are fraudulent. 

The dataset includes 30 numerical features derived from PCA transformation (which cannot be revealed due to confidentiality reasons), with the exception of 'Time' (elapsed time in seconds) and 'Amount' (transaction value). 

The target variable, 'Class,' indicates fraud (1) or non-fraud (0).

The data was curated through a collaboration between Worldline and ULB’s Machine Learning Group.

Resources: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data

In [274]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [275]:
credit = pd.read_csv("creditcard.csv")

In [276]:
# look at the first few rows
credit.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Data Preprocessing

In [277]:
# separate the class attribute and drop amount and time variables
X = credit.drop(columns = ['Class','Amount', 'Time'])
y = credit['Class']

In [278]:
# split data into training/validation set and test set
X_train, X_val, _ , y_val = train_test_split(X, y, test_size=0.2, random_state=24)

In [279]:
# show the first five elements of X_train
print("The first 5 elements of X_train are:\n", X_train[:5])  

The first 5 elements of X_train are:
               V1        V2        V3        V4        V5        V6        V7  \
83806   1.236053  0.286630  0.186867  0.501575 -0.163476 -0.562945 -0.030210   
9321    1.061989 -0.087414  1.146973  1.704821 -0.657469  0.501540 -0.683246   
193863  2.373022 -1.355533 -1.325551 -1.693802 -0.905278 -0.519862 -1.027586   
229423 -0.619616  0.480535 -0.069708 -0.668737  0.946775 -0.620599  1.151250   
282611 -2.928106  2.084414  0.093552  1.084699 -0.782179  0.381218 -0.865072   

              V8        V9       V10  ...       V19       V20       V21  \
83806  -0.029673 -0.188527 -0.110280  ...  0.215519 -0.065490 -0.258603   
9321    0.317528  2.125357 -0.410437  ... -0.235149 -0.339817 -0.188658   
193863 -0.227397 -1.106646  1.594906  ...  0.260985 -0.438500 -0.241407   
229423 -0.058430  0.192752 -1.730746  ...  0.077501  0.152719 -0.120910   
282611  1.816209 -0.315849 -0.768210  ...  1.301298 -0.310797 -0.440210   

             V22       V23    

In [280]:
# show the first five elements of X_val
print("The first 5 elements of X_train are:\n", X_val[:5])  

The first 5 elements of X_train are:
               V1        V2        V3        V4        V5        V6        V7  \
240571 -4.119614 -2.625302 -3.427816  0.742588  0.787195 -2.011271  0.245571   
270711  1.313545 -1.266250 -1.421509  1.574286  0.040139  0.662495  0.312851   
115125 -0.775661  0.742194  0.813272 -0.198428  1.883426  4.269919 -0.539289   
73740   1.198189 -0.220877  0.335124 -0.105590 -0.268381  0.093669 -0.337323   
252707  1.615339 -1.891391 -0.566731 -0.487742 -1.210596  0.696941 -1.123560   

              V8        V9       V10  ...       V19       V20       V21  \
240571  0.775553 -0.087441 -1.370682  ... -0.045896 -1.327149 -0.174424   
270711  0.013220  0.851643 -0.008837  ...  0.790330  0.442007 -0.137653   
115125  1.233106  0.320917  0.041559  ...  0.958351  0.413513 -0.112725   
73740   0.011270  0.369994 -0.232766  ... -0.258793  0.078082  0.085909   
252707  0.229539  0.080977  0.849066  ... -0.597572 -0.028709 -0.194281   

             V22       V23    

In [281]:
# show the first five elements of y_val
print("The first 5 elements of X_train are:\n", y_val[:5])  

The first 5 elements of X_train are:
 240571    0
270711    0
115125    0
73740     0
252707    0
Name: Class, dtype: int64


In [282]:
# check the dimensions of the variables
print ('The shape of X_train is:', X_train.shape)
print ('The shape of X_val is:', X_val.shape)
print ('The shape of y_val is: ', y_val.shape)

The shape of X_train is: (227845, 28)
The shape of X_val is: (56962, 28)
The shape of y_val is:  (56962,)


In [283]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

### Fitting a model

In [284]:
# estimate the Gaussian distribution paramaters mu and var for each of the features
def estimate_gaussian(X): 

    m, n = X.shape

    mu = np.sum(X, axis = 0) / m
    
    var = np.sum((X - mu)**2, axis=0) / m 
        
    return mu, var

In [285]:
# find the probability of each sample given the mean and variance with multivariate Gaussian distribution
def multivariate_gaussian(X, mu, var):

    # number of features
    k = len(mu)

    # if var is a 1D array, convert it into a covariance matrix with the values of var in the diagonal
    if var.ndim == 1:
        var = np.diag(var)

    # subtract the mean vector mu from each data point
    X_new = X - mu

    # use the probability density formula to find p
    p = (2* np.pi)**(-k/2) * np.linalg.det(var)**(-0.5) * \
        np.exp(-0.5 * np.sum(np.matmul(X_new, np.linalg.pinv(var)) * X_new, axis=1))
    
    return p

In [321]:
# selecting the threshold of epsilon
def select_threshold(y_val, p_val): 

    # initiate variables
    best_epsilon = 0
    best_F1 = 0
    F1 = 0

    # set the step size with a minimum threshold
    step_size = max((max(p_val) - min(p_val)) / 1000, 1e-10)
    
    for epsilon in np.arange(1e-300, 0.01, step_size):
    
        # get list of 0s (normal) and 1s (anomaly) for whether the p_val of the sample is smaller than epsilon
        predictions = (p_val <epsilon)

        # find true positive, false positive and false negative
        tp = np.sum ((predictions == 1) & (y_val == 1))
        fp = np.sum ((predictions == 1) & (y_val == 0))
        fn = np.sum ((predictions == 0) & (y_val == 1))

        # calculate precision, recall and F1 scores
        if tp + fp == 0:
            prec = 0  
        else:
            prec = tp / (tp + fp)
            
        rec = tp / (tp + fn)
        
        if prec == 0:
            F1 = 0
        else:
            F1 = 2 * prec * rec / (prec + rec)
    
        # find the best F1 score and the optimum threshold for epsilon
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon
        
    return best_epsilon, best_F1

### Run the model on the dataset

In [322]:
# estimate the Gaussian parameters
mu_train, var_train = estimate_gaussian(X_train_scaled)

# find the probabilities for each sample in the training set
p_train = multivariate_gaussian(X_train_scaled, mu_train, var_train)

# evaluate the probabilites for the cross validation set
p_val = multivariate_gaussian(X_val_scaled, mu_train, var_train)

# Find the best threshold
epsilon, F1 = select_threshold(y_val, p_val)

print('Best epsilon found using cross-validation: %e'% epsilon)
print('Best F1 on Cross Validation Set:  %f'% F1)
print('# Anomalies found: %d'% sum(p_train < epsilon))

Best epsilon found using cross-validation: 1.000000e-300
Best F1 on Cross Validation Set:  0.292398
# Anomalies found: 194
