# Fraudulent credit card transaction detection

Firstly the required modules are imported: numpy is used for array functions; pandas is used for importing the dataset and dataframe handling; sklearn is used for the machine learning algorithms along with normalising the data.

In [4]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.preprocessing import StandardScaler

Then the dataset is imported from the git hub repository for this project but the original datset can be found [here.](https://www.kaggle.com/mlg-ulb/creditcardfraud) This dataset consists of ~285,000 transactions where each transaction has a time and an amount associated with it along with 28 features extracted using PCA (principal component analysis). In addition to this each transaction has a class of either 1 or 0, where 0 means it was a genuine transaction and a 1 means it was a fraudulent transaction.

In [5]:
df = pd.read_csv("https://media.githubusercontent.com/media/f1nn1711/Credit-Card-Fraud-Detection/main/creditcard.csv")

This dataset is highly unbalanced as there is about 578 genuine transactions for ever fraudulent tranaction.

In [6]:
print(f"Percent of transactions fraudulant: {round((len(df[df.Class == 1])/len(df))*100,5)}%")

Percent of transactions fraudulant: 0.17275%


Each tranaction has an amount feature associated with it which can range from 0 to 25,691. This is a massive range for a single feature so it is important that these values are scaled down so the range of them is roughly in-line with the range of the other features. If these values aren't normalised this will result in the amount feature having a larger effect of the model during fitting but this feature might not be more important when it comes to predicting. 

In [7]:
scaler = StandardScaler()

amount_values = df["Amount"].values
amount_values = amount_values.reshape(-1,1)

df["Amount"] = scaler.fit_transform(amount_values)

The dataframe is then split up in to 2 arrays where the array called X are the features and the array called y are the classes. The features are made up of the PCA values along with the normalised amount and the classes consist of either a 1 or a 0.

In [8]:
X = df.drop(["Time","Class"], axis=1).values
y = df["Class"].values

The data is then split in to a training set and a testing set. Only the training set will be used to fit the models this is so the testing set can be used to evaluate the models performance on data it has never seen before. The `train_split` is a measure of how much of the data should be used for training, for example 0.85 means 85% will be used for training and 15% will be used for testing.



In [9]:
train_split = 0.9

split_index = round(len(df)*0.9)

X_train = X[:split_index]
y_train = y[:split_index]

X_test = X[split_index:]
y_test = y[split_index:]

The first model to be trained is the logistic regression model. The logistic regression algorithm works is by fitting a logistic curve to the dataset. For example, instead of trying to predict credit card fraud let's imagine we trying to predict if a person has diabetes and the dataset we have consists of peoples blood glucose levels along with a 0 representing that they dont have diabetes and a 1 representing that they do. Then we would draw a 2D graph where the x-axis is the blood glucose level and the y-axis from 0 to 1 which represents if they do or do not have diabetes. All the elements in the dataset are plotted and a best fit logistic curve is then drawn for the data. Then when it comes to predicting is someone has diabetes their blood glucose levels can be calculated and plotted on the logistic curve, then the y-value for the logistic curve is calculated given the x values (the blood glucose level). The y-value will be a number between 0 and 1 and this can be thought of as the models confindence in whether a person has diabetes or not as, for example, if a person has a low glucose level then the y-value of the logistic curve might be 0.1 which would indicate there is a low chance that this person has diabetes.

In [10]:
logreg_model = LogReg()
logreg_model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

The second model to be trained is the K-nearest-neighbour. The KNN works by seeing how a smaple is simlar to another sample. For example now let's say we are predicting if someone has diabetes using their weight and hours of exercise per week. We would have a dataset which contains the hours of exercise, weight and whether they have diabetes represented by a 1 if they do and a 0 if they do not. Then the dataset is plotted on a 2D graph with the hours of excersie per week on the x-axis and then their weight on the y-axis. When it comes to using this model to predict we plot the sample on the grpah. The class of the nearest neighbouring samples then will decide the classification of the new unknown sample. The K in KNN represents the number of neighbouring samples that will be looked at when classifying the new smaple. For example if the new sample's 5 nearest neighbour have the class of `0,1,0,1,1` then the new unknown sample will be classified as `1` as because 3 of the 5 nearest samples have that class.

In [11]:
k=10
knn_model = KNN(k)
knn_model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

This `calculate_accuracy` function which calculates the accuracy of the model from the predicted vales and the actual values. This is done by comparing the predicted value and the actual value, if they are the same then the model was correct. At the end the number of correct predictions is divided by the total number of predictions to get the accuracy of the model.

In [12]:
def calculate_accuracy(pred, actu):
    correct = 0

    for a, p in zip(actu, pred):
        if a == p:
            correct += 1

    return correct/pred.size

The `apply_threshold` functions takes in a probability and a threshold and if the probability is less than the threshold the function will return 0 otherwise 1 will be returned. This function is then vectorized allowing it to be applied to an array.

In [13]:
def apply_threshold(value, threshold):
    if value < threshold:
        return 0
    else:
        return 1

apply_threshold = np.vectorize(apply_threshold)

Once the logistic regression model has been fitted around the training dataset we classify the testing dataset with the model. The model returns an array for each element in the testing dataset. The first value in this array is the probabilty that the transaction is genuine and the second value is the probabilty that the transaction is fraudulent. So the first step is to filter the array so we only have the probabilty that the transaction is fraudulent. Then these probabilities need to be turned in to either a 1 or a 0, this is where the `apply_threshold` function is used. In this case the threshold is 0.1 so if the probabilty that the transaction is fraudulent is greater than 0.1 then it will be classified as fraudulent(1) otherwise it will be classified as genuine(0).

In [14]:
logreg_prediction = logreg_model.predict_proba(X_test)#Return n length 2D array of where each element is [prob class_1, prob class_2]
fraud_prob = np.reshape(np.delete(logreg_prediction, 0, axis=1), (-1))

logreg_filtered_pred = apply_threshold(fraud_prob, 0.1)

logreg_accuracy = calculate_accuracy(logreg_filtered_pred, y_test)

The same is then done for the K-nearest-neighbour model.

In [15]:
knn_prediction = knn_model.predict_proba(X_test)
fraud_prob = np.reshape(np.delete(knn_prediction, 0, axis=1), (-1))

knn_filtered_pred = apply_threshold(fraud_prob, 0.1)

knn_accuracy = calculate_accuracy(knn_filtered_pred, y_test)

Then the accuracies of the model are printed.

In [16]:
print(f"Logistic regression accuracy: {logreg_accuracy}")
print(f"K-nearest-neighbour accuracy: {knn_accuracy}")

Logistic regression accuracy: 0.9990519995786665
K-nearest-neighbour accuracy: 0.9982093325374811


The prediction results are then analysed to find out more about the model performs. This will analyse the number of: true negatives - this is when the model predicts it was genuine and it was genuine, true positives - this is when the model predicts it was fraudulent and it was fraudulent, false negatives - this is when the model predicts it was genuine and it was fraudulent, false positives - this is when the model predicts it was fraudulent and it was genuine. As before a `0` represents genuine transactions and a `1` represents a fraudulent transaction.

In [17]:
def analyse_predictions(pred, actu):
    results = {
        "tn" : 0,
        "tp" : 0,
        "fn" : 0,
        "fp" : 0
    }

    for a, p in zip(actu, pred):
        if a == 0:
            if p == 0:
                results["tn"] += 1
            else:
                results["fn"] += 1
        else:
            if p == 1:
                results["tp"] += 1
            else:
                results["fp"] += 1
    
    return results

The predictions are compared to the correct classifications using the `analyse_predictions` function and the results are printed out. In this case the results from the KNN model are being used however this could easily be changed to use the logistic regression model's results.

In [18]:
analysed_results = analyse_predictions(knn_filtered_pred, y_test)

print(analysed_results)

{'tn': 28414, 'tp': 16, 'fn': 45, 'fp': 6}


The first value to be printed is the percentage of how many times the model correctly predicted it was genuine and the second percentage is for how often the model predicts fraudulent and it was fraudulent. To calculate the percentage for the genuine transactions the number of times the model correctly predicted a genuine tranactions is divided by the total number of genuine transactions in the testing dataset.

In [19]:
print(f"{round((analysed_results['tn']/np.where(y_test == 0)[0].size)*100, 3)}% of genuine transaction identified.")
print(f"{round((analysed_results['tp']/np.where(y_test == 1)[0].size)*100, 3)}% of fraudulent transaction identified.")

99.842% of genuine transaction identified.
72.727% of fraudulent transaction identified.
