# Credit Card Fraud Detection Predictions
### Alex Bartel
This project will utilize a dataset that focuses on credit card fraud. I will attempt to utilize data science and machine learning techniques to train a model to recognize fraudulent transactions based on a total of 29 features in the dataset. 28 of the features are anonymized numerical features, and the 29th is the purchase amount. 

### Dataset source: https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Loading the Data
We'll take a look at the first five rows of the dataset to understand what we're dealing with. There are 28 anonymized features, the transaction amount, and the target, which is a binary variable where a 0 indicates a legitimate transaction and a 1 indicates a fraudulent transaction.

In [2]:
card_data = pd.read_csv("creditcard_2023.csv")
card_data.head(5)

Unnamed: 0,id,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-0.260648,-0.469648,2.496266,-0.083724,0.129681,0.732898,0.519014,-0.130006,0.727159,...,-0.110552,0.217606,-0.134794,0.165959,0.12628,-0.434824,-0.08123,-0.151045,17982.1,0
1,1,0.9851,-0.356045,0.558056,-0.429654,0.27714,0.428605,0.406466,-0.133118,0.347452,...,-0.194936,-0.605761,0.079469,-0.577395,0.19009,0.296503,-0.248052,-0.064512,6531.37,0
2,2,-0.260272,-0.949385,1.728538,-0.457986,0.074062,1.419481,0.743511,-0.095576,-0.261297,...,-0.00502,0.702906,0.945045,-1.154666,-0.605564,-0.312895,-0.300258,-0.244718,2513.54,0
3,3,-0.152152,-0.508959,1.74684,-1.090178,0.249486,1.143312,0.518269,-0.06513,-0.205698,...,-0.146927,-0.038212,-0.214048,-1.893131,1.003963,-0.51595,-0.165316,0.048424,5384.44,0
4,4,-0.20682,-0.16528,1.527053,-0.448293,0.106125,0.530549,0.658849,-0.21266,1.049921,...,-0.106984,0.729727,-0.161666,0.312561,-0.414116,1.071126,0.023712,0.419117,14278.97,0


### Examining the Features
We will examine the features to understand their distributions better. I will use the describle method and create a density plot for each feature. 

In [3]:
summary = card_data.describe()
summary_rounded = summary.round(2)
summary_rounded

Unnamed: 0,id,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0,...,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0,568630.0
mean,284314.5,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,...,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,12041.96,0.5
std,164149.49,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,6919.64,0.5
min,0.0,-3.5,-49.97,-3.18,-4.95,-9.95,-21.11,-4.35,-10.76,-3.75,...,-19.38,-7.73,-30.3,-4.07,-13.61,-8.23,-10.5,-39.04,50.01,0.0
25%,142157.25,-0.57,-0.49,-0.65,-0.66,-0.29,-0.45,-0.28,-0.19,-0.57,...,-0.17,-0.49,-0.24,-0.65,-0.55,-0.63,-0.3,-0.23,6054.89,0.0
50%,284314.5,-0.09,-0.14,0.0,-0.07,0.08,0.08,0.23,-0.11,0.09,...,-0.04,-0.03,-0.06,0.02,-0.01,-0.01,-0.17,-0.01,12030.15,0.5
75%,426471.75,0.83,0.34,0.63,0.71,0.44,0.5,0.53,0.05,0.56,...,0.15,0.46,0.16,0.7,0.55,0.67,0.33,0.41,18036.33,1.0
max,568629.0,2.23,4.36,14.13,3.2,42.72,26.17,217.87,5.96,20.27,...,8.09,12.63,31.71,12.97,14.62,5.62,113.23,77.26,24039.93,1.0


In [None]:
plt.figure(figsize=[12,36])
for i in range(1, 30):
    plt.subplot(10,3,i)
    sns.kdeplot(card_data.iloc[:, i])
    plt.title(card_data.columns[i])
    plt.ylabel("Count")
plt.tight_layout()
plt.show()

In [None]:
card_data.loc[:, 'Class'].value_counts().sort_index()

### Initial Exploration Findings
Based on this, we see that there are a variety of different distributions and values in our features, and that the dataset has come with the features already standardized (means of 0, standard deviations of 1) with the exception of the amount. Importantly, we also see that this will also be a balanced binary classification task. Since the amount could be an important feature, we will standardize that for our first model.

In [None]:
scaler = StandardScaler()
card_data['Amount'] = scaler.fit_transform(card_data[['Amount']])
card_data.head(5)

### Training and Testing Data Split

Now I will split the dataset into a training dataset on which I will build my predictive models, as well as a validation dataset to gauge the relative performance of different models, and a final test dataset to determine the final model's performance.

In [None]:
X = card_data.iloc[:,1:30].values
y = card_data.loc[:,'Class'].values
X_train, X_hold, y_train, y_hold = train_test_split(X, y, stratify=y, test_size=0.2)
X_val, X_test, y_val, y_test = train_test_split(X_hold, y_hold, stratify=y_hold, test_size=0.5)

### Logistic Regression

I will start with a simple, but effective model for a binary classification task, logistic regression.

In [None]:
model_1 = LogisticRegression().fit(X_train, y_train)

In [None]:
model_1_train_preds = model_1.predict(X_train)
model_1_train_accuracy = np.sum(model_1_train_preds==y_train)/len(y_train)
model_1_val_preds = model_1.predict(X_val)
model_1_val_accuracy = np.sum(model_1_val_preds==y_val)/len(y_val)
print(f'The linear regression model\'s accuracy on the training data is {model_1_train_accuracy: .2f}')
print(f'The linear regression model\'s accuracy on the validation data is {model_1_val_accuracy: .2f}')

In [None]:
from sklearn.metrics import precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay, accuracy_score
cm_model_1_train = confusion_matrix(y_train, model_1_train_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_model_1_train)
disp.plot(cmap=plt.cm.Blues)
plt.show()

model_1_train_precision = precision_score(y_train, model_1_train_preds)
model_1_train_recall = recall_score(y_train, model_1_train_preds)

print(f'Training Precision: {model_1_train_precision:.2f}')
print(f'Training Recall: {model_1_train_recall:.2f}')

In [None]:
cm_model_1_val = confusion_matrix(y_val, model_1_val_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_model_1_val)
disp.plot(cmap=plt.cm.Blues)
plt.show()

model_1_val_precision = precision_score(y_val, model_1_val_preds)
model_1_val_recall = recall_score(y_val, model_1_val_preds)

print(f'Validation Precision: {model_1_val_precision:.2f}')
print(f'Validation Recall: {model_1_val_recall:.2f}')

### Logistic Regression Results

With 96% accuracy, 98% precision, and 95% recall, this model is already pretty good, but we will continue to iterate with different models to see if we can make meaningful improvements.

Of particular concern is the recall. A recall of 95% means that the model is still failing to flag approximately 1 in 20 fraudulent transactions. If a more sophisticated model can improve the recall to ensure that fewer fraudulent transactions are ignored, it will likely be a worthwhile investment.

The good news is that given the similar metrics between the training and validation predictions, this model is well fit, and should extrapolate well to unseen data.

### Tuning Threshold Weights

One easy adjustment we could make to this model to increase the recall is to tune the threshold of the model to reduce false negatives. However, this will come at the cost of lowering the precision and overall accuracy. Still, if the business wants to focus on catching as many instances of fraud as possible, and is willing to accept more false positives (the model incorrectly predicting fraud), then this could be a viable option.

In [None]:
model_1_probs = pd.DataFrame(model_1.predict_proba(X_val))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
    print ('\n******** For i = {} ******'.format(i))
    preds = model_1_probs.applymap(lambda x: 1 if x>i else 0)
    preds = preds.iloc[:,1].values.reshape(preds.iloc[:,1].values.size,1)
    accuracy = accuracy_score(y_val, preds)
    precision = precision_score(y_val, preds)
    recall = recall_score(y_val, preds)
    print(f'Validation accuracy is {accuracy:.2f}')
    print(f'Validation Precision is {precision:.2f}')
    print(f'Validation Recall is {recall:.2f}')

### Threshold Tuning Results

As we can see, if the goal is to maximize recall, a lower threshold value is preferable. There are even models around a threshold of 0.2-0.45 that give a bump to recall without much reduction in precision or overall accuracy. Of course, 0.5 is the default threshold value and is what we saw in our initial assessment of the model.

While this could be helpful in a real-life application, we also want to explore other models that may perform better in all aspects so that we don't have to sacrifice precision in order to catch a very high percentage of fraud.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import tensorflow as tf
from tensorflow import keras
import keras_tuner as kt
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

model_2 = Sequential()
model_2.add(Dense(58, input_shape=(29,), activation='relu'))
model_2.add(Dense(8, activation='relu'))
model_2.add(Dense(1, activation='sigmoid'))
model_2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_2.summary()

# fit the keras model on the dataset
model_2.fit(X_train, y_train, epochs=10, batch_size=10)

accuracy = model_2.evaluate(X_val, y_val)
print(accuracy)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define a function that creates the model
def create_model(neurons_1=58, neurons_2=8, activation='relu', optimizer='adam', learn_rate=0.001):
    model = Sequential()
    model.add(Dense(neurons_1, input_shape=(29,), activation=activation))
    model.add(Dense(neurons_2, activation=activation))
    model.add(Dense(1, activation='sigmoid'))
    opt_instance = optimizer(learning_rate=learn_rate)
    model.compile(loss='binary_crossentropy', optimizer=opt_instance, metrics=['accuracy'])
    return model

# Wrap the model in KerasClassifier
model = KerasClassifier(build_fn=create_model, verbose=0)

# Define the hyperparameter space
param_dist = {
    'neurons_1': [32, 58, 100],  # Number of neurons in the first dense layer
    'neurons_2': [8, 16, 32],    # Number of neurons in the second dense layer
    'activation': ['relu', 'tanh'],  # Activation functions
    'optimizer': [Adam, RMSprop],    # Optimizers
    'learn_rate': [0.001, 0.01, 0.1],  # Learning rates
    'batch_size': [16, 32, 64],  # Batch sizes
    'epochs': [10, 20, 30]       # Number of epochs
}

# Perform Randomized Search
random_search = RandomizedSearchCV(
    estimator=model, 
    param_distributions=param_dist, 
    n_iter=20,  # Number of combinations to try
    scoring='accuracy', 
    cv=3, 
    verbose=1, 
    n_jobs=-1
)

# Fit the random search
random_search_result = random_search.fit(X_train, y_train)

# Display the best parameters and the corresponding score
print("Best accuracy: {:.2f}%".format(random_search_result.best_score_ * 100))
print("Best hyperparameters: ", random_search_result.best_params_)