# Fraud Detection using Convolutional Neural Network (CNN)

In this notebook, we will use Logistic Regression for Fraud Detection using the dataset from https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022/data.

The content of this notebook is as follows: 

1. Set-up and Exploratory Data Analysis \
    1.1. Import Libraries \
    1.2. Exploratory Data Analysis (EDA) 

2. Preprocessing and Data Cleaning \
    2.1. One Hot Encoding \
    2.2. Oversampling using SMOTE \
    2.3. Separation of Training and Test Sets \
    2.4. Standardizing the features

3. Training of Models and Hyperparameter Tuning \
    3.1. Hyperparameter Tuning \
    3.2. Feature Selection \
    3.3. Model Prediction

4. Evaluation of Results and Error Analysis \
    4.1. Confusion Matrix \
    5.1 . Compute Loss using Cross-Entropy Loss with Probabilities

# ***1. Set-up and Exploratory Data Analysis***

**1.1. Import Libraries**

In [19]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import plotly.graph_objects as go
from sklearn.metrics import confusion_matrix, log_loss, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from scipy.stats import chi2_contingency
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten, Dense
from tensorflow.keras.optimizers import Adam
import keras_tuner as kt
from keras_tuner import Hyperband
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import EarlyStopping


**1.2. Exploratory Data Analysis (EDA)**

In [2]:
#Import the dataset and convert it into a pandas dataframe
rawData = pd.read_csv('Base.csv')
print("Shape of the raw data", rawData.shape) 

#Overview of data instances
rawData.head()

Shape of the raw data (1000000, 32)


Unnamed: 0,fraud_bool,income,name_email_similarity,prev_address_months_count,current_address_months_count,customer_age,days_since_request,intended_balcon_amount,payment_type,zip_count_4w,...,has_other_cards,proposed_credit_limit,foreign_request,source,session_length_in_minutes,device_os,keep_alive_session,device_distinct_emails_8w,device_fraud_count,month
0,0,0.3,0.986506,-1,25,40,0.006735,102.453711,AA,1059,...,0,1500.0,0,INTERNET,16.224843,linux,1,1,0,0
1,0,0.8,0.617426,-1,89,20,0.010095,-0.849551,AD,1658,...,0,1500.0,0,INTERNET,3.363854,other,1,1,0,0
2,0,0.8,0.996707,9,14,40,0.012316,-1.490386,AB,1095,...,0,200.0,0,INTERNET,22.730559,windows,0,1,0,0
3,0,0.6,0.4751,11,14,30,0.006991,-1.863101,AB,3483,...,0,200.0,0,INTERNET,15.215816,linux,1,1,0,0
4,0,0.9,0.842307,-1,29,40,5.742626,47.152498,AA,2339,...,0,200.0,0,INTERNET,3.743048,other,0,1,0,0


In [3]:
#Check data types
rawData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 32 columns):
 #   Column                            Non-Null Count    Dtype  
---  ------                            --------------    -----  
 0   fraud_bool                        1000000 non-null  int64  
 1   income                            1000000 non-null  float64
 2   name_email_similarity             1000000 non-null  float64
 3   prev_address_months_count         1000000 non-null  int64  
 4   current_address_months_count      1000000 non-null  int64  
 5   customer_age                      1000000 non-null  int64  
 6   days_since_request                1000000 non-null  float64
 7   intended_balcon_amount            1000000 non-null  float64
 8   payment_type                      1000000 non-null  object 
 9   zip_count_4w                      1000000 non-null  int64  
 10  velocity_6h                       1000000 non-null  float64
 11  velocity_24h                      1000

In [4]:
print(rawData.dtypes)

fraud_bool                            int64
income                              float64
name_email_similarity               float64
prev_address_months_count             int64
current_address_months_count          int64
customer_age                          int64
days_since_request                  float64
intended_balcon_amount              float64
payment_type                         object
zip_count_4w                          int64
velocity_6h                         float64
velocity_24h                        float64
velocity_4w                         float64
bank_branch_count_8w                  int64
date_of_birth_distinct_emails_4w      int64
employment_status                    object
credit_risk_score                     int64
email_is_free                         int64
housing_status                       object
phone_home_valid                      int64
phone_mobile_valid                    int64
bank_months_count                     int64
has_other_cards                 

In [5]:
#Summary Statistics
rawData.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fraud_bool,1000000.0,0.011029,0.104438,0.0,0.0,0.0,0.0,1.0
income,1000000.0,0.562696,0.290343,0.1,0.3,0.6,0.8,0.9
name_email_similarity,1000000.0,0.493694,0.289125,1.43455e-06,0.225216,0.492153,0.755567,0.999999
prev_address_months_count,1000000.0,16.718568,44.04623,-1.0,-1.0,-1.0,12.0,383.0
current_address_months_count,1000000.0,86.587867,88.406599,-1.0,19.0,52.0,130.0,428.0
customer_age,1000000.0,33.68908,12.025799,10.0,20.0,30.0,40.0,90.0
days_since_request,1000000.0,1.025705,5.381835,4.03686e-09,0.007193,0.015176,0.026331,78.456904
intended_balcon_amount,1000000.0,8.661499,20.236155,-15.53055,-1.181488,-0.830507,4.984176,112.956928
zip_count_4w,1000000.0,1572.692049,1005.374565,1.0,894.0,1263.0,1944.0,6700.0
velocity_6h,1000000.0,5665.296605,3009.380665,-170.6031,3436.365848,5319.769349,7680.717827,16715.565404


In [6]:
colors = ['#004B87', 'LightBlue'] 
labels = ['Non-Fraud', 'Fraud']
values = rawData['fraud_bool'].value_counts() / rawData['fraud_bool'].shape[0]
total_normal = rawData[rawData['fraud_bool'] == False].shape[0]
total_fraudulent = rawData[rawData['fraud_bool'] == True].shape[0]

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='black', width=0.1)))

fig.update_layout(
    title_text='<b>Credit Card Fraud</b>',
    title_font_color='black',
    title_font=dict(size=24),
    legend_title_font_color='black',
    paper_bgcolor='white',  # Background color
    plot_bgcolor='white',   # Plot background color
    font_color='black',
)

fig.show()


# ***2. Preprocessing and Data Cleaning***

Observation: \
The dataset has a significant class imbalance by the large discrepancy between fraudulent and non-fraudulent transactions.

Approach: \
We will apply oversampling technique such as SMOTE. 

2.1. One Hot Encoding

In [7]:
# rawData = pd.get_dummies(rawData)
X = rawData.drop('fraud_bool', axis=1)
y = rawData['fraud_bool'] 
# scaler = StandardScaler()
# X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns) 

In [8]:
#Get categorical features for One-hot Enconding
categorical_features = [x for x in X.columns if X[x].dtypes == "O"]
print(categorical_features)

['payment_type', 'employment_status', 'housing_status', 'source', 'device_os']


In [9]:
X = pd.DataFrame(pd.get_dummies(X, prefix=categorical_features))

In [10]:
#Get numeric featues
numeric_features = X.select_dtypes(include=['number'])
non_numeric_features = X.select_dtypes(exclude=['number'])

#Scaling
scaler = StandardScaler()
scaled_numeric_features = scaler.fit_transform(numeric_features)
scaled_numeric_df = pd.DataFrame(scaled_numeric_features, columns=numeric_features.columns, index=X.index)
X_scaled = pd.concat([scaled_numeric_df, non_numeric_features], axis=1)

# Verify the DataFrame
print("Columns in X_scaled:", X_scaled.columns)

Columns in X_scaled: Index(['income', 'name_email_similarity', 'prev_address_months_count',
       'current_address_months_count', 'customer_age', 'days_since_request',
       'intended_balcon_amount', 'zip_count_4w', 'velocity_6h', 'velocity_24h',
       'velocity_4w', 'bank_branch_count_8w',
       'date_of_birth_distinct_emails_4w', 'credit_risk_score',
       'email_is_free', 'phone_home_valid', 'phone_mobile_valid',
       'bank_months_count', 'has_other_cards', 'proposed_credit_limit',
       'foreign_request', 'session_length_in_minutes', 'keep_alive_session',
       'device_distinct_emails_8w', 'device_fraud_count', 'month',
       'payment_type_AA', 'payment_type_AB', 'payment_type_AC',
       'payment_type_AD', 'payment_type_AE', 'employment_status_CA',
       'employment_status_CB', 'employment_status_CC', 'employment_status_CD',
       'employment_status_CE', 'employment_status_CF', 'employment_status_CG',
       'housing_status_BA', 'housing_status_BB', 'housing_status_BC'

2.3. Separation of Training and Test Sets

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42,stratify=y)

print('Training data shape:',X_train.shape)
print('Training ground truth values shape:',y_train.shape) 

print('Test data shape:',X_test.shape)
print('Test ground truth values shape:',y_test.shape)

Training data shape: (600000, 52)
Training ground truth values shape: (600000,)
Test data shape: (400000, 52)
Test ground truth values shape: (400000,)


2.2. Oversamping using SMOTE 

In [12]:
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)
print("Instances after oversampling in X", X_train.shape) 
print("Instances after oversampling in y", y_train.shape) 

Instances after oversampling in X (1186766, 52)
Instances after oversampling in y (1186766,)


In [13]:
X.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
income,1000000.0,0.562696,0.290343,0.1,0.3,0.6,0.8,0.9
name_email_similarity,1000000.0,0.493694,0.289125,1.43455e-06,0.225216,0.492153,0.755567,0.999999
prev_address_months_count,1000000.0,16.718568,44.04623,-1.0,-1.0,-1.0,12.0,383.0
current_address_months_count,1000000.0,86.587867,88.406599,-1.0,19.0,52.0,130.0,428.0
customer_age,1000000.0,33.68908,12.025799,10.0,20.0,30.0,40.0,90.0
days_since_request,1000000.0,1.025705,5.381835,4.03686e-09,0.007193,0.015176,0.026331,78.456904
intended_balcon_amount,1000000.0,8.661499,20.236155,-15.53055,-1.181488,-0.830507,4.984176,112.956928
zip_count_4w,1000000.0,1572.692049,1005.374565,1.0,894.0,1263.0,1944.0,6700.0
velocity_6h,1000000.0,5665.296605,3009.380665,-170.6031,3436.365848,5319.769349,7680.717827,16715.565404
velocity_24h,1000000.0,4769.781965,1479.212612,1300.307,3593.179135,4749.921161,5752.574191,9506.896596


In [14]:
colors = ['#004B87', 'LightBlue'] 
labels = ['Non-Fraud', 'Fraud']
values = y_train.value_counts() / y_train.shape[0] 

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='black', width=0.1)))

fig.update_layout(
    title_text='<b>Credit Card Fraud</b>',
    title_font_color='black',
    title_font=dict(size=24),
    legend_title_font_color='black',
    paper_bgcolor='white',  # Background color
    plot_bgcolor='white',   # Plot background color
    font_color='black',
)

fig.show()


In [15]:
def chi_squared_test(df, feature, target):
    contingency_table = pd.crosstab(df[feature], df[target])
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    return chi2_stat, p_value

alpha = 0.05


categorical_features = [col for col in X_train.columns if X_train[col].dtype == 'object']
selected_features = []

for col in categorical_features:
    chi2_stat, p_value = chi_squared_test(pd.concat([X_train, y_train], axis=1), col, 'fraud_bool')
    if p_value < alpha:
        selected_features.append(col)
    print(f"{col}: Chi2 Stat = {chi2_stat}, P-Value = {p_value}")

In [16]:
feature_dim = X_train.shape[1]
feature_dim

52

In [17]:
def build_model(hp):
    model = Sequential()
    model.add(Conv1D(filters=hp.Int('filters', min_value=32, max_value=64, step=32),
                     kernel_size=hp.Choice('kernel_size', values=[3, 5]),
                     activation='relu',
                     input_shape=(X_train.shape[1], 1)))
    model.add(MaxPooling1D(pool_size=hp.Choice('pool_size', values=[2, 3])))
    model.add(Flatten())
    model.add(Dense(units=hp.Int('dense_units', min_value=64, max_value=128, step=64), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer=hp.Choice('optimizer', values=['adam', 'rmsprop']),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

tuner = Hyperband(
    build_model,
    objective='val_accuracy',
    max_epochs=10,
    hyperband_iterations=2,
    directory='my_dir',
    project_name='cnn_tuning'
)

# Perform the hyperparameter search
tuner.search(X_train, y_train, epochs=10, validation_split=0.2)


Trial 25 Complete [00h 08m 14s]
val_accuracy: 0.9913167953491211

Best val_accuracy So Far: 0.9913167953491211
Total elapsed time: 01h 05m 22s

Search: Running Trial #26

Value             |Best Value So Far |Hyperparameter
32                |64                |filters
5                 |3                 |kernel_size
3                 |2                 |pool_size
64                |128               |dense_units
rmsprop           |adam              |optimizer
10                |10                |tuner/epochs
4                 |4                 |tuner/initial_epoch
1                 |1                 |tuner/bracket
1                 |1                 |tuner/round
0018              |0022              |tuner/trial_id




Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.


Model 'sequential' had a build config, but the model cannot be built automatically in `build_from_config(config)`. You should implement `def build_from_config(self, config)`, and you might also want to implement the method  that generates the config at saving time, `def get_build_config(self)`. The method `build_from_config()` is meant to create the state of the model (i.e. its variables) upon deserialization.


Skipping variable loading for optimizer 'rmsprop', because it has 2 variables whereas the saved optimizer has 8 variables. 



Epoch 5/10
[1m29670/29670[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 1ms/step - accuracy: 0.9815 - loss: 0.0663 - val_accuracy: 0.9849 - val_loss: 0.0363
Epoch 6/10
[1m 4338/29670[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m26s[0m 1ms/step - accuracy: 0.9817 - loss: 0.0649

KeyboardInterrupt: 

In [20]:
# Retrieve the best model based on the search results
best_model = tuner.get_best_models(num_models=1)[0]

# Compile and fit the best model
best_model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])
history = best_model.fit(X_train, y_train, 
                         epochs=50, 
                         batch_size=512, 
                         validation_data=(X_test, y_test), 
                         callbacks=[EarlyStopping(monitor='val_loss', patience=3)])

# Predict on the test data
y_pred = best_model.predict(X_test)

# If you want to evaluate the model, you can use:
loss, accuracy = best_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.


Model 'sequential' had a build config, but the model cannot be built automatically in `build_from_config(config)`. You should implement `def build_from_config(self, config)`, and you might also want to implement the method  that generates the config at saving time, `def get_build_config(self)`. The method `build_from_config()` is meant to create the state of the model (i.e. its variables) upon deserialization.


Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 14 variables. 



Epoch 1/50
[1m2318/2318[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 10ms/step - accuracy: 0.9844 - loss: 0.0460 - val_accuracy: 0.9880 - val_loss: 0.0525
Epoch 2/50
[1m2318/2318[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 12ms/step - accuracy: 0.9845 - loss: 0.0458 - val_accuracy: 0.9816 - val_loss: 0.0630
Epoch 3/50
[1m2318/2318[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 13ms/step - accuracy: 0.9848 - loss: 0.0453 - val_accuracy: 0.9835 - val_loss: 0.0589
Epoch 4/50
[1m2318/2318[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 11ms/step - accuracy: 0.9848 - loss: 0.0454 - val_accuracy: 0.9819 - val_loss: 0.0631
[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 540us/step
[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 584us/step - accuracy: 0.9819 - loss: 0.0630
Test Loss: 0.0631
Test Accuracy: 0.9819


In [25]:
y_pred_train = best_model.predict(X_train)

[1m37087/37087[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 594us/step


Model Prediction

In [26]:
def evaluate(y_train, y_pred_train):
    accuracy = accuracy_score(y_true=y_train, y_pred=y_pred_train)
    precision = precision_score(y_true=y_train, y_pred=y_pred_train)
    recall = recall_score(y_true=y_train, y_pred=y_pred_train)
    f1 = f1_score(y_true=y_train, y_pred=y_pred_train)
    cm = confusion_matrix(y_true=y_train, y_pred=y_pred_train)
    print("Accuracy: ", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)
    print("Confusion Matrix:\n", cm)

Model Prediction

In [22]:
def evaluate(y_test, y_pred):
    accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
    precision = precision_score(y_true=y_test, y_pred=y_pred)
    recall = recall_score(y_true=y_test, y_pred=y_pred)
    f1 = f1_score(y_true=y_test, y_pred=y_pred)
    cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
    print("Accuracy: ", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)
    print("Confusion Matrix:\n", cm)

In [27]:
evaluate(y_train, np.round(y_pred_train))

Accuracy:  0.9851959021407759
Precision: 0.990761164161499
Recall: 0.979525871148988
F1 Score: 0.9851114837924574
Confusion Matrix:
 [[587963   5420]
 [ 12149 581234]]


In [23]:
evaluate(y_test, np.round(y_pred))

Accuracy:  0.9818625
Precision: 0.16607939863753818
Recall: 0.1602447869446963
F1 Score: 0.1631099319414004
Confusion Matrix:
 [[392038   3550]
 [  3705    707]]
