## Imbalanced classification: Credit card Fraud Detection

Example looking into the credit card fraud detection dataset 
to demonstrate how to train a classification model with highly imbalanced classes

url: 
https://keras.io/examples/structured_data/imbalanced_classification/

dataset:
https://www.kaggle.com/mlg-ulb/creditcardfraud/

In [1]:
import csv
import numpy as np

fname = "../dataset/creditcard.csv"

all_features = []
all_targets = []

with open(fname) as f:
    for i, line in enumerate(f):
        if i == 0:
            print("header:", line.strip())
            continue # skip the header
        fields = line.strip().split(",")
        all_features.append([float(v.replace('"', "")) for v in fields[:-1]])
        all_targets.append([int(fields[-1].replace('"', ""))])
        if i == 1:
            print("example features:", all_features[-1])

features = np.array(all_features, dtype="float32")
targets = np.array(all_targets, dtype="uint8")
print("features.shape:", features.shape)
print("targets.shape:", targets.shape)

header: "Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount","Class"
example features: [0.0, -1.3598071336738, -0.0727811733098497, 2.53634673796914, 1.37815522427443, -0.338320769942518, 0.462387777762292, 0.239598554061257, 0.0986979012610507, 0.363786969611213, 0.0907941719789316, -0.551599533260813, -0.617800855762348, -0.991389847235408, -0.311169353699879, 1.46817697209427, -0.470400525259478, 0.207971241929242, 0.0257905801985591, 0.403992960255733, 0.251412098239705, -0.018306777944153, 0.277837575558899, -0.110473910188767, 0.0669280749146731, 0.128539358273528, -0.189114843888824, 0.133558376740387, -0.0210530534538215, 149.62]
features.shape: (284807, 30)
targets.shape: (284807, 1)


### Prepare a validation set

In [2]:
num_val_samples = int(len(features) * 0.2)
train_features = features[:-num_val_samples]
train_targets = targets[:-num_val_samples]
val_features = features[-num_val_samples:]
val_targets = targets[-num_val_samples:]

print("number of training samples:", len(train_features))
print("number of validation samples:", len(val_features))

number of training samples: 227846
number of validation samples: 56961


### Analyze class imbalance in the targets

In [3]:
counts = np.bincount(train_targets[:, 0])

print(
    "number of postive samples in training data: {} ({:.2f}% of total)".format(
        counts[1], 100 * float(counts[1]) / len(train_targets)
    )
)

weight_for_0 = 1.0 / counts[0]
weight_for_1 = 1.0 / counts[1]

number of postive samples in training data: 417 (0.18% of total)


### Normalize the data using training set statistics

In [4]:
mean = np.mean(train_features, axis=0)
train_features -= mean
val_features -= mean
std = np.std(train_features, axis=0)
train_features /= std
val_features /= std

### Build a binary classification model

In [5]:
from tensorflow import keras

model = keras.Sequential(
    [
        keras.layers.Dense(
            256, activation="relu", input_shape=(train_features.shape[-1],)
        ),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model.summary()

Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB



2022-12-09 16:18:53.336464: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-09 16:18:53.337608: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 256)               7936      
                                                                 
 dense_1 (Dense)             (None, 256)               65792     
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 256)               65792     
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_3 (Dense)             (None, 1)                 257       
                                                                 
Total params: 139,777
Trainable params: 139,777
Non-trai

### Train the model with class_weight argument

In [7]:
metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics
)

callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.h5")]
class_weight = {0: weight_for_0, 1: weight_for_1}

model.fit(
    train_features,
    train_targets,
    batch_size=2028,
    epochs=30,
    verbose=2,
    callbacks=callbacks,
    validation_data=(val_features, val_targets),
    class_weight=class_weight,
)

Epoch 1/30


2022-12-09 16:25:42.719350: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-12-09 16:25:43.240262: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-09 16:25:47.436108: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


113/113 - 5s - loss: 2.4018e-06 - fn: 50.0000 - fp: 29164.0000 - tn: 198265.0000 - tp: 367.0000 - precision: 0.0124 - recall: 0.8801 - val_loss: 0.0922 - val_fn: 10.0000 - val_fp: 865.0000 - val_tn: 56021.0000 - val_tp: 65.0000 - val_precision: 0.0699 - val_recall: 0.8667 - 5s/epoch - 44ms/step
Epoch 2/30
113/113 - 2s - loss: 1.4385e-06 - fn: 29.0000 - fp: 7709.0000 - tn: 219720.0000 - tp: 388.0000 - precision: 0.0479 - recall: 0.9305 - val_loss: 0.1094 - val_fn: 9.0000 - val_fp: 988.0000 - val_tn: 55898.0000 - val_tp: 66.0000 - val_precision: 0.0626 - val_recall: 0.8800 - 2s/epoch - 14ms/step
Epoch 3/30
113/113 - 2s - loss: 1.5720e-06 - fn: 32.0000 - fp: 8826.0000 - tn: 218603.0000 - tp: 385.0000 - precision: 0.0418 - recall: 0.9233 - val_loss: 0.1296 - val_fn: 7.0000 - val_fp: 1560.0000 - val_tn: 55326.0000 - val_tp: 68.0000 - val_precision: 0.0418 - val_recall: 0.9067 - 2s/epoch - 14ms/step
Epoch 4/30
113/113 - 2s - loss: 1.3123e-06 - fn: 26.0000 - fp: 7685.0000 - tn: 219744.0000 - 

<keras.callbacks.History at 0x291960f10>