# Imbalanced classification: credit card fraud detection

## Introduction

This example looks at the
[Kaggle Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud/)
dataset to demonstrate how
to train a classification model on data with highly imbalanced classes.

## First, vectorize the CSV data

In [1]:
# импортируем данные
!wget https://www.dropbox.com/s/9vfy1vi6wsfkxyk/creditcard.csv.zip

--2024-12-06 07:10:29--  https://www.dropbox.com/s/9vfy1vi6wsfkxyk/creditcard.csv.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/sx8mbigscvgydbk4wrsux/creditcard.csv.zip?rlkey=4fn9y7cpau3qabvkt8z2tuje6 [following]
--2024-12-06 07:10:30--  https://www.dropbox.com/scl/fi/sx8mbigscvgydbk4wrsux/creditcard.csv.zip?rlkey=4fn9y7cpau3qabvkt8z2tuje6
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc217f829df246ad93a50d5162d0.dl.dropboxusercontent.com/cd/0/inline/CfuusYsWFN33gDPJQD08mvuDLHwf48xdMeCq9eeJr-TE_ulD-F5V90zqgICZEz6TET-UZTPrJuBzMuWp-wwLbSByacXXgBAT00vF9bTjHecsBPWq5gjeXBrzaiSIL6lZKngYkzdeHYml13o-vMdX_Acz/file# [following]
--2024-12-06 07:10:30--  https://uc217f829df246ad93a50d5162d0.dl.dropboxusercont

In [2]:
!unzip creditcard.csv.zip

Archive:  creditcard.csv.zip
  inflating: creditcard.csv          


In [3]:
import csv
import numpy as np

# Get the real data from https://www.kaggle.com/mlg-ulb/creditcardfraud/
fname = "creditcard.csv"

all_features = []
all_targets = []
with open(fname) as f:
    for i, line in enumerate(f):
        if i == 0:
            print("HEADER:", line.strip())
            continue  # Skip header
        fields = line.strip().split(",")
        all_features.append([float(v.replace('"', "")) for v in fields[:-1]])
        all_targets.append([int(fields[-1].replace('"', ""))])
        if i == 1:
            print("EXAMPLE FEATURES:", all_features[-1])

features = np.array(all_features, dtype="float32")
targets = np.array(all_targets, dtype="uint8")
print("features.shape:", features.shape)
print("targets.shape:", targets.shape)


HEADER: "Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount","Class"
EXAMPLE FEATURES: [0.0, -1.3598071336738, -0.0727811733098497, 2.53634673796914, 1.37815522427443, -0.338320769942518, 0.462387777762292, 0.239598554061257, 0.0986979012610507, 0.363786969611213, 0.0907941719789316, -0.551599533260813, -0.617800855762348, -0.991389847235408, -0.311169353699879, 1.46817697209427, -0.470400525259478, 0.207971241929242, 0.0257905801985591, 0.403992960255733, 0.251412098239705, -0.018306777944153, 0.277837575558899, -0.110473910188767, 0.0669280749146731, 0.128539358273528, -0.189114843888824, 0.133558376740387, -0.0210530534538215, 149.62]
features.shape: (284807, 30)
targets.shape: (284807, 1)


## Prepare a validation set

In [None]:
num_val_samples = int(len(features) * 0.2)
train_features = features[:-num_val_samples]
train_targets = targets[:-num_val_samples]
val_features = features[-num_val_samples:]
val_targets = targets[-num_val_samples:]

print("Number of training samples:", len(train_features))
print("Number of validation samples:", len(val_features))


Number of training samples: 227846
Number of validation samples: 56961


## Analyze class imbalance in the targets

In [None]:
counts = np.bincount(train_targets[:, 0])
print(
    "Number of positive samples in training data: {} ({:.2f}% of total)".format(
        counts[1], 100 * float(counts[1]) / len(train_targets)
    )
)

weight_for_0 = 1.0 / counts[0]
weight_for_1 = 1.0 / counts[1]

Number of positive samples in training data: 417 (0.18% of total)


In [None]:
counts = np.bincount(val_targets[:, 0])
print(
    "Number of positive samples in validation data: {} ({:.2f}% of total)".format(
        counts[1], 100 * float(counts[1]) / len(val_targets)
    )
)

Number of positive samples in validation data: 75 (0.13% of total)


In [None]:
print(weight_for_0,weight_for_1)

4.396976638863118e-06 0.002398081534772182


## Normalize the data using training set statistics

In [None]:
print('Before normalization: ', train_features[2])
mean = np.mean(train_features, axis=0)
train_features -= mean
val_features -= mean

std = np.std(train_features, axis=0)
train_features /= std
val_features /= std
print('After normalization: ', train_features[2])

Before normalization:  [ 1.0000000e+00 -1.3583541e+00 -1.3401631e+00  1.7732093e+00
  3.7977961e-01 -5.0319815e-01  1.8004994e+00  7.9146093e-01
  2.4767579e-01 -1.5146543e+00  2.0764287e-01  6.2450147e-01
  6.6083685e-02  7.1729273e-01 -1.6594592e-01  2.3458650e+00
 -2.8900833e+00  1.1099694e+00 -1.2135931e-01 -2.2618570e+00
  5.2497971e-01  2.4799815e-01  7.7167940e-01  9.0941226e-01
 -6.8928093e-01 -3.2764184e-01 -1.3909657e-01 -5.5352796e-02
 -5.9751842e-02  3.7866000e+02]
After normalization:  [-2.000831   -0.6643839  -0.800215    1.0673089   0.23807637 -0.32006603
  1.3394924   0.6662744   0.20148349 -1.3502778   0.19138344  0.5304687
  0.10520667  0.68714315 -0.20587935  2.4593623  -3.2572083   1.252478
 -0.11686509 -2.7495122   0.6611577   0.34202614  1.1212966   1.4581116
 -1.1393514  -0.72019994 -0.29163548 -0.13880983 -0.18471171  1.1489743 ]


In [None]:
train_features.mean(axis=0)

array([ 1.19692659e-05,  4.37138226e-07, -3.57809427e-08,  3.00485590e-06,
       -8.66645451e-07, -6.25757195e-07,  4.03243689e-07, -4.61278717e-07,
        1.01057346e-07,  5.14432301e-08,  1.14444694e-08,  1.28217948e-06,
       -5.83659698e-07,  2.37660203e-07, -2.49136065e-07,  8.00532291e-07,
        1.66812217e-08,  1.94149749e-07,  1.51190221e-07,  9.69690532e-08,
        3.50928673e-07,  1.09110331e-07, -2.78042961e-07, -2.10823060e-07,
        7.19980386e-09,  2.20549555e-06,  4.22972818e-07,  3.48672025e-08,
        7.06674719e-09,  2.68111944e-05], dtype=float32)

In [None]:
train_features.std(axis=0)

array([1.0000184 , 1.0000031 , 0.99999654, 0.9999961 , 1.0000012 ,
       1.0000044 , 1.0000013 , 1.0000077 , 1.0000154 , 0.9999921 ,
       1.0000087 , 0.999995  , 1.0000061 , 1.000002  , 1.0000014 ,
       1.0000093 , 1.0000136 , 1.0000196 , 1.0000007 , 1.0000111 ,
       0.9999998 , 1.0000029 , 0.9999994 , 1.0000079 , 1.0000101 ,
       1.0000007 , 1.0000094 , 1.0000167 , 0.9999909 , 0.99979484],
      dtype=float32)

## Build a binary classification model

In [None]:
from tensorflow import keras

hid_size = 256
model = keras.Sequential(
    [
        keras.layers.Dense(
            hid_size, activation="relu", input_shape=(train_features.shape[-1],)
        ), # fully-connected y^1
        keras.layers.Dense(hid_size*2, activation="relu"), # y^2
        keras.layers.Dropout(0.3),
        keras.layers.Dense(hid_size, activation="relu"), # y^3
        keras.layers.Dropout(0.3),
        keras.layers.Dense(1, activation="sigmoid"), # y^4
    ]
)
model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


## Train the model with `class_weight` argument

In [None]:
metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics
)

callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.keras")]
class_weight = {0: weight_for_0, 1: weight_for_1}

model.fit(
    train_features,
    train_targets,
    batch_size=2048,
    epochs=30,
    callbacks=callbacks,
    validation_data=(val_features, val_targets),
    class_weight=class_weight,
)


Epoch 1/30
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 57ms/step - fn: 28.5310 - fp: 20892.7793 - loss: 3.4427e-06 - precision: 0.0087 - recall: 0.8617 - tn: 95570.1797 - tp: 199.3097 - val_fn: 10.0000 - val_fp: 874.0000 - val_loss: 0.0639 - val_precision: 0.0692 - val_recall: 0.8667 - val_tn: 56012.0000 - val_tp: 65.0000
Epoch 2/30
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - fn: 14.4248 - fp: 3151.8672 - loss: 1.2898e-06 - precision: 0.0689 - recall: 0.9358 - tn: 113326.2656 - tp: 198.2389 - val_fn: 8.0000 - val_fp: 1219.0000 - val_loss: 0.1124 - val_precision: 0.0521 - val_recall: 0.8933 - val_tn: 55667.0000 - val_tp: 67.0000
Epoch 3/30
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - fn: 13.8142 - fp: 3407.5752 - loss: 1.0343e-06 - precision: 0.0705 - recall: 0.9483 - tn: 113059.2891 - tp: 210.1151 - val_fn: 12.0000 - val_fp: 216.0000 - val_loss: 0.0190 - val_precision: 0.2258 - val_recall: 0.8400 - 

<keras.src.callbacks.history.History at 0x7d85517f36d0>

In [None]:
#    P(1)   N(0)
# T
# F

In [None]:
# n_samples ~ 300000
# batch_size ~ 3000
# n_steps ~ 100
# n_epochs ~ 1

## Conclusions

At the end of training, out of 56,961 validation transactions, we are:

- Correctly identifying 66 of them as fraudulent
- Missing 9 fraudulent transactions
- At the cost of incorrectly flagging 441 legitimate transactions

In the real world, one would put an even higher weight on class 1,
so as to reflect that False Negatives are more costly than False Positives.

Next time your credit card gets  declined in an online purchase -- this is why.