[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aldomunaretto/immune_deep_learning/blob/main/notebooks/01_intro_DL/05_keras_imbalanced_classification.ipynb)

# Imbalanced classification: credit card fraud detection

Reference: https://keras.io/examples/structured_data/imbalanced_classification/

## Introduction

This example looks at the
[Kaggle Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud/)
dataset to demonstrate how
to train a classification model on data with highly imbalanced classes.

## First, vectorize the CSV data

In [None]:
import csv
import numpy as np
from kaggle.api.kaggle_api_extended import KaggleApi

# Autenticar y descargar el archivo CSV desde Kaggle
api = KaggleApi()
api.authenticate()
api.dataset_download_file('mlg-ulb/creditcardfraud', 'creditcard.csv', path='/content/drive/MyDrive/data/')

# Descomprimir el archivo descargado
import zipfile
with zipfile.ZipFile('/content/drive/MyDrive/data/creditcard.csv.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/drive/MyDrive/data/')

# Leer los datos del archivo CSV
fname = "/content/drive/MyDrive/data/creditcard.csv"

all_features = []
all_targets = []
with open(fname) as f:
    for i, line in enumerate(f):
        if i == 0:
            print("HEADER:", line.strip())
            continue  # Skip header
        fields = line.strip().split(",")
        all_features.append([float(v.replace('"', "")) for v in fields[:-1]])
        all_targets.append([int(fields[-1].replace('"', ""))])
        if i == 1:
            print("EXAMPLE FEATURES:", all_features[-1])

features = np.array(all_features, dtype="float32")
targets = np.array(all_targets, dtype="uint8")
print("features.shape:", features.shape)
print("targets.shape:", targets.shape)

## Prepare a validation set

In [None]:
num_val_samples = int(len(features) * 0.2)
train_features = features[:-num_val_samples]
train_targets = targets[:-num_val_samples]
test_features = features[-num_val_samples:]
test_targets = targets[-num_val_samples:]

print("Number of training samples:", len(train_targets))
print("Number of validation samples:", len(test_features))

## [Recommended] Normalize the data

In [None]:
train_features = ...
test_features = ...

## Build a binary classification model

In [None]:
import keras

model = ...
model.summary()

## Exercise: detects 90% of frauds in test dataset (TP >= 68)

**Tips**: check the following documentation (class weight parameter): https://keras.io/api/models/model_training_apis/#fit-method

In [None]:
metrics = [
  keras.metrics.TruePositives(name="tp"),
  ...
]

model.compile(
    ...
)

model.fit(
    train_features,
    train_targets,
    batch_size=...,
    epochs=...,
    callbacks=...,
    verbose=2,
    validation_split=...,
    class_weight=...,
)

In [None]:
results = model.evaluate(test_features, test_targets, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test TP: {}'.format(results[1]))