Author:
        
        PARK, JunHo, junho@ccnets.org

        
        KIM, JeongYoong, jeongyoong@ccnets.org
        
    COPYRIGHT (c) 2024. CCNets. All Rights reserved.

# Credit Card Fraud Detection: Handling Imbalanced Dataset with CCNet

## Introduction

This tutorial explores the use of a Cooperative Encoding Network (CCNet) to address challenges associated with imbalanced datasets in the domain of credit card fraud detection. By leveraging the power of synthetic data generation, we aim to enhance the diversity and volume of training data, thereby improving the robustness and accuracy of models designed to identify fraudulent transactions.

## Tutorial Goals

The objectives of this tutorial are designed to guide you through the process of enhancing data quality and model performance:

### Dataset Recreation with CCNet
- **Understand Data Augmentation**: Learn how encoding techniques can be used to generate synthetic data instances that closely mimic the characteristics of real-world fraudulent and non-fraudulent transactions.
- **Impact on Model Training**: Assess how augmenting the dataset influences the training process and subsequently, the model's ability to generalize from training to real-world scenarios.

### Model Training and Evaluation
- **Dual Model Training**: Train two distinct models to directly compare performance metrics:
  - A model trained on the **original dataset**.
  - A model trained on the **CCNet-augmented dataset**.
- **Performance Metrics**: Use the F1 score, a critical measure for models operating on imbalanced datasets, to evaluate and compare the effectiveness of these models.

### Testing and Validation
- **Independent Model Testing**: Conduct a thorough evaluation of both models using a standalone test set that was not involved in the training phase.
- **Objective Analysis**: Critically analyze the outcomes to validate whether data augmentation through CCNet offers a tangible benefit in detecting credit card fraud.

## Conclusion

By the end of this tutorial, participants will not only grasp the theoretical underpinnings of using synthetic data to combat data imbalance but also gain hands-on experience in applying these concepts through CCNet to potentially enhance model performance in fraud detection tasks.


In [None]:
import sys
path_append = "../"
sys.path.append(path_append)  # Go up one directory from where you are.


In [None]:
# https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_7_DeepLearning/FeedForwardNeuralNetworks.html
import pandas as pd 

dataroot = path_append + "../data/credit_card_fraud_detection/creditcard.csv"
df = pd.read_csv(dataroot)
df

In [None]:
print('No Frauds', round(df['Class'].value_counts()[0] / len(df) *100,2), '%of the dataset')
print('Frauds', round(df['Class'].value_counts()[1] / len(df) *100,2), '%of the dataset')

In [None]:
import torch
from sklearn.preprocessing import StandardScaler
from torch.utils.data import Dataset

# Standardizing the features (excluding the class column)
sc = StandardScaler()
df.iloc[:, :-1] = sc.fit_transform(df.iloc[:, :-1])

# Total elements including features and class
n_elements = df.shape[1]
print(f"Number of features and classes (total columns): {n_elements}")

In [None]:

# Defining the labeled and unlabeled dataset classes
class LabeledDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        vals = torch.tensor(self.x[index], dtype=torch.float32)
        label = torch.tensor(self.y[index], dtype=torch.float32)
        return vals, label

class UnlabelledDataset(Dataset):
    def __init__(self, x):
        self.x = x
        
    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        vals = torch.tensor(self.x[index], dtype=torch.float32)
        return vals, None

#### Initial Setup and Model Configuration

This section initializes the environment by setting a fixed random seed to ensure reproducibility of results. It imports necessary configurations and initializes model parameters with specific configurations. The model specified here is set to have no core model but uses a 'tabnet' encoder model for data processing, which is particularly tailored for structured or tabular data like credit card transactions.


In [None]:
from sklearn.model_selection import train_test_split

# Splitting the dataset into training and test sets for model evaluation
df_train, df_test = train_test_split(df, test_size=0.5, shuffle=False)
X_train, y_train = df_train.iloc[:, :-1].values, df_train.iloc[:, -1:].values
X_test, y_test = df_test.iloc[:, :-1].values, df_test.iloc[:, -1:].values

# Preparing the datasets for use in training and testing
df_train_values = df_train.iloc[:, :].values  # Full training data for unsupervised tasks
unlabelled_dataset = UnlabelledDataset(df_train_values)  # Renamed to reflect it includes the full train data

# Labeled datasets for supervised learning tasks
supervised_trainset = LabeledDataset(X_train, y_train)  # Corrected to include training data
supervised_testset = LabeledDataset(X_test, y_test)     # Test set with proper labels

# Printing the shapes of the datasets for verification
print(f"Unlabelled Dataset Shape: {len(unlabelled_dataset)}, {unlabelled_dataset.x.shape[1]}")
print(f"Labeled Trainset Shape: {len(supervised_trainset)}, {supervised_trainset.x.shape[1]}")
print(f"Labeled Testset Shape: {len(supervised_testset)}, {supervised_testset.x.shape[1]}")

In [None]:
# Set a fixed random seed for reproducibility of experiments
from nn.utils.init import set_random_seed
set_random_seed(0)

# Importing configuration setups for ML parameters and data
import torch
from tools.setting.ml_params import MLParameters
from tools.setting.data_config import DataConfig
from trainer_hub import TrainerHub

# Configuration for the data handling, defining dataset specifics and the task type
data_config = DataConfig(dataset_name='CreditCardFraudDetection', task_type='augmentation', obs_shape=[n_elements], label_size=None)

# Initializing ML parameters without a core model and setting the encoder model to 'tabnet' with specific configurations
ml_params = MLParameters(core_model='none', encoder_model='tabnet')
ml_params.model.encoder_config.num_layers = 4
ml_params.model.encoder_config.dropout = 0.0

# Setting training parameters and device configuration
ml_params.training.num_epoch = 4
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

# Create a TrainerHub instance to manage training and data processing
trainer_hub = TrainerHub(ml_params, data_config, device, use_print=True, use_wandb=False)


#### Dataset Splitting for Training and Testing

The original dataset is split into training and testing parts to evaluate the model's performance accurately. This step is crucial for validating the effectiveness of the training on unseen data.


In [None]:
trainer_hub.train(unlabelled_dataset)

#### Data Loading and Synthetic Data Generation

This section deals with loading the unlabelled dataset, processing it through the trained model to create synthetic data. This data augmentation step is crucial for models that benefit from larger datasets, such as in fraud detection scenarios.


In [None]:
import torch
from tools.loader import collate_fn

batch_size = 64
len_unlabelled_trainset = len(unlabelled_dataset)
unlabelled_train_loader = torch.utils.data.DataLoader(dataset=unlabelled_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=False)

# Initialize the recreated dataset container
recreated_dataset = None

# Generate synthetic data through the model to augment the training dataset
with torch.no_grad():
    for data, _ in unlabelled_train_loader:
        data = data.to(device)
        # Generate a large amount of synthetic data
        batch_recreated_data = trainer_hub.encoder_ccnet.synthesize(data)

        # Filter out entries where the last feature is not in the range (0, 1)
        valid_entries = (batch_recreated_data[:, -1] >= 0) & (batch_recreated_data[:, -1] <= 1)
        batch_recreated_data = batch_recreated_data[valid_entries]

        # Check if any valid data remains after filtering; if none, skip this batch
        if batch_recreated_data.shape[0] == 0:
            continue
        
        # Sort the filtered data by the last feature (class label)
        sorted_batch = batch_recreated_data[torch.argsort(batch_recreated_data[:, -1])]

        # Select front half and rear half from the sorted batch
        # Since you want exactly batch_size items, take the first and last quarter
        select_size = batch_size // 2  # Selecting half of the batch size for each tail
        balanced_batch = torch.cat([sorted_batch[:select_size], sorted_batch[-select_size:]], dim=0)

        # Shuffle the selected batch
        indices = torch.randperm(len(balanced_batch))
        balanced_batch = balanced_batch[indices]

        # Concatenate the processed batch to the main dataset
        if recreated_dataset is None:
            recreated_dataset = balanced_batch
        else:
            recreated_dataset = torch.cat([recreated_dataset, balanced_batch], dim=0)
        
        # Ensure we do not exceed the original unlabelled dataset's size
        if len(recreated_dataset) >= len_unlabelled_trainset:
            recreated_dataset = recreated_dataset[:len_unlabelled_trainset]
            break

print(f"Recreated Dataset Shape: {recreated_dataset.shape}")

#### Data Preparation for Model Training

After synthetic data generation, this section separates the data and labels for training purposes, preparing them for use in machine learning models to ensure proper supervision and evaluation.


In [None]:
import torch

# Assuming recreated_dataset is a PyTorch tensor already available in your context

# Separate the recreated data into features and labels for training
recreated_training_data = recreated_dataset[:, :-1].clone().detach().cpu().numpy()  # Features from recreated data
recreated_labels = recreated_dataset[:, -1:].clone().detach().cpu().numpy()  # Labels from recreated data
ccnet_recreated_dataset = LabeledDataset(recreated_training_data, recreated_labels)

# Print the shapes of the datasets for verification
print(f"Labeled Original Trainset Shape: {len(supervised_trainset)}, {supervised_trainset.x.shape[1]}")
print(f"CCNet Recreated Dataset Shape: {len(ccnet_recreated_dataset)}, {ccnet_recreated_dataset.x.shape[1]}")

print(f"Labeled Original Testset Shape: {len(supervised_testset)}, {supervised_testset.x.shape[1]}")

# Retrieve number of features and classes from the recreated dataset
num_features = recreated_training_data.shape[1]
num_classes = recreated_labels.shape[1]

print(f"Number of Features: {num_features}")
print(f"Number of Classes: {num_classes}")

In [None]:
class DNN(torch.nn.Module):
    def __init__(self, input_size, output_size, num_layers=4, hidden_size=256):
        super(DNN, self).__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        
        # Create a list to hold all layers
        layers = []
        
        # Input layer
        layers.append(torch.nn.Linear(input_size, hidden_size))
        layers.append(torch.nn.ReLU())
        
        # Hidden layers
        for _ in range(num_layers - 2):
            layers.append(torch.nn.Linear(hidden_size, hidden_size))
            layers.append(torch.nn.ReLU())
        
        # Output layer
        layers.append(torch.nn.Linear(hidden_size, output_size))
        
        # Register all layers
        self.layers = torch.nn.Sequential(*layers)

    def forward(self, x):
        x = self.layers(x)
        return torch.sigmoid(x)

#### Training Supervised Models

This section outlines the process of training supervised learning models using both original and synthetic datasets. The `train_supervised_model` function is designed to iterate through the dataset, perform forward passes, compute loss, and update model weights using backpropagation.


In [None]:
def train_supervised_model(model, dataset):
    # Initialize the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.0002)
    # Ensure reproducibility by resetting the random seed
    set_random_seed(0)
    # Create DataLoader for batch processing
    trainloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
    
    # Training loop
    for epoch in range(4):  # Train for 2 epochs as an example
        for i, (data, label) in enumerate(trainloader):
            data = data.to(device)
            label = label.to(device)
            # Perform forward pass
            output = model(data)
            # Compute loss
            loss = torch.nn.functional.binary_cross_entropy(output, label)
            # Backward pass to compute gradients
            loss.backward()
            # Update weights
            optimizer.step()
            # Reset gradients
            optimizer.zero_grad()


#### Model Training Using Recreated and Original Datasets

Models are trained using both datasets generated through the Data Augmentation process and the original dataset. This comparison helps to determine the effectiveness of the synthetic data in improving model performance.


In [None]:
# Initialize and train a model on the recreated dataset
model_trained_on_recreated = DNN(input_size=num_features, output_size=num_classes).to(device)
train_supervised_model(model_trained_on_recreated, ccnet_recreated_dataset)

# Initialize and train a model on the original dataset
model_trained_on_original = DNN(input_size=num_features, output_size=num_classes).to(device)
train_supervised_model(model_trained_on_original, supervised_trainset)

#### Evaluating Model Performance

After training, the models are evaluated using the F1 score, a harmonic mean of precision and recall, which is particularly useful in the context of imbalanced datasets like fraud detection. This step is critical for assessing the quality of the models trained on different types of data.


In [None]:
from sklearn.metrics import f1_score

def get_f1_score(model, testset, batch_size=batch_size):
    model.eval()  # Set the model to evaluation mode
    y_true = []
    y_pred = []
    # DataLoader for testing
    test_loader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)

    # No gradient computation needed during inference
    with torch.no_grad():
        for data, label in test_loader:
            data = data.to(device)
            label = label.to(device)
            output = model(data)
            # Process output for binary classification
            predicted = (output.squeeze() > 0.5).long()
            y_true.extend(label.squeeze().long().cpu().numpy())
            y_pred.extend(predicted.cpu().numpy())

    # Compute and return the F1 score
    score = f1_score(y_true, y_pred, average='binary')
    return score

# Calculate F1 scores for both models
f1_score_original = get_f1_score(model_trained_on_original, supervised_testset)
f1_score_recreated = get_f1_score(model_trained_on_recreated, supervised_testset)

# Output the results
print("F1 score of the supervised learning model trained on the original data: ", f1_score_original)
print("F1 score of the supervised learning model trained on the recreated data: ", f1_score_recreated)
