Author:
        
        PARK, JunHo, junho@ccnets.org

        
        KIM, JeongYoong, jeongyoong@ccnets.org
        
    COPYRIGHT (c) 2024. CCNets. All Rights reserved.

# Credit Card Fraud Detection: Proving Causal Generation in Causal Cooperative Network (CCNet)

## Introduction

This tutorial delves into the application of the Cooperative Network (CCNet), enhanced with the CounterGenerate method, to address challenges associated with imbalanced datasets in the domain of credit card fraud detection. We aim to leverage CCNet not just as a data augmentation tool but as a causal model capable of generating realistic training data distributions through counterfactual scenarios. By proving that \(Y\) (outcomes) and \(E\) (explanations) are necessary and sufficient causes within our model, we intend to demonstrate CCNet's ability to perform causal generation, thereby enhancing the diversity, realism, and balance of the training data. This, in turn, is expected to improve the robustness and accuracy of predictive models designed to identify fraudulent transactions.

## Tutorial Goals

The objectives of this tutorial are to guide you through the innovative use of causal data generation techniques and validate the effectiveness of these methods in a practical setting:

### Counterfactual Scenarios
- **Manipulate \(Y\), Maintain \(E\)**: Delve into the dynamics of manipulating the outcome variable \(Y\) while keeping the explanation variable \(E\) constant to create counterfactually augmented datasets. These datasets will represent balanced and redistributed conditions, allowing us to test the hypothesis that they can accurately mirror real-world data distributions. This approach underscores the causal influence of \(Y\) in the data generation process and tests the robustness of \(E\) as a stable explanatory factor.

### Classifier Training and Causal Model Validation
- **Classifier Training**: Engage in the training of classifiers using datasets that have been created under specific causal assumptions to compare their performance metrics:
  - **First Classifier**: Trained on the **original dataset**, serving as the baseline for performance comparison.
  - **Second Classifier**: Trained on the **redistributed fraud outcome dataset**, designed to maintain the same frequency of fraud as the original dataset. This approach aims to test the model's effectiveness under different distribution conditions while expecting to achieve similar prediction accuracy to the first classifier.
  - **Third Classifier**: Trained on a **balanced fraud outcome dataset** where the fraud rate is intentionally reduced to half of its original rate, representing a more severe case of imbalance. This setup challenges the classifier to maintain high detection capabilities under significantly skewed conditions.
- **Performance Metrics**: Utilize key metrics such as the F1 score to evaluate and substantiate the causal generation effectiveness of CCNet. 
- **Counter Generation's Role in Imbalance Mitigation**: Explore how CCNet's counter generation capabilities can be leveraged to synthetically enhance the dataset. By generating realistic, counterfactual instances of both fraudulent and non-fraudulent transactions, CCNet aims to provide a richer, more balanced dataset that helps models learn more effective patterns for fraud detection without the typical bias introduced by skewed data distributions.

### Testing and Empirical Validation
- **Model Evaluation**: Conduct rigorous evaluations of models across all datasets using a consistent test set to ensure fair comparisons.
- **Causal Effectiveness Analysis**: Analyze whether the counterfactually generated data provides a statistically significant improvement in model performance, thereby supporting the hypothesis of effective causal generation.

## Conclusion

By the end of this tutorial, participants will not only comprehend the theoretical framework behind using CCNet for causal data generation but will also witness firsthand the practical benefits of this approach in enhancing model performance for fraud detection tasks. This exploration is intended to solidify the understanding of CCNet as a powerful tool for generating realistic and balanced datasets through principled causal mechanisms.


In [None]:
import sys
path_append = "../../"
sys.path.append(path_append)  # Go up one directory from where you are.


In [None]:
# https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_7_DeepLearning/FeedForwardNeuralNetworks.html
import pandas as pd 

file_name = 'credit_card_fraud_detection'
dataroot = path_append + f"../data/{file_name}/creditcard.csv"
df = pd.read_csv(dataroot)
df

In [None]:
print('No Frauds', round(df['Class'].value_counts()[0] / len(df) *100,2), '%of the dataset')
print('Frauds', round(df['Class'].value_counts()[1] / len(df) *100,2), '%of the dataset')

In [None]:
import torch
from torch.utils.data import Dataset
from tools.preprocessing.data_frame import auto_preprocess_dataframe

# convert column name time to v0
df.rename(columns={'Time': 'V0'}, inplace=True)
df, description = auto_preprocess_dataframe(df, target_columns=['Class'])

# Calculate the number of features and classes
num_features = description['num_features']
num_classes = description['num_classes']

print(num_features, num_classes)

In [None]:

# Defining the labeled and unlabeled dataset classes
class LabeledDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        if isinstance(self.x[index], torch.Tensor):
            vals = self.x[index]
        else:
            vals = torch.tensor(self.x[index], dtype=torch.float32)
        if isinstance(self.y[index], torch.Tensor):
            label = self.y[index]
        else:
            label = torch.tensor(self.y[index], dtype=torch.float32)
        return vals, label

#### Dataset Splitting for Training and Testing

The original dataset is split into training and testing parts to evaluate the model's performance accurately. This step is crucial for validating the effectiveness of the training on unseen data.


In [None]:
from sklearn.model_selection import train_test_split

# Splitting the dataset into training and test sets for model evaluation
df_train, df_test = train_test_split(df, test_size=0.5, shuffle=False)

fraud_counts = df_train['Class'].value_counts(normalize=True)[1]
non_fraud_counts = df_train['Class'].value_counts(normalize=True)[0]
train_fraud_rate = fraud_counts/(fraud_counts+non_fraud_counts)
print("fraud_rate: ", train_fraud_rate)

X_train, y_train = df_train.iloc[:, :-1].values, df_train.iloc[:, -1:].values
X_test, y_test = df_test.iloc[:, :-1].values, df_test.iloc[:, -1:].values

# Labeled datasets for supervised learning tasks
trainset = LabeledDataset(X_train, y_train)  # Corrected to include training data
testset = LabeledDataset(X_test, y_test)     # Test set with proper labels

# Printing the shapes of the datasets for verification
print(f"Labeled Trainset Shape: {len(trainset)}, {trainset.x.shape[1]}")
print(f"Labeled Testset Shape: {len(testset)}, {testset.x.shape[1]}")

#### Initial Setup and Model Configuration

This section initializes the environment by setting a fixed random seed to ensure reproducibility of results. It imports necessary configurations and initializes model parameters with specific configurations. The model specified here is set to have no core model but uses a 'tabnet' encoder model for data processing, which is particularly tailored for structured or tabular data like credit card transactions.


In [None]:
# Set a fixed random seed for reproducibility of experiments
from nn.utils.init_layer import set_random_seed
set_random_seed(0)

# Importing configuration setups for ML parameters and data
import torch
from tools.setting.ml_params import MLParameters
from tools.setting.data_config import DataConfig
from trainer_hub import TrainerHub

# Configuration for the data handling, defining dataset specifics and the task type
data_config = DataConfig(dataset_name='credit_card_fraud_detection', task_type='binary_classification', obs_shape=[num_features], label_size=num_classes, explain_size=num_features-num_classes)

# Initializing ML parameters without a core model and setting the encoder model to 'tabnet' with specific configurations
ml_params = MLParameters(model_name='tabnet', encoder_network='none')

# Setting training parameters and device configuration
ml_params.training.num_epoch = 10
ml_params.model.num_layers = 4

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

# Create a TrainerHub instance to manage training and data processing
trainer_hub = TrainerHub(ml_params, data_config, device, use_print=False, use_wandb=True)


In [None]:
trainer_hub.train(trainset)

#### Data Loading and Counter Generation

This section deals with loading the unlabelled dataset, processing it through the trained model to create synthetic data. This data augmentation step is crucial for models that benefit from larger datasets, such as in fraud detection scenarios.


In [None]:
import torch
import torch.nn.functional as F

len_trainset = len(trainset)
causal_model = trainer_hub.ccnet
# Initialize the recreated dataset container

# Assuming X_train and y_train are your features and binary labels respectively
train_X = torch.tensor(X_train, dtype=torch.float32)
train_y = torch.tensor(y_train, dtype=torch.float32)  # Ensure labels are in long dtype for one_hot

# Convert binary labels to one-hot encoding
# PyTorch's one_hot requires the number of classes, which is 2 for binary labels
train_rand_y = torch.rand_like(train_y)  
balanced_train_y = torch.where(torch.rand_like(train_rand_y) > 0.5, torch.ones_like(train_y), torch.zeros_like(train_y)).float()
redistributed_train_y = torch.where(torch.rand_like(train_rand_y) < train_fraud_rate, torch.ones_like(train_y), torch.zeros_like(train_y)).float()

# Now use the one-hot encoded labels for generating data
balanced_data = causal_model.causal_generate(train_X.to(device), balanced_train_y.to(device))
redistributed_data = causal_model.causal_generate(train_X.to(device), redistributed_train_y.to(device))

# Assuming recreated_dataset is a PyTorch tensor already available in your context
balanced_trainset = LabeledDataset(balanced_data.clone().detach().cpu(), balanced_train_y)
redistributed_trainset = LabeledDataset(redistributed_data.clone().detach().cpu(), redistributed_train_y)

print(f"Original Trainset Shape: {len(trainset)}, {trainset.x.shape[1]}")
print(f"Balanced Trainset Shape: {len(balanced_trainset)}, {balanced_trainset.x.shape[1]}")
print(f"Redistributed Trainset Shape: {len(redistributed_trainset)}, {redistributed_trainset.x.shape[1]}")

print(f"Original Trainset Fraud Rate: {trainset.y.sum().item()/len(trainset)}")
print(f"Balanced Trainset Fraud Rate: {balanced_trainset.y.sum().item()/len(balanced_trainset)}")
print(f"Redistributed Trainset Fraud Rate: {redistributed_trainset.y.sum().item()/len(redistributed_trainset)}")



#### Data Preparation for Model Training

After synthetic data generation, this section separates the data and labels for training purposes, preparing them for use in machine learning models to ensure proper supervision and evaluation.


In [None]:
from nn.tabnet import EncoderTabNet as TabNet
from tools.setting.ml_params import ModelParameters, CooperativeNetworkConfig

class Classifier(torch.nn.Module):
    def __init__(self, input_size, output_size, num_layers=3, hidden_size=256):
        super(Classifier, self).__init__()
        
        model_config = ModelParameters('tabnet')
        model_config.num_layers = num_layers
        model_config.d_model = hidden_size
        network_config = CooperativeNetworkConfig(model_config, "Classifier", input_size, output_size, 'sigmoid')
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        
        # Create a list to hold all layers
        layers = []
        
        # Input layer
        layers.append(torch.nn.Linear(input_size, hidden_size))
        layers.append(torch.nn.ReLU())
        
        ## Add TabNet layers
        layers.append(TabNet(network_config))

        # Register all layers
        self.layers = torch.nn.Sequential(*layers)

    def forward(self, x):
        return self.layers(x)

#### Training Supervised Models

This section outlines the process of training supervised learning models using both original and synthetic datasets. The `train_supervised_model` function is designed to iterate through the dataset, perform forward passes, compute loss, and update model weights using backpropagation.


In [None]:
def train_classifier(model, dataset, num_epoch=4):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
    # Initialize the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    # Ensure reproducibility by resetting the random seed
    # Create DataLoader for batch processing
    trainloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True)
    # Training loop
    len_trainset = len(dataset)
    for epoch in range(num_epoch):  # Train for 2 epochs as an example
        sum_loss = 0
        for i, (data, label) in enumerate(trainloader):
            data = data.to(device).clone().detach()
            label = label.to(device).float()
            # Perform forward pass
            output = model(data)
            # Compute loss
            loss = torch.nn.functional.binary_cross_entropy(output, label)
            # Backward pass to compute gradients
            loss.backward()
            # Update weights
            optimizer.step()
            # Reset gradients
            optimizer.zero_grad()
            sum_loss += loss.item()
            
        print(f"Epoch {epoch+1}, Loss: {sum_loss/len_trainset}")


#### Model Training Using Recreated and Original Datasets

Models are trained using both datasets generated through the Data Augmentation process and the original dataset. This comparison helps to determine the effectiveness of the synthetic data in improving model performance.


In [None]:
# Initialize and train a model on the recreated dataset
model_trained_on_balanced = Classifier(input_size=num_features, output_size=num_classes).to(device)
train_classifier(model_trained_on_balanced, balanced_trainset)

In [None]:
# Initialize and train a model on the recreated dataset
model_trained_on_redistributed = Classifier(input_size=num_features, output_size=num_classes).to(device)
train_classifier(model_trained_on_redistributed, redistributed_trainset)

In [None]:

# Initialize and train a model on the original dataset
model_trained_on_original = Classifier(input_size=num_features, output_size=num_classes).to(device)
train_classifier(model_trained_on_original, trainset)

#### Evaluating Model Performance

After training, the models are evaluated using the F1 score, a harmonic mean of precision and recall, which is particularly useful in the context of imbalanced datasets like fraud detection. This step is critical for assessing the quality of the models trained on different types of data.


In [None]:
from sklearn.metrics import f1_score

def get_f1_score(model, testset, batch_size=256):
    model.eval()  # Set the model to evaluation mode
    y_true = []
    y_pred = []
    # DataLoader for testing
    test_loader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)

    # No gradient computation needed during inference
    with torch.no_grad():
        for data, label in test_loader:
            data = data.to(device)
            label = label.to(device)
            output = model(data)
            # Process output for binary classification
            predicted = (output.squeeze() > 0.5).long()
            y_true.extend(label.squeeze().long().cpu().numpy())
            y_pred.extend(predicted.cpu().numpy())

    # Compute and return the F1 score
    score = f1_score(y_true, y_pred, average='binary')
    return score

# Calculate F1 scores for both models
f1_score_original = get_f1_score(model_trained_on_original, testset)
f1_score_balanced = get_f1_score(model_trained_on_balanced, testset)
f1_score_redistributed = get_f1_score(model_trained_on_redistributed, testset)

# Output the results
print("F1 score of the classifier trained on the original data: ", f1_score_original)
print("F1 score of the classifier trained on the balanced data: ", f1_score_balanced)
print("F1 score of the classifier trained on the redistributed data: ", f1_score_redistributed)
