Author:
        
        PARK, JunHo, junho@ccnets.org

        
        KIM, JeongYoong, jeongyoong@ccnets.org
        
    COPYRIGHT (c) 2024. CCNets. All Rights reserved.

# Introduction to CCNet: A New Paradigm in Causal Learning

## Overview of CCNet
CCNet (Cooperative Causal Network) introduces a framework for understanding causal dynamics in observed data, specifically decoupling the direct causes of an observation (X) from the target outcome (Y). This model employs neural networks to identify explanatory factors in observed data that are conditionally independent of the target variable given the observations.

## A Method for Causal Learning with Neural Networks
CCNet elucidates the causal relationship between observed data (X) and target outcomes (Y) using three interconnected neural network models: the Explainer, the Reasoner, and the Producer. This tutorial aims to demonstrate the conditional independence of explanatory factors (E) from the target outcome (Y) given the observation (X).

## Experiment Description

### Objective
Empirically validate the conditional independence of explanatory factors (E) from the target outcome (Y) given the observations (X) using CCNet in the context of credit card fraud detection.

### Dataset
The experiment uses a credit card fraud detection dataset, comprising features related to transaction details and the target variable indicating fraud. Specifically:
- `df_y_class`: The binary target indicating fraud status, used for binary classification.
- `df_y_amount`: The continuous target indicating transaction amount, used for regression.

This dataset serves as the basis for configuring the causal model dataset for training CCNet.

### Methodology

#### Dataset Preparation
- **CausalModelDataset**: Configures the dataset to include specific features hypothesized to be causally related to the target outcomes. For fraud detection, this includes features relevant to transactions, with either `df_y_class` or `df_y_amount` as the target.
- **EncodingDataset**: Generates encodings (explanatory factors) from the dataset using the causal model for classifier training.

#### Training Process
- **Causal Model Training**: Iteratively train the causal model over several epochs to refine the understanding of causal relationships between transaction features and the outcomes.
- **Classifier Training and Evaluation**:
  - **Selected Feature Classifier**: Trained on encodings of the selected features. In the context of this experiment, this would involve either the binary classification or regression task for fraud detection using `df_y_class` or `df_y_amount`.
  - **Non-Selected Feature Classifier**: Trained on encodings of the non-selected features. Here, this would correspond to the regression task or binary classification using `df_y_amount` or `df_y_class`.
  - The training process also switches selected and non-selected features, training the causal model with different combinations (e.g., using fraud status and then transaction amounts) to evaluate the classifier's ability to predict other features.

#### Validation and Visualization
- **Performance Metrics**: Record accuracy and F1 score for the binary classification task, and mean squared error (MSE) and mean absolute error (MAE) for the regression task. These metrics are used to compare the effectiveness of the causal model.
- **Plotting Results**: Visualize performance trends over epochs to assess the impact of the causal model on classifier performance.

### Summary

The experiment aims to validate CCNet's proficiency in identifying and utilizing causal relationships in the credit card fraud detection dataset. By focusing on conditional independence and evaluating classifier performance across both binary classification (fraud detection using `df_y_class`) and regression (transaction amount prediction using `df_y_amount`), the experiment provides empirical evidence of CCNet's efficacy in causal generation. This highlights CCNet's ability to learn a generative graph that accurately captures the causal dynamics in fraudulent transactions.

### Key Insights

- **Target Versatility**: The experiment demonstrates the flexibility of CCNet in handling different types of target variables. Whether predicting fraud status (binary classification) or transaction amount (regression), the model adapts to learn the causal relationships effectively.
- **Conditional Independence**: The analysis emphasizes proving that explanatory factors (E) are conditionally independent of the target outcome (Y) given observations (X), ensuring robust causal inference.
- **Causal Generation**: The ability of CCNet to generate realistic transaction data based on learned causal relationships showcases its strength in modeling generative processes beyond simple prediction.


In [None]:
import sys
import pandas as pd 
path_append = "../"
sys.path.append(path_append)  # Go up one directory from where you are.

# https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_7_DeepLearning/FeedForwardNeuralNetworks.html

dataroot = path_append + "../data/credit_card_fraud_detection/creditcard.csv"
df = pd.read_csv(dataroot)
df

In [None]:
print('No Frauds', round(df['Class'].value_counts()[0] / len(df) *100,2), '%of the dataset')
print('Frauds', round(df['Class'].value_counts()[1] / len(df) *100,2), '%of the dataset')

In [None]:
import torch
from sklearn.preprocessing import StandardScaler, RobustScaler
from torch.utils.data import Dataset

from tools.preprocessing.data_frame import auto_preprocess_dataframe
from tools.preprocessing.scaler import scale_dataframe

target_columns = ['Amount', 'Class']
df, description = auto_preprocess_dataframe(df, target_columns)
df, scale_amount = scale_dataframe(df, transform_columns = ['Amount'])
scale_class = None

df_x = df.drop(columns = target_columns)
df_y_class = df['Class']
task_type_class = 'binary_classification'

df_y_amount = df['Amount']
task_type_amount = 'regression'

X = torch.tensor(df_x.values[:], dtype=torch.float32)
y_class = torch.tensor(df_y_class.values[:], dtype=torch.float32).unsqueeze(-1)
y_amount = torch.tensor(df_y_amount.values[:], dtype=torch.float32).unsqueeze(-1)

num_features = description['num_features']

In [None]:
# Defining the labeled and unlabeled dataset classes
class LabeledDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        vals = self.x[index] if isinstance(self.x, torch.Tensor) else torch.tensor(self.x[index], dtype=torch.float32)
        label = self.y[index] if isinstance(self.y, torch.Tensor) else torch.tensor(self.y[index], dtype=torch.float32)
        return vals, label

In [None]:

from nn.tabnet import TabNet 
from tools.setting.ml_params import ModelConfig

class PredictionModel(torch.nn.Module):
    def __init__(self, input_size, output_size, task_type, num_layers=3, hidden_size=256):
        super(PredictionModel, self).__init__()
        
        if task_type == 'binary_classification':
            final_act = torch.nn.Sigmoid()
        elif task_type == 'regression':
            final_act = torch.nn.Identity()
        
        self.final_act = final_act
        
        model_config = ModelConfig('tabnet')
        model_config.num_layers = num_layers
        model_config.d_model = hidden_size
        
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        
        # Create a list to hold all layers
        layers = []
        
        # Input layer
        layers.append(torch.nn.Linear(input_size, hidden_size))
        layers.append(torch.nn.ReLU())
        
        ## Add TabNet layers
        layers.append(TabNet(model_config))
        layers.append(torch.nn.ReLU())

        # Output layer
        layers.append(torch.nn.Linear(hidden_size, output_size))
        
        # Register all layers
        self.layers = torch.nn.Sequential(*layers)

    def forward(self, x):
        x = self.layers(x)
        return self.final_act(x)

In [None]:
# Helper Plotting Functions
import matplotlib.pyplot as plt
from IPython.display import clear_output
import numpy as np
from IPython.display import display

def display_plot(fig):
    plt.tight_layout()
    clear_output(wait=True)
    display(fig, display_id='fig')
    plt.pause(0.1)  # Pause to allow the plot to update

def update_annot(ind, line, annot):
    pos = line.get_offsets()[ind["ind"][0]]
    annot.xy = pos
    text = f"{pos[0]:.2f}, {pos[1]:.2f}"
    annot.set_text(text)
    annot.get_bbox_patch().set_alpha(0.4)

def hover(event, fig, ax, line, annot):
    vis = annot.get_visible()
    if event.inaxes == ax:
        cont, ind = line.contains(event)
        if cont:
            update_annot(ind, line, annot)
            annot.set_visible(True)
            fig.canvas.draw_idle()
        else:
            if vis:
                annot.set_visible(False)
                fig.canvas.draw_idle()

def initialize_plot():
    # Sample data
    x = np.linspace(0, 10, 100)
    y = np.sin(x)
    
    # Turn off interactive mode initially
    plt.ioff()
    
    # Create the figure and axes
    fig, axs = plt.subplots(1, 2, figsize=(14, 6))
    
    # Create the scatter plot on the first subplot
    ax = axs[0]
    line = ax.scatter(x, y)
    
    # Annotate point
    annot = ax.annotate("", xy=(0,0), xytext=(20,20),
                        textcoords="offset points",
                        bbox=dict(boxstyle="round", fc="w"),
                        arrowprops=dict(arrowstyle="->"))
    annot.set_visible(False)
    
    # Connect the hover event
    fig.canvas.mpl_connect("motion_notify_event", lambda event: hover(event, fig, ax, line, annot))

    return fig, axs, ax, line, annot

def plot_accuracy(ax, epochs, selected_results, none_selected_results, task_type_selected, task_type_non_selected):
    ax.cla()
    if task_type_selected == 'binary_classification':
        ax.plot(epochs, selected_results['accuracy'], label='Selected Attributes (Accuracy)')
    elif task_type_selected == 'regression':
        ax.plot(epochs, selected_results['mse'], label='Selected Attributes (MSE)')
    
    if task_type_non_selected == 'binary_classification':
        ax.plot(epochs, none_selected_results['accuracy'], label='Non-selected Attributes (Accuracy)')
    elif task_type_non_selected == 'regression':
        ax.plot(epochs, none_selected_results['mse'], label='Non-selected Attributes (MSE)')
    
    ax.set_title('Metrics over Epochs')
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Accuracy/MSE')
    ax.legend()

def plot_f1_score(ax, epochs, selected_results, none_selected_results, task_type_selected, task_type_non_selected):
    ax.cla()
    if task_type_selected == 'binary_classification':
        ax.plot(epochs, selected_results['f1_score'], label='Selected Attributes (F1 Score)')
    elif task_type_selected == 'regression':
        ax.plot(epochs, selected_results['mae'], label='Selected Attributes (MAE)')
    
    if task_type_non_selected == 'binary_classification':
        ax.plot(epochs, none_selected_results['f1_score'], label='Non-selected Attributes (F1 Score)')
    elif task_type_non_selected == 'regression':
        ax.plot(epochs, none_selected_results['mae'], label='Non-selected Attributes (MAE)')
    
    ax.set_title('Metrics over Epochs')
    ax.set_xlabel('Epoch')
    ax.set_ylabel('F1 Score/MAE')
    ax.legend()


In [None]:
import torch
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, f1_score, mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

def _test_prediction_model(model, dataset, task_type, target_scale, device):
    model.eval()
    all_preds = []
    all_labels = []
    target_scale = torch.tensor(target_scale, dtype = torch.float).to(device) if target_scale is not None else None
    
    dataloader = DataLoader(dataset, batch_size=64, shuffle=False)
    with torch.no_grad():
        for data, labels in dataloader:
            data, labels = data.to(device), labels.to(device).float()
            outputs = model(data)
            
            if task_type == 'binary_classification':
                preds = torch.sigmoid(outputs).round()
            elif task_type == 'regression':
                preds =  target_scale*outputs

            all_preds.append(preds.cpu())
            all_labels.append(labels.cpu())
    
    all_preds = torch.cat(all_preds)
    all_labels = torch.cat(all_labels)
    
    if task_type == 'binary_classification':
        accuracy = accuracy_score(all_labels, all_preds)
        f1 = f1_score(all_labels, all_preds, average='macro')
        return accuracy, f1
    elif task_type == 'regression':
        mse = mean_squared_error(all_labels, all_preds)
        mae = mean_absolute_error(all_labels, all_preds)
        return mse, mae


In [None]:
def test_prediction_models(epoch, axs, selected_classifier, none_selected_classifier, testset_selected, testset_none_selected, 
                            task_type_selected, task_type_non_selected, scale_selected, scale_none_selected, selected_results, none_selected_results, device):
    print(f"Testing causal classifier on selected attributes at epoch {epoch}...")
    selected_metrics = _test_prediction_model(selected_classifier, testset_selected, task_type_selected, scale_selected, device)
    if task_type_selected == 'binary_classification':
        selected_results['accuracy'].append(selected_metrics[0])
        selected_results['f1_score'].append(selected_metrics[1])
    elif task_type_selected == 'regression':
        selected_results['mse'].append(selected_metrics[0])
        selected_results['mae'].append(selected_metrics[1])

    print(f"Testing classifier on non-selected attributes at epoch {epoch}...")
    none_selected_metrics = _test_prediction_model(none_selected_classifier, testset_none_selected, task_type_non_selected, scale_none_selected, device)
    if task_type_non_selected == 'binary_classification':
        none_selected_results['accuracy'].append(none_selected_metrics[0])
        none_selected_results['f1_score'].append(none_selected_metrics[1])
    elif task_type_non_selected == 'regression':
        none_selected_results['mse'].append(none_selected_metrics[0])
        none_selected_results['mae'].append(none_selected_metrics[1])

    # Update plots
    if task_type_selected == 'binary_classification':
        epochs = range(1, len(selected_results['accuracy']) + 1)
    elif task_type_selected == 'regression':
        epochs = range(1, len(selected_results['mse']) + 1)
    plot_accuracy(axs[0], epochs, selected_results, none_selected_results, task_type_selected, task_type_non_selected)
    plot_f1_score(axs[1], epochs, selected_results, none_selected_results, task_type_selected, task_type_non_selected)


In [None]:
import torch
from torch.utils.data import DataLoader

# Function to train classifier
DECAY_RATE = 0.01
ITERATION_100K = 100000
gamma = pow(DECAY_RATE, 1 / ITERATION_100K)

def train_prediction_model(model, trainset, task_type, target_scale, num_epochs=3, gamma=gamma, device='cuda'):
    model.train()
    train_loader = DataLoader(trainset, batch_size=64, shuffle=True)
    len_loader = len(train_loader)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=gamma)

    for epoch in range(num_epochs):
        sum_loss = 0
        for data, labels in train_loader:
            data, labels = data.to(device), labels.to(device).float()
            optimizer.zero_grad()
            outputs = model(data)

            if task_type == 'binary_classification':
                # Using binary cross-entropy loss for binary classification
                loss = torch.nn.functional.binary_cross_entropy_with_logits(outputs, labels)
            elif task_type == 'regression':
                if target_scale != None:
                    scaled_labels = labels
                else:
                    scaled_labels = target_scale * labels
                # Using mean squared error for regression
                loss = torch.nn.functional.mse_loss(outputs, scaled_labels)

            loss.backward()
            optimizer.step()
            scheduler.step()
            sum_loss += loss.item()
        
        avg_loss = sum_loss / len_loader
        print(f"Epoch: {epoch + 1}, Loss: {avg_loss:.4f}")

    print(f"Final Learning rate: {optimizer.param_groups[0]['lr']:.8f}")


In [None]:
# Set a fixed random seed for reproducibility of experiments
from nn.utils.init import set_random_seed
set_random_seed(0)

# Importing configuration setups for ML parameters and data
import torch
from tools.setting.ml_params import MLParameters
from tools.setting.data_config import DataConfig
from trainer_hub import TrainerHub as CasualTrainer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

def initialize_causal_trainer(task_type, target_scale):  
    num_classes = 1

    # Configuration for the data handling, defining dataset specifics and the task type
    data_config = DataConfig(dataset_name='CreditCardFraudDetection', task_type=task_type, obs_shape=[num_features], 
                            label_size=1, label_scale= target_scale, explain_size=num_features - num_classes)

    # Initializing ML parameters without a core model and setting the encoder model to 'tabnet' with specific configurations
    ml_params = MLParameters(ccnet_network='tabnet', encoder_network='none')

    # Setting training parameters and device configuration
    ml_params.training.num_epoch = 1
    ml_params.model.ccnet_config.num_layers = 4
    ml_params.algorithm.error_function = 'mse'

    # Create a TrainerHub instance to manage training and data processing
    causal_trainer = CasualTrainer(ml_params, data_config, device, use_print=True, use_wandb=False)
    
    return causal_trainer


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.model_selection import train_test_split

def train_causal_and_prediction_models(X, selected_y, non_selected_y, task_type_selected, task_type_non_selected, scale_selected, scale_non_selected, num_epoch = 5):
    # Split data for "selected" target
    train_x_selected, test_x_selected, train_y_selected, test_y_selected = train_test_split(
        X, selected_y, test_size=0.5, shuffle=False
    )

    # Split data for "non_selected" target
    train_x_non_selected, test_x_non_selected, train_y_non_selected, test_y_non_selected = train_test_split(
        X, non_selected_y, test_size=0.5, shuffle=False
    )

    # Create LabeledDataset instances
    causal_model_dataset = LabeledDataset(train_x_selected, train_y_selected)

    # Initialize causal trainer
    causal_trainer = initialize_causal_trainer(task_type_selected, scale_selected)

    # Initialize the plot
    fig, axs, ax, line, annot = initialize_plot()

    # Create dictionaries to store results
    selected_results_dict = {'accuracy': [], 'f1_score': [], 'mse': [], 'mae': []}
    non_selected_results_dict = {'accuracy': [], 'f1_score': [], 'mse': [], 'mae': []}

    for epoch in range(num_epoch):
        if epoch > 0:
            print(f"Training causal model at epoch {epoch}...")
            causal_trainer.train(causal_model_dataset)
        causal_model = causal_trainer.ccnet

        selected_prediction_model = PredictionModel(causal_model.explain_size, 1, task_type=task_type_selected).to(device)
        non_selected_prediction_model = PredictionModel(causal_model.explain_size, 1, task_type=task_type_non_selected).to(device)

        # Train and evaluate classifiers on the explanation datasets
        print("Training causal classifier on selected attributes...")
        train_explain_selected = causal_model.explain(train_x_selected.to(device))
        trainset_explain_selected = LabeledDataset(train_explain_selected.detach().cpu(), train_y_selected)
        test_explain_selected = causal_model.explain(test_x_selected.to(device))
        testset_explain_selected = LabeledDataset(test_explain_selected.detach().cpu(), test_y_selected)
        train_prediction_model(selected_prediction_model, trainset_explain_selected, task_type_selected, scale_selected, device = device)

        print("Training classifier on non-selected attributes...")
        train_explain_non_selected = causal_model.explain(train_x_non_selected.to(device))
        trainset_explain_non_selected = LabeledDataset(train_explain_non_selected.detach().cpu(), train_y_non_selected)
        test_explain_non_selected = causal_model.explain(test_x_non_selected.to(device))
        testset_explain_non_selected = LabeledDataset(test_explain_non_selected.detach().cpu(), test_y_non_selected)
        train_prediction_model(non_selected_prediction_model, trainset_explain_non_selected, task_type_non_selected, scale_non_selected, device = device)

        # Test classifiers
        test_prediction_models(epoch, axs, 
                               selected_prediction_model, non_selected_prediction_model,
                               testset_explain_selected, testset_explain_non_selected,
                               task_type_selected, task_type_non_selected, 
                               scale_selected, scale_non_selected, 
                               selected_results_dict, non_selected_results_dict, 
                               device)

        # Update the plot
        display_plot(fig)

    plt.ioff()  # Turn off interactive mode
    plt.show()


In [None]:
train_causal_and_prediction_models(X, y_class, y_amount, task_type_class, task_type_amount, scale_class, scale_amount)

In [None]:
train_causal_and_prediction_models(X, y_amount, y_class, task_type_amount, task_type_class, scale_amount, scale_class)

# Credit Card Fraud Detection: Handling Imbalanced Dataset with CCNet

## Introduction

This tutorial explores the use of a Cooperative Network (CCNet) to address challenges associated with imbalanced datasets in the domain of credit card fraud detection. By leveraging the power of data generation, we aim to enhance the diversity and volume of training data, thereby improving the robustness and accuracy of models designed to identify fraudulent transactions.

## Tutorial Goals

The objectives of this tutorial are designed to guide you through the process of enhancing data quality and model performance:

### Dataset Recreation with CCNet
- **Understand Data Augmentation**: Learn how encoding techniques can be used to generate synthetic data instances that closely mimic the characteristics of real-world fraudulent and non-fraudulent transactions.
- **Impact on Model Training**: Assess how augmenting the dataset influences the training process and subsequently, the model's ability to generalize from training to real-world scenarios.

### Model Training and Evaluation
- **Dual Model Training**: Train two distinct models to directly compare performance metrics:
  - A model trained on the **original dataset**.
  - A model trained on the **CCNet-augmented dataset**.
- **Performance Metrics**: Use the F1 score, a critical measure for models operating on imbalanced datasets, to evaluate and compare the effectiveness of these models.

### Testing and Validation
- **Independent Model Testing**: Conduct a thorough evaluation of both models using a standalone test set that was not involved in the training phase.
- **Objective Analysis**: Critically analyze the outcomes to validate whether data augmentation through CCNet offers a tangible benefit in detecting credit card fraud.

## Conclusion

By the end of this tutorial, participants will not only grasp the theoretical underpinnings of using synthetic data to combat data imbalance but also gain hands-on experience in applying these concepts through CCNet to potentially enhance model performance in fraud detection tasks.


In [None]:
# Set a fixed random seed for reproducibility of experiments
from nn.utils.init import set_random_seed
set_random_seed(0)

# Importing configuration setups for ML parameters and data
import torch
from tools.setting.ml_params import MLParameters
from tools.setting.data_config import DataConfig
from trainer_hub import TrainerHub

# Configuration for the data handling, defining dataset specifics and the task type
data_config = DataConfig(dataset_name='CreditCardFraudDetection', task_type='binary_classification', obs_shape=[num_features], label_size=1, explain_size=num_features - 1)

# Initializing ML parameters without a core model and setting the encoder model to 'tabnet' with specific configurations
ml_params = MLParameters(ccnet_network='tabnet', encoder_network='none')

# Setting training parameters and device configuration
ml_params.training.num_epoch = 4
ml_params.model.ccnet_config.num_layers = 4
ml_params.algorithm.error_function = 'mse'

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

# Create a TrainerHub instance to manage training and data processing
trainer_hub = TrainerHub(ml_params, data_config, device, use_print=True, use_wandb=False)


In [None]:
X_train, X_test = train_test_split(X, test_size=0.5, shuffle=False)
y_train, y_test = train_test_split(y_class, test_size=0.5, shuffle=False)

# Labeled datasets for supervised learning tasks
trainset = LabeledDataset(X_train, y_train)  # Corrected to include training data
testset = LabeledDataset(X_test, y_test)     # Test set with proper labels

In [None]:
trainer_hub.train(trainset)

In [None]:
import torch
from tools.loader import collate_fn

len_trainset = len(trainset)
train_loader = torch.utils.data.DataLoader(dataset=trainset, batch_size=256, collate_fn=collate_fn, shuffle=False)
causal_model = trainer_hub.ccnet
# Initialize the recreated dataset container

# Generate synthetic data through the model to augment the training dataset
data = X_train.to(device)
# Generate a large amount of synthetic data
explain = causal_model.explain(data)

generated_data, generated_label = causal_model.generate(explain)

recreated_training_data = generated_data.squeeze(1).detach().cpu()
recreated_labels = generated_label.detach().cpu().argmax(dim=-1).unsqueeze(-1)


print(f"Recreated Training Data Shape: {recreated_training_data.shape}")
print(f"Recreated Labels Shape: {recreated_labels.shape}")


In [None]:
import torch

# Assuming recreated_dataset is a PyTorch tensor already available in your context
ccnet_recreated_dataset = LabeledDataset(recreated_training_data.numpy(), recreated_labels.numpy())

# Print the shapes of the datasets for verification
print(f"Labeled Original Trainset Shape: {len(trainset)}, {trainset.x.shape[1]}")
print(f"CCNet Recreated Dataset Shape: {len(ccnet_recreated_dataset)}, {ccnet_recreated_dataset.x.shape[1]}")

print(f"Labeled Original Testset Shape: {len(testset)}, {testset.x.shape[1]}")

# Retrieve number of features and classes from the recreated dataset
num_features = recreated_training_data.shape[1]
num_classes = recreated_labels.shape[1]

print(f"Number of Features: {num_features}")
print(f"Number of Classes: {num_classes}")

In [None]:
def train_supervised_model(model, dataset, num_epoch=5):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
    # Initialize the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    # Ensure reproducibility by resetting the random seed
    # Create DataLoader for batch processing
    trainloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True)
    # Training loop
    for epoch in range(num_epoch):  # Train for 2 epochs as an example
        for i, (data, label) in enumerate(trainloader):
            data = data.to(device).clone().detach()
            label = label.to(device).float()
            # Perform forward pass
            output = model(data)
            # Compute loss
            loss = torch.nn.functional.binary_cross_entropy(output, label)
            # Backward pass to compute gradients
            loss.backward()
            # Update weights
            optimizer.step()
            # Reset gradients
            optimizer.zero_grad()


In [None]:
# Initialize and train a model on the recreated dataset
model_trained_on_recreated = PredictionModel(input_size=num_features, output_size=num_classes, task_type='binary_classification').to(device)
train_supervised_model(model_trained_on_recreated, ccnet_recreated_dataset)

# Initialize and train a model on the original dataset
model_trained_on_original = PredictionModel(input_size=num_features, output_size=num_classes, task_type='binary_classification').to(device)
train_supervised_model(model_trained_on_original, trainset)

In [None]:
from sklearn.metrics import f1_score

def get_f1_score(model, testset, batch_size=256):
    model.eval()  # Set the model to evaluation mode
    y_true = []
    y_pred = []
    # DataLoader for testing
    test_loader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)

    # No gradient computation needed during inference
    with torch.no_grad():
        for data, label in test_loader:
            data = data.to(device)
            label = label.to(device)
            output = model(data)
            # Process output for binary classification
            predicted = (output.squeeze() > 0.5).long()
            y_true.extend(label.squeeze().long().cpu().numpy())
            y_pred.extend(predicted.cpu().numpy())

    # Compute and return the F1 score
    score = f1_score(y_true, y_pred, average='binary')
    return score

# Calculate F1 scores for both models
f1_score_original = get_f1_score(model_trained_on_original, testset)
f1_score_recreated = get_f1_score(model_trained_on_recreated, testset)

# Output the results
print("F1 score of the supervised learning model trained on the original data: ", f1_score_original)
print("F1 score of the supervised learning model trained on the recreated data: ", f1_score_recreated)
