# TO DO

1. finish objective function - look at code on how to call dataloaders and our custom classes on chatgpt
2. get it to plot train loss vs val loss on best model???
3. get it to spit out accuracy, hit rate, f1 and ROC AUC. get it to plot ROC AUC graph??
4. get it to output best HP combo
5. check paper for other tips and tricks they used.
6. look into Cell-based Architectures: Explore more structured approaches for dynamic architectures, such as cell-based architectures (like those used in NASNet or EfficientNet), which offer a balance between flexibility and control



---

# Intro

**Plan**: Import credit card fraud data. Use encoder only transformer network for classifying time series credit card data

**Purpose/Intro**: Task is to develop transformer architecture proof of concept for potential application at work, detecting fraud. In a normal data science project it might be considered best practice to begin with more interpretable models first, for research purposes, but this project is solely for the purpose of assessing the viability of a transformer for this task.

**Hypothesis**: The attention mechanism of the transformer, when combined with an appropriate positional embedding method, is able to capture both long-term and short-term dependencies in time series credit-card fraud data.

**Methodology**: Using cross valdiation techniques on test dataset to calculate appropriate accuracy metrics (adjusting for the significant class imbalance for the dataset), with an aim to assess the viability of transformer networks for fraud classification.

Credit to the below paper, **referred to as Source 1**, for the methodology design: Yu, C., Xu, Y., Cao, J., Zhang, Y., Jin, Y. and Zhu, M. (2024) 'Credit Card Fraud Detection Using Advanced Transformer Model', arXiv preprint arXiv:2406.03733. Available at: https://arxiv.org/abs/2406.03733 (Accessed: 18 December 2024)

This paper has demonstrated the utility for the transformer that we are about to create, by comparing the methodology with various other shallow learning techniques. In future projects I aim to validate this myself.





---

# Data Sourcing and Package loading



In [16]:

#import packages:

import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

from google.colab import drive

import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
  # Check if drive is mounted by looking for the mount point in the file system.
  import os
  if not os.path.exists('/content/drive'):
    drive.mount('/content/drive')

#basics
import os
from google.colab import drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
!pip install optuna
import optuna

#cuML
import cudf
from cudf import DataFrame as cudfDataFrame
import cupy as cp
from imblearn.over_sampling import SMOTE, RandomOverSampler

#table one
!pip install tableone
from tableone import TableOne

#torch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, TensorDataset


#sklearn
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score, train_test_split, RepeatedStratifiedKFold, KFold
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import MinMaxScaler

from imblearn.over_sampling import RandomOverSampler



In [11]:
data_set_filepath = '/content/drive/MyDrive/Colab_Notebooks/Data/creditcard.feather'

df = pd.read_feather(data_set_filepath)

missing_values = df.isnull().sum()

if missing_values.any():  # Check if any missing values exist
    print("Missing Values Found:")
    print(missing_values)
else:
    print("No Missing Values Found")

columns = df.columns.tolist()

print(f"The dataset lenghth is {str(len(df))}")
print(f"The number of columns is {str(len(columns))}")
print(f"The column names are {str(columns)}")
df.head(10)

#table1 = TableOne(df, columns=columns, groupby= 'Class', pval=True)
#print(table1)

data = df





No Missing Values Found
The dataset lenghth is 284807
The number of columns is 31
The column names are ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']




---


# Data loading and preprocessing:
In Source 1 (listed above), it was found that there are performance boosts associated with removing outliers, as it may help with overfitting. This will be done on the training data only. This is to prevent information leakage from our training set.

The source also suggests there is value in oversampling the minority class. This may be due to the unique challenges of such a large class imbalance. This will be done on the training data only. This is to prevent information leakage from our training set.

In addition, we will min-max scale our validation and training sets, and apply this same scaling to the test data.

In [17]:

# Set device for GPU acceleration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Warning if no GPU is detected
if device.type != 'cuda':
    print("WARNING: GPU is not available. The model will run on the CPU, which might be slower.")
else:
    print("Cuda setup successful")

# Separate data into features and targets
X = data.iloc[:, :-1]  # Features (all columns except the last one)
y = data.iloc[:, -1]   # Labels (the last column)

# Split into train/val and test sets
X, X_test, y, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Data Preprocessing Transformation Class
class DataPreprocessingTransform:
    def __init__(self):
        self.scaler = MinMaxScaler()

    def fit_transform(self, X_train):
        """
        Fit the scaler on the training set and transform it.
        Returns a numpy array.
        """
        # Ensure data is a numpy array
        if isinstance(X_train, pd.DataFrame):
            X_train = X_train.values
        X_scaled = self.scaler.fit_transform(X_train)
        return X_scaled

    def transform(self, X):
        """
        Apply scaling transformation based on the training set scaling.
        Returns a numpy array.
        """
        if isinstance(X, pd.DataFrame):
            X = X.values
        X_scaled = self.scaler.transform(X)
        return X_scaled

# Custom PyTorch Dataset
class CustomDataset(Dataset):
    def __init__(self, X, y, transform=None, resample=False, device="cuda"):
        """
        Custom dataset to handle preprocessing, transformations, and oversampling.

        Parameters:
        - X (pd.DataFrame or np.ndarray): Features.
        - y (pd.Series or np.ndarray): Labels.
        - transform (callable): A function to apply to the features.
        - resample (bool): Whether to apply random oversampling to balance classes.
        - device (str): Device to which the data should be moved (e.g., 'cuda' or 'cpu').
        """
        # Ensure X and y are numpy arrays
        if isinstance(X, pd.DataFrame):
            X = X.values
        if isinstance(y, pd.Series):
            y = y.values

        # Apply random oversampling if needed
        if resample:
            X, y = self._apply_random_oversampler(X, y)

        # Apply optional transformation (e.g., scaling)
        if transform:
            X = transform(X)

        # Convert to torch tensors and move to device
        self.X = torch.tensor(X, dtype=torch.float32, device=device)
        self.y = torch.tensor(y, dtype=torch.float32, device=device)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, index):
        return self.X[index], self.y[index]

    @staticmethod
    def _apply_random_oversampler(X, y):
        """
        Applies random oversampling to balance the dataset.
        """
        ros = RandomOverSampler(random_state=42)
        X_resampled, y_resampled = ros.fit_resample(X, y)
        return X_resampled, y_resampled

# Function to remove outliers using IQR on GPU (optional)

def remove_outliers_gpu(X):
    """
    Removes outliers from the dataset based on the IQR method using cuDF for GPU acceleration.
    This function expects a pandas DataFrame as input, converts it to cuDF, performs outlier removal,
    and returns a pandas DataFrame.

    Steps:
    1. Convert pandas DataFrame (X) to cuDF DataFrame for GPU computations.
    2. Compute the 1st (Q1) and 3rd (Q3) quartiles.
    3. Calculate the Interquartile Range (IQR = Q3 - Q1).
    4. Determine the lower and upper bounds for outliers (Q1 - 1.5*IQR and Q3 + 1.5*IQR).
    5. Fill NaNs with median values to avoid issues with NaNs during filtering.
    6. Create a boolean mask to keep only rows within the [lower_bound, upper_bound] range for all columns.
    7. Apply the mask and drop rows that become NaN after filtering (i.e., outliers).
    8. Convert the filtered cuDF DataFrame back to a pandas DataFrame.
    9. Return the pandas DataFrame with outliers removed.

    Parameters:
    - X (pd.DataFrame): A pandas DataFrame containing the data from which outliers need to be removed.

    Returns:
    - X_filtered (pd.DataFrame): A pandas DataFrame with outliers removed.
    """
    # Convert pandas DataFrame to cuDF DataFrame for GPU operations
    X_cudf = cudf.DataFrame(X)

    # Compute quartiles and IQR
    Q1 = X_cudf.quantile(0.25)
    Q3 = X_cudf.quantile(0.75)
    IQR = Q3 - Q1

    # Determine outlier bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Fill NaNs with median to avoid filtering issues
    X_filled = X_cudf.fillna(X_cudf.median())

    # Create a boolean mask to keep rows within the outlier bounds for all columns
    mask = (X_filled >= lower_bound) & (X_filled <= upper_bound)

    # Apply mask; 'other=None' sets out-of-bound values to None which are dropped subsequently
    X_filtered_cudf = X_filled.where(mask, other=None).dropna()

    # Convert back to pandas DataFrame
    X_filtered = X_filtered_cudf.to_pandas()

    return X_filtered


Cuda setup successful




---
# Transformer Model



We are going to implement the transformer and optimise the hyperparameters using the Optuna package. As per the paper listed above, we will resample from the minority class when training.

In [21]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import KFold
import optuna
import matplotlib.pyplot as plt

# Assuming these are defined from previous code snippets:
# from data_import_code import X, y, X_test, y_test, DataPreprocessingTransform, CustomDataset, remove_outliers_gpu, device, epochs, patience

#===================================== MODEL DEFINITION =====================================#
class FraudDetectionTransformer(nn.Module):
    def __init__(self, input_dim, embed_dim, num_heads, ff_dim_base, dropout, ff_dropout, activation_function, num_ff_layers, use_batchnorm, use_layernorm):
        super(FraudDetectionTransformer, self).__init__()

        # Embedding layer to project input features to embed_dim
        self.embedding = nn.Linear(input_dim, embed_dim)

        # Optional batch normalization on input features
        self.use_batchnorm = use_batchnorm
        self.batch_norm = nn.BatchNorm1d(input_dim) if use_batchnorm else None

        # Multi-head attention layer
        self.multihead_attn = nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout)

        # Build the feed-forward network layers
        ff_layers = []
        current_dim = embed_dim
        for i in range(num_ff_layers):
            # Increase dimension according to ff_dim_base * 2^i
            next_dim = ff_dim_base * (2 ** i)
            ff_layers.append(nn.Linear(current_dim, next_dim))
            ff_layers.append(self.get_activation_function(activation_function))
            ff_layers.append(nn.Dropout(ff_dropout))
            current_dim = next_dim

        # **Important Fix**:
        # Add a final linear layer to bring the dimension back to embed_dim.
        # This ensures ff_output and x have matching dimensions for the residual connection.
        ff_layers.append(nn.Linear(current_dim, embed_dim))

        self.ff_network = nn.Sequential(*ff_layers)

        # Optional layer normalization
        self.use_layernorm = use_layernorm
        self.layer_norm = nn.LayerNorm(embed_dim) if use_layernorm else None

        # Adaptive average pooling to aggregate sequence dimension
        self.pooling = nn.AdaptiveAvgPool1d(1)

        # Final classification layer
        self.fc = nn.Linear(embed_dim, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # Apply batch normalization if enabled
        if self.use_batchnorm:
            x = self.batch_norm(x)

        # Embed the input
        x = self.embedding(x)  # shape: (batch, embed_dim)

        # Transformer expects (seq_len, batch, embed_dim)
        # Current shape: (batch, embed_dim) -> add seq dim at dim=1 -> (batch, 1, embed_dim)
        # Then permute: (1, batch, embed_dim)
        x = x.unsqueeze(1).permute(1, 0, 2)

        # Multi-head self-attention
        attn_output, _ = self.multihead_attn(x, x, x)
        x = x + attn_output  # Residual connection

        # Feed-forward network
        # Permute to (batch, seq, embed_dim) to feed into ff_network
        ff_output = self.ff_network(x.permute(1, 0, 2))

        # Residual connection: ff_output and x now match in embed_dim after the fix
        x = x.permute(1, 0, 2) + ff_output

        # Optional layer normalization
        if self.use_layernorm:
            x = self.layer_norm(x)

        # Pooling: (batch, seq, embed_dim) -> permute to (batch, embed_dim, seq)
        # AdaptiveAvgPool1d(1) -> (batch, embed_dim, 1) -> squeeze -> (batch, embed_dim)
        x = self.pooling(x.permute(0, 2, 1)).squeeze()

        # Final linear layer and sigmoid
        x = self.fc(x)
        return self.sigmoid(x)

    def get_activation_function(self, activation_function):
        if activation_function == "relu":
            return nn.ReLU()
        elif activation_function == "tanh":
            return nn.Tanh()
        elif activation_function == "sigmoid":
            return nn.Sigmoid()
        else:
            raise ValueError("Unsupported activation function!")
#===================================== TRAINING WITH EARLY STOPPING =====================================#
def train_model_with_early_stopping(model, train_loader, val_loader, criterion, optimizer, epochs, patience):
    """
    Train the model with early stopping. Early stopping will stop training if the validation
    loss doesn't improve after a certain number of epochs (patience).
    """
    best_val_loss = float('inf')
    best_model = None
    patience_counter = 0
    train_losses = []
    val_losses = []

    for epoch in range(epochs):
        model.train()
        running_loss = 0.0

        # Training loop
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs).squeeze()
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        # Compute average training loss for this epoch
        train_epoch_loss = running_loss / len(train_loader)
        train_losses.append(train_epoch_loss)

        # Validation loop
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs).squeeze()
                loss = criterion(outputs, labels)
                val_loss += loss.item()

        # Compute average validation loss
        val_epoch_loss = val_loss / len(val_loader)
        val_losses.append(val_epoch_loss)

        # Check for improvement in validation loss
        if val_epoch_loss < best_val_loss:
            best_val_loss = val_epoch_loss
            best_model = model.state_dict()
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print(f"Early stopping at epoch {epoch + 1}")
                break

    # Load the best model state found during training
    model.load_state_dict(best_model)
    return train_losses, val_losses, best_model

#===================================== EVALUATION =====================================#
def evaluate(model, data_loader):
    """
    Evaluate the model on a given dataset. Returns accuracy, hit_rate, and ROC AUC.
    """
    model.eval()
    y_true, y_pred = [], []
    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs).squeeze()
            y_true.extend(labels.cpu().numpy())
            y_pred.extend(outputs.cpu().numpy())

    # Convert predicted probabilities to binary labels
    y_pred_binary = [1 if p >= 0.5 else 0 for p in y_pred]

    # Compute metrics
    accuracy = accuracy_score(y_true, y_pred_binary)
    # Hit rate: Among the actual fraud cases (y=1), how many did we correctly predict as fraud?
    hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)
    roc_auc = roc_auc_score(y_true, y_pred)

    return accuracy, hit_rate, roc_auc

def plot_losses(train_losses, val_losses):
    """
    Plot training and validation loss over epochs.
    """
    plt.figure(figsize=(10, 6))
    plt.plot(train_losses, label="Training Loss")
    plt.plot(val_losses, label="Validation Loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.title("Training and Validation Loss")
    plt.legend()
    plt.show()

#===================================== OPTUNA OBJECTIVE FUNCTION =====================================#
def objective(trial):
    """
    Objective function for Optuna optimization. Defines the hyperparameters to tune,
    runs K-fold cross-validation, and returns the average validation score across folds.
    """

    # Suggest hyperparameters for this trial
    num_heads = trial.suggest_int("num_heads", low=4, high=8, step=2)
    embed_dim = trial.suggest_int("embed_dim", low=64, high=512, step=num_heads)
    ff_dim_base = trial.suggest_int("ff_dim_base", low=64, high=512, step=64)
    dropout = trial.suggest_float("dropout", low=0.1, high=0.5, step=0.1)
    ff_dropout = trial.suggest_float("ff_dropout", low=0.1, high=0.5, step=0.1)
    activation_function = trial.suggest_categorical("activation_function", ["relu", "tanh", "sigmoid"])
    num_ff_layers = trial.suggest_int("num_ff_layers", low=1, high=3, step=1)
    use_batchnorm = trial.suggest_categorical("use_batchnorm", [True, False])
    use_layernorm = trial.suggest_categorical("use_layernorm", [True, False])
    batch_size = trial.suggest_int("batch_size", 64, 512, step=64)

    # New: Tune the learning rate as well
    # Using a log-uniform suggestion: a typical range for LR might be [1e-5, 1e-2]
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)

    print(f"Trial {trial.number}:")
    print(f"  num_heads: {num_heads}")
    print(f"  embed_dim: {embed_dim}")
    print(f"  ff_dim_base: {ff_dim_base}")
    print(f"  dropout: {dropout}")
    print(f"  ff_dropout: {ff_dropout}")
    print(f"  activation_function: {activation_function}")
    print(f"  num_ff_layers: {num_ff_layers}")
    print(f"  use_batchnorm: {use_batchnorm}")
    print(f"  use_layernorm: {use_layernorm}")
    print(f"  batch_size: {batch_size}")
    print(f"  lr: {lr}")
    print("-" * 50)

    # K-Fold Cross Validation
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    fold_scores = []
    best_val_score = -float('inf')
    best_model_state_dict = None

    for train_index, val_index in kf.split(X):
        # Split the data into training and validation sets for this fold
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]

        # Preprocess the data (scaling)
        transform = DataPreprocessingTransform()
        X_train_scaled = transform.fit_transform(X_train)  # numpy array
        X_val_scaled = transform.transform(X_val)          # numpy array

        X_train_scaled_df = pd.DataFrame(X_train_scaled, index=X_train.index, columns=X_train.columns)
        X_val_scaled_df = pd.DataFrame(X_val_scaled, index=X_val.index, columns=X_val.columns)

        # Now call remove_outliers_gpu on the DataFrames
        X_train_filtered = remove_outliers_gpu(X_train_scaled_df)
        X_val_filtered = remove_outliers_gpu(X_val_scaled_df)

        # Align indices after outlier removal since both are DataFrames
        common_index = X_train_filtered.index.intersection(y_train.index)
        X_train_filtered = X_train_filtered.loc[common_index]
        y_train = y_train.loc[common_index]

        common_val_index = X_val_filtered.index.intersection(y_val.index)
        X_val_filtered = X_val_filtered.loc[common_val_index]
        y_val = y_val.loc[common_val_index]

        # Now you can create datasets as before
        train_dataset = CustomDataset(X_train_filtered, y_train, resample=True)
        val_dataset = CustomDataset(X_val_filtered, y_val)

        # Create DataLoaders
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

        # Initialize the model
        model = FraudDetectionTransformer(
            input_dim=X_train.shape[1],
            embed_dim=embed_dim,
            num_heads=num_heads,
            ff_dim_base=ff_dim_base,
            dropout=dropout,
            ff_dropout=ff_dropout,
            activation_function=activation_function,
            num_ff_layers=num_ff_layers,
            use_batchnorm=use_batchnorm,
            use_layernorm=use_layernorm
        ).to(device)

        # Define optimizer and loss
        optimizer = optim.Adam(model.parameters(), lr=lr)
        criterion = nn.BCELoss()

        # Train the model with early stopping
        # For simplicity, define epochs and patience here or assume globally defined
        fold_epochs = 80
        fold_patience = 10
        train_losses, val_losses, _ = train_model_with_early_stopping(
            model, train_loader, val_loader, criterion, optimizer, epochs=fold_epochs, patience=fold_patience
        )

        # Evaluate the model on the validation set
        accuracy, hit_rate, roc_auc = evaluate(model, val_loader)

        # Use ROC AUC as the metric to maximize
        val_score = roc_auc

        # Keep track of the best model across folds
        if val_score > best_val_score:
            best_val_score = val_score
            best_model_state_dict = model.state_dict()

        fold_scores.append(val_score)

    # Compute average score across all folds
    avg_fold_score = np.mean(fold_scores)

    # Save the best model found during the folds
    if best_model_state_dict is not None:
        torch.save(best_model_state_dict, 'best_model.pth')

    return avg_fold_score

# Run Optuna optimization
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

# Output the best parameters
print("Best hyperparameters:", study.best_params)

#=============================================== Test on Holdout Set ===============================================#
best_params = study.best_params

# Build best model with tuned hyperparameters
best_model = FraudDetectionTransformer(
    input_dim=X.shape[1],
    embed_dim=best_params['embed_dim'],
    num_heads=best_params['num_heads'],
    ff_dim_base=best_params['ff_dim_base'],
    dropout=best_params['dropout'],
    ff_dropout=best_params['ff_dropout'],
    activation_function=best_params['activation_function'],
    num_ff_layers=best_params['num_ff_layers'],
    use_batchnorm=best_params['use_batchnorm'],
    use_layernorm=best_params['use_layernorm']
).to(device)

# Preprocess train and test data using the same pipeline
transform = DataPreprocessingTransform()
X_scaled = transform.fit_transform(X)
X_test_scaled = transform.transform(X_test)

# Convert to DataFrame before outlier removal
X_scaled_df = pd.DataFrame(X_scaled, index=X.index, columns=X.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, index=X_test.index, columns=X_test.columns)

# Remove outliers
X_filtered = remove_outliers_gpu(X_scaled_df)
common_index = X_filtered.index.intersection(y.index)
X_filtered = X_filtered.loc[common_index]
y = y.loc[common_index]

X_test_filtered = remove_outliers_gpu(X_test_scaled_df)
common_test_index = X_test_filtered.index.intersection(y_test.index)
X_test_filtered = X_test_filtered.loc[common_test_index]
y_test = y_test.loc[common_test_index]

# Create Datasets and Loaders
train_val_dataset = CustomDataset(X_filtered, y, resample=True)
test_dataset = CustomDataset(X_test_filtered, y_test)

train_val_loader = DataLoader(train_val_dataset, batch_size=best_params["batch_size"], shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=best_params["batch_size"], shuffle=False)

# Final training on combined train+val (if desired)
final_optimizer = optim.Adam(best_model.parameters(), lr=best_params['lr'])
final_criterion = nn.BCELoss()

train_losses, val_losses, best_model_state_dict = train_model_with_early_stopping(
    best_model, train_val_loader, test_loader, criterion=final_criterion, optimizer=final_optimizer,
    epochs=epochs, patience=patience
)

# Load the best model weights after final training
best_model.load_state_dict(best_model_state_dict)

# Evaluate on Test Set
test_accuracy, test_hit_rate, test_roc_auc = evaluate(best_model, test_loader)

print("Test Accuracy:", test_accuracy)
print("Test Hit Rate:", test_hit_rate)
print("Test ROC AUC:", test_roc_auc)


[I 2025-01-06 17:54:50,688] A new study created in memory with name: no-name-15f8a0d0-8014-4967-9833-a0bb44b1f627


Trial 0:
  num_heads: 4
  embed_dim: 492
  ff_dim_base: 384
  dropout: 0.30000000000000004
  ff_dropout: 0.5
  activation_function: tanh
  num_ff_layers: 2
  use_batchnorm: True
  use_layernorm: False
  batch_size: 192
  lr: 0.0003533678466831856
--------------------------------------------------




Early stopping at epoch 21




Early stopping at epoch 14




Early stopping at epoch 24


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 28




Early stopping at epoch 19


[W 2025-01-06 18:05:43,655] Trial 0 failed with parameters: {'num_heads': 4, 'embed_dim': 492, 'ff_dim_base': 384, 'dropout': 0.30000000000000004, 'ff_dropout': 0.5, 'activation_function': 'tanh', 'num_ff_layers': 2, 'use_batchnorm': True, 'use_layernorm': False, 'batch_size': 192, 'lr': 0.0003533678466831856} because of the following error: The value nan is not acceptable.
[W 2025-01-06 18:05:43,656] Trial 0 failed with value nan.


Trial 1:
  num_heads: 4
  embed_dim: 184
  ff_dim_base: 64
  dropout: 0.30000000000000004
  ff_dropout: 0.2
  activation_function: sigmoid
  num_ff_layers: 2
  use_batchnorm: False
  use_layernorm: True
  batch_size: 448
  lr: 0.003867546470789352
--------------------------------------------------




Early stopping at epoch 13




Early stopping at epoch 18




Early stopping at epoch 11


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 18




Early stopping at epoch 16


[W 2025-01-06 18:10:38,632] Trial 1 failed with parameters: {'num_heads': 4, 'embed_dim': 184, 'ff_dim_base': 64, 'dropout': 0.30000000000000004, 'ff_dropout': 0.2, 'activation_function': 'sigmoid', 'num_ff_layers': 2, 'use_batchnorm': False, 'use_layernorm': True, 'batch_size': 448, 'lr': 0.003867546470789352} because of the following error: The value nan is not acceptable.
[W 2025-01-06 18:10:38,633] Trial 1 failed with value nan.


Trial 2:
  num_heads: 6
  embed_dim: 276
  ff_dim_base: 384
  dropout: 0.1
  ff_dropout: 0.1
  activation_function: relu
  num_ff_layers: 1
  use_batchnorm: False
  use_layernorm: False
  batch_size: 320
  lr: 0.00017415047561694403
--------------------------------------------------




Early stopping at epoch 31




Early stopping at epoch 39




Early stopping at epoch 68


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 44




Early stopping at epoch 15


[W 2025-01-06 18:24:03,026] Trial 2 failed with parameters: {'num_heads': 6, 'embed_dim': 280, 'ff_dim_base': 384, 'dropout': 0.1, 'ff_dropout': 0.1, 'activation_function': 'relu', 'num_ff_layers': 1, 'use_batchnorm': False, 'use_layernorm': False, 'batch_size': 320, 'lr': 0.00017415047561694403} because of the following error: The value nan is not acceptable.
[W 2025-01-06 18:24:03,027] Trial 2 failed with value nan.


Trial 3:
  num_heads: 8
  embed_dim: 216
  ff_dim_base: 192
  dropout: 0.1
  ff_dropout: 0.1
  activation_function: tanh
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: False
  batch_size: 384
  lr: 3.0786096741823035e-05
--------------------------------------------------




Early stopping at epoch 57




Early stopping at epoch 44




Early stopping at epoch 60


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 41




Early stopping at epoch 49


[W 2025-01-06 18:40:40,926] Trial 3 failed with parameters: {'num_heads': 8, 'embed_dim': 216, 'ff_dim_base': 192, 'dropout': 0.1, 'ff_dropout': 0.1, 'activation_function': 'tanh', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': False, 'batch_size': 384, 'lr': 3.0786096741823035e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-06 18:40:40,927] Trial 3 failed with value nan.


Trial 4:
  num_heads: 4
  embed_dim: 96
  ff_dim_base: 64
  dropout: 0.2
  ff_dropout: 0.30000000000000004
  activation_function: tanh
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: True
  batch_size: 256
  lr: 9.405361764739654e-05
--------------------------------------------------




Early stopping at epoch 49




Early stopping at epoch 29




Early stopping at epoch 64


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 58




Early stopping at epoch 24


[W 2025-01-06 18:59:43,377] Trial 4 failed with parameters: {'num_heads': 4, 'embed_dim': 96, 'ff_dim_base': 64, 'dropout': 0.2, 'ff_dropout': 0.30000000000000004, 'activation_function': 'tanh', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': True, 'batch_size': 256, 'lr': 9.405361764739654e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-06 18:59:43,378] Trial 4 failed with value nan.


Trial 5:
  num_heads: 8
  embed_dim: 488
  ff_dim_base: 320
  dropout: 0.4
  ff_dropout: 0.2
  activation_function: relu
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: True
  batch_size: 512
  lr: 0.003319619134711062
--------------------------------------------------




Early stopping at epoch 41




Early stopping at epoch 43




Early stopping at epoch 24


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 21




Early stopping at epoch 18


[W 2025-01-06 19:08:49,703] Trial 5 failed with parameters: {'num_heads': 8, 'embed_dim': 488, 'ff_dim_base': 320, 'dropout': 0.4, 'ff_dropout': 0.2, 'activation_function': 'relu', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': True, 'batch_size': 512, 'lr': 0.003319619134711062} because of the following error: The value nan is not acceptable.
[W 2025-01-06 19:08:49,704] Trial 5 failed with value nan.


Trial 6:
  num_heads: 6
  embed_dim: 330
  ff_dim_base: 64
  dropout: 0.1
  ff_dropout: 0.1
  activation_function: sigmoid
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: False
  batch_size: 320
  lr: 0.00019792198198162415
--------------------------------------------------




Early stopping at epoch 29




Early stopping at epoch 42




Early stopping at epoch 25


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 25




Early stopping at epoch 17


[W 2025-01-06 19:19:07,327] Trial 6 failed with parameters: {'num_heads': 6, 'embed_dim': 334, 'ff_dim_base': 64, 'dropout': 0.1, 'ff_dropout': 0.1, 'activation_function': 'sigmoid', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': False, 'batch_size': 320, 'lr': 0.00019792198198162415} because of the following error: The value nan is not acceptable.
[W 2025-01-06 19:19:07,328] Trial 6 failed with value nan.


Trial 7:
  num_heads: 8
  embed_dim: 104
  ff_dim_base: 384
  dropout: 0.4
  ff_dropout: 0.4
  activation_function: sigmoid
  num_ff_layers: 1
  use_batchnorm: False
  use_layernorm: True
  batch_size: 128
  lr: 0.0006104866233913833
--------------------------------------------------




Early stopping at epoch 20




Early stopping at epoch 29




Early stopping at epoch 14


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 28




Early stopping at epoch 17


[W 2025-01-06 19:31:58,704] Trial 7 failed with parameters: {'num_heads': 8, 'embed_dim': 104, 'ff_dim_base': 384, 'dropout': 0.4, 'ff_dropout': 0.4, 'activation_function': 'sigmoid', 'num_ff_layers': 1, 'use_batchnorm': False, 'use_layernorm': True, 'batch_size': 128, 'lr': 0.0006104866233913833} because of the following error: The value nan is not acceptable.
[W 2025-01-06 19:31:58,705] Trial 7 failed with value nan.


Trial 8:
  num_heads: 6
  embed_dim: 318
  ff_dim_base: 512
  dropout: 0.1
  ff_dropout: 0.5
  activation_function: tanh
  num_ff_layers: 1
  use_batchnorm: False
  use_layernorm: False
  batch_size: 512
  lr: 0.0043565020145654694
--------------------------------------------------




Early stopping at epoch 12




Early stopping at epoch 12




Early stopping at epoch 17


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 23




Early stopping at epoch 13


[W 2025-01-06 19:36:12,061] Trial 8 failed with parameters: {'num_heads': 6, 'embed_dim': 322, 'ff_dim_base': 512, 'dropout': 0.1, 'ff_dropout': 0.5, 'activation_function': 'tanh', 'num_ff_layers': 1, 'use_batchnorm': False, 'use_layernorm': False, 'batch_size': 512, 'lr': 0.0043565020145654694} because of the following error: The value nan is not acceptable.
[W 2025-01-06 19:36:12,062] Trial 8 failed with value nan.


Trial 9:
  num_heads: 4
  embed_dim: 168
  ff_dim_base: 384
  dropout: 0.1
  ff_dropout: 0.30000000000000004
  activation_function: tanh
  num_ff_layers: 1
  use_batchnorm: False
  use_layernorm: True
  batch_size: 128
  lr: 1.0048509585797173e-05
--------------------------------------------------




Early stopping at epoch 74


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)
[W 2025-01-06 20:23:26,130] Trial 9 failed with parameters: {'num_heads': 4, 'embed_dim': 168, 'ff_dim_base': 384, 'dropout': 0.1, 'ff_dropout': 0.30000000000000004, 'activation_function': 'tanh', 'num_ff_layers': 1, 'use_batchnorm': False, 'use_layernorm': True, 'batch_size': 128, 'lr': 1.0048509585797173e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-06 20:23:26,132] Trial 9 failed with value nan.


Trial 10:
  num_heads: 8
  embed_dim: 96
  ff_dim_base: 64
  dropout: 0.30000000000000004
  ff_dropout: 0.1
  activation_function: tanh
  num_ff_layers: 1
  use_batchnorm: True
  use_layernorm: False
  batch_size: 64
  lr: 2.5775382574865433e-05
--------------------------------------------------




Early stopping at epoch 70




Early stopping at epoch 32




Early stopping at epoch 40


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 36




Early stopping at epoch 60


[W 2025-01-06 21:13:26,088] Trial 10 failed with parameters: {'num_heads': 8, 'embed_dim': 96, 'ff_dim_base': 64, 'dropout': 0.30000000000000004, 'ff_dropout': 0.1, 'activation_function': 'tanh', 'num_ff_layers': 1, 'use_batchnorm': True, 'use_layernorm': False, 'batch_size': 64, 'lr': 2.5775382574865433e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-06 21:13:26,089] Trial 10 failed with value nan.


Trial 11:
  num_heads: 6
  embed_dim: 324
  ff_dim_base: 384
  dropout: 0.2
  ff_dropout: 0.1
  activation_function: tanh
  num_ff_layers: 1
  use_batchnorm: False
  use_layernorm: False
  batch_size: 448
  lr: 0.0006267525109483329
--------------------------------------------------




Early stopping at epoch 15




Early stopping at epoch 21




Early stopping at epoch 25


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 21




Early stopping at epoch 14


[W 2025-01-06 21:18:48,194] Trial 11 failed with parameters: {'num_heads': 6, 'embed_dim': 328, 'ff_dim_base': 384, 'dropout': 0.2, 'ff_dropout': 0.1, 'activation_function': 'tanh', 'num_ff_layers': 1, 'use_batchnorm': False, 'use_layernorm': False, 'batch_size': 448, 'lr': 0.0006267525109483329} because of the following error: The value nan is not acceptable.
[W 2025-01-06 21:18:48,195] Trial 11 failed with value nan.


Trial 12:
  num_heads: 8
  embed_dim: 160
  ff_dim_base: 128
  dropout: 0.2
  ff_dropout: 0.1
  activation_function: tanh
  num_ff_layers: 2
  use_batchnorm: False
  use_layernorm: False
  batch_size: 192
  lr: 2.4865620800999894e-05
--------------------------------------------------




Early stopping at epoch 25




Early stopping at epoch 19




Early stopping at epoch 28


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 21




Early stopping at epoch 15


[W 2025-01-06 21:28:42,876] Trial 12 failed with parameters: {'num_heads': 8, 'embed_dim': 160, 'ff_dim_base': 128, 'dropout': 0.2, 'ff_dropout': 0.1, 'activation_function': 'tanh', 'num_ff_layers': 2, 'use_batchnorm': False, 'use_layernorm': False, 'batch_size': 192, 'lr': 2.4865620800999894e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-06 21:28:42,877] Trial 12 failed with value nan.


Trial 13:
  num_heads: 8
  embed_dim: 112
  ff_dim_base: 512
  dropout: 0.2
  ff_dropout: 0.2
  activation_function: relu
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: True
  batch_size: 448
  lr: 0.0006227767630806624
--------------------------------------------------




Early stopping at epoch 21




Early stopping at epoch 21




Early stopping at epoch 33


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 21




Early stopping at epoch 22


[W 2025-01-06 21:36:18,994] Trial 13 failed with parameters: {'num_heads': 8, 'embed_dim': 112, 'ff_dim_base': 512, 'dropout': 0.2, 'ff_dropout': 0.2, 'activation_function': 'relu', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': True, 'batch_size': 448, 'lr': 0.0006227767630806624} because of the following error: The value nan is not acceptable.
[W 2025-01-06 21:36:18,995] Trial 13 failed with value nan.


Trial 14:
  num_heads: 8
  embed_dim: 96
  ff_dim_base: 448
  dropout: 0.1
  ff_dropout: 0.1
  activation_function: relu
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: False
  batch_size: 256
  lr: 0.0002189990963220655
--------------------------------------------------




Early stopping at epoch 25




Early stopping at epoch 16




Early stopping at epoch 27


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 17




Early stopping at epoch 11


[W 2025-01-06 21:44:52,067] Trial 14 failed with parameters: {'num_heads': 8, 'embed_dim': 96, 'ff_dim_base': 448, 'dropout': 0.1, 'ff_dropout': 0.1, 'activation_function': 'relu', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': False, 'batch_size': 256, 'lr': 0.0002189990963220655} because of the following error: The value nan is not acceptable.
[W 2025-01-06 21:44:52,068] Trial 14 failed with value nan.


Trial 15:
  num_heads: 6
  embed_dim: 180
  ff_dim_base: 384
  dropout: 0.4
  ff_dropout: 0.1
  activation_function: sigmoid
  num_ff_layers: 3
  use_batchnorm: False
  use_layernorm: True
  batch_size: 192
  lr: 0.00023812843377188356
--------------------------------------------------




Early stopping at epoch 30




Early stopping at epoch 24




Early stopping at epoch 24


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 24




Early stopping at epoch 16


[W 2025-01-06 21:57:02,428] Trial 15 failed with parameters: {'num_heads': 6, 'embed_dim': 184, 'ff_dim_base': 384, 'dropout': 0.4, 'ff_dropout': 0.1, 'activation_function': 'sigmoid', 'num_ff_layers': 3, 'use_batchnorm': False, 'use_layernorm': True, 'batch_size': 192, 'lr': 0.00023812843377188356} because of the following error: The value nan is not acceptable.
[W 2025-01-06 21:57:02,436] Trial 15 failed with value nan.


Trial 16:
  num_heads: 6
  embed_dim: 330
  ff_dim_base: 512
  dropout: 0.1
  ff_dropout: 0.4
  activation_function: sigmoid
  num_ff_layers: 1
  use_batchnorm: True
  use_layernorm: False
  batch_size: 512
  lr: 3.235298834361406e-05
--------------------------------------------------




Early stopping at epoch 65




Early stopping at epoch 78


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 48




Early stopping at epoch 44


[W 2025-01-06 22:14:12,371] Trial 16 failed with parameters: {'num_heads': 6, 'embed_dim': 334, 'ff_dim_base': 512, 'dropout': 0.1, 'ff_dropout': 0.4, 'activation_function': 'sigmoid', 'num_ff_layers': 1, 'use_batchnorm': True, 'use_layernorm': False, 'batch_size': 512, 'lr': 3.235298834361406e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-06 22:14:12,377] Trial 16 failed with value nan.


Trial 17:
  num_heads: 6
  embed_dim: 330
  ff_dim_base: 512
  dropout: 0.1
  ff_dropout: 0.2
  activation_function: sigmoid
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: True
  batch_size: 448
  lr: 5.267799126039186e-05
--------------------------------------------------




Early stopping at epoch 63




Early stopping at epoch 44




Early stopping at epoch 62


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 26




Early stopping at epoch 41


[W 2025-01-06 22:30:09,636] Trial 17 failed with parameters: {'num_heads': 6, 'embed_dim': 334, 'ff_dim_base': 512, 'dropout': 0.1, 'ff_dropout': 0.2, 'activation_function': 'sigmoid', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': True, 'batch_size': 448, 'lr': 5.267799126039186e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-06 22:30:09,637] Trial 17 failed with value nan.


Trial 18:
  num_heads: 4
  embed_dim: 180
  ff_dim_base: 320
  dropout: 0.1
  ff_dropout: 0.4
  activation_function: relu
  num_ff_layers: 2
  use_batchnorm: False
  use_layernorm: False
  batch_size: 256
  lr: 0.00031592181805483593
--------------------------------------------------




Early stopping at epoch 30




Early stopping at epoch 22




Early stopping at epoch 20


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 26




Early stopping at epoch 32


[W 2025-01-06 22:40:36,957] Trial 18 failed with parameters: {'num_heads': 4, 'embed_dim': 180, 'ff_dim_base': 320, 'dropout': 0.1, 'ff_dropout': 0.4, 'activation_function': 'relu', 'num_ff_layers': 2, 'use_batchnorm': False, 'use_layernorm': False, 'batch_size': 256, 'lr': 0.00031592181805483593} because of the following error: The value nan is not acceptable.
[W 2025-01-06 22:40:36,958] Trial 18 failed with value nan.


Trial 19:
  num_heads: 6
  embed_dim: 468
  ff_dim_base: 320
  dropout: 0.1
  ff_dropout: 0.2
  activation_function: tanh
  num_ff_layers: 2
  use_batchnorm: False
  use_layernorm: True
  batch_size: 512
  lr: 4.526432191086678e-05
--------------------------------------------------




Early stopping at epoch 22




Early stopping at epoch 17




Early stopping at epoch 15


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 37




Early stopping at epoch 12


[W 2025-01-06 22:46:35,056] Trial 19 failed with parameters: {'num_heads': 6, 'embed_dim': 472, 'ff_dim_base': 320, 'dropout': 0.1, 'ff_dropout': 0.2, 'activation_function': 'tanh', 'num_ff_layers': 2, 'use_batchnorm': False, 'use_layernorm': True, 'batch_size': 512, 'lr': 4.526432191086678e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-06 22:46:35,057] Trial 19 failed with value nan.


Trial 20:
  num_heads: 4
  embed_dim: 324
  ff_dim_base: 320
  dropout: 0.1
  ff_dropout: 0.30000000000000004
  activation_function: tanh
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: True
  batch_size: 64
  lr: 0.005992546319855461
--------------------------------------------------




Early stopping at epoch 27




Early stopping at epoch 47




Early stopping at epoch 33


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 56




Early stopping at epoch 45


[W 2025-01-06 23:41:53,309] Trial 20 failed with parameters: {'num_heads': 4, 'embed_dim': 324, 'ff_dim_base': 320, 'dropout': 0.1, 'ff_dropout': 0.30000000000000004, 'activation_function': 'tanh', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': True, 'batch_size': 64, 'lr': 0.005992546319855461} because of the following error: The value nan is not acceptable.
[W 2025-01-06 23:41:53,311] Trial 20 failed with value nan.


Trial 21:
  num_heads: 4
  embed_dim: 124
  ff_dim_base: 192
  dropout: 0.1
  ff_dropout: 0.1
  activation_function: relu
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: True
  batch_size: 128
  lr: 5.0103547396957045e-05
--------------------------------------------------




Early stopping at epoch 25




Early stopping at epoch 28




Early stopping at epoch 32


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 40




Early stopping at epoch 22


[W 2025-01-07 00:02:12,338] Trial 21 failed with parameters: {'num_heads': 4, 'embed_dim': 124, 'ff_dim_base': 192, 'dropout': 0.1, 'ff_dropout': 0.1, 'activation_function': 'relu', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': True, 'batch_size': 128, 'lr': 5.0103547396957045e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-07 00:02:12,339] Trial 21 failed with value nan.


Trial 22:
  num_heads: 6
  embed_dim: 306
  ff_dim_base: 256
  dropout: 0.30000000000000004
  ff_dropout: 0.5
  activation_function: sigmoid
  num_ff_layers: 2
  use_batchnorm: False
  use_layernorm: False
  batch_size: 128
  lr: 1.474732825409432e-05
--------------------------------------------------




Early stopping at epoch 41




Early stopping at epoch 60


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 76




Early stopping at epoch 53


[W 2025-01-07 00:40:56,262] Trial 22 failed with parameters: {'num_heads': 6, 'embed_dim': 310, 'ff_dim_base': 256, 'dropout': 0.30000000000000004, 'ff_dropout': 0.5, 'activation_function': 'sigmoid', 'num_ff_layers': 2, 'use_batchnorm': False, 'use_layernorm': False, 'batch_size': 128, 'lr': 1.474732825409432e-05} because of the following error: The value nan is not acceptable.
[W 2025-01-07 00:40:56,263] Trial 22 failed with value nan.


Trial 23:
  num_heads: 8
  embed_dim: 200
  ff_dim_base: 256
  dropout: 0.30000000000000004
  ff_dropout: 0.5
  activation_function: sigmoid
  num_ff_layers: 3
  use_batchnorm: True
  use_layernorm: True
  batch_size: 384
  lr: 0.0013278125733576444
--------------------------------------------------




Early stopping at epoch 23




Early stopping at epoch 20




Early stopping at epoch 23


  hit_rate = sum((np.array(y_true) == 1) & (np.array(y_pred_binary) == 1)) / sum(np.array(y_true) == 1)


Early stopping at epoch 23




Early stopping at epoch 17


[W 2025-01-07 00:48:01,623] Trial 23 failed with parameters: {'num_heads': 8, 'embed_dim': 200, 'ff_dim_base': 256, 'dropout': 0.30000000000000004, 'ff_dropout': 0.5, 'activation_function': 'sigmoid', 'num_ff_layers': 3, 'use_batchnorm': True, 'use_layernorm': True, 'batch_size': 384, 'lr': 0.0013278125733576444} because of the following error: The value nan is not acceptable.
[W 2025-01-07 00:48:01,624] Trial 23 failed with value nan.


Trial 24:
  num_heads: 4
  embed_dim: 276
  ff_dim_base: 320
  dropout: 0.2
  ff_dropout: 0.5
  activation_function: tanh
  num_ff_layers: 2
  use_batchnorm: False
  use_layernorm: True
  batch_size: 64
  lr: 0.006577769431562822
--------------------------------------------------




Early stopping at epoch 15






---

# Auto commit to github

In [None]:
import datetime
import os

# Navigate to the repository directory (if not already there)
%cd /content/drive/MyDrive/Colab_Notebooks/Deep_Learning_Practice

with open('/content/drive/MyDrive/IAM/PAT.txt', 'r') as file:
      github_pat = file.read().strip()
os.environ['GITHUB_PAT'] = github_pat

!git remote add origin "https://github.com/archiegoodman2/machine_learning_practice"

# Replace with your actual username and email (or configure globally)
USERNAME="archiegoodman2"
EMAIL="archiegoodman2011@gmail.com"

# Set global username and email configuration
!git config --global user.name "$USERNAME"
!git config --global user.email "$EMAIL"

now = datetime.datetime.now()
current_datetime = now.strftime("%Y-%m-%d %H:%M")

# Set remote URL using the PAT from environment variable
!git remote set-url origin https://{os.environ['GITHUB_PAT']}@github.com/archiegoodman2/machine_learning_practice.git

# Replace with your desired commit message
COMMIT_MESSAGE = str(current_datetime) + " " + " bug fixes, more comments "

# Stage all changes
!git add .

# Commit the changes
!git commit -m "$COMMIT_MESSAGE"

# Push to origin
!git push origin master


/content/drive/MyDrive/Colab_Notebooks/Deep_Learning_Practice
error: remote origin already exists.
[master 3e57343] 2024-12-20 16:54  bug fixes, more comments
 1 file changed, 1 insertion(+), 1 deletion(-)
 rewrite nn_transformer_creditcardfraud.ipynb (98%)
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 8.08 KiB | 827.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/archiegoodman2/machine_learning_practice.git
   ec7ec6b..3e57343  master -> master
