# 1. Overview

This is a notebook for training a neural network to submit predictions to the "Binary Classification with a Bank Dataset" Kaggle competition ([playground-series-s5e8](https://www.kaggle.com/competitions/playground-series-s5e8)).

Synthetic data is used for this playground competition, and the objective is to, for each client (with a corresponding `id`) in the test set, predict the probability that the client will subscribe to a bank term deposit. Note that while the target represented by `y` is binary (i.e. the client either subscribes or does not subscribe), the predictions are about probabilities, and so their values lie between 0 and 1.

# 2. Setup

## 2.1 Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils

## 2.2 Reproducibility

For reproducibility of results, an arbitrary number will be used for the random seed.

In [2]:
random_seed = 11
np.random.seed(random_seed)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(random_seed)
    torch.cuda.manual_seed_all(random_seed)

## 2.3 Device

`device` will indicate whether CUDA or CPU has to be used. While the CPU can be sufficient for some minor tasks like debugging, training of the neural network that will be done in subsequent sections will require an accelerator (GPU) to allow it to complete in a timely fashion.

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## 2.4 DataFrames

Next, the data provided for the competition will be read into dataframes.

In [4]:
# Read data files
input_filepath = '/kaggle/input'
train_data = pd.read_csv(input_filepath + '/playground-series-s5e8/train.csv')
test_data = pd.read_csv(input_filepath + '/playground-series-s5e8/test.csv')

# Set index
train_data.set_index('id', inplace=True)
test_data.set_index('id', inplace=True)

# Get numeric and categorical columns from training data (which should be the same for test data)
numeric_col_names = train_data.select_dtypes(include='number').columns.to_series()
categorical_col_names = train_data.select_dtypes(include='object').columns.to_series()
assert numeric_col_names.size + categorical_col_names.size == train_data.shape[1]

# Set target column
target_col='y'
# Drop 'y' column from numeric column names
numeric_col_names.drop(target_col, inplace=True)

# 3. Data Preprocessing

## 3.1 Feature Generation

In [5]:
# log transformation of numeric features

train_data['log_balance'] = np.sign(train_data['balance']) * np.log(train_data['balance'].abs().apply(lambda x: x if x > 1 else 1))
test_data['log_balance'] = np.sign(test_data['balance']) * np.log(test_data['balance'].abs().apply(lambda x: x if x > 1 else 1))

train_data['log_duration'] = np.sign(train_data['duration']) * np.log(train_data['duration'].abs().apply(lambda x: x if x > 1 else 1))
test_data['log_duration'] = np.sign(test_data['duration']) * np.log(test_data['duration'].abs().apply(lambda x: x if x > 1 else 1))

train_data['log_pdays'] = train_data['pdays'].apply(lambda x: x if x == -1 else (0 if x < 1 else np.log(x)))
test_data['log_pdays'] = test_data['pdays'].apply(lambda x: x if x == -1 else (0 if x < 1 else np.log(x)))

# sine transformation of numeric features

train_data['sine_balance'] = np.sin(2 * np.pi * train_data['balance'] / 1000)
test_data['sine_balance'] = np.sin(2 * np.pi * test_data['balance'] / 1000)

train_data['sine_duration'] = np.sin(2 * np.pi * train_data['duration'] / 720)
test_data['sine_duration'] = np.sin(2 * np.pi * test_data['duration'] / 720)

train_data['sine_pdays'] = np.sin(2 * np.pi * train_data['pdays'] / 7)
test_data['sine_pdays'] = np.sin(2 * np.pi * test_data['pdays'] / 7)

# transformation of numeric features to boolean ones based on threshold or specific value

balance_threshold = train_data['balance'].quantile(0.75)
print(f"balance_threshold for third quartile: {balance_threshold}")
train_data['balance_above_third_quartile'] = (train_data['balance'] > balance_threshold).astype(object)
test_data['balance_above_third_quartile'] = (test_data['balance'] > balance_threshold).astype(object)

duration_threshold = train_data['duration'].quantile(0.75)
print(f"duration_threshold for third quartile: {duration_threshold}")
train_data['duration_above_third_quartile'] = (train_data['duration'] > duration_threshold).astype(object)
test_data['duration_above_third_quartile'] = (test_data['duration'] > duration_threshold).astype(object)

campaign_threshold = train_data['campaign'].quantile(0.75)
print(f"campaign_threshold for third quartile: {campaign_threshold}")
train_data['campaign_above_third_quartile'] = (train_data['campaign'] > campaign_threshold).astype(object)
test_data['campaign_above_third_quartile'] = (test_data['campaign'] > campaign_threshold).astype(object)

train_data['client_not_previously_contacted'] = (train_data['pdays'] == -1).astype(object)
test_data['client_not_previously_contacted'] = (test_data['pdays'] == -1).astype(object)

balance_threshold for third quartile: 1390.0
duration_threshold for third quartile: 361.0
campaign_threshold for third quartile: 3.0


In [6]:
# drop columns that are no longer necessary due to feature generation
cols_to_drop = [
    'balance',
    'duration',
    'pdays',
]
train_data.drop(cols_to_drop, axis=1, inplace=True)
test_data.drop(cols_to_drop, axis=1, inplace=True)

# get numeric and categorical column names for training data again, and check that
# the total number of numeric and categorical columns is as expected
numeric_col_names = train_data.select_dtypes(include='number').columns.to_series()
categorical_col_names = train_data.select_dtypes(include='object').columns.to_series()
assert numeric_col_names.size + categorical_col_names.size == train_data.shape[1]

# drop 'y' column from numeric column names
numeric_col_names.drop(target_col, inplace=True)

## 3.2 Categorical Encoding

In [7]:
# encode categorical features
for col in categorical_col_names:
    le = LabelEncoder()
    train_data[col] = le.fit_transform(train_data[col].astype(str))
    test_data[col] = le.fit_transform(test_data[col].astype(str))

## 3.3 Review of Features

In [8]:
print(f"Numeric features:\n{numeric_col_names.values}")
print(f"Categorical features:\n{categorical_col_names.values}")

Numeric features:
['age' 'day' 'campaign' 'previous' 'log_balance' 'log_duration'
 'log_pdays' 'sine_balance' 'sine_duration' 'sine_pdays']
Categorical features:
['job' 'marital' 'education' 'default' 'housing' 'loan' 'contact' 'month'
 'poutcome' 'balance_above_third_quartile' 'duration_above_third_quartile'
 'campaign_above_third_quartile' 'client_not_previously_contacted']


# 4. Model

In [9]:
class BankModel(nn.Module):
    def __init__(self, num_numeric_features, num_embeddings_list, embeddings_dim):
        super().__init__()

        # embedding layers for categorical features
        self.categorical_embeddings = nn.ModuleList([
            nn.Embedding(num_embeddings, embeddings_dim) for num_embeddings in num_embeddings_list
        ])

        # layer for projecting numeric features on to the embedding space
        self.numeric_projection = nn.Linear(num_numeric_features, embeddings_dim)

        num_total_features = 1 + len(num_embeddings_list) # num(numeric features) + num(categorical features)
        input_size = num_total_features * embeddings_dim
        output_size = input_size

        self.layers = nn.ModuleList()
        for i in range(5):
            self.layers.append(nn.Linear(input_size, output_size))
            self.layers.append(nn.ReLU())
            self.layers.append(nn.BatchNorm1d(output_size))
            self.layers.append(nn.Dropout(0.1))

        # final output layer
        self.output_layer = nn.Linear(input_size, 1)

    def forward(self, numeric_inputs, categorical_inputs):
        numeric_embeddings = self.numeric_projection(numeric_inputs)
        numeric_embeddings = numeric_embeddings.unsqueeze(1)

        num_categorical_features = len(categorical_inputs)
        categorical_embeddings = [self.categorical_embeddings[i](categorical_inputs[i].unsqueeze(-1)) for i in range(num_categorical_features)]

        all_features = torch.cat([numeric_embeddings] + categorical_embeddings, dim=1)
        x = all_features
        x = x.view(x.size(0), -1)

        for layer in self.layers:
            x = layer(x)

        return self.output_layer(x)


# 5. Helper Classes/Functions

In [10]:
class TabularDataset(Dataset):
    def __init__(self, numeric_data, categorical_data, target_col=None):
        self.numeric_data = torch.from_numpy(numeric_data).float()
        self.categorical_data = torch.from_numpy(categorical_data).long()
        self.target_col = torch.from_numpy(target_col).float() if target_col is not None else None

    def __len__(self):
        return len(self.numeric_data)
    
    def __getitem__(self, idx):
        numeric_tensors = self.numeric_data[idx]
        categorical_tensors = self.categorical_data[idx]
        if self.target_col is not None:
            return numeric_tensors, categorical_tensors, self.target_col[idx]
        return numeric_tensors, categorical_tensors

In [11]:
def get_data_loaders(train_indices, validation_indices):
    X_train_fold_numerical = train_data[numeric_col_names].iloc[train_indices]
    X_train_fold_categorical = train_data[categorical_col_names].iloc[train_indices]
    y_train_fold = train_data[target_col].iloc[train_indices]

    X_validation_fold_numerical = train_data[numeric_col_names].iloc[validation_indices]
    X_validation_fold_categorical = train_data[categorical_col_names].iloc[validation_indices]
    y_validation_fold = train_data[target_col].iloc[validation_indices]

    train_dataset = TabularDataset(
        X_train_fold_numerical.values,
        X_train_fold_categorical.values,
        y_train_fold.values
    )
    validation_dataset = TabularDataset(
        X_validation_fold_numerical.values,
        X_validation_fold_categorical.values,
        y_validation_fold.values
    )
    test_dataset = TabularDataset(
        test_data[numeric_col_names].values,
        test_data[categorical_col_names].values
    )

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
    validation_loader = DataLoader(validation_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)

    return train_loader, validation_loader, test_loader

In [12]:
def get_outputs(model, numeric_data, categorical_data):
    numeric_data = numeric_data.to(device)
    categorical_data = categorical_data.transpose(1,0).to(device)
    return model(numeric_data, categorical_data).squeeze()

In [13]:
def get_predictions(model, numeric_data, categorical_data, data_loader, is_test_data):
    predictions = []

    with torch.no_grad():
        if is_test_data:
            for numeric_data, categorical_data in data_loader:
                outputs = get_outputs(model, numeric_data, categorical_data)
                predictions.extend(torch.sigmoid(outputs).cpu().numpy())
        else:
            for numeric_data, categorical_data, _ in data_loader:
                outputs = get_outputs(model, numeric_data, categorical_data)
                predictions.extend(torch.sigmoid(outputs).cpu().numpy())

    return predictions

# 6. Stratified K-Fold

In [14]:
kfold_num_splits = 10
batch_size = 1024
max_epochs = 5
patience = 50

In [15]:
kfold = StratifiedKFold(n_splits=kfold_num_splits, shuffle=True, random_state=random_seed)
kfold_splits = kfold.split(train_data.drop(target_col, axis=1), train_data[target_col])
kfold_splits_enumeration = enumerate(kfold_splits)

oof_predictions = np.zeros(len(train_data))
test_predictions = np.zeros(len(test_data))

# will be used later for getting values for 'num_embeddings'
train_test_combined = pd.concat([train_data.drop(target_col, axis=1), test_data], axis=0)

for fold, (train_indices, validation_indices) in kfold_splits_enumeration:
    print(f"\n##### Training Fold {fold + 1} of {kfold_num_splits} #####\n")

    # get data loaders for training, validation and test data
    train_loader, validation_loader, test_loader = get_data_loaders(train_indices, validation_indices)

    # will be used later for ROC AUC score calculation
    y_validation_fold = train_data[target_col].iloc[validation_indices]

    # PyTorch model
    model = BankModel(
        len(numeric_col_names),
        train_test_combined[categorical_col_names].nunique(), # `num_embeddings` Embedding parameter
        32 # `embedding_dim` Embedding parameter
    ).to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.BCEWithLogitsLoss()
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=max_epochs)

    best_validation_auc_score = 0
    early_stopping_counter = 0

    print(f"(Max epochs: {max_epochs})")

    for epoch in range(max_epochs):
        model.train()
        loss_train = 0
        for numeric_data, categorical_data, labels in train_loader:
            optimizer.zero_grad()
            outputs = get_outputs(model, numeric_data, categorical_data)
            loss = criterion(outputs, labels.to(device))
            loss.backward()
            optimizer.step()
            loss_train += loss.item()

        scheduler.step()

        # validation
        model.eval()
        validation_predictions = get_predictions(model, numeric_data, categorical_data, validation_loader, False)
        current_validation_auc_score = roc_auc_score(y_validation_fold, validation_predictions)
        if epoch % 25 == 0:
            print(f"[Epoch {epoch + 1}] Training Loss: {loss_train / len(train_indices):.5f}, Validation AUC Score: {current_validation_auc_score:.5f}")

        if current_validation_auc_score > best_validation_auc_score:
            best_validation_auc_score = current_validation_auc_score
            early_stopping_counter = 0
            torch.save(model.state_dict(), 'model.pth')
        else:
            early_stopping_counter += 1
            if early_stopping_counter >= patience:
                print(f"Stopping early - {patience} epochs passed without improvement in best validation ROC AUC score.")
                break

    model.load_state_dict(torch.load('model.pth'))

    model.eval()
    validation_predictions = get_predictions(model, numeric_data, categorical_data, validation_loader, False)

    oof_predictions[validation_indices] = np.array(validation_predictions)
    final_fold_roc_auc_score = roc_auc_score(y_validation_fold, oof_predictions[validation_indices])
    print(f"Final ROC AUC score for Fold {fold + 1}: {final_fold_roc_auc_score:.5f}")

    # Make predictions on test data
    test_predictions_fold = get_predictions(model, numeric_data, categorical_data, test_loader, True)
    test_predictions += np.array(test_predictions_fold) / kfold_num_splits


##### Training Fold 1 of 10 #####

(Max epochs: 5)
[Epoch 1] Training Loss: 0.00022, Validation AUC Score: 0.95400
Final ROC AUC score for Fold 1: 0.96336

##### Training Fold 2 of 10 #####

(Max epochs: 5)
[Epoch 1] Training Loss: 0.00022, Validation AUC Score: 0.95595
Final ROC AUC score for Fold 2: 0.96344

##### Training Fold 3 of 10 #####

(Max epochs: 5)
[Epoch 1] Training Loss: 0.00021, Validation AUC Score: 0.95437
Final ROC AUC score for Fold 3: 0.96215

##### Training Fold 4 of 10 #####

(Max epochs: 5)
[Epoch 1] Training Loss: 0.00022, Validation AUC Score: 0.95570
Final ROC AUC score for Fold 4: 0.96428

##### Training Fold 5 of 10 #####

(Max epochs: 5)
[Epoch 1] Training Loss: 0.00022, Validation AUC Score: 0.95692
Final ROC AUC score for Fold 5: 0.96310

##### Training Fold 6 of 10 #####

(Max epochs: 5)
[Epoch 1] Training Loss: 0.00022, Validation AUC Score: 0.95299
Final ROC AUC score for Fold 6: 0.96278

##### Training Fold 7 of 10 #####

(Max epochs: 5)
[Epoch 1] Tr

In [16]:
# calculate the ROC AUC score for all Out-Of-Fold predictions
oof_roc_auc_score = roc_auc_score(train_data[target_col], oof_predictions)
print(f"{kfold_num_splits}-Fold Cross Validation ROC AUC: {oof_roc_auc_score:.5f}")

10-Fold Cross Validation ROC AUC: 0.96309


# 7. Submission

In [17]:
# Prepare submission
submission = pd.DataFrame({'id': test_data.index, 'y': test_predictions})
submission.to_csv('submission.csv', index=False)
print('Submission file prepared.')

Submission file prepared.
