# TabNet: Classification in Python
### Author: Antons Ruberts


This notebok gives a complete example of how to use TabNet for binary classification example. You can download the data from [here](https://www.kaggle.com/c/ieee-fraud-detection) and follow this example along. Alternatively, you can substitute it with your own data but you'll need to make appropriate changes to the pre-processing code. 

## Imports

In [1]:
from tqdm import tqdm
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from category_encoders.woe import WOEEncoder
from sklearn.compose import ColumnTransformer


from sklearn.metrics import average_precision_score, roc_auc_score, confusion_matrix

from tensorflow_addons.activations import sparsemax
from scipy.special import softmax

import matplotlib.pyplot as plt
import seaborn as sns

# Data

The data is split into 2 files, so we need to join them together.

In [2]:
train_transactions = pd.read_csv('../train_transaction.csv')
train_identity = pd.read_csv('../train_identity.csv')

# merge two datasets
train = pd.merge(train_transactions, train_identity, on='TransactionID', how='left')
train['isFraud'].value_counts(normalize=True)

0    0.96501
1    0.03499
Name: isFraud, dtype: float64

In [3]:
test_transactions = pd.read_csv('../test_transaction.csv')
test_identity = pd.read_csv('../test_identity.csv')

# merge two datasets
test = pd.merge(test_transactions, test_identity, on='TransactionID', how='left')

test.columns = [c.replace('-', '_') for c in test.columns]

## Feature Engineering

Feature engineering is the most crucial step in the fraud detection domain. However, since it's not the main goal of this project, I'm skipping this step and leaving it to the reader.

In [4]:
# Make an hour feature from datetime stamp 
def make_hour_feature(f):
    #Creates an hour of the day feature, encoded as 0-23.  
    hours = f / (3600)        
    encoded_hours = np.floor(hours) % 24
    return encoded_hours

train['hour'] = make_hour_feature(train['TransactionDT'])
test['hour'] = make_hour_feature(test['TransactionDT'])

## Data Cleaning
Here's the cleaning that we need to do:
* Drop columns with too many missing values
* Impute numeric missing values with median
* Impute categorical missing values with "missing" category

In [5]:
cat_features = ['ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 
               'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain', 'M1',
               'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9', 'DeviceType', 'DeviceInfo',
               'id_12', 'id_13', 'id_14', 'id_15', 'id_16', 'id_17', 'id_18', 'id_19', 'id_20',
               'id_21', 'id_22', 'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_30',
               'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38']

exclude = ['TransactionID', 'TransactionDT', 'isFraud']
num_features = [f for f in train.columns if (f not in cat_features) & (f not in exclude)]

# drop more than 90% NAs
col_na = train.isna().sum()
to_drop = col_na[(col_na / train.shape[0]) > 0.9].index

use_cols = [f for f in train.columns if f not in to_drop]
cat_features = [f for f in cat_features if f not in to_drop]
num_features = [f for f in num_features if f not in to_drop]

train[cat_features] = train[cat_features].astype(str)
train[num_features] = train[num_features].astype(np.float)
train = train[use_cols]

test[cat_features] = test[cat_features].astype(str)
test[num_features] = test[num_features].astype(np.float)
test = test[[f for f in use_cols if f != 'isFraud']]

In [6]:
# fill numeric NAs with median
median_values = train[num_features].median() 
train[num_features] = train[num_features].fillna(median_values)

# fill categorical NAs with "missing"
train[cat_features] = train[cat_features].replace("nan", "missing")

train.isna().sum().sum()

0

In [7]:
test[num_features] = test[num_features].fillna(median_values)

# fill categorical NAs with "missing"
test[cat_features] = test[cat_features].replace("nan", "missing")
test.isna().sum().sum()

0

In [55]:
train.shape

(590540, 423)

## Train/Val Split

Validation split is done using datetime column since testing will be performed on the future time period. 

In [56]:
# Split based on time
train_split = train['TransactionDT'] <= np.quantile(train['TransactionDT'], 0.9)

In [57]:
train_X = train.loc[train_split.values, num_features+cat_features]
train_y = train.loc[train_split.values,'isFraud']

val_X = train.loc[~train_split.values, num_features+cat_features]
val_y = train.loc[~train_split.values,'isFraud']

In [58]:
print(len(train_X), 'train examples')
print(len(val_X), 'validation examples')

531486 train examples
59054 validation examples


## Pre-processing

Since neural network can process only numeric data, the input should be preprocessed. This is a very simplified pre-processing pipeline, and can be replaced with something more sophisticated like category embeddings. 

* Scale the numeric features
* Encode the categorical features

In [59]:
# Numeric columns will be scaled by StandardScaler
scaler = StandardScaler()

# Categorical will be transformed using Weight of Evidence approach
woe = WOEEncoder()

column_trans = ColumnTransformer(
    [ ('scaler',scaler, num_features),
    ('woe', woe, cat_features)], remainder='passthrough', n_jobs=-1)

train_X_transformed = column_trans.fit_transform(train_X, train_y)
val_X_transformed = column_trans.transform(val_X)

In [60]:
test_X_transformed = column_trans.transform(test[num_features + cat_features])

print(train_X_transformed.shape, val_X_transformed.shape, test_X_transformed.shape)

(531486, 420) (59054, 420) (506691, 420)


In [61]:
train_X_transformed = pd.DataFrame(train_X_transformed, columns=[num_features + cat_features])
val_X_transformed = pd.DataFrame(val_X_transformed, columns=[num_features + cat_features])
test_X_trinsformed = pd.DataFrame(test_X_transformed, columns=[num_features + cat_features])

## TF Data

To make training and inference faster, we need to transform the data into TF Data object. You can find out more about it at the [TF website](https://www.tensorflow.org/guide/data).

*Large batch sizes are used per paper instructions*

In [62]:
def prepare_tf_dataset(
    X,
    batch_size,
    y = None,
    shuffle = False,
    drop_remainder = False,
):
    size_of_dataset = len(X)
    if y is not None:
        y = tf.one_hot(y.astype(int), 2)
        ds = tf.data.Dataset.from_tensor_slices((np.array(X.astype(np.float32)), y))
    else:
        ds = tf.data.Dataset.from_tensor_slices(np.array(X.astype(np.float32)))
    if shuffle:
        ds = ds.shuffle(buffer_size=size_of_dataset)
    ds = ds.batch(batch_size, drop_remainder=drop_remainder)

    autotune = tf.data.experimental.AUTOTUNE
    ds = ds.prefetch(autotune)
    return ds

train_ds = prepare_tf_dataset(train_X_transformed, 16384, train_y)
val_ds = prepare_tf_dataset(val_X_transformed, 16384, val_y)
test_ds = prepare_tf_dataset(test_X_transformed, 16384)

# Model

## TabNet

Defines all the classes/functions covered in the blog. 

In [15]:
def glu(x, n_units=None):
    """Generalized linear unit nonlinear activation."""
    return x[:, :n_units] * tf.nn.sigmoid(x[:, n_units:])

In [16]:
class FeatureBlock(tf.keras.Model):
    """
    Implementation of a FL->BN->GLU block
    """
    def __init__(
        self,
        feature_dim,
        apply_glu = True,
        bn_momentum = 0.9,
        fc = None,
        epsilon = 1e-5,
    ):
        super(FeatureBlock, self).__init__()
        self.apply_gpu = apply_glu
        self.feature_dim = feature_dim
        units = feature_dim * 2 if apply_glu else feature_dim # desired dimension gets multiplied by 2
                                                              # because GLU activation halves it

        self.fc = tf.keras.layers.Dense(units, use_bias=False) if fc is None else fc # shared layers can get re-used
        self.bn = tf.keras.layers.BatchNormalization(momentum=bn_momentum, epsilon=epsilon)

    def call(self, x, training = None):
        x = self.fc(x) # inputs passes through the FC layer
        x = self.bn(x, training=training) # FC layer output gets passed through the BN
        if self.apply_gpu: 
            return glu(x, self.feature_dim) # GLU activation applied to BN output
        return x

    
class FeatureTransformer(tf.keras.Model):
    def __init__(
        self,
        feature_dim,
        fcs = [],
        n_total = 4,
        n_shared = 2,
        bn_momentum = 0.9,
    ):
        super(FeatureTransformer, self).__init__()
        self.n_total, self.n_shared = n_total, n_shared

        kwrgs = {
            "feature_dim": feature_dim,
            "bn_momentum": bn_momentum,
        }

        # build blocks
        self.blocks = []
        for n in range(n_total):
            # some shared blocks
            if fcs and n < len(fcs):
                self.blocks.append(FeatureBlock(**kwrgs, fc=fcs[n])) # Building shared blocks by providing FC layers
            # build new blocks
            else:
                self.blocks.append(FeatureBlock(**kwrgs)) # Step dependent blocks without the shared FC layers

    def call(self, x, training = None):
        # input passes through the first block
        x = self.blocks[0](x, training=training) 
        # for the remaining blocks
        for n in range(1, self.n_total):
            # output from previous block gets multiplied by sqrt(0.5) and output of this block gets added
            x = x * tf.sqrt(0.5) + self.blocks[n](x, training=training) 
        return x

    @property
    def shared_fcs(self):
        return [self.blocks[i].fc for i in range(self.n_shared)]
    
class AttentiveTransformer(tf.keras.Model):
    def __init__(self, feature_dim):
        super(AttentiveTransformer, self).__init__()
        self.block = FeatureBlock(
            feature_dim,
            apply_glu=False,
        )

    def call(self, x, prior_scales, training=None):
        x = self.block(x, training=training)
        return sparsemax(x * prior_scales)
    
class TabNet(tf.keras.Model):
    def __init__(
        self,
        num_features,
        feature_dim,
        output_dim,
        n_step = 2,
        n_total = 4,
        n_shared = 2,
        relaxation_factor = 1.5,
        bn_epsilon = 1e-5,
        bn_momentum = 0.7,
        sparsity_coefficient = 1e-5
    ):
        super(TabNet, self).__init__()
        self.output_dim, self.num_features = output_dim, num_features
        self.n_step, self.relaxation_factor = n_step, relaxation_factor
        self.sparsity_coefficient = sparsity_coefficient

        self.bn = tf.keras.layers.BatchNormalization(
            momentum=bn_momentum, epsilon=bn_epsilon
        )

        kargs = {
            "feature_dim": feature_dim + output_dim,
            "n_total": n_total,
            "n_shared": n_shared,
            "bn_momentum": bn_momentum
        }

        # first feature transformer block is built first to get the shared blocks
        self.feature_transforms = [FeatureTransformer(**kargs)]
        self.attentive_transforms = []
            
        # each step consists out of FT and AT
        for i in range(n_step):
            self.feature_transforms.append(
                FeatureTransformer(**kargs, fcs=self.feature_transforms[0].shared_fcs)
            )
            self.attentive_transforms.append(
                AttentiveTransformer(num_features)
            )
        
        # Final output layer
        self.head = tf.keras.layers.Dense(2, activation="softmax", use_bias=False)

    def call(self, features, training = None):

        bs = tf.shape(features)[0] # get batch shape
        out_agg = tf.zeros((bs, self.output_dim)) # empty array with outputs to fill
        prior_scales = tf.ones((bs, self.num_features)) # prior scales initialised as 1s
        importance = tf.zeros([bs, self.num_features]) # importances
        masks = []

        features = self.bn(features, training=training) # Batch Normalisation
        masked_features = features

        total_entropy = 0.0

        for step_i in range(self.n_step + 1):
            # (masked) features go through the FT
            x = self.feature_transforms[step_i](
                masked_features, training=training
            )
            
            # first FT is not used to generate output
            if step_i > 0:
                # first half of the FT output goes towards the decision 
                out = tf.keras.activations.relu(x[:, : self.output_dim])
                out_agg += out
                scale_agg = tf.reduce_sum(out, axis=1, keepdims=True) / (self.n_step - 1)
                importance += mask_values * scale_agg
                

            # no need to build the features mask for the last step
            if step_i < self.n_step:
                # second half of the FT output goes as input to the AT
                x_for_mask = x[:, self.output_dim :]
                
                # apply AT with prior scales
                mask_values = self.attentive_transforms[step_i](
                    x_for_mask, prior_scales, training=training
                )

                # recalculate the prior scales
                prior_scales *= self.relaxation_factor - mask_values
                
                # multiply the second half of the FT output by the attention mask to enforce sparsity
                masked_features = tf.multiply(mask_values, features)

                # entropy is used to penalize the amount of sparsity in feature selection
                total_entropy += tf.reduce_mean(
                    tf.reduce_sum(
                        tf.multiply(-mask_values, tf.math.log(mask_values + 1e-15)),
                        axis=1,
                    )
                )
                
                # append mask values for later explainability
                masks.append(tf.expand_dims(tf.expand_dims(mask_values, 0), 3))
                
        #Per step selection masks        
        self.selection_masks = masks
        
        # Final output
        final_output = self.head(out)
        
        # Add sparsity loss
        loss = total_entropy / (self.n_step-1)
        self.add_loss(self.sparsity_coefficient * loss)
        
        return final_output, importance

## HP Tuning
#### Don't run if short on resources/time

TabNet has quite a few parameters to tune. Below you can find the ranges which I usually use in tuning. There's no one-size-fits-all parameter set, so make sure to adjust uit for your problem.

* Feature Dimension - between 32 and 512
* Number of steps - from 2 to 9
* Relaxation factor - from 1 to 3
* Sparsity Coefficiet - from 0 to 0.1

Optional parameters to tune:
* Batch Momentum - from 0.9 to 0.9999
* Class weight - from 1 to 10

In [18]:
import optuna
from optuna import Trial, visualization

def Objective(trial):
    feature_dim = trial.suggest_categorical("feature_dim", [32, 64, 128, 256, 512])
    n_step = trial.suggest_int("n_step", 2, 9, step=1)
    n_shared = trial.suggest_int("n_shared", 0, 4, step=1)
    relaxation_factor = trial.suggest_float("relaxation_factor", 1., 3., step=0.1)
    sparsity_coefficient = trial.suggest_float("sparsity_coefficient", 0.00000001, 0.1, log=True)
    bn_momentum = trial.suggest_float('bn_momentum', 0.9, 0.9999)
    tabnet_params = dict(num_features=train_X_transformed.shape[1],
                         output_dim=feature_dim,
                         feature_dim=feature_dim,
                         n_step=n_step, 
                         relaxation_factor=relaxation_factor,
                         sparsity_coefficient=sparsity_coefficient,
                         n_shared = n_shared,
                         bn_momentum = bn_momentum
                     )
    class_weight =  trial.suggest_int("class_weight", 1, 10.5, step=0.5)
    
    cbs = [tf.keras.callbacks.EarlyStopping(
            monitor="val_loss", patience=5, restore_best_weights=True
        )]
    
    tn = TabNet(**tabnet_params)
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.001,clipnorm=10)
    loss = [tf.keras.losses.CategoricalCrossentropy(from_logits=False),None]
    
    tn.compile(
            optimizer,
            loss=loss)

    tn.fit(train_ds, 
          epochs=100, 
          validation_data=val_ds,
          callbacks=cbs,
          verbose=1)
    
    
    val_preds, _ =  tn.predict(val_ds)
    pr_auc = average_precision_score(val_y, val_preds[:,1])
    
    return pr_auc

study = optuna.create_study(direction="maximize", study_name='TabNet optimization')
study.optimize(Objective, n_jobs=1, n_trials=100, gc_after_trial=True, show_progress_bar=True) 

## Training

In [71]:
# Params after 1 hour of tuning
tabnet = TabNet(num_features = train_X_transformed.shape[1],
                output_dim = 128,
                feature_dim = 128,
                n_step = 2, 
                relaxation_factor= 2.2,
                sparsity_coefficient=2.37e-07,
                n_shared = 2,
                bn_momentum = 0.9245)


# Early stopping based on validation loss    
cbs = [tf.keras.callbacks.EarlyStopping(
        monitor="val_loss", patience=30, restore_best_weights=True
    )]

# Optimiser 
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, clipnorm=10)

# Second loss in None because we also output the importances
loss = [tf.keras.losses.CategoricalCrossentropy(from_logits=False), None]

# Compile the model
tabnet.compile(optimizer,
               loss=loss)

# Train the model
tabnet.fit(train_ds, 
           epochs=1000, 
           validation_data=val_ds,
           callbacks=cbs,
           verbose=1,
          class_weight={
              0:1,
              1: 10
          })

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

<keras.callbacks.History at 0x7f0d9534b668>

In [69]:
from sklearn.metrics import roc_auc_score, average_precision_score

val_preds, val_imps = tabnet.predict(val_ds)


In [65]:
print('Test ROC AUC', np.round(roc_auc_score(val_y, val_preds[:, 1]), 4))
print('Test PR AUC', np.round(average_precision_score(val_y, val_preds[:, 1]), 4))

Test ROC AUC 0.8505
Test PR AUC 0.464


## Test Submission

In [26]:
test.head()

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo,hour
0,3663549,18403224,31.95,W,10409,111.0,150.0,visa,226.0,debit,...,missing,missing,missing,missing,missing,missing,missing,missing,missing,0.0
1,3663550,18403263,49.0,W,4272,111.0,150.0,visa,226.0,debit,...,missing,missing,missing,missing,missing,missing,missing,missing,missing,0.0
2,3663551,18403310,171.0,W,4476,574.0,150.0,visa,226.0,debit,...,missing,missing,missing,missing,missing,missing,missing,missing,missing,0.0
3,3663552,18403310,284.95,W,10989,360.0,150.0,visa,166.0,debit,...,missing,missing,missing,missing,missing,missing,missing,missing,missing,0.0
4,3663553,18403317,67.95,W,18018,452.0,150.0,mastercard,117.0,debit,...,missing,missing,missing,missing,missing,missing,missing,missing,missing,0.0


In [25]:
test_preds, test_imp = tabnet.predict(test_ds)

In [30]:
submission_df  = pd.DataFrame({"TransactionID": test['TransactionID'].values,
                              'isFraud': test_preds[:, 1]})

In [31]:
submission_df.to_csv('tabnet_sumbission.csv', index=False)