# TabM

This is a standalone usage example for the TabM project.
The easiest way to run it is [Pixi](https://pixi.sh/latest/#installation):

```shell
git clone https://github.com/yandex-research/tabm
cd tabm

# With GPU:
pixi run -e cuda jupyter-lab example.ipynb

# Without GPU:
pixi run jupyter-lab example.ipynb
```

For the full overview of the project, and for non-Pixi environment setups, see README in the repository:
https://github.com/yandex-research/tabm

This notebook is based on the original example: https://github.com/yandex-research/tabm/blob/main/example.ipynb

In [None]:
!git clone https://github.com/yandex-research/tabm
!pip install wldhx.yadisk-direct rtdl_num_embeddings

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '3'

import math
import random
import warnings

import numpy as np
import pandas as pd
import rtdl_num_embeddings  # https://github.com/yandex-research/rtdl-num-embeddings
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim
from torch import Tensor
from tqdm.auto import tqdm
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, QuantileTransformer, OrdinalEncoder

warnings.simplefilter('ignore')
from tabm_reference import Model, make_parameter_groups

In [2]:
seed = 0
random.seed(seed)
np.random.seed(seed + 1)
torch.manual_seed(seed + 2)
pass

### PyTorch settings

In [3]:
# Device
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Automatic mixed precision (AMP)
# torch.float16 is implemented for completeness,
# but it was not tested in the project,
# so torch.bfloat16 is used by default.
amp_dtype = (
    torch.bfloat16
    if torch.cuda.is_available() and torch.cuda.is_bf16_supported()
    else torch.float16
    if torch.cuda.is_available()
    else None
)
# Changing False to True will result in faster training on compatible hardware.
amp_enabled = False and amp_dtype is not None
grad_scaler = torch.cuda.amp.GradScaler() if amp_dtype is torch.float16 else None  # type: ignore

# torch.compile
compile_model = False

# fmt: off
print(
    f'Device:        {device.type.upper()}'
    f'\nAMP:           {amp_enabled} (dtype: {amp_dtype})'
    f'\ntorch.compile: {compile_model}'
)
# fmt: on

Device:        CUDA
AMP:           False (dtype: torch.bfloat16)
torch.compile: False


### Dataset
Dataset is "Regression with an Insurance Dataset"

https://www.kaggle.com/competitions/playground-series-s4e12/overview

if you have a kaggle account, you can download the data using kaggle API:

```python
kaggle competitions download -c playground-series-s4e12
```

In [14]:
! curl -L $(yadisk-direct https://disk.yandex.ru/d/YbkU_KfAqGtdXg) -o insurance_dataset_train.csv
! curl -L $(yadisk-direct https://disk.yandex.ru/d/_gs3p3yvp0TNRg) -o insurance_dataset_test.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  185M  100  185M    0     0  65.3M      0  0:00:02  0:00:02 --:--:-- 82.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  119M  100  119M    0     0  64.5M      0  0:00:01  0:00:01 --:--:-- 70.3M


### Data Preprocessing

In [4]:
# Upload train and test parts
train_df = pd.read_csv("./insurance_dataset_train.csv")
test_df = pd.read_csv("./insurance_dataset_test.csv")

Define lists with numerical and categorical column names.

Also we exclude `id` and `Policy Start Date` cols.

In [5]:
target_col = "Premium Amount"

num_cols = ['Age', 'Annual Income', 'Number of Dependents', 'Health Score', 
            'Previous Claims', 'Vehicle Age', 'Credit Score', 'Insurance Duration']
cat_cols = ['Gender', 'Marital Status', 'Education Level', 'Occupation', 'Location',
            'Policy Type', 'Customer Feedback', 'Smoking Status', 'Exercise Frequency',
            'Property Type']

Split train part on train and val, test parts

In [6]:
# If training is TOO slow, set to True
DEMO_MODE = True

In [7]:
# collect number of unique values for categorical columns. Will be needed in future.
cat_cardinalities = train_df[cat_cols].nunique(dropna=False).to_list()

if DEMO_MODE:
    idx = np.random.randint(0, len(train_df), size=50000)
    train_df = train_df.iloc[idx]

all_idx = np.arange(len(train_df))
# Select the test part
train_idx, val_idx = train_test_split(all_idx, train_size=0.8)

# Fill dict with all parts we have
_train_df = train_df.iloc[train_idx]
_val_df = train_df.iloc[val_idx]

data = {
    'train': {
        'x_num': _train_df[num_cols],
        'x_cat': _train_df[cat_cols],
        'y': _train_df[target_col].to_numpy().astype(np.float32)
    },
    'val': {
        'x_num': _val_df[num_cols],
        'x_cat': _val_df[cat_cols],
        'y': _val_df[target_col].to_numpy().astype(np.float32)
    }
}

# Also process test data, we'll need it in the end
test_data = {
    "x_num": test_df[num_cols],
    "x_cat": test_df[cat_cols],
    'id': test_df['id'].to_numpy()
}

Define processing pipelines:

In [8]:
n_quantiles = max(min(len(train_idx) // 30, 1000), 10)

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", QuantileTransformer(n_quantiles=n_quantiles, 
                                   output_distribution='normal',
                                   subsample=10**9))
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('encoder', OrdinalEncoder())
])

Apply processing operations to the data

In [9]:
# The noise is added to improve the output of QuantileTransformer in some cases
noise = (
    np.random.default_rng(0)
    .normal(0.0, 1e-5, data['train']['x_num'].shape)
)

# Fit pipelines on train part
num_processor = num_pipeline.fit(data['train']['x_num'] + noise)
cat_processor = cat_pipeline.fit(data['train']['x_cat'])

# Apply the processing to all parts. Note, that when processor is applied
# each dataframe converts to np.ndarray. Check that x_num array has float32 dtype
# and x_cat is int64.

### YOUR CODE HERE

print(f"`x_num` dtype: {type(data['train']['x_num'])}\n`x_cat` dtype: {type(data['train']['x_cat'])}")

# Apply processing to test data as well

### YOUR CODE HERE


`x_num` dtype: <class 'numpy.ndarray'>
`x_cat` dtype: <class 'numpy.ndarray'>


Convert every part of the dataset to `torch.tensor`

In [10]:
# Convert data to tensors
data_torch = {
    part: {k: torch.as_tensor(v, device=device) for k, v in data[part].items()}
    for part in data
}

test_data_torch = {
    "x_num": torch.as_tensor(test_data["x_num"], device=device),
    "x_cat": torch.as_tensor(test_data["x_cat"], device=device)
}

### TabM model

In [11]:
# Choose one of the two configurations below.

# TabM or TabM-mini
# arch_type = 'tabm'
arch_type = 'tabm-mini'

# Use the piecewise-linear embeddings
bins = rtdl_num_embeddings.compute_bins(data_torch['train']['x_num'])
# bins = None

model = Model(
    n_num_features=len(num_cols),
    cat_cardinalities=cat_cardinalities,
    n_classes=None,
    backbone={
        'type': 'MLP',
        'n_blocks': 3 if bins is None else 2,
        'd_block': 512,
        'dropout': 0.1,
    },
    bins=bins,
    num_embeddings=(
        None
        if bins is None
        else {
            'type': 'PiecewiseLinearEmbeddings',
            'd_embedding': 16,
            'activation': False,
            'version': 'B',
        }
    ),
    arch_type=arch_type,
    k=32,
).to(device)

if compile_model:
    # NOTE
    # `torch.compile` is intentionally called without the `mode` argument
    # (mode="reduce-overhead" caused issues during training with torch==2.0.1).
    model = torch.compile(model)
    evaluation_mode = torch.no_grad
else:
    evaluation_mode = torch.inference_mode

In [12]:
print("Number of params: ", sum(p.numel() for p in model.parameters()))
model

Number of params:  373568


Model(
  (num_module): PiecewiseLinearEmbeddings(
    (linear0): LinearEmbeddings()
    (impl): _PiecewiseLinearEncodingImpl()
    (linear): _NLinear()
  )
  (cat_module): OneHotEncoding0d()
  (backbone): MLP(
    (blocks): ModuleList(
      (0): Sequential(
        (0): Linear(in_features=161, out_features=512, bias=True)
        (1): ReLU()
        (2): Dropout(p=0.1, inplace=False)
      )
      (1): Sequential(
        (0): Linear(in_features=512, out_features=512, bias=True)
        (1): ReLU()
        (2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (minimal_ensemble_adapter): ScaleEnsemble()
  (output): NLinear()
)

Since Loss in kaggle competition is RMSLE (Root Mean Squared Logarithmic Error) let's use it as validation metric

In [13]:
from sklearn.metrics import root_mean_squared_log_error
def compute_rmsle(y_pred, y_true):
    y_pred = y_pred.detach().cpu().numpy()
    y_true = y_true.detach().cpu().numpy()
    return root_mean_squared_log_error(y_true, y_pred)

Let's define class for training and evaluating our model

In [20]:
class Runner:
    """Runner for train/evaludate and predict using provided model."""
    def __init__(self, model, optimizer, loss, device, eval_metric=None, grad_scaler=None, 
                 epoch_bar=False, checkpoint_name="tabm_model.ckpt"):
        self.model = model
        self.optimizer = optimizer
        self.loss = loss
        self.device = device
        self.eval_metric = loss if eval_metric is None else eval_metric
        self.grad_scaler = grad_scaler
        self.epoch_bar = epoch_bar
        self.checkpoint_name = checkpoint_name

        self._train_mean = None
        self._train_std = None

    def compute_loss(self, y_pred, y_true):
        # TabM produces k predictions per object. Each of them must be trained separately.
        # (regression)     y_pred.shape == (batch_size, k)
        # (classification) y_pred.shape == (batch_size, k, n_classes)
        k = y_pred.shape[-1]
        return self.loss(y_pred.flatten(0, 1), y_true.repeat_interleave(k))

    def forward(self, batch_num, batch_cat=None, model=None):
        model = self.model if model is None else model
        return (
            model(batch_num, batch_cat)
            .squeeze(-1)
            .float()
        )

    def _train_step(self, y_pred, y_true, normalize_target):
        # We need to normalize target if needed and compute loss

        ### YOUR CODE HERE

        if self.grad_scaler is None:
            loss.backward()
            self.optimizer.step()
        else:
            self.grad_scaler.scale(loss).backward()
            self.grad_scaler.step(optimizer)
            self.grad_scaler.update()
        self.optimizer.zero_grad()
        return loss.detach().cpu().numpy().mean()
    
    def _eval_step(self, y_pred, y_true):
        # Compute the mean of the k predictions and scale prediction 
        # into it's original range

        y_pred = ### YOUR CODE HERE

        # Avoid computing gradients if train_phase is False
        with torch.set_grad_enabled(False):
            metric = self.eval_metric(y_pred, y_true)
        if isinstance(metric, torch.Tensor):
            return metric.detach().cpu().numpy()
        return metric

    def _run_epoch(self, data, batch_indices, train_phase=True, normalize_target=True, model=None):
        self.model.train(train_phase)
        epoch_history = []
        for batch_idx in tqdm(batch_indices, disable=not self.epoch_bar, leave=False):
            batch_num = data['x_num'][batch_idx]
            batch_cat = data['x_cat'][batch_idx]
            y_true = data['y'][batch_idx]
            y_pred = self.forward(batch_num, batch_cat, model=model)

            if train_phase:
                res = self._train_step(y_pred, y_true, normalize_target=normalize_target)
            else:
                res = self._eval_step(y_pred, y_true)
            epoch_history.append(res)
        return np.mean(epoch_history)

    def train(self, train_data, val_data, batch_size=256, n_epochs=1000, patience=16,
              eval_batch_size=None, normalize_target=True, save_best_model=False, 
              force_collect_stats=False):
        best = {
            'val': math.inf,
            'test': math.inf,
            'epoch': -1,
        }

        # Important!
        # For regression tasks it is highly recommended to standardize the training labels.
        self._train_mean = (train_data['y'].mean() 
                            if self._train_mean is None or force_collect_stats 
                            else self._train_mean)
        self._train_std = (train_data['y'].std() 
                           if self._train_std is None or force_collect_stats 
                           else self._train_std)

        # Early stopping: the training stops when
        # there are more than `patience` consequtive bad updates.
        remaining_patience = patience
        eval_batch_size = batch_size if eval_batch_size is None else eval_batch_size
        for epoch in tqdm(range(n_epochs), total=n_epochs):
            # Generate batch indices
            batch_indices = torch.randperm(len(train_data['y']), device=device).split(batch_size)
            # Perform train epoch
            train_score = self._run_epoch(train_data, batch_indices, train_phase=True,
                                          normalize_target=normalize_target)
            # Validate model after train epoch
            val_score = self.evaluate(val_data, eval_batch_size)

            msg = f"Epoch: {epoch} (train) {train_score:.4f} (val) {val_score:.4f}"
        
            if val_score < best['val']:
                best = {'val': val_score, 'epoch': epoch}
                remaining_patience = patience
                print(msg + "\t🌸 New best epoch! 🌸")
                if save_best_model:
                    torch.save(self.model, open(self.checkpoint_name, 'wb'))
            else:
                remaining_patience -= 1
                print(msg)
        
            if remaining_patience < 0:
                break
        print(f"\nTraining is complete.\nBest model: epoch {best['epoch']} metric {best['val']}")

    @evaluation_mode
    def evaluate(self, data, batch_size=256, model=None):
        batch_indices = torch.randperm(len(data['y']), device=device).split(batch_size)
        return self._run_epoch(data, batch_indices, train_phase=False, model=model)

    @evaluation_mode
    def predict(self, data, batch_size=256, model=None):
        batch_indices = torch.arange(len(data['x_num']), device=device).split(batch_size)
        preds = []
        for batch_idx in tqdm(batch_indices, disable=not self.epoch_bar, leave=False):
            batch_num = data['x_num'][batch_idx]
            batch_cat = data['x_cat'][batch_idx]
            # You need to make prediction for batch and store result in `preds`
            # Don't forget to scale prediction into it's orginal range
            y_pred = ### YOUR CODE HERE

            preds.extend(y_pred.detach().cpu().numpy())

        return preds

### Create runner instance and train the model

In [15]:
optimizer = torch.optim.AdamW(make_parameter_groups(model), lr=2e-3, weight_decay=3e-4)
loss = F.mse_loss
eval_metric = compute_rmsle

runner = Runner(model, optimizer, loss, device=device, eval_metric=eval_metric,
                grad_scaler=grad_scaler, epoch_bar=True)

Train model

In [16]:
runner.train(data_torch['train'], data_torch['val'], batch_size=1024, n_epochs=15, save_best_model=False)

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 0 (train) 1.0004 (val) 1.1702	🌸 New best epoch! 🌸


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 1 (train) 0.9984 (val) 1.1611	🌸 New best epoch! 🌸


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 2 (train) 0.9958 (val) 1.1669


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 3 (train) 0.9942 (val) 1.1566	🌸 New best epoch! 🌸


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 4 (train) 0.9929 (val) 1.1663


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 5 (train) 0.9900 (val) 1.1541	🌸 New best epoch! 🌸


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 6 (train) 0.9865 (val) 1.1585


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 7 (train) 0.9846 (val) 1.1549


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 8 (train) 0.9811 (val) 1.1528	🌸 New best epoch! 🌸


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 9 (train) 0.9781 (val) 1.1552


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 10 (train) 0.9758 (val) 1.1511	🌸 New best epoch! 🌸


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 11 (train) 0.9749 (val) 1.1411	🌸 New best epoch! 🌸


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 12 (train) 0.9693 (val) 1.1367	🌸 New best epoch! 🌸


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 13 (train) 0.9670 (val) 1.1553


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Epoch: 14 (train) 0.9666 (val) 1.1489
Training is complete.
 Best model: epoch 12 metric 1.136688470840454


### How to improve the result??
* Play with network params
* Try to use or dont use embeddings
* Try different models (tabM, tabM_mini)
* (*) Tune model params using optuna ([source](https://optuna.org/), [examples](https://github.com/optuna/optuna-examples))
* Any other ideas?

### What about standard approaches? Let's try catboost!

In [None]:
# if using colab
#!pip install catboost

In [17]:
from catboost import CatBoostRegressor

kwargs = {"task_type": 'GPU', "devices": "0"} if torch.cuda.is_available() else {}
catboost_model = CatBoostRegressor(**kwargs, iterations=500)

X_cb = train_df[cat_cols + num_cols].fillna(0)
y_cb = train_df[target_col]
catboost_model.fit(X_cb, y_cb, cat_features=cat_cols, verbose=50)

Learning rate set to 0.067434
0:	learn: 864.9865538	total: 11.8ms	remaining: 11.8s
50:	learn: 849.9454936	total: 1.27s	remaining: 23.7s
100:	learn: 845.9937747	total: 1.81s	remaining: 16.1s
150:	learn: 841.7127103	total: 2.98s	remaining: 16.7s
200:	learn: 838.3827540	total: 4.14s	remaining: 16.5s
250:	learn: 834.8489806	total: 4.79s	remaining: 14.3s
300:	learn: 831.5816795	total: 5.59s	remaining: 13s
350:	learn: 828.3659713	total: 6.81s	remaining: 12.6s
400:	learn: 825.5104448	total: 7.35s	remaining: 11s
450:	learn: 822.5501248	total: 8.58s	remaining: 10.4s
500:	learn: 819.7249818	total: 9.84s	remaining: 9.8s
550:	learn: 817.0348387	total: 11.1s	remaining: 9.01s
600:	learn: 814.4520659	total: 11.6s	remaining: 7.7s
650:	learn: 811.9044641	total: 12.8s	remaining: 6.87s
700:	learn: 809.4678153	total: 14s	remaining: 5.97s
750:	learn: 806.8525968	total: 14.8s	remaining: 4.92s
800:	learn: 804.4646745	total: 15.7s	remaining: 3.91s
850:	learn: 802.2513984	total: 17s	remaining: 2.97s
900:	learn

<catboost.core.CatBoostRegressor at 0x7fec8e456b40>

In [18]:
catboost_preds = catboost_model.predict(test_df[cat_cols + num_cols].fillna(0))

### Make submission

Tips:
* Save your local "best" results, to prevent overfitting on validation part
* (simple) Before making submission (after params tuning), train model using whole train part (insurance_dataset_train)
* (harder) Make submission as an ensemble of cross-validated models on the train part of the dataset
* 

In [19]:
def make_submission(preds, ids, path=None):
    result = pd.DataFrame({"id": ids, "Premium Amount": preds})
    if path is None:
        return result
    result.to_csv(path, index=False)

In [None]:
# Upload model from checkpoint
# with open("tabm_model.ckpt", 'rb') as f:
#     best_model = torch.load(f)
# preds = runner.predict(test_data_torch, batch_size=2048, model=best_model)

In [None]:
# Use model from last epoch
preds = runner.predict(test_data_torch, batch_size=2048)

In [None]:
make_submission(preds, test_data["id"], "simple_submission.csv")

In [None]:
make_submission(catboost_preds, test_df["id"], "simple_catboost_submission.csv")

### Submit using API

You can submit your solution directly from colab!


```python
kaggle competitions submit -c playground-series-s4e12 -f submission.csv -m "Message"
```