# TabM

This is a standalone usage example for the TabM project.
The easiest way to run it is [Pixi](https://pixi.sh/latest/#installation):

```shell
git clone https://github.com/yandex-research/tabm
cd tabm

# With GPU:
pixi run -e cuda jupyter-lab example.ipynb

# Without GPU:
pixi run jupyter-lab example.ipynb
```

For the full overview of the project, and for non-Pixi environment setups, see README in the repository:
https://github.com/yandex-research/tabm

In [None]:
!git clone https://github.com/yandex-research/tabm
!cd tabm

In [None]:
! pip install rtdl_num_embeddings

In [18]:
# ruff: noqa: E402
import os
import math
import random
import warnings
from typing import Literal, NamedTuple

os.environ['CUDA_VISIBLE_DEVICES'] = '1'

import numpy as np
import pandas as pd
import rtdl_num_embeddings  # https://github.com/yandex-research/rtdl-num-embeddings
import scipy.special
import sklearn.datasets
import sklearn.metrics
import sklearn.model_selection
import sklearn.preprocessing
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim
from torch import Tensor
from tqdm.auto import tqdm

warnings.simplefilter('ignore')
from tabm_reference import Model, make_parameter_groups

In [2]:
seed = 0
random.seed(seed)
np.random.seed(seed + 1)
torch.manual_seed(seed + 2)
pass

### PyTorch settings

In [3]:
# Device
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Automatic mixed precision (AMP) ##ПОГУГЛИ ЧТО ЭТО ТАКОЕ НАХУЙ!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# torch.float16 is implemented for completeness,
# but it was not tested in the project,
# so torch.bfloat16 is used by default.
amp_dtype = (
    torch.bfloat16
    if torch.cuda.is_available() and torch.cuda.is_bf16_supported()
    else torch.float16
    if torch.cuda.is_available()
    else None
)
# Changing False to True will result in faster training on compatible hardware.
amp_enabled = False and amp_dtype is not None
grad_scaler = torch.cuda.amp.GradScaler() if amp_dtype is torch.float16 else None  # type: ignore

# torch.compile
compile_model = False

# fmt: off
print(
    f'Device:        {device.type.upper()}'
    f'\nAMP:           {amp_enabled} (dtype: {amp_dtype})'
    f'\ntorch.compile: {compile_model}'
)
# fmt: on

Device:        CUDA
AMP:           False (dtype: torch.bfloat16)
torch.compile: False


### Dataset
Dataset is "Regression with an Insurance Dataset"

https://www.kaggle.com/competitions/playground-series-s4e12/overview

In [14]:
! curl -L $(yadisk-direct https://disk.yandex.ru/d/YbkU_KfAqGtdXg) -o insurance_dataset_train.csv
! curl -L $(yadisk-direct https://disk.yandex.ru/d/_gs3p3yvp0TNRg) -o insurance_dataset_test.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  185M  100  185M    0     0  65.3M      0  0:00:02  0:00:02 --:--:-- 82.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  119M  100  119M    0     0  64.5M      0  0:00:01  0:00:01 --:--:-- 70.3M


### Data Preprocessing

In [4]:
# Upload train and test parts
train_df = pd.read_csv("./insurance_dataset_train.csv")
test_df = pd.read_csv("./insurance_dataset_test.csv")

Define lists with numerical and categorical column names.

Also we exclude `id` and `Policy Start Date` cols.

In [5]:
target_col = "Premium Amount"

num_cols = ['Age', 'Annual Income', 'Number of Dependents', 'Health Score', 
            'Previous Claims', 'Vehicle Age', 'Credit Score', 'Insurance Duration']
cat_cols = ['Gender', 'Marital Status', 'Education Level', 'Occupation', 'Location',
            'Policy Type', 'Customer Feedback', 'Smoking Status', 'Exercise Frequency',
            'Property Type']

Split train part on train and val, test parts

In [6]:
# collect number of unique values for categorical columns. Will be needed in future.
cat_cardinalities = train_df[cat_cols].nunique(dropna=False).to_list()

all_idx = np.arange(len(train_df))
# Select the test part
train_idx, val_idx = sklearn.model_selection.train_test_split(
    all_idx, train_size=0.8
)

# Fill dict with all parts we have
_train_df = train_df.iloc[train_idx]
_val_df = train_df.iloc[val_idx]

data = {
    'train': {
        'x_num': _train_df[num_cols],
        'x_cat': _train_df[cat_cols],
        'y': _train_df[target_col].to_numpy().astype(np.float32)
    },
    'val': {
        'x_num': _val_df[num_cols],
        'x_cat': _val_df[cat_cols],
        'y': _val_df[target_col].to_numpy().astype(np.float32)
    }
}

test_data = {
    "x_num": test_df[num_cols],
    "x_cat": test_df[cat_cols],
    'id': test_df['id'].to_numpy()
}

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, QuantileTransformer, OrdinalEncoder

Define processing pipelines:

In [8]:
n_quantiles = max(min(len(train_idx) // 30, 1000), 10)

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", QuantileTransformer(n_quantiles=n_quantiles, 
                                   output_distribution='normal',
                                   subsample=10**9))
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('encoder', OrdinalEncoder())
])

Apply processing operations to the data

In [9]:
# The noise is added to improve the output of QuantileTransformer in some cases.
noise = (
    np.random.default_rng(0)
    .normal(0.0, 1e-5, data['train']['x_num'].shape)
)

# Fit pipelines on train part
num_processor = num_pipeline.fit(data['train']['x_num'] + noise)
cat_processor = cat_pipeline.fit(data['train']['x_cat'])

# Apply the processing to all parts. When processor is applied, each dataframe converts to np.ndarray
for part in data:
    data[part]['x_num'] = num_processor.transform(data[part]['x_num']).astype(np.float32)
    data[part]['x_cat'] = cat_processor.transform(data[part]['x_cat']).astype(np.int64)

print(f"`x_num` dtype: {type(data["train"]['x_num'])}\n`x_cat` dtype: {type(data["train"]['x_cat'])}")

# Apply processing to test data as well

test_data["x_num"] = num_processor.transform(test_data['x_num']).astype(np.float32)
test_data["x_cat"] = cat_processor.transform(test_data['x_cat']).astype(np.int64)

`x_num` dtype: <class 'numpy.ndarray'>
`x_cat` dtype: <class 'numpy.ndarray'>


Convert every part of the dataset to `torch.tensor`

In [10]:
# Convert data to tensors
data_torch = {
    part: {k: torch.as_tensor(v, device=device) for k, v in data[part].items()}
    for part in data
}

test_data_torch = {
    "x_num": torch.as_tensor(test_data["x_num"], device=device),
    "x_cat": torch.as_tensor(test_data["x_cat"], device=device)
}

### TabM model

In [56]:
# Choose one of the two configurations below.

# TabM
arch_type = 'tabm'
bins = None

# TabM-mini with the piecewise-linear embeddings.
# arch_type = 'tabm-mini'
# bins = rtdl_num_embeddings.compute_bins(data['train']['x_num'])

model = Model(
    n_num_features=len(num_cols),
    cat_cardinalities=cat_cardinalities,
    n_classes=None,
    backbone={
        'type': 'MLP',
        'n_blocks': 3 if bins is None else 2,
        'd_block': 512,
        'dropout': 0.1,
    },
    bins=bins,
    num_embeddings=(
        None
        if bins is None
        else {
            'type': 'PiecewiseLinearEmbeddings',
            'd_embedding': 16,
            'activation': False,
            'version': 'B',
        }
    ),
    arch_type=arch_type,
    k=32,
).to(device)

if compile_model:
    # NOTE
    # `torch.compile` is intentionally called without the `mode` argument
    # (mode="reduce-overhead" caused issues during training with torch==2.0.1).
    model = torch.compile(model)
    evaluation_mode = torch.no_grad
else:
    evaluation_mode = torch.inference_mode

In [57]:
model

Model(
  (cat_module): OneHotEncoding0d()
  (backbone): MLP(
    (blocks): ModuleList(
      (0-2): 3 x Sequential(
        (0): LinearEfficientEnsemble()
        (1): ReLU()
        (2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (output): NLinear()
)

In [26]:
from torchsummary import summary
summary(model, (1, len(num_cols) + sum(cat_cardinalities), ), device='cpu')

AssertionError: 

Since Loss in kaggle competition is RMSLE (Root Mean Squared Logarithmic Error) let's use it as validation metric

In [58]:
class RMSLELoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.mse = nn.MSELoss()
        
    def forward(self, input, target):
        return torch.sqrt(self.mse(torch.log(input + 1), torch.log(target + 1)))

Let's define class for training and evaluating our model

In [59]:
class Runner:
    """Runner for train/evaludate and predict using provided model."""
    def __init__(self, model, optimizer, loss, device, eval_metric=None, grad_scaler=None, epoch_bar=False):
        self.model = model
        self.optimizer = optimizer
        self.loss = loss
        self.device = device
        self.eval_metric = loss if eval_metric is None else eval_metric
        self.grad_scaler = grad_scaler
        self.epoch_bar = epoch_bar

        self._train_mean = None
        self._train_std = None

    def compute_loss(self, y_pred, y_true):
        # TabM produces k predictions per object. Each of them must be trained separately.
        # (regression)     y_pred.shape == (batch_size, k)
        # (classification) y_pred.shape == (batch_size, k, n_classes)
        k = y_pred.shape[-1]
        return self.loss(y_pred.flatten(0, 1), y_true.repeat_interleave(k))

    def forward(self, batch_num, batch_cat=None):
        return (
            self.model(batch_num, batch_cat)
            .squeeze(-1)
            .float()
        )

    def _train_step(self, y_pred, y_true, normalize_target):
        if normalize_target :
            y_true = (y_true - self._train_mean) / self._train_std
        loss = self.compute_loss(y_pred, y_true)
        if self.grad_scaler is None:
            loss.backward()
            self.optimizer.step()
        else:
            self.grad_scaler.scale(loss).backward()  # type: ignore
            self.grad_scaler.step(optimizer)
            self.grad_scaler.update()
        self.optimizer.zero_grad()
        return loss.detach().cpu().numpy().mean()
    
    def _eval_step(self, y_pred, y_true):
        # Compute the mean of the k predictions
        y_pred = y_pred.mean(1) * self._train_std + self._train_mean
        # Avoid computing gradients if train_phase is False
        with torch.set_grad_enabled(False):
            metric = self.eval_metric(y_pred, y_true)
        return metric.detach().cpu().numpy()

    def _run_epoch(self, data, batch_indices, train_phase=True, normalize_target=True):
        self.model.train(train_phase)
        epoch_history = []
        for batch_idx in tqdm(batch_indices, disable=not self.epoch_bar, leave=False):
            batch_num = data['x_num'][batch_idx]
            batch_cat = data['x_cat'][batch_idx]
            y_true = data['y'][batch_idx]
            y_pred = self.forward(batch_num, batch_cat)

            if train_phase:
                res = self._train_step(y_pred, y_true, normalize_target=normalize_target)
            else:
                res = self._eval_step(y_pred, y_true)
            epoch_history.append(res)
        return np.mean(epoch_history)

    def train(self, train_data, val_data, batch_size=256, n_epochs=1000, patience=16,
              eval_batch_size=None, normalize_target=True, force_collect_stats=False):
        best = {
            'val': math.inf,
            'test': math.inf,
            'epoch': -1,
        }

        # Important!
        # For regression tasks it is highly recommended to standardize the training labels.
        self._train_mean = (train_data['y'].mean() 
                            if self._train_mean is None or force_collect_stats 
                            else self._train_mean)
        self._train_std = (train_data['y'].std() 
                           if self._train_std is None or force_collect_stats 
                           else self._train_std)

        # Early stopping: the training stops when
        # there are more than `patience` consequtive bad updates.
        remaining_patience = patience
        eval_batch_size = batch_size if eval_batch_size is None else eval_batch_size
        for epoch in tqdm(range(n_epochs), total=n_epochs):
            batch_indices = torch.randperm(len(train_data['y']), device=device).split(batch_size)
            train_score = self._run_epoch(train_data, batch_indices, train_phase=True,
                                          normalize_target=normalize_target)
            val_score = self.evaluate(val_data, eval_batch_size)

            msg = f"Epoch: {epoch} (train) {train_score:.4f} (val) {val_score:.4f}"
        
            if val_score < best['val']:
                best = {'val': val_score, 'epoch': epoch}
                remaining_patience = patience
                print(msg + "\t🌸 New best epoch! 🌸")
            else:
                remaining_patience -= 1
                print(msg)
        
            if remaining_patience < 0:
                break
        print(f"Training is complete.\n Best model: epoch {best['epoch']} metric {best['val']}")

    @evaluation_mode
    def evaluate(self, data, batch_size=256):
        batch_indices = torch.randperm(len(data['y']), device=device).split(batch_size)
        return self._run_epoch(data, batch_indices, train_phase=False)

    @evaluation_mode
    def predict(self, data, batch_size=256):
        batch_indices = torch.arange(len(data['x_num']), device=device).split(batch_size)
        preds = []
        for batch_idx in tqdm(batch_indices, disable=not self.epoch_bar, leave=False):
            batch_num = data['x_num'][batch_idx]
            batch_cat = data['x_cat'][batch_idx]
            y_pred = self.forward(batch_num, batch_cat)
            y_pred = y_pred.mean(1) * self._train_std + self._train_mean
            preds.extend(y_pred.detach().cpu().numpy())
        return preds

### Create runner instance and train the model

In [60]:
optimizer = torch.optim.AdamW(make_parameter_groups(model), lr=2e-3, weight_decay=3e-4)
loss = F.mse_loss
eval_metric = RMSLELoss()

runner = Runner(model, optimizer, loss, device=device, eval_metric=eval_metric, grad_scaler=grad_scaler, epoch_bar=True)

Train model

In [61]:
runner.train(data_torch['train'], data_torch['val'], batch_size=10048, n_epochs=15)

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 0 (train) 0.9901 (val) 1.1629	🌸 New best epoch! 🌸


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 1 (train) 0.9803 (val) 1.1563	🌸 New best epoch! 🌸


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 2 (train) 0.9765 (val) 1.1571


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 3 (train) 0.9741 (val) 1.1554	🌸 New best epoch! 🌸


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 4 (train) 0.9716 (val) 1.1548	🌸 New best epoch! 🌸


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 5 (train) 0.9690 (val) 1.1497	🌸 New best epoch! 🌸


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 6 (train) 0.9653 (val) 1.1550


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 7 (train) 0.9620 (val) 1.1512


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 8 (train) 0.9596 (val) 1.1475	🌸 New best epoch! 🌸


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 9 (train) 0.9585 (val) 1.1457	🌸 New best epoch! 🌸


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 10 (train) 0.9568 (val) 1.1398	🌸 New best epoch! 🌸


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 11 (train) 0.9558 (val) 1.1478


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 12 (train) 0.9546 (val) 1.1399


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 13 (train) 0.9532 (val) 1.1415


  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/24 [00:00<?, ?it/s]

Epoch: 14 (train) 0.9514 (val) 1.1405
Training is complete.
 Best model: epoch 10 metric 1.1398487091064453


### How to improve the result??
* Play with network params
* Try to use bins, tabm_mini
* Other ideas?

### What about standard approaches? Let's try catboost!

In [36]:
from catboost import Pool

In [51]:
from catboost import CatBoostRegressor

catboost_model = CatBoostRegressor()

X_cb = train_df[cat_cols + num_cols].fillna(0)
y_cb = train_df[target_col]
catboost_model.fit(X_cb, y_cb, cat_features=cat_cols, verbose=50)

Learning rate set to 0.125513
0:	learn: 861.9131440	total: 433ms	remaining: 7m 12s
50:	learn: 844.2595204	total: 16.1s	remaining: 4m 58s
100:	learn: 842.8552496	total: 29.4s	remaining: 4m 22s
150:	learn: 841.9049956	total: 44s	remaining: 4m 7s
200:	learn: 841.1426827	total: 59.5s	remaining: 3m 56s
250:	learn: 840.5975103	total: 1m 14s	remaining: 3m 42s
300:	learn: 840.2264102	total: 1m 29s	remaining: 3m 28s
350:	learn: 839.7998081	total: 1m 46s	remaining: 3m 16s
400:	learn: 839.4441462	total: 2m 2s	remaining: 3m 2s
450:	learn: 839.1979030	total: 2m 18s	remaining: 2m 48s
500:	learn: 838.9704194	total: 2m 34s	remaining: 2m 33s
550:	learn: 838.7191916	total: 2m 50s	remaining: 2m 18s
600:	learn: 838.4715651	total: 3m 6s	remaining: 2m 3s
650:	learn: 838.2279777	total: 3m 22s	remaining: 1m 48s
700:	learn: 837.9883450	total: 3m 38s	remaining: 1m 33s
750:	learn: 837.7801345	total: 3m 54s	remaining: 1m 17s
800:	learn: 837.5747252	total: 4m 10s	remaining: 1m 2s
850:	learn: 837.3360883	total: 4m 

<catboost.core.CatBoostRegressor at 0x7fd5f677e570>

In [52]:
catboost_preds = catboost_model.predict(test_df[cat_cols + num_cols].fillna(0))

### Make submission

In [29]:
def make_submission(preds, ids, path=None):
    result = pd.DataFrame({"id": ids, "Premium Amount": preds})
    if path is None:
        return result
    result.to_csv(path, index=False)

In [28]:
preds = runner.predict(test_data_torch, batch_size=2048)

  0%|          | 0/391 [00:00<?, ?it/s]

In [30]:
make_submission(preds, test_data["id"], "simple_submission.csv")

In [55]:
make_submission(catboost_preds, test_df["id"], "simple_catboost_submission.csv")

### Submit using API

You can submit your solution directly from colab!


```python
kaggle competitions submit -c playground-series-s4e12 -f submission.csv -m "Message"
```