**To train neural networks faster, you need to enable GPUs for the notebook:**
* Navigate to Edit→Notebook Settings
* select GPU from the Hardware Accelerator drop-down

# Setup

## Installation

In [None]:
!pip install pytabkit
!pip install openml

## Getting a dataset

In [1]:
import openml
from sklearn.model_selection import train_test_split

task = openml.tasks.get_task(359946, download_splits=False) # pol dataset
dataset = openml.datasets.get_dataset(task.dataset_id, download_data=False)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format='dataframe',
    target=task.target_name
)
# X, _, y, _ = train_test_split(X, y, train_size=0.1, random_state=0)  # subsample data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

CPU times: user 2 μs, sys: 0 ns, total: 2 μs
Wall time: 4.53 μs


  dataset = get_dataset(task.dataset_id, *dataset_args, **get_dataset_kwargs)
  dataset = openml.datasets.get_dataset(task.dataset_id, download_data=False)


# Using RealMLP

In [7]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import RealMLP_TD_Classifier
from sklearn.metrics import root_mean_squared_error

model = RealMLP_TD_Classifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE of RealMLP: {rmse}")

RMSE of RealMLP: 4.5089540546191715
CPU times: user 59.4 s, sys: 128 ms, total: 59.5 s
Wall time: 7.47 s


## With bagging
It is possible to do bagging (ensembling of models on 5-fold cross-validation) simply by passing `n_cv=5` to the constructor. Note that it doesn't take 5x as long because of vectorized training. 

In [8]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import RealMLP_TD_Classifier
from sklearn.metrics import root_mean_squared_error

model = RealMLP_TD_Classifier(n_cv=5)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE of RealMLP with bagging: {rmse}")

RMSE of RealMLP with bagging: 4.73849483837783
CPU times: user 2min 31s, sys: 15.6 s, total: 2min 46s
Wall time: 20.8 s


## With hyperparameter optimization
It is possible to do hyperparameter optimization directly inside a sklearn interface by using the `RealMLP_HPO_Regressor` interface.
This is also available for classification, and for other models, for instance `LGBM_HPO_Classifier` or `LGBM_HPO_TPE_Classifier` (to use the Tree-structured Parzen Estimator algorithm).

In [3]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import RealMLP_HPO_Regressor

n_hyperopt_steps = 10 # small number for demonstration purposes
model = RealMLP_HPO_Regressor(n_hyperopt_steps=n_hyperopt_steps)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE of RealMLP with {n_hyperopt_steps} steps HPO: {rmse}")

/home/david/.local/share/hatch/env/virtual/pytabkit/4Jss_1K9/pytabkit/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...
/home/david/.local/share/hatch/env/virtual/pytabkit/4Jss_1K9/pytabkit/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...
/home/david/.local/share/hatch/env/virtual/pytabkit/4Jss_1K9/pytabkit/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...
/home/david/.local/share/hatch/env/virtual/pytabkit/4Jss_1K9/pytabkit/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...
/home/david/.local/share/hatch/env/virtual/pytabkit/4Jss_1K9/pytabkit/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...
/home/david/.lo

RMSE of RealMLP with 10 steps HPO: 6.450466632843018
CPU times: user 2min 31s, sys: 921 ms, total: 2min 32s
Wall time: 19.3 s


/home/david/.local/share/hatch/env/virtual/pytabkit/4Jss_1K9/pytabkit/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...


# Using improved default for tree based models

`TD` stands for *tuned defaults*, which are the improved default we propose. `D` stands for *defaults*, which are the libraries defaults.

In [9]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import CatBoost_TD_Regressor, CatBoost_D_Regressor, LGBM_TD_Regressor, LGBM_D_Regressor, XGB_TD_Regressor, XGB_D_Regressor

for model in [CatBoost_TD_Regressor(), CatBoost_D_Regressor(), LGBM_TD_Regressor(), LGBM_D_Regressor(), XGB_TD_Regressor(), XGB_D_Regressor()]:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = root_mean_squared_error(y_test, y_pred)
    print(f"RMSE of {model.__class__.__name__}: {rmse}")

Fitting CatBoost
RMSE of CatBoost_TD_Regressor: 7.892775535583496
Fitting CatBoost
RMSE of CatBoost_D_Regressor: 8.191725730895996
Fitting LightGBM
RMSE of LGBM_TD_Regressor: 7.906896114349365
Fitting LightGBM
RMSE of LGBM_D_Regressor: 7.977582931518555
RMSE of XGB_TD_Regressor: 8.610333442687988
RMSE of XGB_D_Regressor: 8.964423179626465
CPU times: user 51.8 s, sys: 6.03 s, total: 57.8 s
Wall time: 4.29 s


# Ensembling tuned defaults of tree-based methods and RealMLP: a very good baseline 

In [10]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import Ensemble_TD_Regressor

model = Ensemble_TD_Regressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE of Ensemble_TD_Regressor: {rmse}")

Fitting LightGBM
Fitting CatBoost
RMSE of Ensemble_TD_Regressor: 4.701683521270752
CPU times: user 1min 53s, sys: 5.34 s, total: 1min 58s
Wall time: 11.6 s
