# TabPFN

The TabPFN is a neural network that learned to do tabular data prediction. This is the original CUDA-supporting pytorch impelementation.

### Installation
`pip install tabpfn`


### TabPFN usage
TabPFN is different from other methods you might know for tabular classification. Here, we list some tips and tricks that might help you understand how to use it best.

* Do not preprocess inputs to TabPFN. TabPFN pre-processes inputs internally. It applies a z-score normalization (`x-train_x.mean()/train_x.std()`) per feature (fitted on the training set) and log-scales outliers heuristically. Finally, TabPFN applies a PowerTransform to all features for every second ensemble member. Pre-processing is important for the TabPFN to make sure that the real-world dataset lies in the distribution of the synthetic datasets seen during training. So to get the best results, do not apply a PowerTransformation to the inputs.

* TabPFN expects scalar values only (you need to encode categoricals as integers e.g. with OrdinalEncoder). It works best on data that does not contain any categorical or NaN data.

* TabPFN ensembles multiple input encodings per default. It feeds different index rotations of the features and labels to the model per ensemble member. You can control the ensembling with `TabPFNClassifier(...,N_ensemble_configurations=?)`

* TabPFN does not use any statistics from the test set. That means predicting each test example one-by-one will yield the same result as feeding the whole test set together.

* TabPFN is differentiable in principle, only the pre-processing is not and relies on numpy.


### Training the TabPFN
In the prior-fitting phase, the TabPFN is trained on samples generated from a novel prior specifically designed for tabular data. The training process involves using a 12-layer Transformer and synthetic datasets. The training is computationally expensive, requiring significant time and computational resources. However, it is a one-time offline step done during algorithm development. The trained TabPFN model is then used for all subsequent experiments and evaluations.


### Inference with the TabPFN
During the inference phase, the TabPFN approximates the Posterior Predictive Distribution (PPD) for the dataset prior. It captures the marginal predictions across different spaces of Structural Causal Models (SCMs) and Bayesian Neural Networks (BNNs), with a focus on simplicity and causal explanations for the data. The predictions are obtained through a single forward pass of the TabPFN, as well as an ensemble of 32 forward passes on modified datasets. These modifications involve power transformations and index rotations of feature columns and class labels.

### A Prior for Tabular Data
The performance of the TabPFN relies on a suitable prior designed for tabular data. The prior incorporates distributions instead of point estimates for most hyperparameters. The notion of simplicity is at the core of the prior, aligning with Occam’s Razor and the Speed Prior. The prior leverages SCMs and BNNs as fundamental mechanisms for generating diverse data. SCMs capture causal relationships among columns in tabular data, while BNNs offer flexibility in modeling complex patterns. The prior also accounts for peculiarities of tabular data, such as correlated and categorical features, exponentially scaled data, and missing values.

### Fundamentally Probabilistic Models
Unlike traditional models that rely on point estimates for hyperparameters, the TabPFN allows for full Bayesian treatment of hyperparameters. By defining a probability distribution over the hyperparameter space, including BNN architectures, the TabPFN integrates over the space and model weights. This probabilistic modeling approach extends to a mixture of hyperparameters and distinct priors, combining both SCMs and BNNs.

### Multi-Class Prediction
To generate multi-class prediction the scalar labels obtained from the described priors are converted into discrete class labels. This transformation is done by dividing the values of the scalar labels into intervals that correspond to different class labels. It will automatically account for potential class imbalances as well.


In [1]:
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import xgboost as xgb

from tabpfn import TabPFNClassifier

In [2]:
def tabpfn_classifier(X_train, y_train, X_test):
    # N_ensemble_configurations controls the number of model predictions that are ensembled with feature and class rotations (See our work for details).
    # # When N_ensemble_configurations > #features * #classes, no further averaging is applied.
    tab_pfn_model = TabPFNClassifier(device='cpu', N_ensemble_configurations=32)
    tab_pfn_model.fit(X_train, y_train)
    y_pred, p_pred = tab_pfn_model.predict(X_test, return_winning_probability=True)
    return y_pred, p_pred

def xgboost_classifier(X_train, y_train, X_test):
    xgb_model = xgb.XGBClassifier()
    xgb_model.fit(X_train, y_train)
    y_pred = xgb_model.predict(X_test)
    return y_pred

### Load data

In [3]:
X, y = load_breast_cancer(return_X_y=True)

### Train classifier

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# TabPFN model
y_pred, p_pred = tabpfn_classifier(X_train, y_train, X_test)
tabpfn_accuracy = 100 * accuracy_score(y_test, y_pred)


# XGBoost model
y_pred = xgboost_classifier(X_train, y_train, X_test)
xgb_accuracy = 100 * accuracy_score(y_test, y_pred)

We have to download the TabPFN, as there is no checkpoint at  /home/kchauhan/miniconda3/envs/tabpfn/lib/python3.9/site-packages/tabpfn/models_diff/prior_diff_real_checkpoint_n_0_epoch_100.cpkt
It has about 100MB, so this might take a moment.




### Results

* TabPFN model accuracy = 98.404%
* XGBoost model accuracy = 96.809%
* TabPFN outperforms XGBoost

In [5]:
print(f'Accuracy (TabPFN) = {tabpfn_accuracy:.3f}%')
print(f'Accuracy (XGBoost) = {xgb_accuracy:.3f}%')

Accuracy (TabPFN) = 97.872%
Accuracy (XGBoost) = 96.809%
