# Introduction to **TabPFN** and **TabICL**

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-datascience-datacamp/datacamp-master/blob/main/11_tabular_foundational_models/01-tabpfn-tabicl.ipynb)

Author: [Pedro L. C. Rodrigues](https://plcrodrigues.github.io) and [Thomas Moreau](https://tommoral.github.io)

- **TabPFN** : Hollman et al. "Accurate predictions on small data with a tabular foundation model" (2025) [[link](https://www.nature.com/articles/s41586-024-08328-6)]
- **TabICL** : Qu et al. "TabICL: A Tabular Foundation Model for In-Context Learning on Large Data" (2025) [[link](https://arxiv.org/abs/2502.05564)]

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

%matplotlib inline

The Python implementation of TabPFN is developed by people from [Prior Labs](https://docs.priorlabs.ai/overview) and follows the same API from `scikit-learn`.

Note, however, that to use the last version of the TabPFN's foundational model, you will need to authenticate at HuggingFace, which can be a bit messy. Because of this, we will be focusing on TabPFN-V2, which should be more than enough.

⚡ GPU Recommended: For optimal performance, use a GPU (even older ones with ~8GB VRAM work well; 16GB needed for some large datasets). On CPU, only small datasets (≲1000 samples) are feasible. Note that **this notebook can be run on Codalab with the top button**.

First of all, you will need to install the package by as below

In [11]:
!pip install -U tabpfn



# Regression with TabPFN

We investigate how TabPFN can be used for regression and compare his performance versus other classic regressors.

In [12]:
from sklearn.utils import shuffle
from sklearn.model_selection import KFold, cross_val_score
from sklearn.datasets import fetch_california_housing, load_diabetes

from tabpfn import TabPFNRegressor
from tabpfn.constants import ModelVersion

import pandas as pd, requests

# Loading the Boston dataset
cols = ["CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"]
df_boston = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data",
                        sep='\\s+', header=None, names=cols)

print(df_boston.shape)
df_boston.head()

(506, 14)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [13]:
# Convert to pure numpy arrays
X, y = df_boston.drop(columns=["MEDV"]).values, df_boston["MEDV"].values

# Choose cross-validation strategy
cv = KFold(shuffle=True, n_splits=5)

# Instantiate TabPFN for regression
# regressor_tabpfn = TabPFNRegressor()
regressor_tabpfn = TabPFNRegressor.create_default_for_version(ModelVersion.V2)
regressor_tabpfn.n_estimators = 1

scores = []
for idx_train, idx_test in tqdm(cv.split(X, y)):
    X_train, y_train = X[idx_train], y[idx_train]
    X_test, y_test = X[idx_test], y[idx_test]
    regressor_tabpfn.fit(X_train, y_train)
    scores.append(regressor_tabpfn.score(X_test, y_test))
print(np.mean(scores))

5it [00:01,  3.29it/s]

0.8938961256266159





<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
         <li>What is happening in the <span><code>fit</code></span> method?</li>
     </ul>
</div>

Let's see now how a `RandomForestRegressor` and the `HistGradientRegressor` behave

In [14]:
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
regressor_rf = RandomForestRegressor()
regressor_hgbr = HistGradientBoostingRegressor()
est_dict = {'rf':regressor_rf, 'hgbr':regressor_hgbr}
for key, value in est_dict.items():
    scores = cross_val_score(value, X, y, cv=cv)
    print(key, np.mean(scores))

rf 0.8674679318196061
hgbr 0.8547686266368408


We see that TabPFN beats both baseslines by quite a margin. However, it took much more time...

Let's consider now a different dataset.

In [15]:
from sklearn.datasets import fetch_california_housing
df_california, targets = fetch_california_housing(return_X_y=True, as_frame=True)
print(df_california.shape)
df_california.head()

(20640, 8)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


The dataset is bigger than the previous one, so let's see how TabPFN behaves.

In [16]:
from sklearn.model_selection import train_test_split # let's avoid cross-val for time sake
X, y = df_california.values, targets.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
regressor_tabpfn.ignore_pretraining_limits=True
regressor_tabpfn.fit(X_train, y_train)
print(regressor_tabpfn.score(X_test, y_test))

0.8701703672039364


<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
         <li>Why do you think we exploded in memory?</li>
     </ul>
</div>

One possible trick is to subsample the dataset and use an ensemble of TabPFN regressors as below.

In [17]:
regressor_tabpfn.ignore_pretraining_limits = True
regressor_tabpfn.n_estimators = 1
regressor_tabpfn.inference_config = {"SUBSAMPLE_SAMPLES": 500}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
regressor_tabpfn.fit(X_train, y_train)
print(regressor_tabpfn.score(X_test, y_test))

0.7732634430672185


Or even better, use a GPU :-)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-datascience-datacamp/datacamp-master/blob/main/11_tabular_foundational_models/01-tabpfn-tabicl.ipynb)


In [18]:
regressor_rf = RandomForestRegressor()
regressor_hgbr = HistGradientBoostingRegressor()
est_dict = {'rf':regressor_rf, 'hgbr':regressor_hgbr}
for key, est in est_dict.items():
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    est.fit(X_train, y_train)
    score = est.score(X_test, y_test)
    print(key, score)

rf 0.8038848075128919
hgbr 0.8346716411239447


In the slides, we saw that **TabICL** can, in principle, scale to any number of samples, due to the way that rows and columns are embedded in its architecture. So should we try to use it?

In [19]:
!pip install -U tabicl # watch out for the python version!

Collecting tabicl
  Downloading tabicl-0.1.4-py3-none-any.whl.metadata (13 kB)
Downloading tabicl-0.1.4-py3-none-any.whl (107 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.8/107.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tabicl
Successfully installed tabicl-0.1.4


Checking the **TabICL** [documentation](https://github.com/soda-inria/tabicl) we notice that it currently does not work for regression... :'(

At least for now... ;)

<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
         <li>Why can TabPFN do regression out-of-the-box whereas TabICL not?</li>
     </ul>
</div>

# Classification with TabPFN and TabICL

Let's switch to a classifcation problem first with a small then with a big dataset.

In [20]:
from sklearn.datasets import fetch_openml
df, target = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X, y = df.values, target.values
df.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


We saw in previous classes that we can not simply plug the Titanic dataset into standard scikit-learn estimators. First, it is necessary to pre-process the data, encode categorical features, etc. But what happens in TabPFN ?

In [21]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(shuffle=True, n_splits=5)

# Instantiate TabPFN for classification
from tabpfn import TabPFNClassifier
clf_tabpfn = TabPFNClassifier.create_default_for_version(ModelVersion.V2)
clf_tabpfn.n_estimators = 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_tabpfn.fit(X_train, y_train)
print(clf_tabpfn.score(X_test, y_test))

tabpfn-v2-classifier-finetuned-zk73skhh.(…):   0%|          | 0.00/29.0M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/37.0 [00:00<?, ?B/s]

0.9699074074074074


What about TabICL ?

In [22]:
from tabicl import TabICLClassifier
clf_icl = TabICLClassifier(n_estimators=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_icl.fit(X_train, y_train)
print(clf_icl.score(X_test, y_test))

INFO: You are downloading 'tabicl-classifier-v1.1-0506.ckpt', the latest best-performing version of TabICL.
To reproduce results from the original paper, please use 'tabicl-classifier-v1-0208.ckpt'.

Checkpoint 'tabicl-classifier-v1.1-0506.ckpt' not cached.
 Downloading from Hugging Face Hub (jingang/TabICL-clf).



tabicl-classifier-v1.1-0506.ckpt:   0%|          | 0.00/108M [00:00<?, ?B/s]

ValueError: could not convert string to float: 'Collett, Mr. Sidney C Stuart'

The documention can help us : https://github.com/soda-inria/tabicl?tab=readme-ov-file#basic-integration

In [23]:
!pip install skrub

Collecting skrub
  Downloading skrub-0.7.0-py3-none-any.whl.metadata (4.4 kB)
Downloading skrub-0.7.0-py3-none-any.whl (498 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m498.3/498.3 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: skrub
Successfully installed skrub-0.7.0


In [25]:
from skrub import TableVectorizer
from tabicl import TabICLClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    TableVectorizer(),  # automatically handles various data types
    TabICLClassifier(n_estimators=8)  # beware of the default parameters!
)

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.33, random_state=42) # note that we pass the dataframe!
pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))



0.9675925925925926


<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
         <li>Why can TabPFN preprocess categorical features directly and TabICL needs a pipeline?</li>
     </ul>
</div>

In [26]:
from skrub import tabular_pipeline
est = tabular_pipeline('classifier')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.33, random_state=42) # note that we pass the dataframe!
est.fit(X_train, y_train)
print(est.score(X_test, y_test))

0.9629629629629629


Let's consider now a larger dataset and see how **TabICL** behaves.

In [27]:
import pandas as pd
from pathlib import Path
from urllib.request import urlretrieve

DATA_DIR = Path().parent / "data"

url = ("https://archive.ics.uci.edu/ml/machine-learning-databases"
       "/adult/adult.data")

local_filename =  DATA_DIR/ Path(url).name
if not local_filename.exists():
    print("Downloading Adult Census datasets from UCI")
    DATA_DIR.mkdir(exist_ok=True)
    urlretrieve(url, local_filename)

names = ("age, workclass, fnlwgt, education, education-num, "
         "marital-status, occupation, relationship, race, sex, "
         "capital-gain, capital-loss, hours-per-week, "
         "native-country, income").split(', ')
df = pd.read_csv(local_filename, names=names)
df = df.rename(columns={'income': 'class'})

columns_to_plot = [
    "age",
    "education-num",
    "capital-loss",
    "capital-gain",
    "hours-per-week",
    "class",
]
df = df[columns_to_plot]
print(df.shape)
df.head()

Downloading Adult Census datasets from UCI
(32561, 6)


Unnamed: 0,age,education-num,capital-loss,capital-gain,hours-per-week,class
0,39,13,0,2174,40,<=50K
1,50,13,0,0,13,<=50K
2,38,9,0,0,40,<=50K
3,53,7,0,0,40,<=50K
4,28,13,0,0,40,<=50K


In [28]:
target_name = "class"
target = df[target_name]
X, y = df.iloc[:,:-1].values, target.values
y = (y == ' <=50K').astype(np.int8)

In [31]:
clf_icl = TabICLClassifier(n_estimators=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_icl.fit(X_train, y_train)
print(clf_icl.score(X_test, y_test))

0.8219802717290154


In [30]:
from sklearn.ensemble import HistGradientBoostingClassifier

clf_hgbr = HistGradientBoostingClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_hgbr.fit(X_train, y_train)
print(clf_hgbr.score(X_test, y_test))

0.8400335008375209
