A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data

This repository accompanies the paper "A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data" https://arxiv.org/abs/2407.02112. It contains the proposed evaluation framework, consisting of datasets from Machine Learning Competitions and expert-level solutions for each task. This evaluation framework enables researchers to evaluate machine learning models with realistic preprocessing pipelines beyond overly standardized evaluation setups typically used in academia.

If you use our evaluation framework, please cite the following bib entry:

@article{tschalzev2024data,
  title={A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data},
  author={Tschalzev, Andrej and Marton, Sascha and L{\"u}dtke, Stefan and Bartelt, Christian and Stuckenschmidt, Heiner},
  journal={arXiv preprint arXiv:2407.02112},
  year={2024}
}

Quick start (Linux)

Create a new Python 3.11.7 environment and install 'requirements.txt'. (Currently, only Linux systems are supported.)
A Kaggle account and the KAggle API are required to use our framwork. If necessary, create a Kaggle account. Then follow https://www.kaggle.com/docs/api to create an API token from Kaggle, download it and place the 'kaggle.json' file containing your API token in the directory '~/.kaggle/'.
Run download_datasets.py. If necessary, visit the Kaggle competition websites, accept the competition rules, and rerun the script. More details below.
Run run_experiment.py.

Datasets

To download the datasets for the evaluation framework, run download_datasets.py. Before downloading, the Kaggle API needs to be correctly configured (see https://www.kaggle.com/docs/api). Furthermore, for each competition, the competition rules need to be accepted.

The following competitions are included:

Reproducing the Results

To run an experiment, 'run_experiment.py' can be used and adapted. The script loads a predefined configuration file and runs the modeling pipeline for the specified model, hyperparameter optimization (HPO) regime, dataset, and preprocessing configuration. The configs directory contains configuration files for each model currently implemented in the framework and one example configuration. The example configuration runs a CatBoost model with default hyperparameters on the mercedes-benz-greener-manufacturing (MBGM) dataset with standardized (minimalistic) preprocessing. The model config files include the configuration for the model with extensive HPO on the MBGM dataset with standardized preprocessing and can be adapted to reproduce all reported settings in the paper. To achieve this, we provide an overview of the most important configuration choices:

[dataset][dataset_name]: One of {'mercedes-benz-greener-manufacturing', 'santander-value-prediction-challenge', 'amazon-employee-access-challenge', 'otto-group-product-classification-challenge', 'santander-customer-satisfaction', 'bnp-paribas-cardif-claims-management', 'santander-customer-transaction-prediction', 'homesite-quote-conversion', 'ieee-fraud-detection', 'porto-seguro-safe-driver-prediction'}
[dataset][preprocess_type]: One of {expert, minimalistic, null} - 'minimalistic' corresponds to the standardized preprocessing pipeline, 'expert' to the feature engineering pipeline
[dataset][use_test]: One of {false, true} - If true, the test-time adaptation pipeline is executed. If false, the feature engineering pipeline is executed. The parameter only takes effect whenever the preprocess_type is 'expert'
[model][model_name]: One of {XGBoost, CatBoost, LightGBM, ResNet, FTTRansformer, MLP-PLR, GRANDE, AutoGluon}
[model][gpus]: Which GPUs in cluster to use
[model][folds_parallel]: How many CV folds to train in parallel on one GPU. For small datasets, running cross-validation folds in parallel greatly reduces the overall training time. Consider increasing the single-gpu parallelization depending on your hardware.
[hpo][n_trials]: No. of HPO trials. 100 for the extensive HPO setting, 20 for light HPO, and null for default.
[hpo][n_startup_trials]: No. random search warmup iterations. We set this parameter to 20 whenever we use HPO.

The notebooks Final_Evaluation.ipynb and Final_Evaluation_orig_metric.ipynb gather all our results and execute the evaluation reported in the paper.

Contributing

Datasets

To add new datasets, the following steps are required:

A new class in the datasets.py file following the instructions in the BaseDataset class header
Add the class to the get_datasets function in datasets.py
Addition of the dataset to download_datasets.py

Currently, the only supported platform is Kaggle, as it offers an API that allows for easy post-competition submissions. We appreciate any contributions that allow us to integrate datasets from other platforms, as well as contributions of datasets without expert preprocessing for comparison in the standardized preprocessing regime.

Models

To add new models, the following steps are required:

A new class in the models.py file that can be initialized with a params dictionary and includes the following functions: fit, predict, get_default_hyperparameters, get_optuna_hyperparameters
Add the class to the get_models function in models.py
Add a configs/{new_model}.yaml file including information about configurable, but not tuned hyperparameters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data

Quick start (Linux)

Datasets

Reproducing the Results

Contributing

Datasets

Models

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
GRANDE		GRANDE
configs		configs
datasets		datasets
figures		figures
.gitignore		.gitignore
Final_Evaluation.ipynb		Final_Evaluation.ipynb
Final_Evaluation_orig_metric.ipynb		Final_Evaluation_orig_metric.ipynb
LICENSE		LICENSE
README.md		README.md
datasets.py		datasets.py
download_datasets.py		download_datasets.py
modeling_helpers.py		modeling_helpers.py
models.py		models.py
pip_requirements_frozen.txt		pip_requirements_frozen.txt
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py
utils.py		utils.py

License

atschalz/dc_tabeval

Folders and files

Latest commit

History

Repository files navigation

A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data

Quick start (Linux)

Datasets

Reproducing the Results

Contributing

Datasets

Models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages