Skip to content

collab-uniba/Software-Solutions-for-Reproducible-ML-Experiments

Repository files navigation

Software Solutions for Reproducible ML Experiments

This repository contains auxiliary material for the article: "A Taxonomy of Tools for Reproducible Machine Learning Experiments" by Luigi Quaranta, Fabio Calefato, and Filippo Lanubile.

In the following of this README, the full sample of analyzed tools is classified according to the features from the taxonomy presented in the paper; for the reader's convenience, a figure representing the taxonomy is also displayed in the following paragraph.

Creative Commons License
The tool categorization reported in this README as well as the figure representing the taxonomy are licensed under a Creative Commons Attribution 4.0 International License.

Please, include the following citation if you intend to (re)use our work:

L. Quaranta, F. Calefato and F. Lanubile, “A Taxonomy of Tools for Reproducible Machine Learning Experiments,” Proceedings of the AIxIA 2021 Discussion Papers Workshop (AIxIA DP 2021), 2021, pp. 65-76, online: CEUR-WS.org/Vol-3078/paper-81.pdf.

The Taxonomy

Taxonomy

Tools Review

General

The tool sample classified according to the features of the General category.

Interaction Mode Workflow Coverage Languages License
DVC CLI All Language agnostic FLOSS
(Apache 2.0)
Guild AI CLI, API Data Preparation + Model Building Python

Built-in framework support: TensorFlow, PyTorch, Keras, Scikit-Learn
FLOSS
(Apache 2.0)
Pachyderm CLI, API All Language agnostic Community Ed.:
FLOSS
(Apache 2.0)
Enterprise Ed.:
Proprietary
Comet.ml API, CLI Data Preparation + Model Building Python, R, Java (beta)

Built-in framework support: TensorFlow, PyTorch, Keras, Scikit-Learn, SageMaker
Proprietary
MLflow API, CLI All Python, R, Java

Built-in framework support: Apache Spark, TensorFlow, PyTorch, Keras, Scikit-Learn, H2O
FLOSS
(Apache 2.0)
Neptune API, CLI All Language agnostic (CLI)

Python and R (API)

Built-in framework support: TensorFlow, PyTorch, Keras
MLflow, SageMaker
Proprietary
wandb API, CLI Data Preparation + Model Building Python Proprietary
Valohai CLI, API All Language agnostic Proprietary
Google Colab Cloud IDE Data Preparation + Model Building Python Proprietary
FloydHub Cloud IDE, API, CLI All Python

Built-in framework support: TensorFlow, PyTorch, Keras, Scikit-Learn
Proprietary
Domino Cloud IDE, API, CLI All Python, R, Julia

Built-in framework support: TensorFlow, PyTorch, H2O, Apache Spark, Hadoop
Proprietary
Spell.run Cloud IDE, CLI All Python

Built-in framework support: TensorFlow, Keras
Weights & Biases
Proprietary
Polynote Web-based IDE Data Preparation + Model Building Scala, Python, SQL

Built-in framework support: Apache Spark
FLOSS
(Apache 2.0)
DataRobot AutoML Platform All Language agnostic
(Python API)
Proprietary
databricks Cloud IDE, API, CLI All Python, R, Scala, SQL

Built-in framework support: Apache Spark, MLflow, Delta Lake, TensorFlow
Proprietary
Driverless AI AutoML Platform All (Python recipes) Proprietary
RapidMiner AutoML Platform All (Python and R for
custom code)
Proprietary
dstack.ai API Data Preparation Python, R Proprietary
Dotscience Cloud IDE, API, CLI All Language agnostic (CLI)
Python (Cloud IDE, API)
Proprietary

Analysis Support

The tool sample classified according to the features of the Analysis Support category.

Notebook support Data Visualization Web Dashboard Collaboration mode Computational
Resources
DVC No No No Async
(push/pull commands)
Local
Guild AI Yes
(on-premise)
No Yes
(local)
Async
(push/pull commands)
Local
Pachyderm Yes
(on-premise)
No Yes
(local or remote)
Async
(push/pull commands)
Local +
On-premise +
Remote (in-house*)
Comet.ml Yes
(on-premise)
No Yes
(remote)
No Local +
On-premise* +
Remote*
(in-house)
MLflow Yes
(on-premise)
No Yes
(local)
No Local +
On-premise
Neptune Yes
(on-premise)
No Yes
(remote)
Async (comments) On-premise* +
Remote (in-house)
wandb Yes
(on-premise)
No Yes
(remote)
No On-premise* +Remote
(in-house)
Valohai Yes
(on-premise orhosted)
No Yes
(remote)
No On-premise* +
Remote (in-house)
Google Colab Yes
(hosted)
No No Sync (co-editing) +
Async (comments)
Local +
Remote (in-house or third-party)
FloydHub Yes
(hosted)
No Yes
(remote)
No On-premise* +
Remote (in-house)
Domino Yes
(hosted)
No Yes
(remote)
Async (reviews) Remote (in-house*)
Spell.run Yes
(hosted)
No Yes
(remote)
No On-premise* +
Remote (in-house)
Polynote Yes
(on-premise)
Yes No No Local
DataRobot No Yes Yes
(remote)
No On-premise* +
Remote*
(in-house or
third-party)
databricks Yes
(hosted)
Yes Yes
(remote)
Sync (co-editing) +
Async (comments)
Remote* (third-party)
Driverless AI No Yes Yes
(remote)
No Remote* (in-house or third-party)
RapidMiner Yes
(hosted)
Yes Yes
(remote)
No Local +
Remote* (in-house or third-party)
dstack.ai Yes
(on-premise)
No Yes
(remote)
Async (comments) On-premise* +
Remote
(in-house)
Dotscience Yes
(hosted)
No Yes
(remote)
Async
(Fork&Pull for notebooks)
On-premise* +
Remote (in-house or third-party*)

Reproducibility Support

The tool sample classified according to the features of the Reproducibility Support category.

Code Versioning Data Access Data Versioning Experiment
Logging
Reproducible
Pipeline
DVC Yes
(external, git-based)
Local +Remote (third-party) Yes Yes
(manual)
Yes
(automatic)
Guild AI Yes
(external, git-based)
Local +Remote (third-party) Yes Yes
(hybrid)
Yes
(configuration file)
Pachyderm Yes
(integrated)
Local +Remote (third-party) Yes No Yes
Comet.ml Yes
(external, git-based)
Local +
Remote (internal)
Yes Yes
(hybrid)
?
MLflow Yes
(external, git-based)
Local +
Remote (third-party)
No Yes
(hybrid)
Yes
(configuration file)
Neptune Yes
(integrated orexternal, git-based)
Local +
Remote (third-party)
No Yes
(hybrid)
No
wandb Yes
(external, git-based)
Local +
Remote (internal orthird-party)
No Yes
(hybrid)
Local +
Remote (third-party)
Valohai Yes
(integrated or
external, git-based)
Local +
Remote (third-party*)
Yes Yes
(manual)
Yes
(configuration file)
Google Colab Yes
(file-sharing services - Google Drive)
Remote (internal orthird-party) Yes No No
FloydHub Yes (integrated orexternal, git-based) Remote (internal orthird-party) Yes Yes
(manual)
Yes
Domino Yes
(integrated)
Remote (internal orthird-party) Yes No Yes
(automatic)
Spell.run Yes
(external, git-based)
Remote (internal orthird-party) ? Yes
(hybrid)
Yes
(script)
Polynote No Local No No No
DataRobot ? Remote ? Yes
(automatic)
Yes
(built-in)
databricks Yes (integrated orexternal, git-based) Remote (internal orthird-party) Yes Yes
(hybrid)
?
Driverless AI Yes
(integrated)
Remote
(internal or third-party)
Yes Yes
(automatic)
Yes
(built-in)
RapidMiner Yes
(external, git-based)
Local +
Remote (third-party)
? Yes
(automatic)
Yes
(visual or built-in)
dstack.ai No Local +
Remote (internal)
Yes Yes
(manual)
No
Dotscience Yes
(integrated)
Remote
(internal or third-party)
Yes Yes
(manual)
Yes
(automatic)

* = only available in paid plans

N.B.: Rows related to Dotscience are strike-through because the service seems to be shutting down. We read this blog post a few days after our trial.


Repository contents

The tools/ folder contains environment templates for the tools that require a local installation to be executed. To try the tools we used -- where possible -- a realistic case study inspired to the lessons of the Kaggle's micro-courses "Intro to Machine Learning" and "Intermediate Machine Learning". The kernels/ folder contains template notebooks implementing the case study, while the sample dataset is stored in the input/ folder.

Setup instructions

To try one of the reviewed tools, follow these steps:

  1. go to the tool's folder: /tools/<tool_name>;
  2. if a .env_template file exist, make a copy of it; give the name .env to the copy; edit .env giving a value to each of the mentioned variables.
  3. if a README.md file is present, follow the specific instruction there.