Introduction: Background
========================

Reproducibility across all scientific domains has long been a major problem. The common solution is to firmly capture all inputs, parameters, procedures, conditions, and environments and provide it as part of the presented results. Conceptually, this overcomes the main reproducibility issue of scientific research, but _applied fields_ typically come with additional (in a certain way even conflicting) requirement - _portability_ - in order to achieve (not only) _distributability_ of the particular ML solution.

In this worksop, we focus not so much on the problems of the general scientific research but primarily on the practical aspects of delivering ML solutions.

Definitions
-----------

### Solution

> An ML solution is the complete set of procedures to model a particular problem using machine learning so that it can predict results that meet the expected performance.

The term _model_ is often used interchangeably in two different meanings:
* the actual **trained model** (i.e. the learned weights/parameters) generalizing the patterns found in the data
* the **logical model** - the particular choice of algorithms setup in a way (in form of a pipeline) to learn these patterns when exposed to the data

The scope of this workshop is constrained to Python-based solutions only.

### Reproducibility
    
[Reproducibility and Replicability in Science, page 46](https://nap.nationalacademies.org/read/25303/chapter/6#46):
    
> Reproducibility is the capacity to obtain consistent results using the **same inputs** (data, hyperparameters) and the **same processing** (code + hardware).

Alternative/related definitions:
* Replicability - when using conceptually same but otherwise completely _independently obtained_ instances of the original datasets
* Repeatability - when performing ongoing (repeated) experimentation in the scope of the (same) project research

### Portability

> Portability is the ability to operate the same solution within alternative environments.

This can relate to a number of different _layers_ of the implementing stack:
* different **executors** (e.g. Local vs Dask vs PySpark)
* different script **interpreter/virtual machine** (e.g. Python (ABI) version)
* different **Operating Systems**
* different **HW architectures** (e.g. i386 vs arm64)

Different levels of portability can be achieved depending on which of the two mentioned model definitions we consider:
* the logical model - the pipeline (code) - can be fairly portable across all the different layers
* portability of the physical trained model is generally much more difficult across different architectures and/or different Python (ABI) versions; that can be achieved only with specific model algorithms (through the support of special formats like [PMML](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) and [ONNX](https://en.wikipedia.org/wiki/Open_Neural_Network_Exchange))

The key principle in achieving portability is _abstraction_ of the required functionality which can then be delivered through a number of alternative providers.

### Distributability

> Distributability refers to the ease and flexibility with which an ML solution can be provided to, and operated by a third party.

Distribution artifacts delivering an ML solution need to supply (at least):
* the solution code
* its dependencies
* configuration
* model files

Distributability is the fruit of reproducibility and portability.

Motivation
----------

Pursuing reproducibility, portability and distributability brings the following rewards:

### Sharing, Reuse & Collaboration

* allowing others to build on existing work and expand upon it
* saving resources by avoiding the need to reinvent existing solutions (efficiency)
* introducing a concept of ML solution marketplaces ([Hugging Face](https://huggingface.co/models), [BigML](https://bigml.com/gallery/models), ...)

### Verification, Error Detection, Troubleshooting

* helping to verify results and identifying errors in the particular solution (ability to reproduce an error is the key to fix it)
* reproducibility is in principle the inherent foundation for the necessary internal consistency between the train/predict modes of any (supervised) ML solution

### Cross-function Engagement

* allowing to switch between technolgies that are optimal for different operational modes (which might themselves have conflicting requirements - e.g. using low latency engines for prediction serving vs large throughput platforms for training)
* simplifying the research-to-production transition

### Failure Prevention

* cure to the [Replication Crisis](https://en.wikipedia.org/wiki/Replication_crisis) (acording to the [Nature survey](https://www.nature.com/articles/533452a), more than 50% researchers failed to reproduce their own experiments)
* research fraud (deliberately avoiding reproducibility with malicious intent - e.g. just to get 15 minutes of glory)
* examples of reproducibility failures:
    * [change of the popular ImagNet dataset](https://www.wired.com/story/researchers-blur-faces-launched-thousand-algorithms/) rendered all previously published models irreproducible
    * from our own experience: _"utf encoding"_ (incompatible train/predict implementations)

What's Stopping it?
-------------------

* extra effort (i.e. in short-term seems rather as additional and unjustified cost)
* little credit (i.e. not recognized as an obvious business value of the ML solution)