Introduction: Principles
========================

To achieve reproducibility, portability, and distributability in a ML solution, it is essential to implement specific principles across all key layers of the solution's architecture. Let's focus on these principles in more detail.

### Project Structure
Organizing project structure so that it is intuitively easy to understand (no just by other people but also programmatically). Structure reflects the logical components of the solution:

* data sourcing
* data preprocessing
* model training
* model evaluation
* tests
* docs
* notebooks
* CI/CD

Examples:
* [Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/)
* [ML Project Best Practices](https://neptune.ai/blog/how-to-organize-deep-learning-projects-best-practices)

### Software Dependencies

ML solutions are typically built using a number of 3rd party libraries and frameworks. This exact setup (including the individual versions) needs to be captured to allow reproducible reintegration.

The special case of dependency is the version of the Python intepreter itself.

Standard tools:

* [Pipenv](https://github.com/pypa/pipenv)
* [Conda](https://docs.conda.io/en/latest/)
* [Docker](https://docs.docker.com/)

### Hardware Factor

Hardware - at least within the same architecture family - is the foremost subject to the main portability expectations. It's the role of the system stack to abstract the hardware layer well enough so that it generally doesn't have to be considered a factor. This is not necessarily the case for ML solutions due to their dominant use of floating-point arithmetic that's intrinsically prone to discrepancies thus affecting reproducibility.

### Non-deterministic Operations

Stochastic processes are at the heart of many ML techniques thus posing a major reproducibility challenge.

Common practice is to explicitly set the **seed** used by the random generators within:

* the actual **algorithms** depending on randomization 
* core **libraries** themselves:
    * `random.seed()`
    * `torch.manual_seed()`
    * `tensorflow.set_random_seed()`
    * `numpy.random.seed()`

Note, however, the choice of the seed numbers might have considerable impact on the results. It's advised to run a couple of experiments with different values to assess the particular effect.

In addition, some of the popular libraries allow to explicitly disable internal use of non-deterministic algorithms:

* `torch.use_deterministic_algorithms(True)`
* `tfdeterminism.patch()`

### Source Code Implementation

ML solutions are implemented as series of data-processing steps. Capturing these exact steps is essential for true reproducibility. In practice, this can be provided in form of:

* **plain scripts** performing the steps directly upon execution (limited portability)
* robust **task dependency management** systems separating the stage of the task graph assembly from its scheduling and execution; this concept allows to:
    * transcode workflows for execution using **different runners**
    * **derive secondary workflows** from primary definitions (e.g. _evaluation_ workflow can be potentially derived from _train_ and _predict_ workflows)
    * transparently **manage (model) state persistence**

Note the implementor is responsible to maintain logical **consistency between the training and predicting modes** (e.g. matching transformations to be applied in the scope of pre-processing etc.).

The general **code quality** is a major aspect of its direct usability and thus reproducibility. Therfore standard clean-code development practice is the base guideline to follow:

* code **linting** ([Pylint](https://github.com/pylint-dev/pylint), [Pyflakes](https://github.com/PyCQA/pyflakes))
* code **formatting** ([Black](https://github.com/psf/black), [isort](https://pycqa.github.io/isort/), [PyCln](https://hadialqattan.github.io/pycln/#/), [pyupgrade](https://github.com/asottile/pyupgrade), ...)
* automated (unit) **testing** ([Pytest](https://docs.pytest.org/))

### Source Code Version Control

The use of version control system for managing the code base and parameters of a ML solution is necessary to:

* track and review changes made over time
* enabling better collaboration and teamwork
* identifying and fixing changes leading to bugs
* possibly as part of experiment tracking 


### Dataset Management

Effective dataset management is absolutely critical for ensuring reliability of any data-driven solution.
 
The key mechanisms related to reproducible dataset management are:

* data **versioning** - tracking changes made to data over time and allowing to access their particular revision:
    * [Pachyderm](https://www.pachyderm.com/blog/what-is-data-lineage/)
    * [git-lfs](https://git-lfs.com/)
    * [git-annex](https://git-annex.branchable.com/)
* data **lineage tracking** - capturing the history of data as it moves through the various stages within the system:
    * [DataHub](https://datahubproject.io/docs/lineage/lineage-feature-guide/)

From portability perspective, ML solution shouldn't be hardcoding related parameters like explicit datasource location, their physical data formats or encodings. This relates to a couple of additional concepts:

* dataset **abstraction** - decoupling solutions from physical datasets using their **schemas as data proxies**
* specifying data requirements within portable solutions using runtime-interpretable **queries**

### Experiment Tracking

Keeping track of experiments during the ML solution research is the key mechanism for ensuring its _repeatability_ (see our original [definitions](1-background.ipynb)).

Existing tools for experiment tracking:

* in the simplest form can be implemented just using plain version control
* number of available dedicated high-level tools:
    * [MLflow](https://mlflow.org/docs/latest/tracking.html)
    * [Neptune](https://neptune.ai/product/experiment-tracking)

### Model Persistence

Central component of the production lifecycle is the **model registry** used to persist the internal **state** of trained models and their related **artifacts**. This allows to _distribute_ the models across different teams and environments and also contributes to reproducibility by tracking the model versions, their code and hyperparameters.

I contrast to special model formats of certain algorithms, the popular generic approach for persisting any model states in Python is their serialization using the [Pickle protocol](https://docs.python.org/3/library/pickle.html). Since this is notoriously insecure, only trusted models persisted this way should ever be loaded (possibly armed with security checksums).

Few examples of model registry implementations are:
* [MLflow](https://medium.com/ochrona/python-pickle-is-notoriously-insecure-d6651f1974c9)
* [Neptune](https://neptune.ai/product/model-registry)

### Lifecycle Management, Orchestration and CI/CD

Eliminating manual steps when performing any of the (particularly production) lifecycle actions is a necessary practice for achieving reproducibility.

The main lifecycle steps that need to be handled non-interactively are:

* releasing/artifact packaging
* training
* tuning
* evaluation
* deployment

This is typically implemented through:

* integration with standard CI/CD tools and version control systems
* using task scheduling platforms for any periodic actions (model retraining/refreshing, serving performance evaluation/monitoring)


### Documentation

Good documentation is the corner stone of successful reproducibility. 

The documentation can come in a number of different forms (and their combinations):

* code itself and its docstrings
* [Jupyter](https://jupyter.org/) notebooks
* simple `README` files
* specialized documentation tools:
    * [mkdocs](https://www.mkdocs.org/)
    * [Sphinx](https://www.sphinx-doc.org/en/master/)
* in case of scientific research this should be the published paper itself