
## ML workflow

Unless you do a [Kaggle](https://www.kaggle.com/) challenge or work in a research project model training is usually only a small part of a machine learning project.

The workflow usually consists of the following steps:

 * Collect data
   * How easy is it got collect the data?
   * Labelling problem for supervised learning
   * Join datasets from different sources
   * Privacy and legal issues
 * Preprocess data
   * Cleaning data
   * Normalization
   * Tokenization for text input
   * Augmentation for image input
 * Train model
   * Framework selection
   * Model architecture
   * Cloud vs. own hardware
   * Local vs. distributed training
   * Scaling
   * Memory issues
 * Evaluate model
   * Hyperparameter selection
 * Deploy model
   * Online deployment (e.g. RPC service)
   * Offline deployment (e.g. as Spark, Flink or batch job)
   * Library deployment (e.g. model + code bundle that can be used by other teams)
 * Make predictions
   * Monitor preditions
   * Compare predictions with observed outcome
 * Improve model
   * Here it starts all over again

You see that most tasks require software engineering or data engineering skills.


## Principles

Productivity:
 * Enable developers to use ML: standardize algorithms, dataset and workflows
 * Training a model should be easy for engineers of varying ML experience
 * Standard ML algorithm should be implemented only once in a reusable manner
 * Rerun pipelines with different inputs and parameters
 * Run multiple experiments at the same time, distributed training, GPU 

Reproducability:
 * Record train parameters and inputs, store model, record performance
 * Everybody should be able to easily search past experiments, view results, share with others, and start new variants of a given experiment.
 
Model deployment 
 * Standardize model deployment: avoid engineering teams have to create a custom serving container specific to the project
 * Support for offline and near-realtime features

Deploy multiple model versions at the same time
 * transition between model versions
 * A/B testing


Monitor deployed models
 * Monitor model performance
 * Detect slowly shifting distributions in the underlying data
 * Log a percentage of the predictons and later compare them to the observed outcome


Monitor model training
 * Plot loss, accuracy etc.
 * Plot computational graph
 * Mean and variance of activations over time
 * Mean and variance of gradients over time
 * Mean and variance of parameter updates over time

## Model deployment

Deployment types:

 * Offline deployment. The model is deployed to an offline container and run in a Spark job to generate batch predictions either on demand or on a repeating schedule.
 * Online deployment. The model is deployed to an online prediction service cluster (generally containing hundreds of machines behind a load balancer) where clients can send individual or batched prediction requests as network RPC calls.
 * Library deployment. We intend to launch a model that is deployed to a serving container that is embedded as a library in another service and invoked via a Java API.


## Training metadata:

 * Who trained the model
 * Start and end time of the training job
 * Full model configuration (features used, hyper-parameter values, etc.)
 * Reference to training and test data sets
 * Distribution and relative importance of each feature
 * Model accuracy metrics
 * Standard charts and graphs for each model type (e.g. ROC curve, PR curve, and confusion matrix for a binary classifier)
 * Full learned parameters of the model
 * Summary statistics for model visualization