Skip to content
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
Latest commit 981b21b Nov 1, 2022 History
Signed-off-by: Terry Cox <>
4 contributors

Users who have contributed to this file

@tdcox @michaelneale @AlmogBaku @ianhellstrom

MLOps Roadmap 2020

Warning This document is now outdated. Please see the 2021 edition

About this document

This document sets out the current state of MLOps and provides a five year roadmap for future customer needs which is intended to support pre-competitive collaboration across the industry with a view to improving the overall state of MLOps as a capability for all.

It is intended that this document be iteratively refined by group consensus within the MLOps SIG across a series of regular meetings and then published annually whilst relevant.


Current active contributors to the MLOps SIG Roadmap:

Terry Cox, Bootstrap Ltd

Michael Neale, CloudBees

Kara de la Marck, CloudBees

Ian Hellström, D2IQ

Almog Baku, Rimoto

Initial publication - Sept 25th, 2020


Current State of MLOps

What is MLOps?

MLOps could be narrowly defined as "the ability to apply DevOps principles to Machine Learning applications" however as we shall see shortly, this narrow definition misses the true value of MLOps to the customer. Instead, we define MLOps as “the extension of the DevOps methodology to include Machine Learning and Data Science assets as first class citizens within the DevOps ecology”.

MLOps should be viewed as a practice for consistently managing the ML aspects of products in a way that is unified with all of the other technical and non-technical elements necessary to successfully commercialise those products with maximum potential for viability in the marketplace. This includes DataOps, too, as machine learning without complete, consistent, semantically valid, correct, timely, and unbiased data is problematic or leads to flawed solutions that can exacerbate built-in biases.

MLOps is not to be confused with "AIOps". AIOps often means an application of AI technologies to Ops data with sometimes unclear aims for gaining insights. These terms are still evolving, but for the purposes of this document we do not mean AIOps in the latter usage. Some organisations are more comfortable with the designation 'AI' rather than 'Machine Learning' and so it is to be expected that MLOps may be referred to by AIOps in those domains, however the reverse is not true as use of the term 'AIOps' may not refer to the MLOps methodology.

What is MLOps not?

It sometimes helps to consider what anti-patterns exist around a concept in order to better understand it.

For example, MLOps is not "putting Jupyter Notebooks into production environments".

RAD tools like Jupyter Notebooks can be extremely useful, both in classroom environments and in exploring problem spaces to understand potential approaches to mathematical problems. However, like all Rapid Application Development tools, they achieve the rapid element of their name by trading off other key non-functional requirements like maintainability, testability and scalability.

In the next section, we will discuss the key drivers for MLOps and expand upon the requirements for a true DevOps approach to managing ML assets. At this point in the development of the practice, it perhaps helps to understand that much of ML and AI research and development activity has been driven by Data Science rather than Computer Science teams. This specialisation has enabled great leaps in the ML field but at the same time means that a significant proportion of ML practitioners have never been exposed to the lessons of the past seventy years of managing software assets in commercial environments.

As we shall see, this can result in large conceptual gaps between what is involved in creating a viable proof of concept of a trained ML model on a Data Scientist’s laptop vs what it subsequently takes to be able to safely transition that asset into a commercial product in production environments. It is not therefore unfair to describe the current state of MLOps in 2020 as still on the early path towards maturity and to consider that much of the early challenge for adoption will be one of education and communication rather than purely technical refinements to tooling.

Compounding this challenge, Machine Learning solutions tend to be decision-making systems rather than just data processing systems and thus will be required to be held accountable to much higher standards than those applied to the best quality software delivery projects. The bar for quality and governance processes is therefore very high, in many cases representing legal compliance processes mandated by regional legislation.

As these decision-making solutions increasingly displace human decision-makers in commerce and government, we encounter a new class of governance problems, collectively known as ‘Responsible AI’. These introduce a range of challenges around complex issues such as ethics, fairness and bias in ML models and often fall under government regulation requiring interpretability and explainability of models, often to a standard higher than that applied to human staff.

As a result, it will be necessary for MLOps to develop in a manner that facilitates complex and sensitive governance processes that are of a standard appropriate to what is expected to become a highly regulated area.


General DevOps drivers applied to MLOps

  • Optimising the process of taking ML features into production by reducing Lead Time

  • Optimising the feedback loop between production and development for ML assets

  • Supporting the problem-solving cycle of experimentation and feedback for ML applications

  • Unifying the release cycle for ML and conventional assets

  • Enabling automated testing of ML assets

  • Application of Agile principles to ML projects

  • Supporting ML assets as first class citizens within CI/CD systems

  • Enabling shift-left on Security to include ML assets

  • Improving quality through standardisation across conventional and ML assets

  • Applying Static Analysis, Dynamic Analysis, Dependency Scanning and Integrity Checking to ML assets

  • Reducing Mean Time To Restore for ML applications

  • Reducing Change Fail Percentage for ML applications

  • Management of technical debt across ML assets

  • Enabling whole-of-life cost savings at a product level

  • Reducing overheads of IT management through economies of scale

  • Facilitating the re-use of ML approaches through template or ‘quickstart’ projects

  • Managing risk by aligning ML deliveries to appropriate governance processes

Drivers unique to Machine Learning solutions that represent MLOps requirements

  • Mitigating the risks associated with producing decision-making products

  • Provide specific testing support for detecting ML-specific errors

  • Verifying that algorithms make decisions aligned to customer values

  • Tracking long-term stability of ML asset performance

  • Incorporate ethical governance into management of ML assets

  • Enabling automated bias detection testing

  • Ensuring explainability of decisions

  • Ensuring fairness in decisions

  • Providing transparency and audit-ability of ML assets

  • Allowing for provability of decision-making algorithms

  • Ensuring long term supportability of long-lived ML assets

  • Manage ML-specific security risks such as adversarial attacks

  • Providing clear action paths to suspend or revert failed ML assets in production environments

  • Managing economic risks associated with deployment of key decision-making systems on customer infrastructure

  • Leveraging ML techniques to improve MLOps solutions

  • Mitigating liability risk and accountability through appropriate due-diligence over ML assets

  • Facilitating compliance verification against regional AI legislation

  • Enabling fast development life cycles by offering out-of-the-box (correlated) sampling of relevant data sets

  • Enabling online (streaming) and offline (batch) model predictions

  • Accelerating enterprise-wide development of models through ready-to-reuse features (i.e. feature stores)


An optimal MLOps experience is one where Machine Learning assets are treated consistently with all other software assets within a CI/CD environment. Machine Learning models can be deployed alongside the services that wrap them and the services that consume them as part of a unified release process.

MLOps must be language-agnostic. Training scripts for ML models can be written in a wide range of different programming languages, so a focus on Python is artificially limiting to widespread adoption.

MLOps must be framework-agnostic. There are a plethora of different Machine Learning frameworks commonly in use today and it is to be expected that these will continue to proliferate over time. It must be possible to use any desired framework within an MLOps context and be possible to combine multiple frameworks in any given deployment.

MLOps must be platform and infrastructure agnostic. Adoption of MLOps is predicated upon being able to operate using this approach within the constraints of previously defined corporate infrastructure. It should be presumed unlikely that MLOps alone is a sufficient driver to motivate fundamental change to infrastructure.

It is very important that MLOps should make no over-simplified assumptions with respect to hardware. To achieve currently known customer requirements in a range of ML fields, it will be necessary to achieve performance gains of at least three orders of magnitude. This can be expected to lead to significant change with respect to the hardware utilised to train and execute models.

We make the following assertions:

  • Models may expect to be trained on various combinations CPU, GPU, TPU, dedicated ASICs or custom neuro-morphic silicon, with new hardware entering the market regularly.
  • Models may expect to be executed on various combinations CPU, GPU, TPU, dedicated ASICs or custom neuro-morphic silicon, again with new hardware entering the market regularly.
  • It must not be assumed that models will be executed on the same hardware as they are trained upon.

It is desirable to have the ability to train once but run anywhere, that is, models are trained on specific hardware to minimise training time, and optimised for different target devices based on cost or performance considerations, however this aspiration may need to be tempered with a degree of pragmatism. The ability to train on a given hardware platform requires, at minimum, dedicated driver support, and in practice usually also necessitates extensive library / framework support. The ability to execute on a given hardware platform may involve having to significantly optimise resource usage to lower product cost or require the production of dedicated silicon specific to a single task.

Whilst there will undoubtedly be common scenarios, for example in Cloud deployment, where CPU -> CPU or GPU -> GPU is sufficient to meet requirements, MLOps must allow for a broad range of potential cross-compilation scenarios.

MLOps implementations should follow a ‘convention over configuration’ pattern, seeking to minimise the level of build-specific assets that need to be maintained on top of the desired solution. Where configuration is necessary, it should be synthesised where possible and operate on the principle of always creating working examples that can be modified incrementally by customers to meet their needs.

The use of MLOps should teach best known methods of applying MLOps. It should be recognised that many customers will be experts in the field of Data Science but may have had relatively little exposure to DevOps or other SDLC principles. To minimise the learning curve, MLOps defaults should always align to best practice in production environments rather than ‘quick-and-dirty’ as might be the case for expert users wishing to trial an instance of the tooling in a test bench environment.


The focus of this roadmap is upon aspects relating to extending DevOps principles to the area of Machine Learning and does not discuss basic features of DevOps or CI/CD approaches that are common across both domains. The intention is for MLOps to decorate DevOps rather than differentiate.

Summary and Updates

This is the first edition of the MLOps Roadmap and sets out the initial challenges for the approach with suggested potential solutions.


In this section, we identify challenges in the near and far term that constrain the customer from successfully adopting and benefiting from MLOps principles:

Educating data science teams regarding the risks of trying to use Jupyter Notebooks in production - Communicating the SDLC and the lessons of DevOps in a way that is comfortable for the audience
- Highlighting the dangers of using RAD tools in production
- Demonstrate simple alternatives that are easy to use as part of an MLOps process
Treat ML assets as first class citizens in DevOps processes - Extend CI/CD tools to support ML assets
- Integrate ML assets with version control systems
Providing mechanisms by which training/testing/validation data sets, training scripts, models, experiments, and service wrappers may all be versioned appropriately - All assets under version control
- ML assets include sets of data
- Data volumes may be large
- Data may be sensitive
- Data ownership may be complex
- Multiple platforms may be involved
Providing mechanisms by which changes to training/testing/validation data sets, training scripts, models, experiments, and service wrappers may all be auditable across their full lifecycle - Models may be retrained based upon new training data
- Models may be retrained based upon new training approaches
- Models may be self-learning
- Models may degrade over time
- Models may be deployed in new applications
- Models may be subject to attack and require revision
- Incidents may occur that require root cause analysis and change
- Corporate or government compliance may require audit or investigation
Treating training/testing/validation data sets as managed assets under a MLOps workflow - Content of training/testing/validation data sets is essential to audit process, root cause analysis
- Data is not typically treated as a managed asset under conventional CI/CD
- Data may reside across multiple systems
- Data may only be able to reside in restricted jurisdictions
- Data storage may not be immutable
- Data ownership may be a factor
Managing the security of data in the MLOps process with particular focus upon the increased risk associated with aggregated data sets used for training or batch processing - Aggregated data carries additional risk and represents a higher value target
- Migration of data into Cloud environments, particularly for training, may be problematic
Implications of privacy, GDPR and the right to be forgotten upon training sets and deployed models - The right to use data for training purposes may require audit
- Legal compliance may require removing data from audited training sets, which in turn implies the need to retrain and redeploy models built from that data.
- The right to be forgotten or the right to opt out of certain data tracking may require per-user reset of model predictions
Methods for wrapping trained models as deployable services in scenarios where data scientists training the models may not be experienced software developers with a background in service-oriented design Operational use of a model brings new architectural requirements that may sit outside the domain of expertise of the data scientists who created it.

Model developers may not be software developers and therefore experience challenges implementing APIs around their models and integrating within solutions.
Approaches for enabling all Machine Learning frameworks to be used within the scope of MLOps, regardless of language or platform Common mistake to assume ML = Python

Many commonly used frameworks such as PyTorch and TensorFlow exist and can be expected to continue to proliferate.

MLOps must not be opinionated about frameworks or languages.
Approaches for enabling MLOps to support a broad range of target platforms, including but not limited to CPU, GPU, TPU, custom ASICs and neuromorphic silicon. Choice of training platform and operational platform may be varied and could be different for separate models in a single project
Methods for ensuring efficient use of hardware in both training and operational scenarios - Hardware accelerators are expensive and difficult to virtualise
- Cadence of training activities impacts cost of ownership
- Elastic scaling of models against demand in operation is challenging when based upon hardware acceleration
Approaches for applying MLOps to very large scale problems at petabyte scale and beyond - Problems associated with moving and refreshing large training sets
- Problems associated with distributing training loads across hardware accelerators
- Problems with speed of distributing training data to correct hardware accelerators
- Problems of provisioning / releasing large pools of hardware resources
Providing appropriate pipeline tools to manage MLOps workflows transparently as part of existing DevOps solutions - Integration of ML assets with existing CD/CD solutions
- Extending Cloud-native build tools to support allocation of ML assets, data and hardware during training builds
- Hardware pool management
Testing ML assets appropriately - Conventional Unit / Integration / BDD / UAT testing
- Adversarial testing
- Bias detection
- Fairness testing
- Ethics testing
- Interpretability
- Stress testing
- Security testing
Governance processes for managing the release cycle of MLOps assets, including Responsible AI principles - Managing release of new training sets to data science team
- Establishing thresholds for acceptable models
- Monitoring model performance (and drift) over time to feed into thresholds for retraining and deployments
- Managing competitive training of model variants against each other in dev environments
- Managing release of preferred models into staging environments for integration and UAT
- Managing release of specific model versions into production environments for specific clients/deployments
- Managing root cause analysis for incident investigation
- Observability / interpretability
- Explainability and compliance
Management of shared dependencies between training and operational phases A number of ML approaches require the ability to have reusable resources that are applied both during training and during the pre-processing of data being passed to operational models. It is necessary to be able to synchronise these assets across the lifecycle of the model. e.g. Preprocessing, Validation, Word embeddings etc.
Abstraction for models Stored models are currently often little more than serialised objects. To decouple training languages, platforms and hardware from operational languages, platforms and hardware it is necessary to have broadly supported standard intermediate storage formats for models that can be used reliably to decouple training and operational phases.
Longevity of ML assets Decision-making systems can be expected to require very long effective operating lifetimes. It will be necessary in some scenarios to be able to refer to instances of models across significant spans of time and therefore forward and backward compatibility issues, storage formats and migration of long running transactions are all to be considered.
Managing and tracking trade-offs Solutions including ML components will be required to manage trade-offs between multiple factors, for example in striking a balance between model accuracy and customer privacy, or explainability and the risk of revealing data about individuals in the data set. It may be necessary to provide intrinsic metrics to help customers balance these equations in production. It should also be anticipated that customers will need to be able to safely A/B test different scenarios to measure their impact upon this balance.
Escalation of data categories As a side effect of applying governance processes to check for fairness and bias within models, it may become necessary to hold special category data providing information about race, religion or belief, sexual orientation, disability, pregnancy, or gender reassignment in order to detect such biases. As a result of this, there will be an escalation of data sensitivity and in the legal constraints that apply to the solution.
Intrinsic protection of models Models are vulnerable to certain common classes of attack, such as:
   - Black box attempts to reverse engineer them
   - Model Inversion attacks attempting to extract data about individuals
   - Membership Inference attacks attempting to verify the presence of individuals in data sets
   - Adversarial attacks using tailored data to manipulate outcomes
It should be anticipated that there will be a need for generic protections against these classes of challenge across all deployed models.
Emergency cut out As models are trained typically with production/runtime data, and act on production data, there can be cases where undesirable behaviour of a recently deployed model change is only apparent in a production environment. One example is a chat bot that uses appropriate language in reaction to some interactions and training sets. There is a need to have the ability to quickly cut out a model or roll back immediately to an earlier version should this happen.
Online learning There are systems in use that use online learning, where a model or similar evolves in near real time with the data flowing in (and there is not necessarily a deploy stage). Systems also may modify themselves at runtime in this case without any sort of rebuilding or human approval. This will require further research and observation of emerging practices and use cases of this approach to machine learning.

Technology Requirements

In this section, we capture specific technology requirements to enable progress against the challenges identified above:

Educating data science teams regarding the risks of trying to use Jupyter Notebooks in production Jupyter Notebooks are the tool we use to educate Data Scientists as they can easily be used to explore ad hoc ML problems incrementally on a laptop. Unfortunately, when all you have is a hammer, everything tends to look like a nail.

We see Jupyter Notebooks featuring in production scenarios, not because this are the best tool for the job, but because this is the tool we taught people to use and because we didn't teach about any of the problems inherent in using that tool.

This approach persists because of an ongoing gap in the tool chain.

- Improved technology solutions are required that enable Data Scientists to easily run experiments at scale on elastic Cloud resources in a consistent, audited and repeatable way.
- These solutions should integrate with existing Integrated Development Environments, Source Code Control and Quality Management tools.
- Solutions should integrate with CI/CD environments so as to facilitate Data Scientists setting up multiple variations on a training approach and launching these as parallel, audited activities on Cloud infrastructure.
- Working with the training of models and the consumption of trained models should be a seamless experience within a single toolchain.
- Tooling should introduce new Data Scientists to the core concepts of software asset management, quality management and automated testing.
- Tooling should enable appropriate governance processes around the release of ML assets into team environments.
Treat ML assets as first class citizens in DevOps processes ML assets include but are not limited to training sets, training configurations (hyper-parameters), scripts and models. Some of these assets can be large, some are more like source code assets. We should be able to track changes to all these assets, and see how the changes relate to each other. Reporting on popular DevOps "metrics" like "Change Failure Rate" and "Mean cycle time" should be possible if done correctly. Whilst some of these assets are large, they are not without precedent in the DevOps world. People have been handling changes to database configurations, copying and backup of data for some time, so they same should apply to all ML assets. In the following sections we can explore how this may be implemented.
Providing mechanisms by which training sets, training scripts and service wrappers may all be versioned appropriately Training sets are prepared data sets which are extracted from production data. They may contain PII or sensitive information, so making the data available to developers in a way analogous to source code may be problematic. Training scripts are smaller artefacts but are sometimes coupled to the training sets. All scripts should be versioned in a way that makes the connected to the training sets that they are associated with. Scripts and training sets are also coupled to meta data, such as instructions as to how training sets are split up for testing and validation, so a model can be reproduced in ideally a deterministic manner if required.
Providing mechanisms by which changes to training sets, training scripts and service wrappers may all be auditable across their full lifecycle It is an essential requirement that all changes to ML assets must result in a robust audit trail capable of meeting forensic standards of analysis. It must be possible to work backwards from a given deployment of assets at a specified time, tracing all changes to this set of assets and the originators of each change. Tooling supporting decision-making systems in applications working with sensitive personal data or life-threatening situations must support non-repudiation and immutable audit records.

It should be anticipated that customers will be operating in environments subject to legal or regulatory compliance requirements that will vary by industry and jurisdiction that may require varying standards of audit transparency, including requirements for third party audit.
Treating training sets as managed assets under a MLOps workflow One of the difficult challenges for MLOps tooling is to be able to treat data as a managed asset in an MLOps workflow. Typically, it should be expected that traditional source code control techniques are inapplicable to data assets, which may be very large and reside in environments that are not amenable to tracking changes in the same manner as source code. New meta-data techniques must be created to effectively record and manage aggregates of data that represent specific versions of training, testing or validation sets. Given the variety of data storage mechanisms in common usage, this will likely necessitate pluggable extensions to tooling.

It must be possible to define a specific set of data which can be introduced into MLOps tooling for a given training run to produce a model version, and to retrospectively inspect the set of data that was used to create a known model version in the past. Multiple model instances may share one data set and subsequent iterations of a model may be required to be regression tested against specific data set instances. It should be recognised that data sets are long-lived assets that may have compliance implications but which may also be required to be edited in response to data protection requests, invalidating all models based upon a set.
Managing the security of data in the MLOps process with particular focus upon the increased risk associated with aggregated data sets used for training or batch processing The vast majority of MLOps use-cases can be expected to involve mission-critical applications, sensitive personal data and large aggregated data sets which all represent high value targets with high impact from a security breach. As a result, MLOps tooling must adopt a 'secure-by-design' position rather than assuming that customers will harden their deployments as part of their responsibilities.

Solutions must not default to insecure configurations for convenience, nor should they provide user-facing options that invalidate system security as a side effect of adding functionality.
Implications of privacy, GDPR and the right to be forgotten upon training sets and deployed models - Tooling should provide mechanisms for auditing individual permissions to use source data as part of training sets, with the assumption that permission may be withdrawn at any time

- Tooling should provide mechanisms to invalidate the status of deployed models where permissions to use have been revoked

- Tooling may optionally provide mechanisms to automatically retrain and revalidate models on the basis of revoked permissions

- Tooling should facilitate user-specific exceptions to model invocation rules where this is necessary to enable the right to be forgotten or to support the right to opt out of certain types of data tracking
Methods for wrapping trained models as deployable services in scenarios where data scientists training the models may not be experienced software developers with a background in service-oriented design Models that have been trained and which have passed acceptance testing need to be deployed as part of a broader application. This might take the form of elastically scalable Cloud services within a distributed web application, or as an embedded code/data bundle within a mobile application or other physical device. In some cases, it may be expected that models need to be translated to an alternate form prior to deployment, perhaps as components of a dedicated FPGA or ASIC in a hardware solution.

- MLOps tooling should integrate and simplify the deployment of models into customer applications, according to the architecture specified by the customer.

- MLOps tooling should not force a specific style of deployment for models, such as a dedicated, central 'model server'

- Tooling should assume that model execution in Cloud environments must be able to scale elastically

- Tooling should allow for careful management of execution cost of models in Cloud environments, to mitigate the risk of unexpected proliferation of consumption of expensive compute resources

- Tooling should provide mechanisms for out-of-the-box deployment of models against common architectures, with the assumption that customers may not be expert service developers

- Tooling should provide automated governance processes to manage the release of models into production environments in a controlled manner

- Tooling should provide the facility to upgrade and roll-back deployed models across environments

- It should be assumed that models represent reusable assets that may be deployed in the form of multiple instances at differing point release versions across many independent production environments

- It should be assumed that more than one point release version of a model may be deployed concurrently in order to support phased upgrades of system functionality across customer environments
Approaches for enabling all Machine Learning frameworks to be used within the scope of MLOps, regardless of language or platform MLOps is a methodology that must be applicable in all environments, using any programming language or framework. Implementations of MLOps tooling may be opinionated about the approach to the methodology but must be agnostic to the underlying technologies used to implement the models and services associated.

It should be possible to use MLOps tooling to deploy solution components utilising different languages or frameworks using loosely-coupled principles to provide compatibility layers.
Approaches for enabling MLOps to support a broad range of target platforms, including but not limited to CPU, GPU, TPU, custom ASICs and neuromorphic silicon. MLOps should be considered as a cross-compilation problem where the architectures of the source and target platforms may be different. In trivial cases, models may be trained on say CPU or GPU and deployed to execute on the same CPU or GPU architecture, however other scenarios already exist and should be expected to be increasingly likely in the future. This may include training on GPU / TPU and executing on CPU or, in edge devices, training on any architecture and then translating the models into physical logic that can be implemented at very low cost / size / power directly on FPGA or ASIC devices.

This implies the need for architecture-independent intermediate formats to facilitate cross-deployment or cross-compilation onto target platforms.
Methods for ensuring efficient use of hardware in both training and operational scenarios ML training and inferencing operations are typically very processing intensive operations that can expect to be accelerated by the use of dedicated hardware. Training is essentially an intermittent ad-hoc process that may run for hours-to-days to complete a single training run, demanding full utilisation of large scale compute resources for this period and then releasing that demand outside active training runs. Similarly, inferencing on a model may be processing intensive during a given execution. Elastic scaling of hardware resources will be essential for minimising cost of ownership however the dedicated nature of existing accelerator cards makes it currently hard to scale these elastically in today's Cloud infrastructure.

Additionally, some accelerators provide the ability to subdivide processing resource into smaller allocation units to allow for efficient allocation of smaller work units within high capacity infrastructure.

It will be necessary to extend the capabilities of existing Cloud platforms to permit more efficient utilisation of expensive compute resources whilst managing overall demand across multiple customers in a way that mitigates security and privacy concerns.
Approaches for applying MLOps to very large scale problems at petabyte scale and beyond As of 2020, large ML data sets are considered to start at around 50TB and very large data sets may derive from petabytes of source data, especially in visual applications such as autonomous vehicle control. At these scales, it becomes necessary to spread ML workloads across thousands of GPU instances in order to keep overall training times within acceptable elapsed time windows (less than a week per run).

Individual GPUs are currently able to process in the order of 1-10GB of data per second but only have around 40GB of local RAM. An individual server can be expected to have around 1TB of conventional RAM and around 15TB of local high speed storage as cache for around 8 GPUs, so may be able to transfer data between these and the compute units at high hundreds of GB/s with upstream connections to network storage running at low hundreds of GB/s.

Efficient workflows rely upon being able to reduce problems into smaller sub-units with constrained data requirements or systems start to become I/O bound. MLOps tooling for large problems must be able to efficiently decompose training and inferencing workloads into individual operations and data sets that can be effectively distributed as parallel activities across a homogeneous infrastructure with a supercomputing-style architecture. This can be expected to exceed the capabilities of conventional Cloud computing infrastructure and require dedicated hardware components and architecture so any MLOps tooling must have appropriate awareness of the target architecture in order to optimise deployments.

At this scale, it is not feasible to create multiple physical copies of petabytes of training data due to storage capacity constraints and limitations of data transfer rates, so strategies for versioning sets of data with metadata against an incrementally growing pool will be necessary.
Providing appropriate pipeline tools to manage MLOps workflows transparently as part of existing DevOps solutions Existing projects using DevOps practices typically will have pipelines and delivery automated. Ideally MLOps solutions would extend this rather than replace it. In some cases it may be necessary to have a ML model (for example) have its own tooling and pipeline, but that should be the exception (as typically there is a non trivial amount of source code that goes along with the training and model for scripts and endpoints, as covered previously).
Testing ML assets appropriately ML assets should be considered at least at the same level as traditional source code in terms of testing (unit, integration, end to end, acceptance etc). Metrics like coverage may still apply to scripts and service endpoints, it not for the model itself (as it isn't derived from source code). Further to this, typically models when deployed are used in decision making capacities where the stakes are higher, or there are potential governance or compliance or bias risks. This implies that testing will need to cover far more than source code, but actively show in a way suitable to a variety of stakeholders, how it was tested (for example was socioeconomic bias testing included). The testing involved should be in an accessible fashion so it is not only available for developers to audit.
Governance processes for managing the release cycle of MLOps assets, including Responsible AI principles MLOps as a process extends governance requirements into areas beyond those typically considered as part of conventional DevOps practices. It is necessary to be able to extend auditing and traceability of MLOps assets all the way back into the data that is chosen for the purposes of training models in the first instance. MLOps tooling will need to provide mechanisms for managing the release of new training sets for the purposes of training, with the consideration that many data scientists may be working on a given model, and more than one model may be trained against a specific instance of a training data set. Customers must have the capability to retain a history of prior versions of training data and be able to recreate specific environments as the basis of new avenues of investigation, remedial work or root cause analysis, potentially under forensic conditions.

The training process is predicated upon the idea of setting predefined success criteria for a given training session. Tooling should make it easy for data science teams to clearly and expressively declare success criteria against which an automated training execution will be evaluated, such that no manual intervention is required to determine when a particular training run meets the standard required for promotion to a staging environment.

Tooling should also offer the ability to manage competitive training of model variants against each other as part of automated batch training activities. This could involve predefined sets of hyper-parameters to test in parallel training executions, use of smart hyper-parameter tuning libraries as part of training scripts, or evolutionary approaches to iteratively creating model variants.

It should be possible to promote preferred candidate models into staging environments for integration and acceptance testing using a defined set of automated governance criteria and providing a full audit trail that can be retained across the lifetime of the ML asset being created.

Tooling should permit the selective promotion of specific model versions into target production environments with the assumption that customers may need to manage multiple live versions of any given model in production for multiple client environments. Again, this requires a persistent audit trail for each deployment.

It should be assumed that the decision-making nature of ML-based products will require that any incident or defect in a production environment may result in the need for a formal investigation or root-cause analysis, potentially as part of a compliance audit or litigation case. Tooling should facilitate the ability to easily walk backwards through audit trail pathways to establish the full state of all assets associated with a given deployment and all governance decisions associated. This should be implemented in such a way as to be easily initiated by non-technical staff and provide best efforts at non-repudiation and tamper protection of any potential evidence.

Models themselves can be expected to be required to be constructed with a level of conformance to observability, interpretability and explainability standards, which may be defined in legislation in some cases. This is a fundamentally hard problem and one which will have a significant impact upon the practical viability of deploying ML solutions in some fields. Typically, these concerns have an inverse relationship with security and privacy requirements so it is important that tooling consider the balance of these trade-offs when providing capabilities in these areas.
Management of shared dependencies between training and operational phases Appropriate separation of concerns should be facilitated in the implementation of training scripts so that aspects associated with the preprocessing of data are handled independently from the core training activities.

Tooling should provide clean mechanisms for efficiently and safely deploying code associated with preprocessing activities to both training and service deployment environments. It is critical that preprocessing algorithms are kept consistent between these environments at all times and a change in one must trigger a change in the other or raise an alert status capable of blocking an accidental release of mismatched libraries.

It should be considered that the target environment for deployment may differ from that of training so it may be necessary to support implementations of preprocessing functions in different languages or architectures. Under such circumstances, tooling should provide robust mechanisms for ensuring an ongoing link between the implementations of these algorithms, preferably with a single, unified form of testing to prevent divergence.
Abstraction layer for models It is unsafe to assume that models will be deployed into environments that closely match those it which the model was originally trained. As a result, the use of object serialisation for model description is only viable in a narrow range of potential use cases. Typically, we need to be able to use a platform independent intermediate format to describe models so that this can act as a normalised form for data exchange between training and operational environments.

This form must be machine readable and structured in such a way that it is extensible to support a wide range of evolving ML techniques and can easily have new marshalling and unmarshalling components added as new source and target environments are adopted. The normalised form should be structured such that the training environment has no need to know about the target operational environment in advance.
Longevity of ML assets With traditional software assets, such as binaries, there is some expectation of backwards compatibility (in the case of, for example, Java, this compatibility of binaries has spanned decades). ML Assets such as binaries will need to have some reasonable backwards compatibility. Versioning of artefacts for serving the model are also important (but typically that is a better known challenge). In the case of backwards breaking changes, ML assets such as training scripts, tests and data sets will need to be accessed to reproduce a model for the new target runtime.
Managing and tracking trade-offs ML solutions always involve trade-offs between different compromises. A typical example might be the trade-off between model accuracy, model explainability and data privacy. Viewed as the points of a triangle, we can select for a model that fits some point within the triangle where its distance from each vertex represents proximity to that ideal case. Since we train models by discovery, we can only make changes to our training data sets and hyper-parameters and then evaluate the properties of any resulting model by testing against these desired properties. Because of this, it is important that MLOps tooling provides capabilities to automate much of the heavy lifting associated with managing trade-offs in general. This might take the form of automated training of variations upon a basic model which are subsequently tested against a panel of selection criteria, evaluated and presented to customers in such a way as to make their relevant properties easily interpretable.
Escalation of data categories To obtain an accurate model, or to prevent the production of models with undesirable biases, it may be necessary to store data of a very sensitive nature or legally protected categories. This data may be used to vet a model pre release, or for training, in any case it will likely be persisted along with training scripts. This will require strong data protections (as with any database of a sensitive nature) and auditability of access. MLOps systems are likely to be targets of attack to obtain this data, meaning stronger considerations for protections that just source code would be required.

It is anticipated that regulatory requirements intended to reduce the impact of bias or fairness issues will have unintended consequences relating to the sensitivity of other data that must be collected to fulfil these requirements, creating additional privacy risks.
Intrinsic protection of models Model inferencing will have to embrace modern application security techniques to protect the model against these kinds of attacks. Inferencing might be protected through restriction of access(tokens), rate-limiting, and monitoring of incoming traffic. In addition, as part of the integration test phase, there is a requirement to test the model against adversarial attacks(both common attacks, and domain-specific attacks) in a sandboxed environment.

It must also be recognised that Python, whilst convenient as a language for expressing ML concepts, is an interpreted scripting language that is intrinsically insecure in production environments since any ad-hoc Python source that can be injected into a Python environment can be executed without constraint, even if shell access is disabled. The long term use of Python to build mission-critical ML models should be discouraged in favour of more secure-by-design options.
Emergency cut out As a model may need to abruptly be cut out, this may need to be done at the service wrapper level as an emergency measure. Rolling back to a previous version of the model and a service wrapper is desirable, but only if it is fast enough for safety reasons. At the very least, the ability to deny service in the span of minutes in cases of misbehaviour is required. This cut out needs to be human triggered at least, and possibly triggered via a live health check in the service wrapper. It is common in traditional service deployments to have health and liveness checks, a similar thing exists for deployed models where health includes acceptable behaviour.
Online learning Self-learning ML techniques require that models are trained live in production environments, often continuously. This places the behaviour of such models outside the governance constraints of current MLOps platforms and introduces potentially unconstrained risk that must be managed in end user code. Further consideration must be given to identifying ways in which MLOps capabilities can be extended into this space in order to provide easier access to best known methods for mitigating the risk of degrading quality.

Potential Solutions

This section addresses a timeline of potential solutions to assist in understanding the relative maturity of each area of concern and to highlight where significant gaps exist:


Cross-Cutting Concerns

The following areas represent aspects of MLOps that overlap with issues covered by other SIGs and indicate where we must act to ensure alignment of work on these elements:

The following cross-cutting concerns are identified:

  • CI/CD tooling across DevOps and MLOps scenarios

  • Pipeline management

  • Common testing aspects such as code coverage, code quality, licence tracking, CVE checking etc.

  • Data quality

  • Security

Conclusions and Recommendations

As of 2020 we believe that, whilst some techniques exist for treating ML assets and practices as first class, they are not evenly applied, and should be widely adopted, practiced and continuously improved.

Throughout 2020 there have been interesting developments in the application of AI and ML, often negative, in the case of knee jerk regulation, bias and negative press for example. Many of these could be addressed or improved (and in some cases prevented) with the application of the MLOps practices discussed in this document. Some techniques require more research, however many of them are already in existence in some form and merely require continuous improvement in the future. The one area we do not have a clear line of sight in is Online Learning (including but not limited to scenarios like model-less reinforcement learning) and more work must be done to understand how we can further governance goals in this context.

We recommend that development continue in the next year toward improved auditability of changes to training data, scripts and service wrappers across their full lifecycle. As AI-powered systems mature there will be more requirement to treat models like other software assets through their lifecycle. Training data must become a managed asset like any other input to software development.

We anticipate that the security of data that is used for training, and the tension between the issues of machine learning, privacy, GDPR and the "right to be forgotten" will remain as ongoing challenges until well in 2022.


Short definitions of terms used that aren't explained inline.

  • Training: act of combining data with (hyper) parameters to yield a model
  • Model: A deployable (usually binary) artefact that is the result of training. Can be used at runtime to make predictions for example
  • Hyper-parameter: parameters used during training of a model, typically set by a data scientist/human
  • Parameter: a configuration variable set after the training phase (usually part of a model)
  • Endpoint: typically a model is deployed to an endpoint which may be a HTTPS service which serves up predictions (models may be also deployed to devices and other places)
  • Training Pipeline: all the steps needed to prepare a model
  • Training set: a set of example data used for training a model