# Machine Learning Operations (MLOps): Machine Learning Ops: Planning, Model Development, Deployment & Governance

#### A Consulting mindset

The single most important shift from being from being a student, software developer, or data scientist within an established team to being a data science *consultant* is: 

**Your client probably doesn't have the full picture of what they want; it's your job to help them figure that out *and* to build it for them.**

So for example, your client might say something like the following: 

- "We want a machine learning model that can deploy to a hospital system and predicts which patients will have a stroke" 
- "We want an application that takes streaming video at an outpost in *insert foreign country* and identifies potential threats."
- "We want something to help us prioritize patients based on the probability of remission. Can you help us with that?" 

Each of these is a decent starting point (sometimes you won't even get something this specific, or you might get conflicting requests and have to work with the client to set priorities), and a possible application of machine learning technologies.

In each case, though, there are tens, hundreds of questions to answer before arriving at a picture of what will actually work best for your client. Some of these questions you'll be able to answer yourself and others you'll have to ask your client directly, ideally with your recommendation and an analysis of pros and cons. 

At IBM, every data scientist has to understand the client's business needs and make technical decisions with its needs in mind. 

## Machine-Learning Operations (MLOps)



Machine Learning Operations, or MLOps, takes a holistic view of the machine learning development process. As a data scientist, you're not strictly concerned with developing a model. You're also responsible for understanding how that model will be used to satisfy business needs, how it will be developed and deployed, and how it will be monitored and improved. 

This process is cyclical, requiring ongoing revisitation and improvements. There are many graphical representations of this cycle (just as there are many approaches to MLOps); this is the version that Databricks uses: 

![DataBricks Lifecycle](static/Databricks_mlops.png). 

## A Framework for developing a "complete" ML solution: 

Try this framework for working out solutions with a client. These steps will be constantly overlapping (e.g., as soon as you understand the business case you should be thinking in the back of your mind about possible approaches), but this is the general order I've found effective. 

Note that this is based in part on the O'Reilly book [Introducing MLOps](https://www.oreilly.com/library/view/introducing-mlops/9781492083283/), which you have access to via IBM YourLearning, but I've reordered and dramatically condensed the material. 

#### Initiate the Engagement: 

1. Describe your approach and seek client buy-in

#### Gather Information 

2. Understand the Business Need 
3. Understand Data 
4. Understand IT Architecture

#### Develop an Approach

5. Develop a model(s)
6. Prep Model for Deployment and Deploy
7. Model Monitoring, Improvement, and Governance

## Initiate the Engagement 

#### 1. Describe your approach and seek client buy-in

This may sound redundant, but the most important aspect of a client engagement is maintaining confidence and trust. With trust, you'll be able to take more risks and do far better work. And the best way of maintaining trust is communication, i.e., telling the client what you're doing, why you're doing it, and what you're doing next. 

Start the engagement with a kick-off with your client. Walk them through your plans for steps 2-7. Ask your Project Manager what the timeframe for your project is, and if they don't know, ask the client during this kick-off. You can also ask your PM what the budget is, and how many staff will be committed 

## Gather Information

#### 2. Understand the Business Need

You should understand exactly what need your client is trying fill with a machine learning model and why. This is also the element of the engagement in which your relationship with your client should be most conversational. Keep up an ongoing rapport on what your model is trying to accomplish. 

Some the questions you might ask include: 

- What are we trying to accomplish with an ML model? 
- What Key Performance Indicators will we be targeting? 
- How will you determine whether the return on investment in ML is worth ongoing cost?
- What is the level of risk tolerance? E.g., if there is a wrong prediction? I.e., should sensitity or specificity be optimized? 
- What the level of interpretability required? 
- Are there any ethical, regulatory, or compliance concerns at play? 
- What cost constraints are we working with?
- What time constraints are we working with? 
- What other teams will we be able to work with/rely on? Legal, data, DevOps, model risk manager/auditor, etc.? 

Note that I use "we" rather than "you" when describing the model development process. This is a joint endeavor -- you should view your consulting team as, for the duration of the engagement, part of your client team, and see these problems from their point of view. The "We" mentality is the key to building client relationships that last years, well beyond the initial contract. 

#### 3. Understand Data 

Once you understand the business case, turn to the data. The data are the building blocks on which you'll be developing a machine learning solution. Without data that speaks to the business need(s) you've identified, you're unlikely to be able to meet those needs with a ML solution. 

At this point, you should be asking questions like: 

- What relevant datasets are available? 
- Is this data sufficiently accurate and reliable? 
- How can stakeholders (e.g., our team) get access to data? 
- What features currently exist in the data? Do we have a data dictionary? 
- Is data made available in real time? If not, how many increments? 
- How large are the relevant data sets? 
- Do we have “ground truth?” Is there a need to create it, and will that be resource-intensive? 
- How will data be updated once we’ve deployed a model? 
- Is there PII that must be redacted? 
- Are there features, such as gender, that cannot be legally used?

If the client has an established data governance framework, then there may be additional questions to ask, such as: 

- Can the selected datasets be used for this purpose? 
- What are the terms of use? 



#### 4. Understand IT Architecture

Finally, before moving to the planning and development stage, you'll want to have at least a cursory understanding of the IT Architecture on which you'll be deploying. Also, the sooner you can get your hands on technical documentation, such as an architecture diagram, and circulate to your team, the better. 

You'll ask far more detailed questions during the solution process, but initially, you'll want to establish basic premises, like: 

- Do we have an architecture diagram or current state diagram for the organization, as well as logical and physical data models? 
- Do applications in the organization tend to be on-prem, cloud-based, or hybrid? 
- How many applications would this model serve and touch? 
- Are there any additional dependences that we should keep in mind? 
- How do applications deploy on in the organization? e.g., Dev, Test, Prod? What barriers to deployment should we be aware of? 

## Develop an Approach

#### 5. Develop a model(s)

Finally, the data science part! 

*As you develop, you should be continually referencing and updating the information you gathered in steps 1-4 because the nature of the predictive model will be determined by the business need, the available data, and the IT Architecture within which it has to function* 

This is the part of MLOps for which you will have been trained for years as an IBM data, scientist, so I won't go into this phase too much beyond that. 

Here are some general principles to keep in mind: 

- Experiment and have fun! Consider lots of sources -- Co-workers, textbooks, Kaggle etc. 
- Look to subject-matter experts and sources for feature engineering ideas (e.g., if you work on the FDA account, like I do, this is the part of the engagement where you should be working very closely with medical SME's) 
- Keep checking in on business needs, data, and IT Architecture (mainly relevant for constraints in model training, re-training, and deployment)
- Use ensembling methods and reinforcement learning wherever feasible to make sure you have the best possible model 
- Keep train/test completely separate
- Keep interpretability in mind! For government clients and clients in heavily regulated industries, you often have to be able to explain decision criteria - which rules out most deep learning methods 

Document everything! What worked, what didn't, and everything in between. You never know what you might want to revisit later, and when you present a final solution to your client, the first thing they'll ask is what else you tried. 

There are any number of decision trees available for model development. This is an example: 

![Model Development Decision Tree](attachment:image.png)

##### Examples: 

For Python coding snippets on the fundamental machine learning algorithms, you can see some of the other Jupyter notebooks in this repository: 

- Regression
- Classification
- Clustering
- Deep Learning

#### 6. Prep Model for Deployment and Deploy

This is where we develop a much deeper understanding of the client's IT Architecture. As a student, you probably developed a machine learning script in Python, ran it on a dataset on your local machine, and then saved a model on your local machine. For a number of reasons, especially IT security, system integration, computational performance, that will pretty much never be how you deploy a model. Your model will have to be packaged, deployed, and mounted within a much larger organization, where it accepts data from a database and sends outputs to any number of applications for follow-on actions. There will cer

There are two main phases here. The first is understanding what changes you need to make to your model to prepare it for deployment (so, an extension of step 5, in a sense). The second step is understanding the logistics of deployment on your client-site. 

#### 6.1 Prepare Model for Deployment to Production 

Factors you will want to consider for preparing a model for deployment to production include: 

* How are applications packaged for the Production environment? E.g., custom-built services, data science platforms, Kubernetes clusters/ Docker, etc.,? 
* What DevOps practices and techstack is in place? E.g., Jenkins for CI/CD pipeline? Gitlab? 
* What are the software testing requirements? E.g., Pytest? 
* What applications will the model be integrated with, and what integration testing will be required?
* Need the model be performance-optimized, e.g., to conserve storage or RAM? 
* Is the techstack tooling appropriate? E.g., Need a Python model be converted to C++ for performance and integration?
* What are the model auditing and testing requirements? 
* How will we be able to monitor the model in Prod? E.g., logging of results, health checks, etc.?  
* How will we test the model? Will we be able to conduct all necessary testing in a Test or PreProd environment? 

This is also when questions about regulatory risk, business risk, bugs in runtimes environment, etc., sometimes emerge, if they haven't already. As with the other steps, you'll want to be looping back and making sure you are continuously updating your information and changing plans, if necessary. 


#### 6.2 Deploy Model to Production


Finally, make a plan for deployment! Prior to deployment, you shoudl know: 

- How should we package release documentation? 
- Is there downtime required?
- Who signs off on deployment? 
- How does the model ingest data? There are two main options: 
  - Batch scoring, where whole datasets are processed using the model 
  - Real-time scoring, where one or small number of records are scored
- What is the Deployment method: 	
  - Deployment in place (downtime, not ideal) 
  - Rolling/Blue Green 
- How is the model maintained in production?
   - Resource monitoring, e.g., CPU, memory, disk, network usage
   - Health check, i.e., query the model at a fixed internal and log the result
- ML metrics monitoring 
- Questions for containerization: 
    - Which Docker host should release the containers? 
    - When a model is deployed in several copies, how is the workload balanced? 
- What is the switchover protocol? 
- When deploying a model, there are several possible scenarios: 
    - One model is deployed on one server
    - One model is deployed on multiple servers 
    - Multiple versions of a model are deployed on one server
    - Multiple versions of a model are deployed on multiple servers
    - Multiple versions of multiple models deployed on multiple servers 


#### 7. Model Monitoring, Improvement, and Governance

