# Operational level monitoring

<img src="_images/03op_om.jpg">

## System performance and reliability

The system/application performance metrics to monitor that will give you an idea of model performance include:

- CPU/GPU utilization when the model is computing predictions on incoming data from each API call; tells you how much your model is consuming per request.
- Memory utilization for when the model caches data or input data is cached in memory for faster I/O performance.
- Number of failed requests by an event/operation.
- Total number of API calls.
- Response time of the model server or prediction service.
- System reliability: infrastructure and network uptime,...



## Pipelines

Monitor the health of your data and model pipeline. Unhealthy data pipelines can affect data quality, and your model pipeline leakages or unexpected changes can easily generate negative value.



### Data pipelines

Monitoring the health of data pipelines is extremely crucial because data quality issues can arise from bad or unhealthy data pipelines. This especially is extremely tricky to monitor for your IT Ops/DevOps team and may require empowering your data engineering/DataOps team to monitor and troubleshoot issues.

It also has to be a shared responsibility. Work with your DataOps team, communicate what your model expects, and the team will tell you what the output of their data pipeline is—this can help you tighten up your system and drive positive results.

If you’re charged with the responsibility of monitoring your data pipeline, here are some metrics and factors you may want to track:

- __Input data__ – are the data and files in the pipeline with the appropriate structure, schema, and completeness? Are there data validation tests and checks in place so that the team can be alerted in case of an oddity in ingested data? Monitor what comes into the data pipeline to keep it healthy.
- __Intermediate workflow steps__ – are the inputs and outputs of every task and flow in the DAG as expected, in terms of the number of files and file types? How long does a task take to run in the pipeline? This could be the data preprocessing task, or the validation task, or even the data distribution monitoring task.
- __Output data__ – is the output data schema as expected by the machine learning model in terms of features and feature embeddings? What’s the typical file size expected from an output file?
- __Data quality metrics__ – tracking the statistical metrics according to the data that flows in. This could be basic statistical properties of the data such as mean, standard deviation, correlation, and so on, or distance metrics (such as KL divergence, Kolmogorov-Smirnov statistic). The statistical metric used will be mostly dependent on the dimension of data expected; a couple of features or several features.
- __Scheduled run time__ of a job, actual run time, how long it took to run, and the state of the job (successful, or failed job?).



### Model pipeline

You want to track crucial factors that can cause your model to break in production after retraining and being redeployed. This includes:

- Dependencies – you don’t want a situation where your model was built with Tensorflow 2.0 and a recent dependency update by someone else on your team that’s bundled with Tensorflow 2.4 causes part of your retraining script to fail. Validate the versions of each dependency your model runs on and log that as your pipeline metadata, so dependency updates that cause failure can be easier to debug.
- The actual time a retraining job was triggered, how long it took the retraining job to run, resources usage of the job, and the state of the job (successfully retrained and redeployed model, or failed?).



## Cost

You need to keep an eye out for how much it’s costing you and your organization to host your entire machine learning application, including data storage and compute costs, retraining, or other types of orchestrated jobs. These costs can add up fast, especially if they’re not being tracked. Also, it takes computational power for your models to make predictions for every request, so you also need to track inference costs.