# Overview

[How to Monitor Your Models in Production](https://neptune.ai/blog/how-to-monitor-your-models-in-production-guide)

## Goal of monitoring step

![image.png](_images/6_Models_monitoring/model_degrade.png)

**Cause:**
- Machine learning models degrade over time. They’re dynamic and sensitive to real changes in the real world
- Validation result during development will seldom fully show the model's performance in production
- The difference of evironment between development and production, may be caused of the difference of performance

**So, the Goal of monitoring model:**
- To detect problems with your model and the system serving your model in production before they start to generate negative business value,
- To take action by triaging and troubleshooting models in production or the inputs and systems that enable them,
- To ensure their predictions and results can be explained and reported, 
- To ensure the model’s prediction process is transparent to relevant stakeholders for proper governance, 
- Finally, to provide a path for maintaining and improving the model in production.

## Criteria of metrics selection

It's very unique for each business case, depend on:
- What does your business define as success and the KPIs that were set in business analysis phase ?
- What were the performance expectation or result's distribution expectation before deploying to production ?

**Criteria of metrics selection to make sense and be comfortable**
- Availabel to compare across models
- Simple and easy to understand
- Can be collected/computed in real-time
- Allows to set threshold for actionable alerting on problems.

For example of metrics that use for building a loan approval system:
1. What is the accuracy of model prediction that pay back at stipulated time ? __(Functional level monitoring)__
2. How fast does model score is returned after get the request from client ? __(Operational level monitoring)__

## Type of monitoring

1. __Functional level monitoring__
- Data (input data level)
    - Data quality issues
    - Data/feature drift
    - Outliers
- Model
    - Monitoring model drift
    - Model configuration and artifacts
    - Model versions
    - Concerted adversaries
- Predictions (Output)
    - Model evaluation metrics
        - When availabel have Y true (the ground truth) ?
        - When not availabel have Y true ?
        
2. __Operational level monitoring__
- System performance monitoring for ML models in production
    - System performance metrics
        - CPU/GPU utilization
        - Memory utilization
        - Total number of `failed request`
        - Total number of `API calls`
        - Responce time
    - System reliability
- Pipelines
    - Data pipelines
    - Model pipeline
- Cost
    
__Challenges might be met when monitoring__
- At input level:
    - Data sources in production may be scattered and unreliable
    - Do not have clear data requirements
    - Data sources don’t have defined ownership
    - Metadata for your production data workflow is not discoverable
- Teamwork
- Model quality
    - Ground truth (y_true) are not availability
    - Model bias
    - Blackbox model
    - Tracking hyper-parameter

## Stage of monitoring model

![image.jpg](_images/6_Models_monitoring/Essential-Signals-to-Monitor.jpg)


### Level 0: training and deploying models manually

![image.jpg](_images/6_Models_monitoring/lv0.png)

At this stage, you probably aren’t even thinking of monitoring your model yet, perhaps just finding a way to validate your model on the test set and hand it off to your IT Ops or software developers to deploy.

I know because I was there. I celebrated when I handed it off, as mentioned at the beginning of this article, but as you know—a couple of months later—it has indeed ended in tears and on the hospital bed.

For you to avoid this scenario, I propose you prioritize the lowest hanging fruit. Although less informative, and won’t help you monitor model performance, it can still serve as a reasonable performance proxy to tell you if your general application is working as intended. 

You don’t want to spend long hours focusing on monitoring your model’s metrics or try to justify its performance in line with a business KPI when your workflow is still in its manual deployment stage; such metrics will get easier to measure and analyze when your MLOps system gets mature, and you can collect ground truth labels or integrate other performance proxies in the absence of ground truth.



### Level 1: continuous training of models

![image.jpg](_images/6_Models_monitoring/lv1.png)

Being at this level means that you have automated the machine learning pipeline to enable continuous training of your machine learning models based on triggers that have been set by criteria or a defined threshold.

At this stage, I reckon you focus more on monitoring:

- The business metric used to gauge your model’s performance (see “What Could Go Right” section)—if it doesn’t turn out to be pretty difficult to measure, especially if you can’t spend them on getting ground truth for monitoring model metrics.
- The properties of your production data and your model’s performance in production to detect model staleness and degradation; can help with continuous training through triggers that automate the ML production pipelines to retrain models with new production data.
- Your model’s retraining process needs to log pipeline metadata, model configuration, and model metadata because you’re most likely going to manually deploy a retrained model, and you want to make sure you can monitor the properties of that model before redeploying it to production.
- You also need to monitor your production pipeline health as retraining steps are automated, and your data pipeline validates and preprocesses data from one or more sources.
- You should also start monitoring how much your continuous training process is incurring so you don’t wake up with a gigantic AWS bill one day that you or your company did not plan for.



### Level 2: completely mature in your MLOps

![image.jpg](_images/6_Models_monitoring/lv2.png)

Being at this level indicates that you’re completely mature in your MLOps implementation and pretty much the entire pipeline is a robust, automated CI/CD system. Your training, validation, and deployment phases are all automated in a complimentary feedback loop.

At this stage, you should pretty much monitor everything but your team’s focus should be on the more informative metrics, making sure that all the relevant stakeholders are empowered with the more informative metrics before spending more time on the least informative metrics.

## Best practices for monitoring

__General monitoring best practices__

- ___Focus on people first___. If you build a culture where data is also treated as the product in your organization, people will most likely be inclined to take ownership of the product to ensure it serves its intended purpose end-to-end. You can learn a lot from DevOps cultural change.
- If it’s possible, don’t give the application’s “monitoring power” to one person. If you have a cross-functional team of data professionals and Ops engineers, let everyone handle their service and communicate effectively. This will help decentralize knowledge and know-how and when the use cases scale, no one will be overwhelmed.
- Take a lean approach; using too many tools can be very tasking. Centralize your tools but decentralize the team; everyone staying on top of a task.
- Monitoring doesn’t start after deployment, it starts when you begin experimentation. Build a culture of monitoring right from the model development stage (monitoring model experimentation metrics, logs, and so on).
- Always consider what’s optimal for the productivity of your team when you encounter any crucial decision-making point.
- Encourage your team to properly document their troubleshooting framework and create a framework for going from alerting to action to troubleshooting for effective model maintenance.

__Best practices for data monitoring__

- Batch and streaming data should be processed in the same manner, using the same pipeline so that issues with the data pipeline are a lot more intuitive to troubleshoot.
- Ensure you go beyond checking for the drift for an entire dataset and look gradually at the feature drift as that can provide more insights.
- Invest in a global data catalog that can help log high-quality metadata for your data that every user (your data and ML team) can rely on; it will help you tackle - challenges with streaming and maintaining reliable data quality. It will also make lineage tracking easier.
- Perform a pre-launch validation on your evaluation set before moving your model to production to establish a baseline performance.

__Best practices for model monitoring__

- Model performance will inevitably degrade over time, but beware of a big dip in performance which is often indicative of something wrong—you can select tools that detect this automatically.
- Perform shadow deployment and testing with the challenger model vs the champion model and log the predictions so that performance on the new model can be tracked alongside the current model in production; before you decide to deploy the newly trained (challenger) model.
- You can use a metadata store (like Neptune.ai) to store hyperparameters for models that have been versioned and retrained in production; this improves auditing, compliance, lineage traceability, and troubleshooting. 

__Best practices for monitoring predictions/output__

- Prediction drift can be a good performance proxy for model metrics, especially when ground truth isn’t available to collect, but it shouldn’t be used as the sole metric.
- Track unreasonable outputs from your model. For example, your classification model predicting the wrong class for a set of inputs with a high confidence score, or your regression model predicting a negative score (when the base metric score should be 0) for a given set of features. 

## Bonus contents

### Monitoring vs Observability

[Comparation of two](https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/#monitoring-vs-observability)

__Observability__ is your ability to look at the metrics you’ve been monitoring and perform root-cause analysis on them to understand why they are a certain way, and what threat they pose to the overall performance of your system—all to improve system quality. 

__Monitoring__ is pretty much everything that happens before observability:
- Collecting performance metrics, 
- tracking them, 
- detecting potential problems, 
- alerting the right user. 

__To put it simply, you can monitor without observing, but can’t observe your system’s overall performance without monitoring it. Monitoring is about collecting the dots, observability is about connecting them!__

### Setting alerts the right way

- Test your alerts before they go into production
- Monitor the primary metrics as concluded in your needs analysis.
- Agree on the media for the alert, so every service owner is comfortable with their medium (email, stack,...)
- Send context to the alert by including descriptive information and action by the primary service owner.
- Make sure to set up a feedback loop that makes your monitoring better.

### Write log everything
[Read more](https://neptune.ai/blog/how-to-monitor-your-models-in-production-guide)

# Functional level monitoring
![image.png](_images/6_Models_monitoring/Functional-Monitoring.jpg)

## Data input

### Data quality issues
Tính toàn vẹn của Data input bị thay đổi. Để validate tính toàn vẹn của data trước khi đưa vào model, cần kiểm tra một số metrics liên quan đến data properties/ datatypes

__Nguyên nhân:__
- Break in data preprocesing pipelines
- Change source of data
- Data bị loss in source

__Detection techniques:__
- Testing input data for duplicates,
- Testing input data for missing values,
- Catching syntax errors,
- Catching data type and format errors,
- Kiểm tra source dữ liệu của feature bị detect có issue ,

__Possible solutions after detecting data quality issues:__
- Tạo alert khi data source thay đổi

### Data/feature drift
Sự thay đổi distribution/histogram của dữ liệu training và production sét trên level features/variables

__Nguyên nhân:__
- Data quality issue
- Change in data properies in real world

__Detection techniques:__
- Testing __statistic estimator__ of input features: mean, STD, median, variance, range,...
- For __continuous features__: use divergence and distance test the distribution: KL divergence, KS statistic, Population Stability Index (PSI), Hellinger distance,...
- For __categorical features__: use chi-square test, entropy, number of distance, mode,...
- Boxsplot

_(if there are a lot of features, can be use dimmensionality reducetion techniques (such as PCA,...) before test)_

__Possible solutions after detecting data drift:__
- Tạo alert và gửi notif khi phát hiện data drift vượt threshold
- Retrain set of new data collection in model periodically

#### PSI
__Rules__:
- `PSI` < 0.1 - No change. You can continue using existing model.
- `PSI` >=0.1 but less than 0.2 - Slight change is required.
- `PSI` >=0.2 - Significant change is required. Ideally, you should not use this model any more.

In [None]:
import numpy as np

def _psi(expected: np.ndarray, actual: np.ndarray, bucket_type: str = "bins", n_bins: int = 10) -> float:
    """Calculate PSI metric for two arrays.

    Parameters
    ----------
        expected : list-like
            Array of expected values
        actual : list-like
            Array of actual values
        bucket_type : str
            Binning strategy. Accepts two options: 'bins' and 'quantiles'. Defaults to 'bins'.
            'bins': input arrays are split into bins with equal
                and fixed steps based on 'expected' array
            'quantiles': input arrays are binned according to 'expected' array
                with given number of n_bins
        n_bins : int
            Number of buckets for binning. Defaults to 10.

    Returns
    -------
        A single float number
    """
    breakpoints = np.arange(0, n_bins + 1) / (n_bins) * 100
    if bucket_type == "bins":
        breakpoints = np.histogram(expected, n_bins)[1]
    elif bucket_type == "quantiles":
        breakpoints = np.percentile(expected, breakpoints)

    # Calculate frequencies
    expected_percents = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_percents = np.histogram(actual, breakpoints)[0] / len(actual)
    # Clip frequencies to avoid zero division
    expected_percents = np.clip(expected_percents, a_min=0.0001, a_max=None)
    actual_percents = np.clip(actual_percents, a_min=0.0001, a_max=None)
    # Calculate PSI
    psi_value = (expected_percents - actual_percents) * np.log(expected_percents / actual_percents)
    psi_value = sum(psi_value)

    return psi_value


def calculate_psi(
        expected: np.ndarray, actual: np.ndarray, bucket_type: str = "bins", n_bins: int = 10, axis: int = 0
) -> np.ndarray:
    """Apply PSI calculation to 2 1-d or 2-d arrays.

    Parameters
    ----------
    expected : list-like
        Array of expected values
    actual : list-like
        Array of actual values
    bucket_type : str
        Binning strategy. Accepts two options: 'bins' and 'quantiles'. Defaults to 'bins'.
            'bins' - input arrays are split into bins with equal
                and fixed steps based on ’expected' array
            'quantiles' - input arrays are binned according to ’expected’ array
                with given number of n_bins
    n_bins : int
        Number of buckets for binning. Defaults to 10.
    axis : int

    Returns
    -------
        np.ndarray

    Args:
        axis:
        axis:
    """
    if len(expected.shape) == 1:
        psi_values = np.empty(len(expected.shape))
    else:
        psi_values = np.empty(expected.shape[axis])

    for i in range(0, len(psi_values)):
        if len(psi_values) == 1:
            psi_values = _psi(expected, actual, bucket_type, n_bins)
        elif axis == 0:
            psi_values[i] = _psi(expected[:, i], actual[:, i], bucket_type, n_bins)
        elif axis == 1:
            psi_values[i] = _psi(expected[i, :], actual[i, :], bucket_type, n_bins)
        return np.array(psi_values)


calculate_psi(feature_train_proba, feature_produ_proba, bucket_type="bins", n_bins=10, axis=0)

### Outliers
Sự xuất hiện với tần xuất nhiều các outlies có thể ảnh hưởng tới hiệu suất của mô hình, hoặc dấu hiệu chỉ ra 1 pattern mới mà dữ liệu train trước đó chưa được học.

__Detection techniques:__
- Determine how far/how often from outlier to training dataset

__Possible solutions after detecting outliers:__
- Tạo subset mới chứa outlier và retrain new model, đánh giá sự khác biệt giữa new model và primary model.

## Model

### Model drift
Hiện tượng thay đổi relationship giữa biến Y và các biến X (supervised) hoặc giữa các biến X (unsupervised) với nhau, thậm trí không còn mối tương quan, dẫn tới kết quả model giảm tính chính xác overtime so với benchmark/KPIs

__Cause:__
- The real-world data changes naturally or sudden as stress events

__Model drift detection__
- Catching the change of correlation/auc/... between X and Y or between Xs
- Detect by predictive performance is reduce overtime by setting a predictive metrics threshold
- Detect by label drift (change the distribution)

__Possible solutions after detecting model/concept drift__
- If your business objectives and environment change frequently, you may want to consider automating your system to schedule and execute retraining at predefined intervals compared to more stable businesses
- If retraining your models doesn’t improve performance, you may want to consider remodeling or redeveloping models from scratch.
- If you’re working on larger scale projects with a good budget and little trade-off between cost and performance (in terms of how well your model catches up with a very dynamic business climate), you may want to consider __online learning algorithms__ for your project.

### Model configuration and artifacts, version
The model configuration file and artifacts contain all the components that were used to build that model, including:

- Training dataset location and version,
- Test dataset location and version,
- Model version
- Hyperparameters used,
- Default feature values,
- Dependencies and their versions; you want to monitor changes in dependency versions to easily find them for root cause analysis when model failure is caused by dependency changes,
- Environment variables,
- Model type (classification vs regression),
- Model author,
- Target variable name,
- Features to select from the data,
- Code and data for testing scenarios,
- Code for the model and its preprocessing.

Track the configurations for relevance—especially the hyperparameter values used by the model during retraining for any abnormality.

### Protect model by attack

Monitor your system for adversarial attacks by using the same steps you use to flag inputs with outlier events because adversarial threats don’t follow a pattern, they’re atypical events.

## Predictions (Output)

Monitoring model output in production is not just the best indicator of model performance, but it also tells us if business KPIs are being met. In terms of model predictions, the most important thing to monitor is model performance in line with business metrics.

### Model evaluation metrics

(Scoring models when ground truth is available)

Using metrics to evaluate model performance is a big part of monitoring your model in production. Different metrics can be used here, such as classification, regression, clustering, reinforcement learning, and so on.

We typically evaluate the model using predefined model scoring metrics (accuracy, AUC, precision, etc) when you have a ground truth/label to compare your model with.

![image.png](_images/6_Models_monitoring/avai_y.png)

At `1`, a part of the production data (input data) is channeled to the ground truth service which typically involves real-time ground truth generated by your system (for example, logging if a user clicked on an ad when the model predicted they would), a human label annotator, or other data labeling vendors for more complicated tasks (such as confirming if a customer repaid a loan at the stipulated time, or confirming if a transaction was fraudulent or legitimate after contacting a customer).

The event id that tracks prediction and model details is tagged with that ground truth event and logged to a data store. The data is then ingested into the monitoring platform, which computes the model performance metric given the model’s prediction and the actual label.

- As you probably already know, metrics for a classification model include:
    - Accuracy
    - Confusion Matrix,
    - ROC-AUC Score,
    - Precision and Recall Scores,
    - F1-Score.

- Metrics for a regression model include:
    - Root Mean Square Error (RMSE),
    - R-Squared and Adjusted R-Square Metrics,
    - Mean Absolute Error (MAE),
    - Mean Absolute Percentage Error (MAPE).

Calculating the model metrics above is only possible when you have the ground truth available.

### Prediction Drift

(Scoring models when ground truth is NOT available)

![image.png](_images/6_Models_monitoring/not_avai_y.png)

- Metrics:
    - Hellinger Distance (HDDDM)
    - Kullback-Leibler Divergence: đo sự khác biệt giữa 2 phân phối rời rạc
    - Population Stability Index (PSI): đo sự khác biệt giữa 2 phân phối liên tục


In [None]:
# PSI
import numpy as np

def _psi(expected: np.ndarray, actual: np.ndarray, bucket_type: str = "bins", n_bins: int = 10) -> float:
    """Calculate PSI metric for two arrays.

    Parameters
    ----------
        expected : list-like
            Array of expected values
        actual : list-like
            Array of actual values
        bucket_type : str
            Binning strategy. Accepts two options: 'bins' and 'quantiles'. Defaults to 'bins'.
            'bins': input arrays are split into bins with equal
                and fixed steps based on 'expected' array
            'quantiles': input arrays are binned according to 'expected' array
                with given number of n_bins
        n_bins : int
            Number of buckets for binning. Defaults to 10.

    Returns
    -------
        A single float number
    """
    breakpoints = np.arange(0, n_bins + 1) / (n_bins) * 100
    if bucket_type == "bins":
        breakpoints = np.histogram(expected, n_bins)[1]
    elif bucket_type == "quantiles":
        breakpoints = np.percentile(expected, breakpoints)

    # Calculate frequencies
    expected_percents = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_percents = np.histogram(actual, breakpoints)[0] / len(actual)
    # Clip frequencies to avoid zero division
    expected_percents = np.clip(expected_percents, a_min=0.0001, a_max=None)
    actual_percents = np.clip(actual_percents, a_min=0.0001, a_max=None)
    # Calculate PSI
    psi_value = (expected_percents - actual_percents) * np.log(expected_percents / actual_percents)
    psi_value = sum(psi_value)

    return psi_value


def calculate_psi(
        expected: np.ndarray, actual: np.ndarray, bucket_type: str = "bins", n_bins: int = 10, axis: int = 0
) -> np.ndarray:
    """Apply PSI calculation to 2 1-d or 2-d arrays.

    Parameters
    ----------
    expected : list-like
        Array of expected values
    actual : list-like
        Array of actual values
    bucket_type : str
        Binning strategy. Accepts two options: 'bins' and 'quantiles'. Defaults to 'bins'.
            'bins' - input arrays are split into bins with equal
                and fixed steps based on ’expected' array
            'quantiles' - input arrays are binned according to ’expected’ array
                with given number of n_bins
    n_bins : int
        Number of buckets for binning. Defaults to 10.
    axis : int

    Returns
    -------
        np.ndarray

    Args:
        axis:
        axis:
    """
    if len(expected.shape) == 1:
        psi_values = np.empty(len(expected.shape))
    else:
        psi_values = np.empty(expected.shape[axis])

    for i in range(0, len(psi_values)):
        if len(psi_values) == 1:
            psi_values = _psi(expected, actual, bucket_type, n_bins)
        elif axis == 0:
            psi_values[i] = _psi(expected[:, i], actual[:, i], bucket_type, n_bins)
        elif axis == 1:
            psi_values[i] = _psi(expected[i, :], actual[i, :], bucket_type, n_bins)
        return np.array(psi_values)


calculate_psi(y_train_proba, y_produ_proba, bucket_type="bins", n_bins=10, axis=0)

In [1]:
# Kullback
from scipy.special import rel_entr
P = [.05, .1, .2, .05, .15, .25, .08, .12]
Q = [.3, .1, .2, .1, .1, .02, .08, .1]
#calculate (P || Q)
sum(rel_entr(P, Q))

0.589885181619163

# Operational level monitoring
![image.jpg](_images/6_Models_monitoring/Operational-Monitoring.jpg)

## System performance and reliability

The system/application performance metrics to monitor that will give you an idea of model performance include:

- CPU/GPU utilization when the model is computing predictions on incoming data from each API call; tells you how much your model is consuming per request.
- Memory utilization for when the model caches data or input data is cached in memory for faster I/O performance.
- Number of failed requests by an event/operation.
- Total number of API calls.
- Response time of the model server or prediction service.
- System reliability: infrastructure and network uptime,...



## Pipelines

Monitor the health of your data and model pipeline. Unhealthy data pipelines can affect data quality, and your model pipeline leakages or unexpected changes can easily generate negative value.



### Data pipelines

Monitoring the health of data pipelines is extremely crucial because data quality issues can arise from bad or unhealthy data pipelines. This especially is extremely tricky to monitor for your IT Ops/DevOps team and may require empowering your data engineering/DataOps team to monitor and troubleshoot issues.

It also has to be a shared responsibility. Work with your DataOps team, communicate what your model expects, and the team will tell you what the output of their data pipeline is—this can help you tighten up your system and drive positive results.

If you’re charged with the responsibility of monitoring your data pipeline, here are some metrics and factors you may want to track:

- __Input data__ – are the data and files in the pipeline with the appropriate structure, schema, and completeness? Are there data validation tests and checks in place so that the team can be alerted in case of an oddity in ingested data? Monitor what comes into the data pipeline to keep it healthy.
- __Intermediate workflow steps__ – are the inputs and outputs of every task and flow in the DAG as expected, in terms of the number of files and file types? How long does a task take to run in the pipeline? This could be the data preprocessing task, or the validation task, or even the data distribution monitoring task.
- __Output data__ – is the output data schema as expected by the machine learning model in terms of features and feature embeddings? What’s the typical file size expected from an output file?
- __Data quality metrics__ – tracking the statistical metrics according to the data that flows in. This could be basic statistical properties of the data such as mean, standard deviation, correlation, and so on, or distance metrics (such as KL divergence, Kolmogorov-Smirnov statistic). The statistical metric used will be mostly dependent on the dimension of data expected; a couple of features or several features.
- __Scheduled run time__ of a job, actual run time, how long it took to run, and the state of the job (successful, or failed job?).



### Model pipeline

You want to track crucial factors that can cause your model to break in production after retraining and being redeployed. This includes:

- Dependencies – you don’t want a situation where your model was built with Tensorflow 2.0 and a recent dependency update by someone else on your team that’s bundled with Tensorflow 2.4 causes part of your retraining script to fail. Validate the versions of each dependency your model runs on and log that as your pipeline metadata, so dependency updates that cause failure can be easier to debug.
- The actual time a retraining job was triggered, how long it took the retraining job to run, resources usage of the job, and the state of the job (successfully retrained and redeployed model, or failed?).



## Cost

You need to keep an eye out for how much it’s costing you and your organization to host your entire machine learning application, including data storage and compute costs, retraining, or other types of orchestrated jobs. These costs can add up fast, especially if they’re not being tracked. Also, it takes computational power for your models to make predictions for every request, so you also need to track inference costs.