# Monitoring ML datasets on Verta

Verta provides a extensible [model monitoring framework](https://docs.verta.ai/verta/monitoring) that allows the user to fully define and configure what data to monitor and how to monitor it.

This notebook shows an example of how Verta model monitoring can be used to define custom monitors on datasets used to build ML models

## 0. Imports

In [None]:
from __future__ import print_function

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import itertools
import os
import time

import six

import numpy as np
import pandas as pd

import sklearn
from sklearn import model_selection
from sklearn import linear_model
from sklearn import metrics

### 0.1 Verta import and setup

In [None]:
# restart your notebook if prompted on Colab
try:
    import verta
except ImportError:
    !pip install verta

In [None]:
# import os
# os.environ['VERTA_EMAIL'] = 
# os.environ['VERTA_DEV_KEY'] = 
# os.environ['VERTA_HOST']

from verta import Client
client = Client(os.environ['VERTA_HOST'])

## 1. Fetch data

In [None]:
import wget
import pandas as pd

def download_data_if_missing(filename):
    url = "https://verta-demo.s3-us-west-2.amazonaws.com/{}".format(filename)
    local_data_file = "../data/{}".format(filename)
    if not os.path.isfile(local_data_file):
        wget.download(url, out=local_data_file)

def clean_data(df):
    df = df.drop(['id'], axis=1)

    df.loc[df['Gender'] == 'Male', 'Gender'] = 1
    df.loc[df['Gender'] == 'Female', 'Gender'] = 0

    df.loc[df['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2
    df.loc[df['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
    df.loc[df['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0

    df.loc[df['Vehicle_Damage'] == 'Yes', 'Vehicle_Damage'] = 1
    df.loc[df['Vehicle_Damage'] == 'No', 'Vehicle_Damage'] = 0
    return df

filename = "xselltrain.csv"
download_data_if_missing(filename)
train = pd.read_csv("../data/{}".format(filename))
train = clean_data(train)
train.head()

## 2. Define monitored entities

In Verta Model Monitoring, a Monitored entity (ME) encapsulates the thing being monitored, e.g., a model, a pipeline, and acts as a context within which data summaries are produced and analyzed

In [None]:
me = client.monitoring.get_or_create_monitored_entity("insurance-cross-sell")

## 2.1 Define data summaries and summary samples

For a specific ME, there are particular aspects of the data that we wish to monitor, e.g., for a model, we may want to monitor the inputs and outputs; for a dataset, we may want to monitor values in each column of the dataset.

So the next step is to define the data _summaries_ we wish to capture.

## Summaries and Summary Samples

Users who are monitoring a deployed model, a data pipeline, or a model training process may be interested in many kinds of summary data including any number of summary statistics or data distribution summaries such as histograms. Within Verta's tools for data monitoring, a _summary_ defines a class of data which the user is interested in, for example a mean squared error or a histogram of data table column values. A _summary sample_ is an instance of that summary, which might be logged from a training epoch or a batch of inputs and outputs for a deployed model.

In [None]:
# Suppose in this case, we would like to monitor data summaries for a specific set of columns in our data, then
# here's how we could define a generic function to define those summaries

continuous_columns = ["Age", "Annual_Premium", "Vintage"]
discrete_columns = ["Gender", "Driving_License", "Previously_Insured", "Vehicle_Damage", "Vehicle_Age"]
all_columns = continuous_columns + discrete_columns

from verta.data_types import (
    DiscreteHistogram,
    FloatHistogram,
    NumericValue,
)

from verta.monitoring.profiler import (
    MissingValuesProfiler,
    BinaryHistogramProfiler,
    ContinuousHistogramProfiler,
)

def profile(data, labels, start_time, end_time, monitored_entity):        
    for col in continuous_columns:
        summary_name = col + "-Histogram"
        summary = client.monitoring.summaries.get_or_create(summary_name, FloatHistogram, monitored_entity)
        summary_samples = ContinuousHistogramProfiler(columns=[col], bins=[x*3000 for x in range(20)]).profile(data)

        for _, histogram in summary_samples.items():  
            summary.log_sample(histogram, labels, start_time, end_time)
        
    for col in discrete_columns:    
        summary_name = col + "-Histogram"
        summary = client.monitoring.summaries.get_or_create(summary_name, DiscreteHistogram, monitored_entity)
        summary_samples = BinaryHistogramProfiler(columns=[col]).profile(data)

        for _, histogram in summary_samples.items():  
            summary.log_sample(histogram, labels, start_time, end_time)

    for col in all_columns:
        missing_summary_name = col + "-Missing"
        missing_summary = client.monitoring.summaries.get_or_create(missing_summary_name, DiscreteHistogram, monitored_entity)                
        summary_samples = MissingValuesProfiler(columns=[col]).profile(data)

        for _, missing_counts in summary_samples.items():  
            missing_summary.log_sample(missing_counts, labels, start_time, end_time)

## 2.2 Define alerts

In many ways, monitors and summaries are just a way to get to our objective; know when unexpected things happen in the system. So next, we define alerts to notify us when somethin unexpected happens

In [None]:
from verta.monitoring.notification_channel import SlackNotificationChannel
from verta.monitoring.alert import ReferenceAlerter
from verta.monitoring.comparison import GreaterThan
from verta.monitoring.summaries.queries import SummaryQuery
from verta.monitoring.summaries.queries import SummarySampleQuery

In [None]:
from datetime import datetime, timedelta, timezone

today = datetime.now(timezone.utc)

In [None]:
channel = None

# supply a Slack notification channel, if available
# channel = monitoring.notification_channels.get_or_create(
#     "Demo Monitoring Alerts",
#     SlackNotificationChannel(webhook_url)
# )

def set_alerts(monitored_entity):
    summaries = client.monitoring.summaries.find(SummaryQuery(
            monitored_entities=[monitored_entity.id],
        ))
    for summary in summaries:
        threshold = 0.2
        ref_sample = summary.find_samples(SummarySampleQuery(labels={"source":"reference"}))[0]
        alerter = ReferenceAlerter(
            GreaterThan(threshold),
            ref_sample,
        )
        alert = summary.alerts.create(
            summary.name + "- ReferenceDeviation GT {}".format(threshold),
            alerter,
            # notification_channels=[channel], # uncomment if channel is supplied
            starting_from=today-timedelta(hours=30), # pick a suitable time from which the alerter should be enabled
        )

## 3. Incorporate profiling functions into your workflow
A typical data monitoring workflow works as follows: 
1. Log reference summaries (e.g., for training data, at training time)
2. Log live/new summaries (e.g., when a daily job is re-run or when a model makes predictions)

### 3.1 Log reference summaries

In [None]:
profile(train, {"source" : "reference"}, today - timedelta(hours=30), today - timedelta(hours=30), me)

In [None]:
# note: as defined above, our alerts need a reference sample to work correctly, so alerts must be set after logging
# reference summary samples
set_alerts(me)

### 3.2 Log live/new summaries

Suppose in this case that we have a new dataset and we want to make sure that the new data matches the reference one.

In [None]:
test_filename = "xselltest.csv"
download_data_if_missing(test_filename)
test = pd.read_csv("../data/{}".format(test_filename))
test = clean_data(test)
test.head()

In [None]:
# let's create some interesting subsets of dataa
import numpy as np

test_low_premium_customers = test[test.Annual_Premium < 5000]
test_high_premium_customers = test[test.Annual_Premium >= 5000]
hp_splits = np.array_split(test_high_premium_customers, 25)
lp_splits = np.array_split(test_low_premium_customers, 5)
test_prev_insured_customers = test[test.Previously_Insured == 1]
pi_splits = np.array_split(test_prev_insured_customers, 10)

In [None]:
# Here's some data that looks like the reference and should not produce alerts
for idx in range(len(hp_splits[:5])):
    profile(
        hp_splits[idx], 
        {"source": "high_premium"}, 
        today-timedelta(hours=30 - idx - 1), 
        today-timedelta(hours=30 - idx - 1),
        me
    )

In [None]:
# Here's some data that does not look like the reference and should produce alerts
for idx in range(len(lp_splits)):
    profile(
        lp_splits[idx], 
        {"source": "low_premium"}, 
        today-timedelta(hours=5 - idx - 1), 
        today-timedelta(hours=5 - idx - 1),
        me
    )

In [None]:
# Here's some data that does not look like the reference and should produce alerts
for idx in range(len(pi_splits[:5])):
    profile(
        pi_splits[idx], 
        {"source": "prev_insured"}, 
        today-timedelta(hours=5 -idx - 1), 
        today-timedelta(hours=5 -idx-1),
        me
    )
