# Monitoring time series datasets on Verta

Verta provides a extensible [model monitoring framework](https://docs.verta.ai/verta/monitoring) that allows the user to fully define and configure what data to monitor and how to monitor it.

This notebook shows an example of how Verta model monitoring can be used with time series data and a Prophet model.

## 0. Imports

In [None]:
from __future__ import print_function

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import itertools
import os
import time
from datetime import date, datetime, timedelta, timezone

import six

import numpy as np
import pandas as pd

import sklearn
from sklearn import metrics
from fbprophet import Prophet

### 0.1 Verta import and setup

In [None]:
# restart your notebook if prompted on Colab
try:
    import verta
except ImportError:
    !pip install verta

In [None]:
# import os
# os.environ['VERTA_EMAIL'] = 
# os.environ['VERTA_DEV_KEY'] = 
# os.environ['VERTA_HOST']

from verta import Client
client = Client(os.environ['VERTA_HOST'])

## 1. Fetch data

In [None]:
import wget
import pandas as pd

def download_data_if_missing(filename):
    url = "https://verta-starter.s3.amazonaws.com/{}".format(filename)
    local_data_file = "../data/{}".format(filename)
    if not os.path.isfile(local_data_file):
        wget.download(url, out=local_data_file)

filename = "clean_manning_regressors.csv"
download_data_if_missing(filename)
df = pd.read_csv("../data/{}".format(filename))
df.head()

In [None]:
# define training set
df_train = df[(df['ds'] <= '2011-12-31')]
df_groundtruth = df

## Define prophet model

In [None]:
m_train = Prophet()
m_train.add_regressor('open')
m_train.add_regressor('close')
m_train.add_regressor('high')
m_train.add_regressor('low')
m_train.add_regressor('volume')
m_train.add_regressor('adj_close')
m_train.fit(df_train)

### (for simulation) build full future forecast

In [None]:
full_future = df[(df['ds'] > '2011-12-31') & (df['ds'] <= '2015-12-31')]

In [None]:
full_future.drop(columns=["y"])
full_forecast = m_train.predict(full_future)

In [None]:
def df_between_start_end(start_str, end_str, df):
    return df[(df['ds'] >= start_str) & (df['ds'] < end_str)]

def get_forecast(start_str, end_str, forecast):
    return df_between_start_end(start_str, end_str, forecast)

def get_groundtruth(start, end, groundtruth):
    return df_between_start_end(start_str, end_str, groundtruth)

In [None]:
import sklearn.metrics
def compute_mse(start, end, forecast, groundtruth):
    # compute accuracy of forecast by using the true data
    predicted = get_forecast(start, end, forecast)
    actual = get_groundtruth(start, end, groundtruth)
    return sklearn.metrics.mean_squared_error(predicted["yhat"], actual["y"])

## 2. Define monitored entities

In Verta Model Monitoring, a Monitored entity (ME) encapsulates the thing being monitored, e.g., a model, a pipeline, and acts as a context within which data summaries are produced and analyzed

In [None]:
me = client.monitoring.get_or_create_monitored_entity("time-series-regression")

## 2.1 Define data summaries and summary samples

For a specific ME, there are particular aspects of the data that we wish to monitor, e.g., for a model, we may want to monitor the inputs and outputs; for a dataset, we may want to monitor values in each column of the dataset.

So the next step is to define the data _summaries_ we wish to capture.

## Summaries and Summary Samples

Users who are monitoring a deployed model, a data pipeline, or a model training process may be interested in many kinds of summary data including any number of summary statistics or data distribution summaries such as histograms. Within Verta's tools for data monitoring, a _summary_ defines a class of data which the user is interested in, for example a mean squared error or a histogram of data table column values. A _summary sample_ is an instance of that summary, which might be logged from a training epoch or a batch of inputs and outputs for a deployed model.

In [None]:
from verta.monitoring.profiler import (
    MissingValuesProfiler,
    BinaryHistogramProfiler,
    ContinuousHistogramProfiler,
)

continuous_profilers = [MissingValuesProfiler, ContinuousHistogramProfiler]
from verta.data_types import (
    DiscreteHistogram,
    FloatHistogram,
    NumericValue,
)

continuous_columns = ["open", "close", "high", "low", "volume", "adj_close"]

def profile(data, mse, labels, start_time, end_time, monitored_entity):        
    for col in continuous_columns:
        summary_name = col + "-Histogram"
        summary = client.monitoring.summaries.get_or_create(summary_name, FloatHistogram, monitored_entity)
        # TODO: add binning
        summary_samples = ContinuousHistogramProfiler(columns=[col]).profile(data)

        for _, histogram in summary_samples.items():  
            summary.log_sample(histogram, labels, start_time, end_time)

    for col in continuous_columns:
        missing_summary_name = col + "-Missing"
        missing_summary = client.monitoring.summaries.get_or_create(missing_summary_name, DiscreteHistogram, monitored_entity)                
        summary_samples = MissingValuesProfiler(columns=[col]).profile(data)

        for _, missing_counts in summary_samples.items():  
            missing_summary.log_sample(missing_counts, labels, start_time, end_time)
            
    mse_summary = client.monitoring.summaries.get_or_create("mse_summary", NumericValue, monitored_entity)
    mse_summary.log_sample(NumericValue(mse), labels, start_time, end_time)

## 2.2 Define alerts

In many ways, monitors and summaries are just a way to get to our objective; know when unexpected things happen in the system. So next, we define alerts to notify us when somethin unexpected happens

In [None]:
from verta.monitoring.notification_channel import SlackNotificationChannel
from verta.monitoring.alert import ReferenceAlerter, FixedAlerter
from verta.monitoring.comparison import GreaterThan
from verta.monitoring.summaries.queries import SummaryQuery
from verta.monitoring.summaries.queries import SummarySampleQuery

In [None]:
from datetime import datetime, timedelta, timezone

today = datetime.now(timezone.utc)

In [None]:
channel = None

# supply a Slack notification channel, if available
# channel = monitoring.notification_channels.get_or_create(
#     "Demo Monitoring Alerts",
#     SlackNotificationChannel(webhook_url)
# )

def set_alerts(monitored_entity):
    summaries = client.monitoring.summaries.find(SummaryQuery(
            monitored_entities=[monitored_entity.id],
        ))
    for summary in summaries:
        if summary.name == "mse_summary":
            mse_alert = summary.alerts.create(
                "MSE",
                FixedAlerter(GreaterThan(2.0)),
                # notification_channels=[channel], # uncomment if channel is supplied
                labels={"source":"time_series"},
            )
            continue
        threshold = 0.2
        ref_sample = summary.find_samples(SummarySampleQuery(labels={"source":"reference"}))[0]
        alerter = ReferenceAlerter(
            GreaterThan(threshold),
            ref_sample,
        )
        alert = summary.alerts.create(
            summary.name + "- ReferenceDeviation GT {}".format(threshold),
            alerter,
            # notification_channels=[channel], # uncomment if channel is supplied
            starting_from=today-timedelta(hours=30), # pick a suitable time from which the alerter should be enabled
        )

## 3. Incorporate profiling functions into your workflow
A typical data monitoring workflow works as follows: 
1. Log reference summaries (e.g., for training data, at training time)
2. Log live/new summaries (e.g., when a daily job is re-run or when a model makes predictions)

### 3.1 Log reference summaries

In [None]:
profile(df_train, 0.3, {"source" : "reference"}, today - timedelta(hours=30), today - timedelta(hours=30), me)

In [None]:
set_alerts(me)

In [None]:
# note: as defined above, our alerts need a reference sample to work correctly, so alerts must be set after logging
# reference summary samples
set_alerts(me)

### 3.2 Log live/new summaries

Suppose in this case that we have a new dataset and we want to make sure that the new data matches the reference one.

In [None]:
# month by month: loop over months
years = [2015]
# years = [2012, 2013, 2014, 2015]

for year in years:
    for month in range(1, 13):
        start = datetime(year, month, 1, tzinfo=timezone.utc)
        end_year, end_month = (year, month + 1) if month < 12 else (year + 1, 1)
        end = datetime(end_year, end_month, 1, tzinfo=timezone.utc)

        start_str = start.date().isoformat()
        end_str = end.date().isoformat()
        print("simulating summary for ({}, {})".format(start_str, end_str))
        mse = compute_mse(start_str, end_str, full_forecast, df_groundtruth)
        
        predicted = get_forecast(start_str, end_str, full_forecast)
        actual = get_groundtruth(start_str, end_str, df_groundtruth)
        profile(actual, mse, {"source":"time_series"}, start, end, me)