In [1]:
# @title Copyright & License (click to expand)
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Author: Chanchal Chatterjee
# Email: cchatterjee@google.com

# Vertex Model Monitoring



## Overview

### What is Model Monitoring?

Modern applications rely on a well established set of capabilities to monitor the health of their services. Examples include:

* software versioning
* rigorous deployment processes
* event logging
* alerting/notication of situations requiring intervention
* on-demand and automated diagnostic tracing
* automated performance and functional testing

You should be able to manage your ML services with the same degree of power and flexibility with which you can manage your applications. That's what MLOps is all about - managing ML services with the best practices Google and the broader computing industry have learned from generations of experience deploying well engineered, reliable, and scalable services.

Model monitoring is only one piece of the ML Ops puzzle - it helps answer the following questions:

* How well do recent service requests match the training data used to build your model? This is called **training-serving skew**.
* How significantly are service requests evolving over time? This is called **drift detection**.

If production traffic differs from  training data, or varies substantially over time, that's likely to impact the quality of the answers your model produces. When that happens, you'd like to be alerted automatically and responsively, so that **you can anticipate problems before they affect your customer experiences or your revenue streams**.

### Objective

In this notebook, you will learn how to... 

* deploy a pre-trained model
* configure model monitoring
* generate some artificial traffic
* understand how to interpret the statistics, visualizations, other data reported by the model monitoring feature

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertext AI
* BigQuery

Learn about [Vertext AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### The example model

The model you'll use in this notebook is based on [this blog post](https://cloud.google.com/blog/topics/developers-practitioners/churn-prediction-game-developers-using-google-analytics-4-ga4-and-bigquery-ml). The idea behind this model is that your company has extensive log data describing how your game users have interacted with the site. The raw data contains the following categories of information:

- identity - unique player identitity numbers
- demographic features - information about the player, such as the geographic region in which a player is located
- behavioral features - counts of the number of times a  player has triggered certain game events, such as reaching a new level
- churn propensity - this is the label or target feature, it provides an estimated probability that this player will churn, i.e. stop being an active player.

The blog article referenced above explains how to use BigQuery to store the raw data, pre-process it for use in machine learning, and train a model. Because this notebook focuses on model monitoring, rather than training models, you're going to reuse a pre-trained version of this model, which has been exported to Google Cloud Storage. In the next section, you will setup your environment and import this model into your own project.

## Before you begin

### Setup your dependencies

In [33]:
import os
import sys

assert sys.version_info.major == 3, "This notebook requires Python 3."

# Google Cloud Notebook requires dependencies to be installed with '--user'
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

if 'google.colab' in sys.modules: 
    from google.colab import auth
    auth.authenticate_user()

# Install Python package dependencies.
! pip3 install {USER_FLAG} --quiet --upgrade google-api-python-client google-auth-oauthlib \
                                             google-auth-httplib2 oauth2client requests \
                                             google-cloud-aiplatform google-cloud-storage==1.32.0

if not os.getenv("IS_TESTING"): 
    # Automatically restart kernel after installs
    import IPython
    app = IPython.Application.instance() 
    app.kernel.do_shutdown(True)


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tfx 0.21.5 requires absl-py<0.9,>=0.1.6, but you have absl-py 0.11.0 which is incompatible.
tfx 0.21.5 requires apache-beam[gcp]<2.18,>=2.17, but you have apache-beam 2.30.0 which is incompatible.
tfx 0.21.5 requires docker<5,>=4.1, but you have docker 5.0.0 which is incompatible.
tfx 0.21.5 requires google-api-python-client<2,>=1.7.8, but you have google-api-python-client 2.13.0 which is incompatible.
tfx 0.21.5 requires kubernetes<11,>=10.0.1, but you have kubernetes 12.0.1 which is incompatible.
tfx 0.21.5 requires pyarrow<0.16,>=0.15, but you have pyarrow 2.0.0 which is incompatible.
tfx 0.21.5 requires tensorflow-data-validation<0.22,>=0.21.4, but you have tensorflow-data-validation 1.1.0 which is incompatible.
tfx 0.21.5 requires tfx-bsl<0.22,>=0.21.3, but you have tfx-bsl 1.1.0 which is incompatible.
t

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. Enter your project id in the first line of the cell below.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).

1. You'll use the *gcloud* command throughout this notebook. In the following cell, enter your project name and run the cell to authenticate yourself with the Google Cloud and initialize your *gcloud* configuration settings.

**Model monitoring is currently supported in regions us-central1, europe-west4, asia-east1, and asia-southeast1. To keep things simple for this lab, we're going to use region us-central1 for all our resources (BigQuery training data, Cloud Storage bucket, model and endpoint locations, etc.). You can use any supported region, so long as all resources are co-located.**

In [1]:
# Automatically restart kernel after installs
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [1]:
# Import globally needed dependencies here, after kernel restart.
import copy
import numpy as np
import os
import random
import sys
import time

PROJECT_ID = "cchatterjee-sandbox"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}
SUFFIX = "aiplatform.googleapis.com"
API_ENDPOINT = f"{REGION}-{SUFFIX}"
PREDICT_API_ENDPOINT = f"{REGION}-prediction-{SUFFIX}"
if os.getenv("IS_TESTING"):
    !gcloud --quiet components install beta
    !gcloud --quiet components update
!gcloud config set project $PROJECT_ID
!gcloud config set ai/region $REGION


Updated property [core/project].
Updated property [ai/region].


### Login to your Google Cloud account and enable AI services

In [2]:
# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

# Execute this line on cloud shell
!gcloud services enable aiplatform.googleapis.com


### Define utilities

Run the following cells to define some utility functions and distributions used later in this notebook. Although these utilities are not critical to understand the main concepts, feel free to expand the cells
in this section if you're curious or want to dive deeper into how some of your API requests are made.

In [3]:
# @title Utility imports and constants
from google.protobuf.struct_pb2 import Value
from google.protobuf.duration_pb2 import Duration
from google.cloud.aiplatform_v1beta1.services.job_service import JobServiceClient

from google.cloud.aiplatform_v1beta1.types.model_monitoring import ThresholdConfig
from google.cloud.aiplatform_v1beta1.types.model_monitoring import SamplingStrategy
from google.cloud.aiplatform_v1beta1.types.model_monitoring import ModelMonitoringAlertConfig
from google.cloud.aiplatform_v1beta1.types.model_monitoring import ModelMonitoringObjectiveConfig
from google.cloud.aiplatform_v1beta1.types.model_deployment_monitoring_job import ModelDeploymentMonitoringJob
from google.cloud.aiplatform_v1beta1.types.model_deployment_monitoring_job import ModelDeploymentMonitoringObjectiveConfig
from google.cloud.aiplatform_v1beta1.types.model_deployment_monitoring_job import ModelDeploymentMonitoringScheduleConfig

from google.cloud.aiplatform_v1beta1.services.endpoint_service import EndpointServiceClient
from google.cloud.aiplatform_v1beta1.services.prediction_service import PredictionServiceClient
from google.cloud.aiplatform_v1beta1.types.io import BigQuerySource
from google.cloud.aiplatform_v1beta1.types.io import GcsSource
from google.cloud.aiplatform_v1beta1.types.prediction_service import PredictRequest
from google.protobuf import json_format

# This is the default value at which you would like the monitoring function to trigger an alert.
# In other words, this value fine tunes the alerting sensitivity. This threshold can be customized
# on a per feature basis but this is the global default setting.
DEFAULT_THRESHOLD_VALUE = 1


In [4]:
# @title Utility functions

def create_monitoring_job(objective_configs):
    # Create sampling configuration.
    random_sampling = SamplingStrategy.RandomSampleConfig(sample_rate=LOG_SAMPLE_RATE)
    sampling_config = SamplingStrategy(random_sample_config=random_sampling)

    # Create schedule configuration.
    duration = Duration(seconds=MONITOR_INTERVAL_IN_SECONDS)
    schedule_config = ModelDeploymentMonitoringScheduleConfig(monitor_interval=duration)

    # Create alerting configuration.
    emails = [USER_EMAIL]
    email_config = ModelMonitoringAlertConfig.EmailAlertConfig(user_emails=emails)
    alerting_config = ModelMonitoringAlertConfig(email_alert_config=email_config)

    # Create the monitoring job.
    endpoint = f"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{ENDPOINT_ID}"
    predict_schema = ""
    analysis_schema = ""
    job = ModelDeploymentMonitoringJob(
        display_name=JOB_NAME,
        endpoint=endpoint,
        model_deployment_monitoring_objective_configs=objective_configs,
        logging_sampling_strategy=sampling_config,
        model_deployment_monitoring_schedule_config=schedule_config,
        model_monitoring_alert_config=alerting_config,
        predict_instance_schema_uri=predict_schema,
        analysis_instance_schema_uri=analysis_schema,
    )
    options = dict(api_endpoint=API_ENDPOINT)
    client = JobServiceClient(client_options=options)
    parent = f"projects/{PROJECT_ID}/locations/{REGION}"
    response = client.create_model_deployment_monitoring_job(
        parent=parent, model_deployment_monitoring_job=job
    )
    print("Created monitoring job:")
    print(response)
    return response


def get_thresholds(default_thresholds, custom_thresholds):
    thresholds = {}
    default_threshold = ThresholdConfig(value=DEFAULT_THRESHOLD_VALUE)
    for feature in default_thresholds.split(","):
        feature = feature.strip()
        thresholds[feature] = default_threshold
    for custom_threshold in custom_thresholds.split(","):
        pair = custom_threshold.split(":")
        if len(pair) != 2:
            print(f"Invalid custom skew threshold: {custom_threshold}")
            return
        feature, value = pair
        thresholds[feature] = ThresholdConfig(value=float(value))
    return thresholds


def get_deployed_model_ids(endpoint_id):
    client_options = dict(api_endpoint=API_ENDPOINT)
    client = EndpointServiceClient(client_options=client_options)
    parent = f"projects/{PROJECT_ID}/locations/{REGION}"
    response = client.get_endpoint(name=f"{parent}/endpoints/{endpoint_id}")
    model_ids = []
    for model in response.deployed_models:
        model_ids.append(model.id)
    return model_ids


def set_objectives(model_ids, objective_template):
    # Use the same objective config for all models.
    objective_configs = []
    for model_id in model_ids:
        objective_config = copy.deepcopy(objective_template)
        objective_config.deployed_model_id = model_id
        objective_configs.append(objective_config)
    return objective_configs


def send_predict_request(endpoint, input_list):
    client_options = {"api_endpoint": PREDICT_API_ENDPOINT}
    client = PredictionServiceClient(client_options=client_options)
    params = {}
    params = json_format.ParseDict(params, Value())
    #request = PredictRequest(endpoint=endpoint, parameters=params)
    #inputs = [json_format.ParseDict(input_list, Value())]
    #request.instances.extend(inputs)
    #response = client.predict(request)
    response = client.predict(endpoint=endpoint, instances=input_list, parameters=params)
    return response


def list_monitoring_jobs():
    client_options = dict(api_endpoint=API_ENDPOINT)
    parent = f"projects/{PROJECT_ID}/locations/us-central1"
    client = JobServiceClient(client_options=client_options)
    response = client.list_model_deployment_monitoring_jobs(parent=parent)
    jobs_list = []
    for job in response:
        jobs_list.append(
            {
                "name": job.name,
                "state": job.state.name
             }
        )
    return(jobs_list)

def pause_monitoring_job(job):
    client_options = dict(api_endpoint=API_ENDPOINT)
    client = JobServiceClient(client_options=client_options)
    response = client.pause_model_deployment_monitoring_job(name=job)
    print(response)


def delete_monitoring_job(job):
    client_options = dict(api_endpoint=API_ENDPOINT)
    client = JobServiceClient(client_options=client_options)
    response = client.delete_model_deployment_monitoring_job(name=job)
    print(response)


In [5]:
# Print current job names and status
jobs_list = list_monitoring_jobs()
njobs = len(jobs_list)
if (njobs == 0):
    print("No monitoring jobs")
else:
    for i in range(len(jobs_list)):
        print(jobs_list[i]['name'])
        print(jobs_list[i]['state'])

No monitoring jobs


In [6]:
# Delete existing monitoring jobs
jobs_list = list_monitoring_jobs()
if (len(jobs_list) == 0):
    print("No jobs to delete.")
for i in range(len(jobs_list)):
    if (jobs_list[i]['state'] == "JOB_STATE_RUNNING"):
        pause_monitoring_job(jobs_list[i]['name'])
    if (jobs_list[i]['state'] == "JOB_STATE_PENDING"):
        print("{} pending cannot be stopped".format(jobs_list[i]['name']))
    else:
        delete_monitoring_job(jobs_list[i]['name'])
        print("Deleted job")

No jobs to delete.


## Import your model

The churn propensity model you'll be using in this notebook has been trained in BigQuery ML and exported to a Google Cloud Storage bucket. This illustrates how you can easily export a trained model and move a model from one cloud service to another. 

Run the next cell to import this model into your project. **If you've already imported your model, you can skip this step.**

In [7]:
# List all deployed models
from google.cloud.aiplatform import gapic as aip
def list_models():
    PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION
    API_ENDPOINT = "{}-aiplatform.googleapis.com".format(REGION)
    client_options = {"api_endpoint": API_ENDPOINT}
    client = aip.ModelServiceClient(client_options=client_options)
    response = client.list_models(parent=PARENT)
    model_list = []
    for model in response:
        model_list.append(
            {
                "name": model.name,
                "display_name": model.display_name,
                "create_time": model.create_time,
                "container":  model.container_spec.image_uri,
                "artifact_uri": model.artifact_uri
            }
        )
    return(model_list)

model_list = list_models()
model_list


[{'name': 'projects/901951554789/locations/us-central1/models/4920050651506933760',
  'display_name': 'image_tensorflow_model',
  'create_time': DatetimeWithNanoseconds(2021, 7, 16, 3, 51, 13, 308910, tzinfo=datetime.timezone.utc),
  'container': 'us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-2:latest',
  'artifact_uri': 'gs://vapit_data/tf_models/models/model_07152021_2019/checkpoints/cp-203503-0-0.9740/'},
 {'name': 'projects/901951554789/locations/us-central1/models/6353180495429238784',
  'display_name': 'freddiemacdata_20216247028',
  'create_time': DatetimeWithNanoseconds(2021, 6, 24, 7, 1, 20, 984405, tzinfo=datetime.timezone.utc),
  'container': '',
  'artifact_uri': ''},
 {'name': 'projects/901951554789/locations/us-central1/models/7870893569853095936',
  'display_name': 'my_first_tensorflow_model',
  'create_time': DatetimeWithNanoseconds(2021, 6, 23, 5, 7, 45, 691387, tzinfo=datetime.timezone.utc),
  'container': 'us-docker.pkg.dev/cloud-aiplatform/prediction/tf2-cpu.2-4:

In [8]:
# List all Endpoints
from google.cloud.aiplatform import gapic as aip
def list_endpoints():
    PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION
    API_ENDPOINT = "{}-aiplatform.googleapis.com".format(REGION)
    client_options = {"api_endpoint": API_ENDPOINT}
    client = aip.EndpointServiceClient(client_options=client_options)
    response = client.list_endpoints(parent=PARENT)
    endpoint_list = []
    for endpoint in response:
        model_name = ''
        if (len(endpoint.deployed_models) > 0):
            model_name = endpoint.deployed_models[0].model
        endpoint_list.append(
            {
                "name": endpoint.name,
                "display_name": endpoint.display_name,
                "create_time": endpoint.create_time,
                "deployed_models": model_name
            }
        )
    return(endpoint_list)

endpoint_list = list_endpoints()
endpoint_list


[{'name': 'projects/901951554789/locations/us-central1/endpoints/5417038703354707968',
  'display_name': 'freddimac_deployed',
  'create_time': DatetimeWithNanoseconds(2021, 6, 24, 15, 26, 21, 375987, tzinfo=datetime.timezone.utc),
  'deployed_models': 'projects/901951554789/locations/us-central1/models/6353180495429238784'},
 {'name': 'projects/901951554789/locations/us-central1/endpoints/5035921584888479744',
  'display_name': 'my_first_tensorflow_model_endpoint',
  'create_time': DatetimeWithNanoseconds(2021, 6, 23, 5, 14, 35, 882989, tzinfo=datetime.timezone.utc),
  'deployed_models': 'projects/901951554789/locations/us-central1/models/7870893569853095936'},
 {'name': 'projects/901951554789/locations/us-central1/endpoints/7537108227939368960',
  'display_name': 'my_first_tensorflow_model_endpoint',
  'create_time': DatetimeWithNanoseconds(2021, 6, 21, 16, 20, 50, 482709, tzinfo=datetime.timezone.utc),
  'deployed_models': 'projects/901951554789/locations/us-central1/models/28251731

### If you already have a deployed endpoint

You can reuse your existing endpoint by filling in the value of your endpoint ID in the next cell and running it. **If you've just deployed an endpoint in the previous cell, you should skip this step.**

In [9]:
# Find Endpoint ID for a given display name
ENDPOINT_NAME = 'freddimac_deployed'
endpoint = ''
endpoint_id = ''
for i in range(len(endpoint_list)):
    if (endpoint_list[i]['display_name'] == ENDPOINT_NAME):
        endpoint = endpoint_list[i]['name']
        endpoint_id = endpoint.split('/')[-1]
        break

if (endpoint != ''):
    print("Endpoint found with name {}".format(ENDPOINT_NAME))
    print("name = {}".format(endpoint))
    print("id = {}".format(endpoint_id))

if (endpoint == ''):
    print("No endpoint found with name {}".format(ENDPOINT_NAME))


Endpoint found with name freddimac_deployed
name = projects/901951554789/locations/us-central1/endpoints/5417038703354707968
id = 5417038703354707968


In [10]:
# @title Run this cell only if you want to reuse an existing endpoint.
if not os.getenv("IS_TESTING"):
    ENDPOINT_ID = endpoint_id  # @param {type:"string"}
    if ENDPOINT_ID:
        ENDPOINT = endpoint
        print(f"Using endpoint {ENDPOINT}")
    else:
        print("If you want to reuse an existing endpoint, you must specify the endpoint id above.")

Using endpoint projects/901951554789/locations/us-central1/endpoints/5417038703354707968


## Get Training and Test data

 -----------------

In [24]:
# Read and prepare the test data
import pandas as pd
print(ENDPOINT)
TEST_FEATURE_PATH = f"gs://tuti_asset/datasets/mortgage_structured_x_test.csv" 
x_test = pd.read_csv(TEST_FEATURE_PATH)
x_test = x_test.fillna(0)
x_test.drop(['Unnamed: 0'], axis=1, inplace=True)
cols = x_test.columns
a = [x for x in range(len(cols))]
for cnt, val in enumerate(cols):
    x1 = val.replace('\"', '')
    x2 = x1.replace(',', '_')
    x3 = x2.replace('&', '_')
    x4 = x3.replace(' ', '_')
    a[cnt] = x4
x_test.columns = a
x_test.shape


projects/901951554789/locations/us-central1/endpoints/5417038703354707968


(10405, 148)

## Run a prediction test

Now that you have imported a model and deployed that model to an endpoint, you are ready to verify that it's working. Run the next cell to send a test prediction request. If everything works as expected, you should receive a response encoded in a text representation called JSON.

**Try this now by running the next cell and examine the results.**

In [25]:
# Create the instances for AutoML Tables model prediction
instances_dict = x_test.iloc[0:100].astype(str).to_dict(orient="index")
instances_list = []
for i in range(len(instances_dict)):
    instances_list.append(instances_dict[i])


In [26]:
# Predict with AutoML Table model
import pprint as pp
print(ENDPOINT)
#print("request:")
#pp.pprint(instances_list)
try:
    resp = send_predict_request(ENDPOINT, instances_list)
    print("Prediction Succeeded")
    #pp.pprint(resp)
except Exception:
    print("Prediction Failed")


projects/901951554789/locations/us-central1/endpoints/5417038703354707968
Prediction Succeeded


## Start your monitoring job

Now that you've created an endpoint to serve prediction requests on your model, you're ready to start a monitoring job to keep an eye on model quality and to alert you if and when input begins to deviate in way that may impact your model's prediction quality.

In this section, you will configure and create a model monitoring job based on the churn propensity model you imported from BigQuery ML.

In [16]:
# Organize the features to monitor
def convert_list_to_string(org_list, seperator=' '):
    """ Convert list to string, by joining all item in list with given separator.
        Returns the concatenated string """
    return seperator.join(org_list)
# Join all the strings in list
feature_list = convert_list_to_string(list(x_test.columns), ', ')
feature_list_thresh = convert_list_to_string(list(x_test.columns), ':1, ')
feature_list_thresh+=':1'
feature_list_thresh


'credit_score:1, metropolitan_division:1, mortgage_insurance_percentage:1, Number_of_units:1, cltv:1, original_upb:1, ltv:1, original_interest_rate:1, original_loan_term:1, number_of_borrowers:1, min_CURRENT_ACTUAL_UPB:1, max_CURRENT_ACTUAL_UPB:1, Range_CURRENT_ACTUAL_UPB:1, stdev_CURRENT_ACTUAL_UPB:1, mode_CURRENT_ACTUAL_UPB:1, average_CURRENT_ACTUAL_UPB:1, min_CURRENT_DEFERRED_UPB:1, max_CURRENT_DEFERRED_UPB:1, Range_CURRENT_DEFERRED_UPB:1, mode_CURRENT_DEFERRED_UPB:1, average_CURRENT_DEFERRED_UPB:1, stdev_CURRENT_DEFERRED_UPB:1, min_CURRENT_INTEREST_RATE:1, max_CURRENT_INTEREST_RATE:1, Range_CURRENT_INTEREST_RATE:1, mode_CURRENT_INTEREST_RATE:1, stdev_CURRENT_INTEREST_RATE:1, average_CURRENT_INTEREST_RATE:1, PREFINAL_LOAN_DELINQUENCY_STATUS:1, frequency_0:1, frequency_1:1, frequency_2:1, frequency_3:1, Recency_0:1, Recency_1:1, Recency_2:1, Recency_3:1, first_time_home_buyer_flag_9:1, first_time_home_buyer_flag_N:1, occupancy_status_I:1, occupancy_status_P:1, occupancy_status_S:1, c

### Configure the following fields:

1. User email - The email address to which you would like monitoring alerts sent.
1. Log sample rate - Your prediction requests and responses are logged to BigQuery tables, which are automatically created when you create a monitoring job. This parameter specifies the desired logging frequency for those tables.
1. Monitor interval - The  time window over which to analyze your data and report anomalies. The minimum window is one hour (3600 seconds).
1. Target field - The prediction target column name in training dataset.
1. Skew detection threshold - The skew threshold for each feature you want to monitor.
1. Prediction drift threshold - The drift threshold for each feature you want to monitor.

In [17]:
USER_EMAIL = "cchatterjee@google.com"  # @param {type:"string"}
JOB_NAME = "monitoring_job"
TRAIN_DATA = f"gs://tuti_asset/datasets/chanchal_mortgage_structured_train.csv"

# Sampling rate (optional, default=.8)
LOG_SAMPLE_RATE = 0.8  # @param {type:"number"}

# Monitoring Interval in seconds (optional, default=3600).
MONITOR_INTERVAL_IN_SECONDS = 3600  # @param {type:"number"}

# URI to training dataset.
DATASET_GCS_URI = TRAIN_DATA  # @param {type:"string"}
# Prediction target column name in training dataset.
TARGET = "label"

# Skew and drift thresholds.
SKEW_DEFAULT_THRESHOLDS = feature_list  # @param {type:"string"}
SKEW_CUSTOM_THRESHOLDS = feature_list_thresh  # @param {type:"string"}
DRIFT_DEFAULT_THRESHOLDS = feature_list  # @param {type:"string"}
DRIFT_CUSTOM_THRESHOLDS = feature_list_thresh  # @param {type:"string"}


### Create your monitoring job

The following code uses the Google Python client library to translate your configuration settings into a programmatic request to start a model monitoring job. To do this successfully, you need to specify your alerting thresholds (for both skew and drift), your training data source, and apply those settings to all deployed models on your new endpoint (of which there should only be one at this point).

Instantiating a monitoring job can take some time. If everything looks good with your request, you'll get a successful API response. Then, you'll need to check your email to receive a notification that the job is running.

Drift Detection

You need to specify the data drift threshold for the features you want to monitoring. The whole idea behind the alerting is to see if a feature's data distribution distance is above the threshold you set. If it is, we will send email alerts to the $USER_EMAIL you specified above.

How do we calculate the feature distribution distance? We use [L-infinity distance](https://en.wikipedia.org/wiki/Chebyshev_distance) for categorical features and [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) for numerical features. More details are [here](https://www.tensorflow.org/tfx/guide/tfdv#drift_detection).

Below you just need to specify the features that you want use default threshold(0.001) and the customized threshold features. If you don't want to monitor a feature, feel free to skip filling into any of these fields.

Note: if you want to enable the feature attributions score
(based on Sampled Sharpley method, more details are [here](https://cloud.google.com/ai-platform-unified/docs/explainable-ai)) monitoring, just change the "enable_feature_attributes" to True in monitoring_objective_config_template. Make sure your model is configed with explanations [requirement](https://cloud.google.com/ai-platform-unified/docs/explainable-ai/configuring-explanations). 

In [None]:
# Set thresholds specifying alerting criteria for training/serving skew and create config object.
skew_thresholds = get_thresholds(SKEW_DEFAULT_THRESHOLDS, SKEW_CUSTOM_THRESHOLDS)
skew_config = ModelMonitoringObjectiveConfig.TrainingPredictionSkewDetectionConfig(
    skew_thresholds=skew_thresholds
)

# Set thresholds specifying alerting criteria for serving drift and create config object.
drift_thresholds = get_thresholds(DRIFT_DEFAULT_THRESHOLDS, DRIFT_CUSTOM_THRESHOLDS)
drift_config = ModelMonitoringObjectiveConfig.PredictionDriftDetectionConfig(
    drift_thresholds=drift_thresholds)
#explanation_config = ModelMonitoringObjectiveConfig.ExplanationConfig(
#    enable_feature_attributes = False)

# Specify training dataset source location (used for schema generation).
# training_dataset = ModelMonitoringObjectiveConfig.TrainingDataset(target_field=TARGET)
# training_dataset.bigquery_source = BigQuerySource(input_uri=DATASET_BQ_URI)
training_dataset = ModelMonitoringObjectiveConfig.TrainingDataset(target_field=TARGET)
training_dataset.data_format = "csv"
training_dataset.gcs_source = GcsSource(uris=[DATASET_GCS_URI])

# Aggregate the above settings into a ModelMonitoringObjectiveConfig object and use
# that object to adjust the ModelDeploymentMonitoringObjectiveConfig object.
objective_config = ModelMonitoringObjectiveConfig(
    training_dataset=training_dataset,
    training_prediction_skew_detection_config=skew_config,
    prediction_drift_detection_config=drift_config,
    #explanation_config = explanation_config,
)
objective_template = ModelDeploymentMonitoringObjectiveConfig(
    objective_config=objective_config
)

# Find all deployed model ids on the created endpoint and set objectives for each.
model_ids = get_deployed_model_ids(ENDPOINT_ID)
objective_configs = set_objectives(model_ids, objective_template)

# Create the monitoring job for all deployed models on this endpoint.
monitoring_job = create_monitoring_job(objective_configs)


In [19]:
#  Predict with AutoML Table model

instances_dict = x_test.iloc[0:100].astype(str).to_dict(orient="index")
instances_list = []
for i in range(len(instances_dict)):
    instances_list.append(instances_dict[i])

# # Run a prediction request to generate schema, if necessary.
for i in range(100):
    try:
        _ = send_predict_request(ENDPOINT, instances_list)
        print("{}th Prediction Succeeded".format(i))
    except Exception:
        print("{}th Prediction Failed".format(i))

0th Prediction Succeeded
1th Prediction Succeeded
2th Prediction Succeeded
3th Prediction Succeeded
4th Prediction Succeeded
5th Prediction Succeeded
6th Prediction Succeeded
7th Prediction Succeeded
8th Prediction Succeeded
9th Prediction Succeeded
10th Prediction Succeeded
11th Prediction Succeeded
12th Prediction Succeeded
13th Prediction Succeeded
14th Prediction Succeeded
15th Prediction Succeeded
16th Prediction Succeeded
17th Prediction Succeeded
18th Prediction Succeeded
19th Prediction Succeeded
20th Prediction Succeeded
21th Prediction Succeeded
22th Prediction Succeeded
23th Prediction Succeeded
24th Prediction Succeeded
25th Prediction Succeeded
26th Prediction Succeeded
27th Prediction Succeeded
28th Prediction Succeeded
29th Prediction Succeeded
30th Prediction Succeeded
31th Prediction Succeeded
32th Prediction Succeeded
33th Prediction Succeeded
34th Prediction Succeeded
35th Prediction Succeeded
36th Prediction Succeeded
37th Prediction Succeeded
38th Prediction Succee

After a minute or two, you should receive email at the address you configured above for USER_EMAIL. This email confirms successful deployment of your monitoring job. Here's a sample of what this email might look like:
<br>
<br>
<img src="https://storage.googleapis.com/mco-general/img/mm6.png" />
<br>
As your monitoring job collects data, measurements are stored in Google Cloud Storage and you are free to examine your data at any time. The circled path in the image above specifies the location of your measurements in Google Cloud Storage. Run the following cell to take a look at your measurements in Cloud Storage.


In [20]:
!gsutil ls gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/*
!gsutil ls gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/model_monitoring/*
    

gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/job-3248190444915392512/:
gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/job-3248190444915392512/analysis

gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/job-3602285965617397760/:
gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/job-3602285965617397760/analysis

gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/job-3851672794983038976/:
gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/job-3851672794983038976/analysis

gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/job-4401111949522239488/:
gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/job-4401111949522239488/analysis

gs://cloud-ai-platform-e3baca16-39fe-4ad9-9cd1-957fa54da12d/instance_schemas/job-4461470739840630784/:
gs://cloud-ai-platform-e3baca16-39fe-4ad9

You will notice the following components in these Cloud Storage paths:

- **cloud-ai-platform-..** - This is a bucket created for you and assigned to capture your service's prediction data. Each monitoring job you create will trigger creation of a new folder in this bucket.
- **[model_monitoring|instance_schemas]/job-..** - This is your unique monitoring job number, which you can see above in both the response to your job creation requesst and the email notification. 
- **instance_schemas/job-../analysis** - This is the monitoring jobs understanding and encoding of your training data's schema (field names, types, etc.).
- **instance_schemas/job-../predict** - This is the first prediction made to your model after the current monitoring job was enabled.
- **model_monitoring/job-../serving** - This folder is used to record data relevant to drift calculations. It contains measurement summaries for every hour your model serves traffic.
- **model_monitoring/job-../training** - This folder is used to record data relevant to training-serving skew calculations. It contains an ongoing summary of prediction data relative to training data.

In [21]:
# List all monitoring jobs
job_list = list_monitoring_jobs()
job_list

[{'name': 'projects/901951554789/locations/us-central1/modelDeploymentMonitoringJobs/804583026787876864',
  'state': 'JOB_STATE_PENDING'}]

### You can create monitoring jobs with other user interfaces

In the previous cells, you created a monitoring job using the Python client library. You can also use the *gcloud* command line tool to create a model monitoring job and, in the near future, you will be able to use the Cloud Console, as well for this function. 


## Interpret your results

While waiting for your results, which, as noted, may take up to an hour, you can read ahead to get sense of the alerting experience.

### Here's what a sample email alert looks like...

<img src="https://storage.googleapis.com/mco-general/img/mm7.png" />


This email is warning you that the *cnt_user_engagement*, *country* and *language* feature values seen in production have skewed above your threshold between training and serving your model. It's also telling you that the *cnt_user_engagement* feature value is drifting significantly over time, again, as per your threshold specification.

### Monitoring results in the Cloud Console

You can examine your model monitoring data from the Cloud Console. Below is a screenshot of those capabilities.

#### Monitoring Status

<img src="https://storage.googleapis.com/mco-general/img/mm1.png" />

#### Monitoring Alerts

<img src="https://storage.googleapis.com/mco-general/img/mm2.png" />

## Clean up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [22]:
# Print current job names and status
jobs_list = list_monitoring_jobs()
njobs = len(jobs_list)
if (njobs == 0):
    print("No monitoring jobs")
else:
    for i in range(len(jobs_list)):
        print(jobs_list[i]['name'])
        print(jobs_list[i]['state'])

projects/901951554789/locations/us-central1/modelDeploymentMonitoringJobs/804583026787876864
JOB_STATE_PENDING


In [23]:
# Delete all monitoring jobs
jobs_list = list_monitoring_jobs()
if (len(jobs_list) == 0):
    print("No jobs to delete.")
for i in range(len(jobs_list)):
    if (jobs_list[i]['state'] == "JOB_STATE_RUNNING"):
        pause_monitoring_job(jobs_list[i]['name'])
    if (jobs_list[i]['state'] == "JOB_STATE_PENDING"):
        print("{} pending cannot be stopped".format(jobs_list[i]['name']))
    else:
        delete_monitoring_job(jobs_list[i]['name'])
        print("Deleted job")

projects/901951554789/locations/us-central1/modelDeploymentMonitoringJobs/804583026787876864 pending cannot be stopped


In [None]:
# out = !gcloud ai endpoints undeploy-model $ENDPOINT_ID --deployed-model-id $DEPLOYED_MODEL_ID
# if _exit_code == 0:
#     print("Model undeployed.")
# else:
#     print("Error undeploying model:", out)

# out = !gcloud ai endpoints delete $ENDPOINT_ID --quiet
# if _exit_code == 0:
#     print("Endpoint deleted.")
# else:
#     print("Error deleting endpoint:", out)

# out = !gcloud ai models delete $MODEL_ID --quiet
# if _exit_code == 0:
#     print("Model deleted.")
# else:
#     print("Error deleting model:", out)

## Learn more about model monitoring

**Congratulations!** You've now learned what model monitoring is, how to configure and enable it, and how to find and interpret the results. Check out the following resources to learn more about model monitoring and ML Ops.

- [TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv)
- [Data Understanding, Validation, and Monitoring At Scale](https://blog.tensorflow.org/2018/09/introducing-tensorflow-data-validation.html)
- [Vertex Product Documentation](https://cloud.google.com/vertex-ai/docs)
- [Model Monitoring Reference Docs](https://cloud.google.com/vertex-ai/docs/model-monitoring)
