# Sample External Model with Domino Model Monitoring

Example notebook to set up external Domino Model Monitoring (DMM):
- Models hosted outside of Domino 
- Batch inference models scored using through Domino Jobs

## Background
The key difference between monitoring external models and Domino's integrated model monitoring is that with external models, Domino does not capture your model's training data using TrainingSets or prediction data through the DataCaptureClient. You will need to provide the model's training data, prediction data and, optionally, ground truth labels in a Monitoring Data Source, and then point your model in DMM to that data.

Other notes:

(1) The model does not need to be trained in Domino. It can be an existing model trained elsewhere.

(2) It does not matter where the external model is hosted. It could be on an edge device, on-prem, in your cloud hosting service, or even hosted in Domino.

The steps below can be done via the DMM UI, or automated using DMM's API. To use the UI, follow the steps documented here:

https://docs.dominodatalab.com/en/latest/user_guide/679cc1/set-up-model-monitor/

The examples below demonstrate the setup & monitoring of external models using DMM's API. 

### Step 1: Connect a Monitoring Data Source

Domino requires an external data source to register an external model. For integrated models, you only need a Monitoring Data Source if you are ingesting ground truth labels.

The external data source stores:

(1) The training dataset

(2) Prediction data & model predictions

(3) Ground truth labels (optional)

One Monitoring Data Source can be used for multiple DMM models. The same Monitoring Data Source can also be used for both ground truth labels for integrated models and data used for external models.

The Monitoring Data Sources are registered independently of the data sources used in Domino Workbench. Model monitoring can read in data from multiple cloud data sources or on-prem data sources. A list of available Monitoring Data Sources is here:
https://docs.dominodatalab.com/en/latest/user_guide/8c7833/connect-a-data-source/

You can register your Monitoring Data Source through the DMM UI or using DMM's API (see example API call below). If using the example below, be sure to update inputs 1-4 (labeled "UPDATE"), and the datasource type specific to your DMM data source. 

In [23]:
# Example: Register a Monitoring Data Source using the the API.

# API Reference: https://docs.dominodatalab.com/en/latest/api_guide/f31cde/model-monitoring-api-reference/#_datasource

import os
import json
import requests

# UPDATE: (1) Your Domino API key 
# https://docs.dominodatalab.com/en/latest/user_guide/40b91f/domino-api-authentication/#_authenticate_with_an_api_key)
API_key = os.environ['MY_API_KEY']

# UPDATE: (2) Your organizations's Domino url
your_domino_url = 'demo2.dominodatalab.com'

# UPDATE: (3) Your new DMM datasource name
datasource_name = 'New_DataSource'

# UPDATE: (4) DMM Datasource Type & Attributes. These credentials will be different for each datasource.
# This example is for AWS s3, other data sources are documented here: 
# https://docs.dominodatalab.com/en/latest/api_guide/f31cde/model-monitoring-api-reference/#_dataSourceRequestCommon

datasource_type = "s3"
S3_Bucket_Name = "se-demo-bucket"
S3_Region = "us-west-2"
AWS_Access_Key = os.environ.get("AWS_ACCESS_KEY_ID")
AWS_Secret_Key = os.environ.get("AWS_SECRET_ACCESS_KEY")

datasource_url = "https://{}/model-monitor/v2/api/datasource".format(your_domino_url)

# Set up call headers
headers = {
           'X-Domino-Api-Key': API_key,
           'Content-Type': 'application/json'
          }

data_source_request = {
    "name": datasource_name,
    "type": datasource_type,
    "config" : {
        "bucket": S3_Bucket_Name,
        "region": S3_Region,
        "instance_role" : False,
        "access_key": AWS_Access_Key,
        "secret_key": AWS_Secret_Key
    }
}
# format(datasource_name, datasource_type, S3_Bucket_Name, S3_Region, AWS_Access_Key, AWS_Secret_Key)

# Make api call
ground_truth_response = requests.request("PUT", datasource_url, headers=headers, data = json.dumps(data_source_request))
 
# Print response
print(ground_truth_response.text.encode('utf8'))
 
print('DONE!')

b'Datasource with name New_DataSource and type s3 is already registered.'
DONE!


### 2. Register an External Model

Once you have a Monitoring Data Source registered:

(1) Upload the training data used for your model to that Monitoring Data Source, and note the path to your training data file. DMM will need this to initiate the model.

(2) Prepare your **Monitoring Config JSON** file. In the UI, the config json looks like the example below.

It contains 3 components:

(A) **variables**: A list of variable names, data types, and variable types for each column that you want to monitor. This can include the target variable if you'd like to monitor drift in your model's predictions.

(B) **datasetDetails**: The name and location of your training dataset that you just uploaded into the DMM datasource

(C) **modelMetadata**: The name and description of your model to render in Domino Model Monitoring

Like with DMM Data Sources, Monitoring Config JSONs can be copied and pasted into the UI or automatically sent to Domino via APIs. Full documentation for Monitoring Config JSONs here:

https://docs.dominodatalab.com/en/latest/user_guide/bb88ca/monitoring-config-json/

**Pro Tip:** Domino recommends saving your config json as a file in your Project files for future reference and modification. See "Example_Model_Config.json"



```
{
    "variables": [
        {
            "valueType": "numerical",
            "variableType": "feature",
            "name": "petal.length"
        },
        {
            "valueType": "numerical",
            "variableType": "feature",
            "name": "sepal.length"
        },
        {
            "valueType": "numerical",
            "variableType": "feature",
            "name": "petal.width"
        },
        {
            "valueType": "numerical",
            "variableType": "feature",
            "name": "sepal.width"
        },
        {
            "valueType": "categorical",
            "variableType": "prediction",
            "name": "variety"
        }
    ],
    "datasetDetails": {
        "name": "iris.csv",
        "datasetType": "file",
        "datasetConfig": {
            "path": "iris.csv",
            "fileFormat": "csv"
        },
        "datasourceName": "dmm-shared-bucket",
        "datasourceType": "s3"
    },
    "modelMetadata": {
        "name": "iris_model",
        "modelType": "classification",
        "version": "1.01",
        "description": "classification_iris_model",
        "author": "John Doe"
    }
}
```

### Link the Monitoring Data Source to this Project (Optional)

Data sources are registered separately in Domino Model Monitoring and Domino's workbench. 

In this example, we want to attach the same external data source (such as an s3 bucket) to both DMM and Domino's Workbench, since we are uploading the training data from Domino's Workbench. You don't have to do it from here though, you can add your training data directly to your DMM datasource outside of Domino.

To upload the training dataset from the Workbench, add the same Monitoring Data Source registered above to this project. In a workspace, you can add it through the Data tab on the left.

The Iris training dataset is already saved in the "data" folder in the mnt directory as "iris_training_data.csv"

In [20]:
# from sklearn.datasets import load_iris
# import pandas as pd

# data = load_iris()
# target_column_name = "variety"

# training_df = pd.DataFrame(data['data'], columns = data.feature_names)
# training_df['variety'] = [data.target_names[y] for y in data["target"]]
# training_df.tail(20)
# training_df.to_csv('/mnt/code/data/iris_training_data.csv', index=False)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),variety
130,7.4,2.8,6.1,1.9,virginica
131,7.9,3.8,6.4,2.0,virginica
132,6.4,2.8,5.6,2.2,virginica
133,6.3,2.8,5.1,1.5,virginica
134,6.1,2.6,5.6,1.4,virginica
135,7.7,3.0,6.1,2.3,virginica
136,6.3,3.4,5.6,2.4,virginica
137,6.4,3.1,5.5,1.8,virginica
138,6.0,3.0,4.8,1.8,virginica
139,6.9,3.1,5.4,2.1,virginica


In [24]:
# Upload the Training Data from Domino Workbench to the DMM data source

from domino.data_sources import DataSourceClient

# The name of the DMM data source in Domino's Workbench.
external_datasource = "demo-bucket" # UPDATE 

# instantiate a client and fetch the datasource instance
object_store = DataSourceClient().get_datasource("{}".format(external_datasource))

# Upload the existing training data to your DMM datasource from the Workbench
object_store.upload_file('iris_training_data.csv', '/mnt/code/data/iris_training_data.csv')
object_store.upload_file('external_model_scoring_data.csv', '/mnt/code/data/external_model_scoring_data.csv')

In [25]:
#### Example to register a model via the API
# API Reference: https://docs.dominodatalab.com/en/latest/user_guide/a94c1c/model-monitoring-apis/#_model

import os
import json
import requests

# UPDATE: (1) Your Domino API key
API_key = os.environ['MY_API_KEY']

# UPDATE: (2) Your organizations's Domino url
your_domino_url = 'demo2.dominodatalab.com'

# UPDATE: (3) Your DMM datasource name
datasource_name = 'se-demo-bucket'

# UPDATE: (4) Your DMM datasource type
datasource_type = 's3'

# UPDATE: (5) DMM Datasource Type & Attributes. These file names & format will be different for each datasource.
training_dataset_name = "iris_training_data.csv"
training_dataset_path = "iris_training_data.csv"
training_dataset_fileFormat = "csv"

datasource_url = "https://{}/model-monitor/v2/api/model".format(your_domino_url)

# Set up call headers
headers = {
           'X-Domino-Api-Key': API_key,
           'Content-Type': 'application/json'
          }

# Update each variable name, varibleType and valueType for your model:

model_register_request = {
    "variables": [
        {
            "valueType": "numerical",
            "variableType": "feature",
            "name": "petal length (cm)"
        },
        {
            "valueType": "numerical",
            "variableType": "feature",
            "name": "sepal length (cm)"
        },
        {
            "valueType": "numerical",
            "variableType": "feature",
            "name": "petal width (cm)"
        },
        {
            "valueType": "numerical",
            "variableType": "feature",
            "name": "sepal width (cm)"
        },
        {
            "valueType": "categorical",
            "variableType": "prediction",
            "name": "variety"
        }
    ],
    "datasetDetails": {
        "name": training_dataset_name,
        "datasetType": "file",
        "datasetConfig": {
            "path": training_dataset_path,
            "fileFormat": training_dataset_fileFormat
        },
        "datasourceName": datasource_name,
        "datasourceType": datasource_type
    },
    "modelMetadata": {
        "name": "Example External Model",
        "modelType": "classification",
        "version": "1.01",
        "description": "classification_iris_model",
        "author": "John Doe"
    }
}

# Make api call
ground_truth_response = requests.request("PUT", datasource_url, headers=headers, data = json.dumps(model_register_request))
 
# Print response
print(ground_truth_response.text.encode('utf8'))

print("New model id is:")

print('DONE!')

b'{"id": "662967dff19b7cc360c275ac", "createdAt": 1713989599, "updatedAt": 1713989599, "name": "Example External Model", "description": "classification_iris_model", "modelType": "classification", "author": "John Doe", "version": "1.01", "userId": "4b684539-9bd4-46d4-bb64-60c8094ccb15", "isDeleted": false, "ingestionStatus": "created", "registrationStatus": "created", "sourceType": "standalone", "visibility": "public", "collaborators": [], "tagIds": []}'
DONE!


### 3. Set up Drift Detection
https://docs.dominodatalab.com/en/latest/user_guide/86bc1f/set-up-drift-detection/


While integrated models can capture prediction data using the DataCaptureClient, external models need to ingest prediction data from a connected Monitoring Data Source. Just like with the initial model registration, information needed to ingest the prediction data is provided to Domino using a **Prediction Config JSON**.


There are two approaches to automating prediction data ingest from a Monitoring Data Source:


(1) Append new data to the same file in your Monitoring Data Source.


  -  Only register your Prediction Data config with the path to the prediction data file once. Domino will automatically retrieve new data every 24 hours from that file. You can schedule the daily check in the DMM UI.
  -  This approach requires registering a **timestamp** variable so that DMM knows which prediction rows are new.


(2) Upload prediction data as separate files to your Monitoring Data Source.


  - This requires updating the datasetDetails in the Prediction Data Config everytime new prediction data is added. This is best automated through the API, using a Domino Job or some other scheduler.
  - When you update the Prediction Data Config, only update the "datasetDetails" with the new prediction data file path. Variables are only set the first time, if you re-register variable names DMM will throw an error.


Below is an example of an initial Prediction Data Config file, which be copied and pasted into the UI or automatically sent to Domino via APIs. Full docs here:


https://docs.dominodatalab.com/en/latest/user_guide/bb88ca/monitoring-config-json/


At the end is an example of only updating the "datasetDetails" via the API if you choose to follow approach #2.

**Pro Tip:** Domino recommends saing your config json as a file in your Project files for future reference and modification. See "Example_Prediction_Config.json"

In [22]:
### Example to register the initial Prediction Config via the API
# API Reference: https://docs.dominodatalab.com/en/latest/user_guide/a94c1c/model-monitoring-apis/#_model

import os
import json
import requests

# UPDATE: (1) Your Domino API key
API_key = os.environ['MY_API_KEY']

# UPDATE: (2) Your Model Monitoring Model ID, created when the model was registered in Step 2.
model_id='66295e50965e21e5b0d56b84'

# UPDATE: (3) Your organizations's Domino url
your_domino_url = 'demo2.dominodatalab.com'

# UPDATE: (4) Your DMM datasource name
datasource_name = 'se-demo-bucket'

# UPDATE: (5) Your DMM datasource type
datasource_type = 's3'

# UPDATE: (6) Your RowID Name (Optional, for model quality monitoring. Do this only once.)
Prediction_ID_name = 'id'

# UPDATE: (7) DMM Datasource Type & Attributes. These credential will be different for each datasource.
prediction_dataset_name = "external_model_scoring_data.csv"
prediction_dataset_path = "external_model_scoring_data.csv"
prediction_dataset_fileFormat = "csv"

prediction_data_url = "https://{}/model-monitor/v2/api/model/{}/register-dataset/prediction".format(your_domino_url, model_id)


# Set up call headers
headers = {
           'X-Domino-Api-Key': API_key,
           'Content-Type': 'application/json'
          }

# Update each variable name, varibleType and valueType for your model:

prediction_registration_request = {
    "variables": [
        {
            "valueType": "string",
            "variableType": "row_identifier",
            "name": Prediction_ID_name
        }
    ],
    "datasetDetails": {
        "name": prediction_dataset_name,
        "datasetType": "file",
        "datasetConfig": {
            "path": prediction_dataset_path,
            "fileFormat": prediction_dataset_fileFormat
        },
        "datasourceName": datasource_name,
        "datasourceType": datasource_type
    }
}

# Make api call
ground_truth_response = requests.request("PUT", prediction_data_url, headers=headers, data = json.dumps(prediction_registration_request))
 
# Print response
print(ground_truth_response.text.encode('utf8'))
 
print('DONE!')

b'["Dataset already registered with the model."]'
DONE!


#### Option 2: Upload additional prediction data as separate files to your Monitoring Data Source.

Next is an example for updating the prediction data file via the API if you choose option (2).

Example scripts to automate these steps using Domino Jobs are in the "external_model_scripts" folder.

**domino_batch_job.py** simulates the external model scoring step, using a Domino job for batch inference. Batch inference in Domino using Domino Jobs can be monitored the same way as an external model. For external models, the scoring data, external model predictions, and prediction ID must be captured manually.

**daily_scoring_upload.py** 

1) First, the script uploads the scoring data, including the external model predictions and prediction ID, to the DMM data source for our external model. 
2) Next, it uploads a csv file containing the ground truth labels to the DMM data source.
3) Finally it updates the file paths for both scoring data and ground truth data using the DMM API so that DMM can find the new data when it ingests data from the DMM data source. 

To test these scripts, the ensure batch job runs before daily scoring, and that DMM is scheduled to ingest the scoring and ground truth data after daily_scoring_upload has finished. 

In [None]:
### Example to update Prediction Config if uploading prediction data as separate files to your Monitoring Data Source.
# Only update the "datasetDetails". You could run this snippet as a Domino Job, updating the prediction dataset name and path.

import os
import json
import requests

# Your Domino API key
API_key = os.environ['MY_API_KEY']

# Your Model Monitoring Model ID, created when the model was registered in Step 2.
model_id='6628103c965e21e5b0d56b29'

# Your organizations's Domino url
your_domino_url = 'demo2.dominodatalab.com'

# Your DMM datasource name
datasource_name = 'se-demo-bucket'

# Your DMM datasource type
datasource_type = 's3'

# The updated path to your prediction dataset
prediction_dataset_name = "iris_ground_truth_1_25_2024.csv"
prediction_dataset_path = "iris_ground_truth_1_25_2024.csv"
prediction_dataset_fileFormat = "csv"

# Set up call headers
headers = {
           'X-Domino-Api-Key': API_key,
           'Content-Type': 'application/json'
          }

prediction_registration_request = {
    "datasetDetails": {
        "name": prediction_dataset_name,
        "datasetType": "file",
        "datasetConfig": {
            "path": prediction_dataset_path,
            "fileFormat": prediction_dataset_fileFormat
        },
        "datasourceName": datasource_name,
        "datasourceType": datasource_type
    }
}

# Make api call
ground_truth_response = requests.request("PUT", prediction_data_url, headers=headers, data = json.dumps(prediction_registration_request))
 
# Print response
print(ground_truth_response.text.encode('utf8'))
 
print('DONE!')


### 4. Set up Model Quality Monitoring (Optional)

There is very little difference in setting up Model Quality Monitoring between Internal and External models, since Domino Model APIs cannot capture actual outcomes after-the-fact. The process is nearly the same as registering prediction data for external models.

Typically for this step you would fetch actual ground truth data (the actual outcomes from what your model predicted on), 
join the actual outcomes with your prediction data, and upload into a Monitoring Data Source for Model Quality 
analysis.

However, for purposes of creating a quick demo, we'll make up some fake ground truth data using the data we used in the previous step.

In [None]:
import pandas as pd

# Navigate to the most recent predictions and copy the file path to one of the parquet files in there. 
# This is where you can find data captured by the Data Capture Client in your Model API

# /mnt/data/prediction_data/{PREDICTION_DATA_ID}/{DATE}/{TIME}/predictions_{ID}.parquet

path = '/mnt/data/prediction_data/65b04f6b1266902edb95b260/$$date$$=2024-04-23Z/$$hour$$=07Z/predictions_96f154f9-99c3-4da0-ae7c-878b21ddffa7.parquet'

predictions = pd.read_parquet(path)

The Ground Truth dataset needs 2 columns: 

1) The existing event ID column from the model predictions.
   
    This column has the join keys for joing ground truth lables to your model's predictions

3) Your new column containing ground truth labels.


In [None]:
event_id = predictions['event_id']
iris_ground_truth = predictions['variety']

# Create a new dataframe
ground_truth = pd.DataFrame(columns=['event_id', 'iris_ground_truth'])
ground_truth['event_id'] = event_id
ground_truth['iris_ground_truth'] = iris_ground_truth

# These row labels help find some diferent iris types in our initial scoring data
end_index = predictions.shape[0]
mid_index = int(round(predictions.shape[0] / 2, 0))

# Simulate some classifcation errors. This makes our confusion matrix interesting.
ground_truth.iloc[0, 1] = 'virginica'
ground_truth.iloc[1, 1] = 'versicolor'
ground_truth.iloc[mid_index-1, 1] = 'versicolor'
ground_truth.iloc[mid_index, 1] = 'virginica'
ground_truth.iloc[end_index-2, 1] = 'setosa'
ground_truth.iloc[end_index-1, 1] = 'setosa'

# Save this example ground truth csv to your file to your Project files for reference.

date = datetime.datetime.today()
month = date.month
day = date.day
year = date.year

date = str(datetime.datetime.today()).split()[0]

ground_truth.to_csv('data/iris_ground_truth_{}_{}_{}.csv'.format(month, day, year), index=False)

In [None]:
import pandas as pd
import numpy as np
import random
import math
import pickle
import json
import os
import requests
import datetime
import boto3
from botocore.exceptions import NoCredentialsError
 
# UPDATE: (1) The name of your monitoring data source in Domino Model Monitoring
data_source = 'se-demo-bucket'

# UPDATE: (2) Your Model Monitoring Model ID (NOT Model API model ID)
model_id='65b0525c54ac3acc8cb495d1'

# UPDATE: (3) Your Domino API key
API_key = os.environ['MY_API_KEY']
 
# UPDATE: (4) The name of the file uploaded to s3 above
gt_file_name = "iris_ground_truth_{}_{}_{}.csv".format(month, day, year)

# UPDATE: (5) Ground Truth column name
GT_column_name = 'iris_ground_truth'

# UPDATE: (6) Your original target column name
target_column_name = 'variety'

# UPDATE: (7) Your organizations's Domino url
your_domino_url = 'demo2.dominodatalab.com'

# UPDATE: (8) Your DataSource Type
datasource_type = "s3"

ground_truth_url = "https://{}/model-monitor/v2/api/model/{}/register-dataset/ground_truth".format(your_domino_url, model_id)

print('Registering {} From S3 Bucket in DMM'.format(gt_file_name))
 
# create GT payload    
 
# Set up call headers
headers = {
           'X-Domino-Api-Key': API_key,
           'Content-Type': 'application/json'
          }

 
ground_truth_payload = """
{{
    "variables": [{{
    
            "valueType": "categorical",
            "variableType": "ground_truth",
            "name": "{2}", 
            "forPredictionOutput": "{3}"
        
    }}],
    "datasetDetails": {{
            "name": "{0}",
            "datasetType": "file",
            "datasetConfig": {{
                "path": "{0}",
                "fileFormat": "csv"
            }},
            "datasourceName": "{1}",
            "datasourceType": "{4}"
        }}
}}
""".format(gt_file_name, data_source, GT_column_name, target_column_name, datasource_type)
 
# Make api call
ground_truth_response = requests.request("PUT", ground_truth_url, headers=headers, data = ground_truth_payload)
 
# Print response
print(ground_truth_response.text.encode('utf8'))
 
print('DONE!')


### Next Steps

Going forward, Domino will automatically capture all prediction data going across your Model API. It will ingest these predictions for Drift detection once per day. You can set a schedule to determine when this ingest happens.

To periodically upload ground truth labels, repeat the previous step, but without the “variables” in the ground truth payload (this only needs to be done once). As new ground truth labels are added, point Domino to the path to the new labels in the monitoring data source by pinging the same Model Monitoring API:

ground_truth_payload = """

{{

       "datasetDetails": {{
        
            "name": "{0}",
            "datasetType": "file",
            "datasetConfig": {{
                "path": "{0}",
                "fileFormat": "csv"
            }},
            "datasourceName": "{1}",
            "datasourceType": "s3"
        }}
}}""".format(gt_file_name, data_source, GT_column_name, target_column_name)



### Automation with Domino Jobs
To simulate Domino Model Monitoring over time, you can try out running the following two scripts as scheduled Domino Jobs:

**(1) daily_scoring.py**

Daily scoring simulates a daily batch scoring script. Data is read in, sent to the Domino Model API, and predictions are returned.
Domino's Prediction Capture Client captures this scoring data, and every 24 hours, it gets ingested into the Drift Monitoring dashboard.

**(2) daily_ground_truth.py**

Daily ground truth simulates uploading actual outcomes after the predictions have been made. A scheduled Domino Job writes the latest ground truth labels to an s3 bucket, then calls the Domino Model Monitoring API with the path to the file with the latest ground truth labels.

If you schedule these two jobs, be sure that ground truth runs after the predictions!