## Model Drift Detection Preparation

This notebook is used to create inference data for the drift detection notebooks.  The plan is to run these an hour or a day before so the workshop participants can use them for their training.

This notebook will have the bare minimum necessary for the training.  The rest will be part of the N4_drift_detection.ipynb notebook to actually demonstrate using assays for drift detection.

## Steps

* Load the workspace, pipeline, and model versions.
* Perform sample inferences to:
  * Set the baseline
  * Perform "normal" inferences.
  * Perform inferences that should trigger alerts.

### Import Libraries

The first step will be to import our libraries, and set variables used through this tutorial.

In [1]:
import wallaroo
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import Framework

from IPython.display import display

# used to display DataFrame information without truncating
from IPython.display import display
import pandas as pd
pd.set_option('display.max_colwidth', None)

import datetime
import time


workspace_name = 'workshop-workspace-john-05'
main_pipeline_name = 'houseprice-estimator'
model_name_control = 'house-price-prime'

# ignoring warnings for demonstration
import warnings
warnings.filterwarnings('ignore')

# used to display DataFrame information without truncating
from IPython.display import display
import pandas as pd
pd.set_option('display.max_colwidth', None)

### Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [3]:
# Login through local Wallaroo instance

wl = wallaroo.Client()

Please log into the following URL in a web browser:

	https://doc-test.keycloak.wallaroocommunity.ninja/auth/realms/master/device?user_code=VIVM-QUKD

Login successful!


### Set the Helper Functions

The following helper functions are used to retrieve the workspace, pipelines, and models that were established in N1_deploy_a_model.ipynb.  Verify that the workspace, pipeline, and model names all match that notebook.

### Retrieve Workspace, Pipeline, and Models

Retrieve the workspace, pipeline and model from notebook N1_deploy_a_model.ipynb.

In [5]:
## blank space to log in 

wl = wallaroo.Client()

# retrieve the previous workspace, model, and pipeline version

workspace_name = "workshop-workspace-john-05"

workspace = wl.get_workspace(name=workspace_name, create_if_not_exist=True)

# set your current workspace to the workspace that you just created
wl.set_current_workspace(workspace)

# optionally, examine your current workspace
wl.get_current_workspace()

model_name = 'house-price-prime'

prime_model_version = wl.get_model(model_name)

pipeline_name = 'houseprice-estimator'

pipeline = wl.get_pipeline(pipeline_name)

display(workspace)
display(prime_model_version)
display(pipeline)


{'name': 'workshop-workspace-john-05', 'id': 10, 'archived': False, 'created_by': 'fa780cd9-154a-4456-848b-5934f703fcdb', 'created_at': '2024-03-11T17:58:57.996784+00:00', 'models': [{'name': 'house-price-prime', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 3, 11, 17, 58, 59, 18588, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 3, 11, 17, 58, 59, 18588, tzinfo=tzutc())}], 'pipelines': [{'name': 'houseprice-estimator', 'create_time': datetime.datetime(2024, 3, 11, 17, 58, 59, 194422, tzinfo=tzutc()), 'definition': '[]'}]}

0,1
Name,house-price-prime
Version,6082ad4c-e034-4bb1-a9e7-dc267b149adc
File Name,xgb_model.onnx
SHA,31e92d6ccb27b041a324a7ac22cf95d9d6cc3aa7e8263a229f7c4aec4938657c
Status,ready
Image Path,
Updated At,2024-11-Mar 17:58:59


0,1
name,houseprice-estimator
created,2024-03-11 17:58:59.194422+00:00
last_updated,2024-03-11 18:15:20.007804+00:00
deployed,True
tags,
versions,"e2c02684-b936-4eaa-ae22-16cc425ac1a7, 90680a22-b46c-4c4c-9c93-cecf87860321, d7ae395c-c5db-41aa-abfa-37aab4050924, e2c920d7-f993-4974-86ff-fdb5230ff590"
steps,house-price-prime


### Deploy Pipeline

Deploy the pipeline with the model version.

In [6]:
pipeline.clear()
pipeline.add_model_step(prime_model_version)

deploy_config = wallaroo.DeploymentConfigBuilder().replica_count(1).cpus(0.5).memory("1Gi").build()
pipeline.deploy(deployment_config=deploy_config)

 ok


0,1
name,houseprice-estimator
created,2024-03-11 17:58:59.194422+00:00
last_updated,2024-03-11 19:09:18.809008+00:00
deployed,True
tags,
versions,"77ad5fac-733b-40aa-8471-2b623731a1c2, e2c02684-b936-4eaa-ae22-16cc425ac1a7, 90680a22-b46c-4c4c-9c93-cecf87860321, d7ae395c-c5db-41aa-abfa-37aab4050924, e2c920d7-f993-4974-86ff-fdb5230ff590"
steps,house-price-prime


### Generate Sample Data

Before creating the assays, we must generate data for the assays to build from.

For this example, we will:

* Perform sample inferences based on lower priced houses and use that as our baseline.
* Generate inferences from specific set of high priced houses create inference outputs that will be outside the baseline.  This is used in later steps to demonstrate baseline comparison against assay analyses.

#### Inference Results History Generation

To start the demonstration, we'll create a baseline of values from houses with small estimated prices and set that as our baseline.

We will save the beginning and end periods of our baseline data to the variables `assay_baseline_start` and `assay_baseline_end`.

In [10]:
small_houses_inputs = pd.read_json('../data/lowprice.df.json')
baseline_size = 500

# Where the baseline data will start
assay_baseline_start = datetime.datetime.now()

# These inputs will be random samples of small priced houses.  Around 30,000 is a good number
small_houses = small_houses_inputs.sample(baseline_size, replace=True).reset_index(drop=True)

# Wait 60 seconds to set this data apart from the rest
time.sleep(60)
small_results = pipeline.infer(small_houses)

# Set the baseline end

assay_baseline_end = datetime.datetime.now()

#### Generate Numpy Baseline Values

This process generates a numpy array of the inference results used as baseline data in later steps.

In [11]:
# get the numpy values

# set the results to a non-array value
small_results_baseline_df = small_results.copy()
small_results_baseline_df['variable']=small_results['out.variable'].map(lambda x: x[0])
small_results_baseline_df

# set the numpy array
small_results_baseline = small_results_baseline_df['variable'].to_numpy()

#### Assay Test Data

The following will generate inference data for us to test against the assay baseline.  For this, we will add in house data that generate higher house prices than the baseline data we used earlier.

This process should take 6 minutes to generate the historical data we'll later use in our assays.  We store the DateTime `assay_window_start` to determine where to start out assay analyses.

In [12]:
# Get a spread of house values

# # Set the start for our assay window period.
assay_window_start = datetime.datetime.now()

time.sleep(65)
inference_size = 1000

# And a spread of large house values

small_houses_inputs = pd.read_json('../data/lowprice.df.json', orient="records")
small_houses = small_houses_inputs.sample(inference_size, replace=True).reset_index(drop=True)

pipeline.infer(small_houses)

time.sleep(65)

In [13]:
# Get a spread of large house values

time.sleep(65)
inference_size = 1000

# And a spread of large house values

big_houses_inputs = pd.read_json('../data/highprice.df.json', orient="records")
big_houses = big_houses_inputs.sample(inference_size, replace=True).reset_index(drop=True)

pipeline.infer(big_houses)

time.sleep(65)

### Undeploy Main Pipeline

With the examples and tutorial complete, we will undeploy the main pipeline and return the resources back to the Wallaroo instance.

In [None]:
mainpipeline.undeploy()

### Store the Assay Values

We will store the following into a location configuration file:

* `small_results_baseline`:  Used to create the baseline from the numpy values from sample inferences.
* `assay_baseline_start`: When to start the baseline from the inference history.
* `assay_baseline_end`: When to end the baseline from the inference history.
* `assay_window_start`: When to start the assay window period for assay samples.

In [15]:
# skip this step if the file is already there

import numpy

numpy.save('./small_results_baseline.npy', small_results_baseline)

In [16]:
baseline_numpy = numpy.load('./small_results_baseline.npy')

In [22]:
with open('./assay_baseline_start', 'w') as file:
    file.write(assay_baseline_start.strftime("%d-%b-%Y (%H:%M:%S.%f)"))
assay_baseline_start

datetime.datetime(2024, 3, 11, 19, 14, 59, 997423)

In [23]:
with open('./assay_baseline_end', 'w') as file:
    file.write(assay_baseline_end.strftime("%d-%b-%Y (%H:%M:%S.%f)"))
assay_baseline_end

datetime.datetime(2024, 3, 11, 19, 16, 0, 128604)

In [24]:
with open('./assay_window_start', 'w') as file:
    file.write(assay_window_start.strftime("%d-%b-%Y (%H:%M:%S.%f)"))
assay_window_start

datetime.datetime(2024, 3, 11, 19, 16, 0, 149220)

In [25]:
# read the assay baseline start datetime

with open('./assay_baseline_start', 'r') as file:
    assay_baseline_start_test = datetime.datetime.strptime(file.read(), "%d-%b-%Y (%H:%M:%S.%f)")
assay_baseline_start_test

datetime.datetime(2024, 3, 11, 19, 14, 59, 997423)

In [26]:
# read the assay baseline end datetime

with open('./assay_baseline_end', 'r') as file:
    assay_baseline_end_test = datetime.datetime.strptime(file.read(), "%d-%b-%Y (%H:%M:%S.%f)")
assay_baseline_end_test

datetime.datetime(2024, 3, 11, 19, 16, 0, 128604)

In [27]:
# read the assay window start datetime

with open('./assay_window_start', 'r') as file:
    assay_window_start_test = datetime.datetime.strptime(file.read(), "%d-%b-%Y (%H:%M:%S.%f)")
assay_window_start_test

datetime.datetime(2024, 3, 11, 19, 16, 0, 149220)