## NBIC Workflow Provena Toy Example - Redux

This reworks the original NBIC Workflow Provena Toy Example embracing a number of principals for capturing provenance in use cases where users may not have time or knowledge to build and manage detailed provenance records.

It relies on nbic.py a very rough prototype and mock of utilities for automating provenance record creation. Fundamental is the create_or_fetch pattern which proposes that, often, there is enough information to create a dataset_template, dataset, model_run_template and model run from information implicit in user activities or through a very minimal user input. The create_or_fetch pattern, when fully implemented would fuzzy search existing entities and provide suggestions allowing users to select an existing entities, or, when no match is found, automatically create the missing entity.

In leui of fully implementing the create_or_fetch pattern a number of resources have been created directly in the provena interface for use in mocks in this workflow for example the entity https://hdl.handle.net/10378.1/1741058 represents a generic dataset template that with a display name of wind and a simple deferred resource with a key of "path". It would be possible to create this resource from a call like p.register_input_dataset('s3://nbic1-stage-shared-artifacts/nbic-stage1/weather/projected/AU_hourly_temperature_C.zarr', 'wind_speed') furthermore this corresponding dataset record also created for use in mocks at https://hdl.handle.net/10378.1/1740705 could have been created automatically with minimal metadata using the S3 path passed in the register_input_dataset call. 

Similarly model run workflow templates and model runs themselves can be automatically generated. In nbic.py model_run_payload best demonstrates the automatic creation of a records from minimal user input.

All functionality is encapsulated in a provenance object that stores inputs and outputs for later use when model run templates and model run records are required to be generated. The provenance object also captures a default start and end time and would, if fully implemented, guess the organisation and person associated with the workflow.

Quality of provenance records generated through this workflow may vary. There are numerous possibilities to improve these  records if that is required. Dataset, dataset template, model run workflow templates and model run records could all be edited post-hoc if Provena allows this. Further enhancements might include identifying these records using automated annotations as "draft" and filtering them until they are improved or reviewed by the creator or a provenance manager. Quick capture of provenance is important because in many cases important data products are produced under pressure, many important variations on workflows maybe lost because provenance overhead maybe too high and science experts may struggle to comprehend and implement provenance detail requirements.

In [16]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [17]:
# This is a small helper class which provides a config object for validation and
# a loader function
import workflow_config

# this contains helpers for interacting with the registry
import registry

# This is a helper function for managing authentication with Provena
import mdsisclienttools.auth.TokenManager as ProvenaAuth


import json
import time
import requests
import os

In [18]:
import nbic
from nbic import *

## The Provenance Object, creation of this starts a timer

In [31]:
p = Provenance()

In [None]:
## Authorization
provena_auth = ProvenaAuth.DeviceFlowManager(
    stage=stage,
    keycloak_endpoint=kc_endpoint
)

# expose the get auth function which is used for provena methods 
get_auth = provena_auth.get_auth

No storage or object provided, using default location: .tokens.json.
Using storage type: FILE.
Using DEVICE auth flow.
Attempting to generate authorisation tokens.

Looking for existing tokens in local storage.

Validating found tokens

Trying to use found tokens to refresh the access token.

Tokens found in storage but they are not valid.

Initiating device auth flow to generate access and refresh tokens.

Decoding response

Please authorise using the following endpoint.

Verification URL: https://auth.dev.provena.nbic.cloud/auth/realms/provena/device?user_code=UFKT-HWLU
User Code: UFKT-HWLU
[?1l>4;1H[2J[?47l8TPS connection to auth.dev.provena.nbic.cloud[m[m                   [22;39H[m[m                           [2;1H                                                                                [3;1H                                                                                [4;1H                                                                                [5;1H  

In [23]:
# For Mocks - we shouldn'nt need config at this point in a full implementation
config_path = "./configs/example_workflow.json"
config = workflow_config.load_config(path=config_path)

In [24]:
# For Mocks - we shouldn'nt need config at this point in a full implementation
nbic.config = config
nbic.registry = registry
nbic.get_auth = get_auth

## Minimal wrapping of paths to register inputs

In [25]:
## The provenance object register_input_dataset either finds and retrieves existing dataset records and dataset template records or creates them
## In the current implementation this happens immediately but we could delay creation till model run registration if we didn't want to clog up the workflow
temperature_path = p.register_input_dataset('s3://nbic1-stage-shared-artifacts/nbic-stage1/weather/projected/AU_hourly_temperature_C.zarr', 'temperature')
humidity_path = p.register_input_dataset('s3://nbic1-stage-shared-artifacts/nbic-stage1/weather/projected/AU_hourly_temperature_C.zarr', 'humidity')
wind_speed_path = p.register_input_dataset('s3://nbic1-stage-shared-artifacts/nbic-stage1/weather/projected/AU_hourly_temperature_C.zarr', 'wind_speed')
mc_adf_path = p.register_input_dataset('s3://nbic1-stage-shared-artifacts/nbic-stage1/weather/projected/AU_hourly_temperature_C.zarr', 'mc_adf_path')

Fetching from registry, id: 10378.1/1740705...
Fetching from registry, id: 10378.1/1741057...
Fetching from registry, id: 10378.1/1740705...
Fetching from registry, id: 10378.1/1741060...
Fetching from registry, id: 10378.1/1740705...
Fetching from registry, id: 10378.1/1741058...
Fetching from registry, id: 10378.1/1740705...
Fetching from registry, id: 10378.1/1741059...


### Running our fake model

In [26]:
##This is similar to what was happening in the original notebook except we register an output similar to the inputs above 
temperature = xr.open_zarr(temperature_path)
humidity_path = xr.open_zarr(temperature_path)
wind_speed_path = xr.open_zarr(temperature_path)
mc_adf_path = xr.open_zarr(temperature_path)

def fake_model(temperature: int, humidity: int, wind_speed : int, mc_adf: int) -> int:
    # this model does some heavy lifting and takes 10 seconds to finish 
    time.sleep(1)     
    return xr

# run the model 
fake_model_output = fake_model(
    temperature=temperature,
    humidity=humidity_path,
    wind_speed=wind_speed_path,
    mc_adf=mc_adf_path
)

fake_model_output_path = p.register_output_dataset("s3://nbic1-stage-shared-artifacts/nbic-stage1/weather/projected/AU_FFDI.zarr", 'ffdi')
fake_model_output.to_zarr(fake_model_output_path)

Fetching from registry, id: 10378.1/1740705...
Fetching from registry, id: 10378.1/1741061...


In [None]:
# Uncomment if you want to see the template, internally the template would be fetched or created from inputs/outputs
# p._create_or_fetch_workflow_template_record()

## Generate the model run record

In [None]:
# generate the model run record passing only description in 
model_run_payload = p.model_run_payload("hourly FFDI")
model_run_payload

In [29]:
## Don't validate stuff 
# if the items aren't valid presumably we would want to register them with a warning this
# workflow is valid by definition

#if not valid:
#    print("FAILED VALIDATION")
#    raise Exception("Workflow config validation exception occurred. See output above.")

In [30]:
# Registering the model run 
endpoint = provenance_endpoint + "/model_run/register_complete"
payload = model_run_payload

# send off request
print("Registering model run")
response = requests.post(url=endpoint, json=payload, auth=get_auth())

# use helper function to check response
registry.check_response(response=response, status_check=True)

response_content = response.json()
method_two_record_info = response_content["record_info"]
method_two_record_info

Registering model run


{'id': '10378.1/1741747',
 'prov_json': '{"prefix": {"default": "http://hdl.handle.net/"}, "activity": {"10378.1/1741747": {"model_run/10378.1/1741747": true, "item_category": "ACTIVITY", "item_subtype": "MODEL_RUN"}}, "entity": {"10378.1/1740705": {"model_run/10378.1/1741747": true, "item_category": "ENTITY", "item_subtype": "DATASET"}, "10378.1/1741046": {"model_run/10378.1/1741747": true, "item_category": "ENTITY", "item_subtype": "MODEL_RUN_WORKFLOW_TEMPLATE", "prov:type": {"$": "prov:Collection", "type": "prov:QUALIFIED_NAME"}}, "10378.1/1741057": {"model_run/10378.1/1741747": true, "item_category": "ENTITY", "item_subtype": "DATASET_TEMPLATE"}, "10378.1/1741060": {"model_run/10378.1/1741747": true, "item_category": "ENTITY", "item_subtype": "DATASET_TEMPLATE"}, "10378.1/1741058": {"model_run/10378.1/1741747": true, "item_category": "ENTITY", "item_subtype": "DATASET_TEMPLATE"}, "10378.1/1741059": {"model_run/10378.1/1741747": true, "item_category": "ENTITY", "item_subtype": "DATA