## Statsmodel Forecast with Wallaroo Features: ML Workload Orchestration

Wallaroo provides Data Connections and ML Workload Orchestrations to provide organizations with a method of creating and managing automated tasks that can either be run on demand or a regular schedule.

## Prerequisites

* A Wallaroo instance version 2023.2.1 or greater.

## References

* [Wallaroo SDK Essentials Guide: Model Uploads and Registrations: Python Models](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/wallaroo-sdk-model-upload-python/)
* [Wallaroo SDK Essentials Guide: Pipeline Management](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-pipelines/wallaroo-sdk-essentials-pipeline/)
* [Wallaroo SDK Essentials Guide: ML Workload Orchestration](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-ml-workload-orchestration/)

## Orchestrations, Taks, and Tasks Runs

We've details how Wallaroo Connections work.  Now we'll use Orchestrations, Tasks, and Task Runs.

| Item | Description |
|---|---|
| Orchestration | ML Workload orchestration allows data scientists and ML Engineers to automate and scale production ML workflows in Wallaroo to ensure a tight feedback loop and continuous tuning of models from training to production. Wallaroo platform users (data scientists or ML Engineers) have the ability to deploy, automate and scale recurring batch production ML workloads that can ingest data from predefined data sources to run inferences in Wallaroo, chain pipelines, and send inference results to predefined destinations to analyze model insights and assess business outcomes. |
| Task | An implementation of an Orchestration.  Tasks can be either `Run Once`:  They run once and upon completion, stop. `Run Scheduled`: The task runs whenever a specific `cron` like schedule is reached.  Scheduled tasks will run until the `kill` command is issued. |
| Task Run | The execusion of a task.  For `Run Once` tasks, there will be only one `Run Task`.  A `Run Scheduled` tasks will have multiple tasks, one for every time the schedule parameter is met.  Task Runs have their own log files that can be examined to track progress and results. |

## Statsmodel Forecast Connection Steps

### Import Libraries

The first step is to import the libraries that we will need.

In [1]:
import json
import os
import datetime

import wallaroo
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import Framework

# used to display dataframe information without truncating
from IPython.display import display
import pandas as pd
import numpy as np

from resources import simdb
from resources import util

pd.set_option('display.max_colwidth', None)

import time

In [2]:
display(wallaroo.__version__)

'2023.2.1+f07257bc2'

### Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [3]:
# Login through local Wallaroo instance

# wl = wallaroo.Client()

wallarooPrefix = "doc-test."
wallarooSuffix = "wallaroocommunity.ninja"

wl = wallaroo.Client(api_endpoint=f"https://{wallarooPrefix}api.{wallarooSuffix}", 
                    auth_endpoint=f"https://{wallarooPrefix}keycloak.{wallarooSuffix}", 
                    auth_type="sso")

### Set Configurations

The following will set the workspace, model name, and pipeline that will be used for this example.  If the workspace or pipeline already exist, then they will assigned for use in this example.  If they do not exist, they will be created based on the names listed below.

Workspace names must be unique.  To allow this tutorial to run in the same Wallaroo instance for multiple users, set the `suffix` variable or share the workspace with other users.

#### Set Configurations References

* [Wallaroo SDK Essentials Guide: Workspace Management](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-workspace/)
* [Wallaroo SDK Essentials Guide: Pipeline Management](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-pipelines/wallaroo-sdk-essentials-pipeline/)

In [4]:
# used for unique connection names

import string
import random

suffix= ''.join(random.choice(string.ascii_lowercase) for i in range(4))

suffix='john'

workspace_name = f'forecast-model-workshop{suffix}'

pipeline_name = f'forecast-workshop-pipeline'

### Set the Workspace and Pipeline

The workspace will be either used or created if it does not exist, along with the pipeline.  The models uploaded in the Upload and Deploy tutorial are referenced in this step.

In [5]:
def get_workspace(name):
    workspace = None
    for ws in wl.list_workspaces():
        if ws.name() == name:
            workspace= ws
    if(workspace == None):
        workspace = wl.create_workspace(name)
    return workspace

def get_pipeline(name):
    try:
        pipeline = wl.pipelines_by_name(name)[0]
    except EntityNotFoundError:
        pipeline = wl.build_pipeline(name)
    return pipeline

workspace = get_workspace(workspace_name)

wl.set_current_workspace(workspace)

pipeline = get_pipeline(pipeline_name)

# Get the most recent version of a model in the workspace
# Assumes that the most recent version is the first in the list of versions.
# wl.get_current_workspace().models() returns a list of models in the current workspace

def get_model(mname):
    modellist = wl.get_current_workspace().models()
    model = [m.versions()[0] for m in modellist if m.name() == mname]
    if len(model) <= 0:
        raise KeyError(f"model {mname} not found in this workspace")
    return model[0]

# upload three models:  the control and two challengers

control_model_name = 'forecast-control-model'
challenger01_model_name = 'forecast-challenger01-model'
challenger02_model_name = 'forecast-challenger02-model'

# retrieve the models

bike_day_model = get_model(control_model_name)

challenger_model_01 = get_model(challenger01_model_name)

challenger_model_02 = get_model(challenger02_model_name)

### Deploy Pipeline

The pipeline is already set witht the model.  For our demo we'll verify that it's deployed.

In [6]:
# Set the deployment to allow for additional engines to run
# Undeploy and clear the pipeline in case it was used in other demonstrations
pipeline.undeploy()
pipeline.clear()
deploy_config = (wallaroo.DeploymentConfigBuilder()
                        .replica_count(1)
                        .replica_autoscale_min_max(minimum=2, maximum=5)
                        .cpus(0.25)
                        .memory("512Mi")
                        .build()
                    )

pipeline.add_model_step(bike_day_model)
# pipeline.add_model_step(step)
pipeline.deploy(deployment_config = deploy_config)

 ok
Waiting for deployment - this will take up to 45s ............................ ok


0,1
name,forecast-workshop-pipeline
created,2023-08-02 15:50:59.480547+00:00
last_updated,2023-08-02 21:16:55.320303+00:00
deployed,True
tags,
versions,"f8188956-8b3e-4479-8b15-e8747fe915a6, 33e5cc2c-2bb2-4dc2-8a9e-c058e60f6163, 5d419693-97cc-461b-b72a-a389ab7a001b, 56c78f52-cba5-415c-913a-fee0e1863a90, a109a040-c8f2-46dc-8c0b-373ae10d4fa0, dcaec327-1358-42a7-88de-931602a42a72, debc509f-9481-464b-af7f-5c3138a9cdb4, b0d167aa-cc98-440a-8e85-1ae3f089745a, d9e69c40-c83b-48af-b6b9-caafcb85f08b, 186ffdd2-3a8f-40cc-8362-13cc20bd2f46, 535e6030-ebe5-4c79-b5cd-69b161637a99, c5c0218a-800b-4235-8767-64d18208e68a, 4559d934-33b0-4872-a788-4ef27f554482, 94d3e20b-add7-491c-aedd-4eb094a8aebf, ab4e58bf-3b75-4bf6-b6b3-f703fe61e7af, 3773f5c5-e4c5-4e46-a839-6945af15ca13, 3abf03dd-8eab-4a8d-8432-aa85a30c0eda, 5ec5e8dc-7492-498b-9652-b3733e4c87f7, 1d89287b-4eff-47ec-a7bb-8cedaac1f33f"
steps,forecast-control-model


### Forecast Sample Orchestration

The orchestration that will automate this process is `./resources/forecast-orchestration.zip`.  The files used are stored in the directory `forecast-orchestration`, created with the command:

`zip -r forecast-bigquery-connection.zip forecast-orchestration/`.

This contains the following:

* `requirements.txt`:  The Python requirements file to specify the following libraries used.  For this example, that will be empty since we will be using the 
* `main.py`: The entry file that uses a deployed pipeline and performs an inference request against it visible from its log files.
* `data/testdata_dict.json`: An inference input file.

The `main.py` script performs a workspace and pipeline retrieval, then an inference against the inference input file.

```python
import json
import os


import wallaroo
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import Framework

import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', None)

wl = wallaroo.Client()

# get the arguments
arguments = wl.task_args()

if "workspace_name" in arguments:
    workspace_name = arguments['workspace_name']
else:
    workspace_name="multiple-replica-forecast-tutorial"

if "pipeline_name" in arguments:
    pipeline_name = arguments['pipeline_name']
else:
    pipeline_name="bikedaypipe"

if "bigquery_connection_input_name" in arguments:
    bigquery_connection_name = arguments['bigquery_connection_input_name']
else:
    bigquery_connection_name = "statsmodel-bike-rentals"

print(bigquery_connection_name)
def get_workspace(name):
    workspace = None
    for ws in wl.list_workspaces():
        if ws.name() == name:
            workspace= ws
    return workspace

def get_pipeline(name):
    try:
        pipeline = wl.pipelines_by_name(name)[0]
    except EntityNotFoundError:
        print(f"Pipeline not found:{name}")
    return pipeline

print(f"Workspace: {workspace_name}")
workspace = get_workspace(workspace_name)

wl.set_current_workspace(workspace)
print(workspace)

# the pipeline is assumed to be deployed
print(f"Pipeline: {pipeline_name}")
pipeline = get_pipeline(pipeline_name)
print(pipeline)

inferencedata = json.load(open("./data/testdata_dict.json"))

results = pipeline.infer(inferencedata)

print(results)
```

This orchestration allows a user to specify the workspace, pipeline, and data connection.  As long as they all match the previous conditions, then the orchestration will run successfully.

### Upload the Orchestration

Orchestrations are uploaded with the Wallaroo client `upload_orchestration(path)` method with the following parameters.

| Parameter | Type | Description |
| --- | --- | ---|
| **path** | string (Required) | The path to the ZIP file to be uploaded. |

Once uploaded, the deployment will be prepared and any requirements will be downloaded and installed.


For this example, the orchestration `./bigquery_remote_inference/bigquery_remote_inference.zip` will be uploaded and saved to the variable `orchestration`.  Then we will loop until the uploaded orchestration's `status` displays `ready`.

In [7]:
inferencedata = pd.read_json("./data/testdata_standard.df.json")
display(inferencedata)

results = pipeline.infer(inferencedata)

display(results)

Unnamed: 0,count
0,"[1526, 1550, 1708, 1005, 1623, 1712, 1530, 1605, 1538, 1746, 1472, 1589, 1913, 1815, 2115, 2475, 2927, 1635, 1812, 1107, 1450, 1917, 1807, 1461, 1969, 2402, 1446, 1851]"


Unnamed: 0,time,in.count,out.forecast,out.weekly_average,check_failures
0,2023-08-02 21:17:32.990,"[1526, 1550, 1708, 1005, 1623, 1712, 1530, 1605, 1538, 1746, 1472, 1589, 1913, 1815, 2115, 2475, 2927, 1635, 1812, 1107, 1450, 1917, 1807, 1461, 1969, 2402, 1446, 1851]","[1764, 1749, 1743, 1741, 1740, 1740, 1740]",1745.285714,0


In [8]:
orchestration = wl.upload_orchestration(name="statsmodel-orchestration 6", path="./forecast-orchestration/forecast-orchestration.zip")

while orchestration.status() != 'ready':
    print(orchestration.status())
    time.sleep(5)

pending_packaging
pending_packaging
packaging
packaging
packaging
packaging
packaging
packaging
packaging


In [9]:
wl.list_orchestrations()

id,name,status,filename,sha,created at,updated at
fc4fd8cb-a108-404b-8ef9-8a1c9e279bb7,statsmodel-orchestration 5,ready,forecast-orchestration.zip,2c1f30...0f0761,2023-02-Aug 20:50:47,2023-02-Aug 20:51:36
f8cccfd4-5ef3-49f7-a0e0-4bbccc0fc664,statsmodel-orchestration 6,ready,forecast-orchestration.zip,1b675d...3a4a32,2023-02-Aug 21:07:51,2023-02-Aug 21:08:37
8a476448-06da-43b8-96a6-6f4b492973b0,statsmodel-orchestration 6,ready,forecast-orchestration.zip,1b675d...3a4a32,2023-02-Aug 21:13:18,2023-02-Aug 21:14:02
db9cdef8-4171-43c2-97ae-2188c7d29b41,statsmodel-orchestration 6,ready,forecast-orchestration.zip,1b675d...3a4a32,2023-02-Aug 21:17:33,2023-02-Aug 21:18:17


In [10]:
orchestration

Field,Value
ID,db9cdef8-4171-43c2-97ae-2188c7d29b41
Name,statsmodel-orchestration 6
File Name,forecast-orchestration.zip
SHA,1b675deaf51f53992ee2c9bc434d7e453bd03b4ddff49e7d11949e0c693a4a32
Status,ready
Created At,2023-02-Aug 21:17:33
Updated At,2023-02-Aug 21:18:17


### Create the Task

The orchestration is now ready to be implemented as a Wallaroo Task.  We'll just run it once as an example.  This specific Orchestration that creates the Task assumes that the pipeline is deployed, and accepts the arguments:

* workspace_name
* pipeline_name
* bigquery_connection_name

We'll supply the workspaces, pipeline and connection created in previous steps and stored in the initial variables above.  Verify these exist and match the existing workspace, pipeline and connection used in the previous notebooks in this series.

Tasks are generated and run once with the Orchestration `run_once(name, json_args, timeout)` method.  Any arguments for the orchestration are passed in as a `Dict`.  If there are no arguments, then an empty set `{}` is passed.

In [11]:
pipeline_name

'forecast-workshop-pipeline'

In [12]:
task = orchestration.run_once(name="statsmodel single run finale", 
                              json_args={"workspace_name":workspace_name,
                                         "pipeline_name":pipeline_name}
                              )

### Monitor Run with Task Status

We'll monitor the run first with it's status.

For this example, the status of the previously created task will be generated, then looped until it has reached status `started`.

In [13]:
while task.status() != "started":
    display(task.status())
    time.sleep(5)

'pending'

'pending'

'started'

### List Tasks

We'll use the Wallaroo client `list_tasks` method to view the tasks currently running.

In [14]:
wl.list_tasks()

id,name,last run status,type,active,schedule,created at,updated at
d897ef95-911e-42ee-a874-2a7435b5ca77,statsmodel single run finale,success,Temporary Run,True,-,2023-02-Aug 21:18:19,2023-02-Aug 21:18:30
f406497a-d8c1-4b20-8fe9-d83c8102da40,statsmodel single run finale,success,Temporary Run,True,-,2023-02-Aug 21:08:42,2023-02-Aug 21:08:48
7117f780-5fc4-476a-a5d2-0654fdb6271f,statsmodel single run finale,failure,Temporary Run,True,-,2023-02-Aug 20:55:17,2023-02-Aug 20:55:23
f209c52a-88e2-43e3-a614-b08a35b72a94,statsmodel single run finale,failure,Temporary Run,True,-,2023-02-Aug 20:52:24,2023-02-Aug 20:52:35


### Display Task Run Results

The Task Run is the implementation of the task - the actual running of the script and it's results.  Tasks that are Run Once will only have one Task Run, while a Task set to Run Scheduled will have a Task Run for each time the task is executed.  Each Task Run has its own set of logs and results that are monitoried through the Task Run `logs()` method.

We'll wait 30 seconds, then retrieve the task run for our generated task, then start checking the logs for our task run.  It may take longer than 30 seconds to launch the task, so be prepared to run the `.logs()` method again to view the logs.

In [15]:
task

Field,Value
ID,d897ef95-911e-42ee-a874-2a7435b5ca77
Name,statsmodel single run finale
Last Run Status,success
Type,Temporary Run
Active,True
Schedule,-
Created At,2023-02-Aug 21:18:19
Updated At,2023-02-Aug 21:18:30


In [16]:
statsmodel_task_run = task.last_runs()[0]

In [17]:
time.sleep(30)
statsmodel_task_run._status

'success'

In [18]:
statsmodel_task_run.logs()

### Undeploy the Pipeline

Undeploy the pipeline and return the resources back to the Wallaroo instance.

In [19]:
pipeline.undeploy()

Waiting for undeployment - this will take up to 45s .................................... ok


0,1
name,forecast-workshop-pipeline
created,2023-08-02 15:50:59.480547+00:00
last_updated,2023-08-02 21:16:55.320303+00:00
deployed,False
tags,
versions,"f8188956-8b3e-4479-8b15-e8747fe915a6, 33e5cc2c-2bb2-4dc2-8a9e-c058e60f6163, 5d419693-97cc-461b-b72a-a389ab7a001b, 56c78f52-cba5-415c-913a-fee0e1863a90, a109a040-c8f2-46dc-8c0b-373ae10d4fa0, dcaec327-1358-42a7-88de-931602a42a72, debc509f-9481-464b-af7f-5c3138a9cdb4, b0d167aa-cc98-440a-8e85-1ae3f089745a, d9e69c40-c83b-48af-b6b9-caafcb85f08b, 186ffdd2-3a8f-40cc-8362-13cc20bd2f46, 535e6030-ebe5-4c79-b5cd-69b161637a99, c5c0218a-800b-4235-8767-64d18208e68a, 4559d934-33b0-4872-a788-4ef27f554482, 94d3e20b-add7-491c-aedd-4eb094a8aebf, ab4e58bf-3b75-4bf6-b6b3-f703fe61e7af, 3773f5c5-e4c5-4e46-a839-6945af15ca13, 3abf03dd-8eab-4a8d-8432-aa85a30c0eda, 5ec5e8dc-7492-498b-9652-b3733e4c87f7, 1d89287b-4eff-47ec-a7bb-8cedaac1f33f"
steps,forecast-control-model
