This tutorial and the assets can be downloaded as part of the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/tree/main/wallaroo-model-cookbooks/mlflow-tutorial).

## MLFlow Inference with Wallaroo Tutorial 

Wallaroo users can register their trained [MLFlow ML Models](https://www.mlflow.org/docs/latest/models.html) from a containerized model container registry into their Wallaroo instance and perform inferences with it through a Wallaroo pipeline.

As of this time, Wallaroo only supports **MLFlow 1.3.0** containerized models.  For information on how to containerize an MLFlow model, see the [MLFlow Documentation](https://mlflow.org/docs/latest/projects.html).

This tutorial assumes that you have a Wallaroo instance, and have either your own containerized model or use the one from the reference and are running this Notebook from the Wallaroo Jupyter Hub service.

See the [Wallaroo Private Containerized Model Container Registry Guide](https://docs.wallaroo.ai/wallaroo-operations-guide/wallaroo-configuration/wallaroo-private-model-registry/) for details on how to configure a Wallaroo instance with a private model registry.

## MLFlow Data Formats

When using containerized MLFlow models with Wallaroo, the inputs and outputs must be named.  For example, the following output:

```json
[-12.045839810372835]
```

Would need to be wrapped with the data values named:

```json
[{"prediction": -12.045839810372835}]
```

A short sample code for wrapping data may be:

```python
output_df = pd.DataFrame(prediction, columns=["prediction"])
return output_df
```

### MLFlow Models and Wallaroo

MLFlow models are composed of two parts:  the model, and the flavors.  When submitting a MLFlow model to Wallaroo, both aspects must be part of the ML Model included in the container.  For full information about MLFlow model structure, see the [MLFlow Documentation](https://www.mlflow.org/docs/latest/index.html).

Wallaroo registers the models from container registries.  Organizations will either have to make their containers available in a public or through a private Containerized Model Container Registry service.  For examples on setting up a private container registry service, see the [Docker Documentation "Deploy a registry server"](https://docs.docker.com/registry/deploying/).  For more details on setting up a container registry in a cloud environment, see the related documentation for your preferred cloud provider:
  * [Google Cloud Platform Container Registry](https://cloud.google.com/container-registry)
  * [Amazon Web Services Elastic Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html)
  *  [Microsoft Azure Container Registry](https://azure.microsoft.com/en-us/free/container-registry/)

For this example, we will be using the MLFlow containers that was registered in a GitHub container registry service in MLFlow Creation Tutorial Part 03: Container Registration.  The address of those containers are:

* postprocess: ghcr.io/johnhansarickwallaroo/mlflowtests/mlflow-postprocess-example .  Used for format data after the statsmodel inferences.
* statsmodel: ghcr.io/johnhansarickwallaroo/mlflowtests/mlflow-statsmodels-example . The statsmodel generated in MLFlow Creation Tutorial Part 01: Model Creation.

### Prerequisites

Before uploading and running an inference with a MLFlow model in Wallaroo the following will be required:

* **MLFlow Input Schema**:  The input schema with the fields and data types for each MLFlow model type uploaded to Wallaroo.  In the examples below, the data types are imported using the `pyarrow` library.
* An installed Wallaroo instance.
* The following Python libraries installed:
  * `os`
  * [`wallaroo`](https://pypi.org/project/wallaroo/): The Wallaroo SDK. Included with the Wallaroo JupyterHub service by default.

**IMPORTANT NOTE**:  Wallaroo supports MLFlow 1.3.0.  Please ensure the MLFlow models used in Wallaroo meet this specification.

## MLFlow Inference Steps

To register a containerized MLFlow ML Model into Wallaroo, use the following general step:

* Import Libraries
* Connect to Wallaroo
* Set MLFlow Input Schemas
* Register MLFlow Model
* Create Pipeline and Add Model Steps
* Run Inference

### Import Libraries

We start by importing the libraries we will need to connect to Wallaroo and use our MLFlow models. This includes the `wallaroo` libraries, `pyarrow` for data types, and the `json` library for handling JSON data.

In [37]:
import wallaroo
from wallaroo.object import EntityNotFoundError
import pyarrow as pa
import pandas as pd


import os
# Used for the Wallaroo SDK version 2023.1
os.environ["ARROW_ENABLED"]="True"

### Connect to Wallaroo

Connect to Wallaroo and store the connection in the variable `wl`.

The folowing methods are used to create the workspace and pipeline for this tutorial.  A workspace is created and set as the current workspace that will contain the registered models and pipelines.

In [38]:
bike_day = pd.read_json('./resources/bike_day_eval_engine.df.json', orient="records")
display(bike_day)

Unnamed: 0,holiday,temp,windspeed,workingday
0,0,0.317391,0.184309,1
1,0,0.365217,0.203117,1
2,0,0.415,0.209579,1
3,0,0.54,0.231017,1
4,0,0.4725,0.368167,0
5,0,0.3325,0.207721,0
6,0,0.430435,0.288783,1


In [39]:
# Login through local Wallaroo instance

wl = wallaroo.Client()

# SSO login through keycloak

wallarooPrefix = "YOUR PREFIX"
wallarooSuffix = "YOUR PREFIX"

wallarooPrefix = "doc-test"
wallarooSuffix = "wallaroocommunity.ninja"

wl = wallaroo.Client(api_endpoint=f"https://{wallarooPrefix}.api.{wallarooSuffix}", 
                    auth_endpoint=f"https://{wallarooPrefix}.keycloak.{wallarooSuffix}", 
                    auth_type="sso")

Please log into the following URL in a web browser:

	https://doc-test.keycloak.wallaroocommunity.ninja/auth/realms/master/device?user_code=ZNTN-HDMI

Login successful!


In [40]:
def get_workspace(name):
    workspace = None
    for ws in wl.list_workspaces():
        if ws.name() == name:
            workspace= ws
    if(workspace == None):
        workspace = wl.create_workspace(name)
    return workspace

def get_pipeline(name):
    try:
        pipeline = wl.pipelines_by_name(pipeline_name)[0]
    except EntityNotFoundError:
        pipeline = wl.build_pipeline(pipeline_name)
    return pipeline

In [41]:
prefix = 'mlflow'
workspace_name= f"{prefix}statsmodelworkspace"
pipeline_name = f"{prefix}statsmodelpipeline"

mlflowworkspace = get_workspace(workspace_name)
wl.set_current_workspace(mlflowworkspace)


pipeline = get_pipeline(pipeline_name)

### Set MLFlow Input Schemas

Set the MLFlow input schemas through the `pyarrow` library.  In the examples below, the input schemas for both the MLFlow model `statsmodels-test` and the `statsmodels-test-postprocess` model.

In [42]:
sm_input_schema = pa.schema([
  pa.field('temp', pa.float32()),
  pa.field('holiday', pa.uint8()),
  pa.field('workingday', pa.uint8()),
  pa.field('windspeed', pa.float32())
])

pp_input_schema = pa.schema([
    pa.field('predicted_mean', pa.float32())
])

### Register MLFlow Model

Use the `register_model_image` method to register the Docker container containing the MLFlow models.

In [43]:
statsmodelUrl = "ghcr.io/wallaroolabs/wallaroo_tutorials/mlflow-statsmodels-example:2023.1"
postprocessUrl = "ghcr.io/wallaroolabs/wallaroo_tutorials/mlflow-postprocess-example:2023.1"

sm_model = wl.register_model_image(
    name=f"{prefix}statmodels",
    image=f"{statsmodelUrl}"
).configure("mlflow", input_schema=sm_input_schema, output_schema=pp_input_schema)
pp_model = wl.register_model_image(
    name=f"{prefix}postprocess",
    image=f"{postprocessUrl}"
).configure("mlflow", input_schema=pp_input_schema, output_schema=pp_input_schema)

### Create Pipeline and Add Model Steps

With the models registered, we can add the MLFlow models as steps in the pipeline.  Once ready, we will deploy the pipeline so it is available for submitting data for running inferences.

In [44]:
pipeline.add_model_step(sm_model)
pipeline.add_model_step(pp_model)

0,1
name,mlflowstatsmodelpipeline
created,2023-03-29 15:30:47.168398+00:00
last_updated,2023-03-30 19:39:33.909512+00:00
deployed,False
tags,
versions,"a0f9b98c-1111-41b1-a196-2caffa9e41c8, 06c42cf6-a4c3-4610-a0f0-f3f2464c4c1f, ace1242f-bf62-4018-b47a-b9ef2de2f80a, e20bf245-2a81-4d84-a58a-6264cf988819, ba67924f-7e4b-4b8c-be97-1c3ef995f4ac, a8ae51f2-9027-4dc6-b262-ef247948fd16"
steps,mlflowstatmodels


In [45]:
pipeline.deploy()

0,1
name,mlflowstatsmodelpipeline
created,2023-03-29 15:30:47.168398+00:00
last_updated,2023-03-31 21:27:02.378296+00:00
deployed,True
tags,
versions,"42d7e5be-61eb-4b12-b5a1-cfcd50d1ad10, a0f9b98c-1111-41b1-a196-2caffa9e41c8, 06c42cf6-a4c3-4610-a0f0-f3f2464c4c1f, ace1242f-bf62-4018-b47a-b9ef2de2f80a, e20bf245-2a81-4d84-a58a-6264cf988819, ba67924f-7e4b-4b8c-be97-1c3ef995f4ac, a8ae51f2-9027-4dc6-b262-ef247948fd16"
steps,mlflowstatmodels


In [46]:
pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.244.1.27',
   'name': 'engine-7544765498-vx8sv',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'mlflowstatsmodelpipeline',
      'status': 'Running'}]},
   'model_statuses': {'models': [{'name': 'mlflowstatmodels',
      'version': '38ed4b71-2eed-4f6a-b8aa-535760cb9123',
      'sha': '3afd13d9c5070679e284050cd099e84aa2e5cb7c08a788b21d6cb2397615d018',
      'status': 'Running'},
     {'name': 'mlflowpostprocess',
      'version': '8cce5ed1-7f9e-44ac-9408-10221d1125e8',
      'sha': '825ebae48014d297134930028ab0e823bc0d9551334b9a4402c87a714e8156b2',
      'status': 'Running'}]}}],
 'engine_lbs': [{'ip': '10.244.2.20',
   'name': 'engine-lb-ddd995646-b8wms',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.244.1.28',
   'name': 'engine-sidekick-mlflowstatmodels-74-6bd9b6648f-wgd9r',
   'status': 'Running',
   'reason': None,
   

### Run Inference

Once the pipeline is running, we can submit our data to the pipeline and return our results.  Once finished, we will undeploy the pipeline to return the resources back to the cluster.

In [47]:
results = pipeline.infer_from_file('./resources/bike_day_eval_engine.df.json')
display(results.loc[:,["out.predicted_mean"]])

Unnamed: 0,out.predicted_mean
0,0.281983
1,0.658847
2,0.572368
3,0.619873
4,-1.217801
5,-1.849156
6,0.933885


In [48]:
display(results)

Unnamed: 0,time,in.holiday,in.temp,in.windspeed,in.workingday,out.predicted_mean,check_failures
0,2023-03-31 21:28:09.665,0,0.317391,0.184309,1,0.281983,0
1,2023-03-31 21:28:09.665,0,0.365217,0.203117,1,0.658847,0
2,2023-03-31 21:28:09.665,0,0.415,0.209579,1,0.572368,0
3,2023-03-31 21:28:09.665,0,0.54,0.231017,1,0.619873,0
4,2023-03-31 21:28:09.665,0,0.4725,0.368167,0,-1.217801,0
5,2023-03-31 21:28:09.665,0,0.3325,0.207721,0,-1.849156,0
6,2023-03-31 21:28:09.665,0,0.430435,0.288783,1,0.933885,0


In [49]:
pipeline.undeploy()

0,1
name,mlflowstatsmodelpipeline
created,2023-03-29 15:30:47.168398+00:00
last_updated,2023-03-31 21:27:02.378296+00:00
deployed,False
tags,
versions,"42d7e5be-61eb-4b12-b5a1-cfcd50d1ad10, a0f9b98c-1111-41b1-a196-2caffa9e41c8, 06c42cf6-a4c3-4610-a0f0-f3f2464c4c1f, ace1242f-bf62-4018-b47a-b9ef2de2f80a, e20bf245-2a81-4d84-a58a-6264cf988819, ba67924f-7e4b-4b8c-be97-1c3ef995f4ac, a8ae51f2-9027-4dc6-b262-ef247948fd16"
steps,mlflowstatmodels
