This tutorial and the assets can be downloaded as part of the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/blob/wallaroo2024.1_tutorials/wallaroo-run-anywhere/edge-architecture-publish-hf-summarization-model).

## Run Anywhere for ARM Architecture Tutorial: Hugging Face Summarization Model

Wallaroo Run Anywhere provides model deployment in any device, any cloud, and any architecture.  Models uploaded to Wallaroo are set to their targeted architecture.  By default, this is `X64`.

The model's deployment configuration is tied to its architecture settings.  That deployment configuration is carried over to the models publication in Open Container Initiative (OCI) Registries, which allows edge model deployments on X64 and ARM architectures.

This tutorial demonstrates deploying a ML model trained to predict house prices to X64 and ARM edge locations through the following steps.

* Setting up a workspace, and pipeline.
* Upload a model set to the ARM architecture.
* Deploy the model to the ARM architecture and demonstrate that the deployment configuration inherits its architecture from the model configuration.
* Performing a sample set of inferences to verify the deployment.
* Publish the deployed model to an Open Container Initiative (OCI) Registry ARM deployments and verify it inherits the edge deployment architecture based on the model configuration.
* Deploy the model to an ARM edge device.
* Perform similar inferences on the edge device and show the results.

In this notebook, we use LLM Hugging Face summarization model.  The sample model is available at the following URL.  **This model should be downloaded and placed into the `./models` folder before beginning this demonstration.**

[model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip (1.4 GB)](https://storage.googleapis.com/wallaroo-public-data/llm-models/model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip)

## Goal

Demonstrate models uploaded to Wallaroo with the targeted architecture set to ARM are deployable on ARM architecture nodepools and ARM edge devices.

### Resources

This tutorial provides the following:

* Models:
  * `models/model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip`: **This model should be downloaded and placed into the `./models` folder before beginning this demonstration.** [model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip (1.4 GB)](https://storage.googleapis.com/wallaroo-public-data/llm-models/model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip)
  * Various inputs:
    * `./data/test_summarization.df.json`: A DataFrame with sample text to summarize.

### Prerequisites

* A deployed Wallaroo instance with [Edge Registry Services](https://docs.wallaroo.ai/wallaroo-operations-guide/wallaroo-configuration/wallaroo-edge-deployment/#enable-wallaroo-edge-deployment-registry) and [Edge Observability enabled](https://docs.wallaroo.ai/wallaroo-operations-guide/wallaroo-configuration/wallaroo-edge-deployment/#set-edge-observability-service).
* The following Python libraries installed:
  * [`wallaroo`](https://pypi.org/project/wallaroo/): The Wallaroo SDK. Included with the Wallaroo JupyterHub service by default.
  * [`pandas`](https://pypi.org/project/pandas/): Pandas, mainly used for Pandas DataFrame
  * `json`: Used for format input data for inference requests.
* A X64 Docker deployment to deploy the model on an edge location.


## Steps

* Upload the model with the targeted architecture set to `ARM`.
* Create the pipeline add the model as a model step.
* Deploy the model in the targeted architecture and perform sample inferences.
* Publish the pipeline an OCI registry.
* Deploy the model from the pipeline publish to the edge deployment with ARM architecture.
* Perform sample inferences on the ARM edge model deployment.

### Import Libraries

The first step will be to import our libraries, and set variables used through this tutorial.

In [6]:
import wallaroo
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture

from IPython.display import display

# used to display DataFrame information without truncating
from IPython.display import display
# used for input/output data schemas
import pyarrow as pa
import pandas as pd
pd.set_option('display.max_colwidth', None)



import datetime
import time

workspace_name = f'run-anywhere-architecture-hf-summarizer-demonstration-tutorial'
arm_pipeline_name = f'architecture-demonstration-arm'
model_name_arm = f'hf-summarizer-arm'
model_file_name = './models/model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip'

# ignoring warnings for demonstration
import warnings
warnings.filterwarnings('ignore')

# used to display DataFrame information without truncating
from IPython.display import display
import pandas as pd
pd.set_option('display.max_colwidth', None)

### Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [3]:
# Login through local Wallaroo instance

wl = wallaroo.Client()

Please log into the following URL in a web browser:

	https://keycloak.autoscale-uat-gcp.wallaroo.dev/auth/realms/master/device?user_code=CPAZ-MUGU

Login successful!


### Create Workspace

We will create a workspace to manage our pipeline and models.  The following variables will set the name of our sample workspace then set it as the current workspace.

Workspace, pipeline, and model names should be unique to each user, so we'll add in a randomly generated suffix so multiple people can run this tutorial in a Wallaroo instance without effecting each other.

In [4]:
workspace = wl.get_workspace(name=workspace_name, create_if_not_exist=True)

wl.set_current_workspace(workspace)

{'name': 'run-anywhere-architecture-hf-summarizer-demonstration-tutorial', 'id': 64, 'archived': False, 'created_by': 'b4a9aa3d-83fc-407a-b4eb-37796e96f1ef', 'created_at': '2024-03-05T16:09:41.98492+00:00', 'models': [], 'pipelines': []}

### Upload Models and Set Target Architecture to ARM

For our example, we will upload the Hugging Face Summarizer model.  The model file is `llm-models/model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip`, and is uploaded with the name `hf-summarizer-arm`.

Models are uploaded to Wallaroo via the [`wallaroo.client.upload_model`](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-reference-guide/client/#Client.upload_model) method which takes the following arguments:

| Parameter | Type | Description |
|---|---|---|
| **path** | *String* (*Required*) | The file path to the model. |
| **framework** | *wallaroo.framework.Framework* (*Required*) | The model's framework.  See [Wallaroo SDK Essentials Guide: Model Uploads and Registrations](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/) for supported model frameworks. |
| **input_schema** | *pyarrow.lib.Schema* (*Optional*)  | The model's input schema.  **Only required for non-Native Wallaroo frameworks.  See [Wallaroo SDK Essentials Guide: Model Uploads and Registrations](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/) for more details. |
| **output_schema** | *pyarrow.lib.Schema* (*Optional*)  | The model's output schema.  **Only required for non-Native Wallaroo frameworks.  See [Wallaroo SDK Essentials Guide: Model Uploads and Registrations](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/) for more details. |
| **convert_wait** | *bool* (*Optional*)  | Whether to wait in the SDK session to complete the auto-packaging process for non-native Wallaroo frameworks. |
| **arch** | *wallaroo.engine_config.Architecture* (*Optional*)  | The targeted architecture for the model.  Options are <ol><li>`X86` (*Default*)</li><li>`ARM`</li></ol> |

Verify the ML model is downloaded from [model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip (1.4 GB)](https://storage.googleapis.com/wallaroo-public-data/llm-models/model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip) and placed into the `./models` directory.

In [None]:
input_schema = pa.schema([
    pa.field('inputs', pa.string()),
    pa.field('return_text', pa.bool_()),
    pa.field('return_tensors', pa.bool_()),
    pa.field('clean_up_tokenization_spaces', pa.bool_()),
    # pa.field('generate_kwargs', pa.map_(pa.string(), pa.null())), # dictionaries are not currently supported by the engine
])

output_schema = pa.schema([
    pa.field('summary_text', pa.string()),
])

model_name_arm = f'hf-summarizer-arm'
model_file_name = './models/model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip'

model_arm = wl.upload_model(model_name_arm, 
                        model_file_name, 
                        framework=wallaroo.framework.Framework.HUGGING_FACE_SUMMARIZATION, 
                        input_schema=input_schema, 
                        output_schema=output_schema,
                        arch=Architecture.ARM
                        )

In [12]:
display(model_arm)

0,1
Name,house-price-estimator-arm
Version,0d7bb3f4-db0f-448d-8100-22c1c49e6a2e
File Name,model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip
SHA,ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268
Status,ready
Image Path,proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.1.0-main-4609
Architecture,
Acceleration,
Updated At,2024-05-Mar 16:16:49


### Deploy Models to ARM Architecture

#### Build the Pipeline

This pipeline is used as an example of the edge deployment environment for testing.  We will create a pipeline and add our model set to the `ARM` architecture as a pipeline step.

The **model's targeted architecture** is inherited by the pipeline version; no additional pipeline settings are required to set that architecture.  The other settings apply the CPU and memory used by the model.

In [13]:
pipeline_arm = wl.build_pipeline(arm_pipeline_name)


# clear the steps if used before
pipeline_arm.clear()

# set the model step with the ARM targeted model
pipeline_arm.add_model_step(model_arm)

0,1
name,architecture-demonstration-arm
created,2024-03-05 16:18:38.768602+00:00
last_updated,2024-03-05 16:18:38.768602+00:00
deployed,(none)
arch,
accel,
tags,
versions,d033152c-494c-44a6-8981-627c6b6ad72e
steps,
published,False


#### Deploy Pipeline

We deploy the pipeline, then show that the deployment configuration architecture is set to `ARM`.

In [14]:
#minimum deployment config
deployment_config = wallaroo.DeploymentConfigBuilder() \
    .cpus(0.25).memory('1Gi') \
    .sidekick_cpus(model_arm, 4) \
    .sidekick_memory(model_arm, "8Gi") \
    .build()

In [15]:
pipeline_arm.deploy(deployment_config=deployment_config)

Waiting for deployment - this will take up to 45s ...........................................

WaitForDeployError: Deployment failed. See status for details.
Status: {'status': 'Error', 'details': [], 'engines': [{'ip': '10.124.0.140', 'name': 'engine-98974ffff-f99tm', 'status': 'Running', 'reason': None, 'details': [], 'pipeline_statuses': {'pipelines': [{'id': 'architecture-demonstration-arm', 'status': 'Running'}]}, 'model_statuses': {'models': [{'config': {'batch_config': None, 'filter_threshold': None, 'id': 82, 'input_schema': '/////xgBAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAQAAAC8AAAAfAAAAEgAAAAEAAAAZP///wAAAQYQAAAAMAAAAAQAAAAAAAAAHAAAAGNsZWFuX3VwX3Rva2VuaXphdGlvbl9zcGFjZXMAAAAAbP///6T///8AAAEGEAAAACAAAAAEAAAAAAAAAA4AAAByZXR1cm5fdGVuc29ycwAAnP///9T///8AAAEGEAAAABwAAAAEAAAAAAAAAAsAAAByZXR1cm5fdGV4dADI////EAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAYAAABpbnB1dHMAAAQABAAEAAAA', 'model_version_id': 47, 'output_schema': '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAAwAAABzdW1tYXJ5X3RleHQAAAAABAAEAAQAAAA=', 'runtime': 'flight', 'sidekick_uri': None, 'tensor_fields': None}, 'model_version': {'conversion': {'framework': 'hugging-face-summarization', 'python_version': '3.8', 'requirements': []}, 'file_info': {'file_name': 'model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip', 'sha': 'ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268', 'version': '0d7bb3f4-db0f-448d-8100-22c1c49e6a2e'}, 'id': 47, 'image_path': 'proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.1.0-main-4609', 'name': 'house-price-estimator-arm', 'status': 'ready', 'task_id': 'a872f5f5-e971-45dc-8d42-e74b839f9610', 'visibility': 'private', 'workspace_id': 64}, 'status': 'Running'}]}}], 'engine_lbs': [{'ip': '10.124.0.139', 'name': 'engine-lb-d7cc8fc9c-5wxp2', 'status': 'Running', 'reason': None, 'details': []}], 'sidekicks': [{'ip': None, 'name': 'engine-sidekick-house-price-estimator-arm-47-74966db799-wfcnw', 'status': 'Pending', 'reason': '0/4 nodes are available: 1 node(s) had untolerated taint {kubernetes.io/arch: arm64}, 3 Insufficient cpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..', 'details': ['0/4 nodes are available: 1 node(s) had untolerated taint {kubernetes.io/arch: arm64}, 3 Insufficient cpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..'], 'statuses': None}]}

In [None]:
pipeline_arm.status()

### Inferences on ARM Architecture Model Deployments

With the pipeline deployed, we'll perform sample inferences through the pipeline's inference API endpoints.  This will be the same method used for inference requests for model edge deployments.

In [None]:
### inferences on arm model deployments

arm_deploy_url = pipeline_arm._deployment._url()

# get authorization header
headers = wl.auth.auth_header()

# set the content-type and accept headers

# set the content type for pandas records
headers['Content-Type']= 'application/json; format=pandas-records'

# set accept as pandas-records
headers['Accept']='application/json; format=pandas-records'

!curl -X POST {arm_deploy_url} \
    -H "Authorization: {headers['Authorization']}" \
        -H "Content-Type:{headers['Content-Type']}" \
            -H "Accept:{headers['Accept']}" \
                --data @./data/test_summarization.df.json > ./arm_results.df.json

In [None]:
# show the first 5 results from the outputs

arm_df = pd.read_json("./arm_results.df.json", orient="records")
arm_df.head(5)

### Undeploy the Pipelines

With the inference examples complete, we can undeploy the pipelines.

In [None]:
pipeline_arm.undeploy()

## Edge Deployment on ARM Architecture

We now deploy the pipeline versions to our edge devices.

* Publish the pipeline versions:  Publishes the pipeline versions to the OCI registry with the deployment configuration inherited from the pipeline versions, with the architecture settings inherited from the model versions.
* Deploy Edge:  Deploy the edge device with the edge location settings.

### Publish Models with ARM Architecture Deployment Configuration

Publishing the pipeline uses the pipeline `wallaroo.pipeline.publish()` command.  This requires that the Wallaroo Ops instance have [Edge Registry Services](https://docs.wallaroo.ai/wallaroo-operations-guide/wallaroo-configuration/wallaroo-edge-deployment/#enable-wallaroo-edge-deployment-registry) enabled.

The following publishes the pipeline to the OCI registry and displays the container details.  For more information, see [Wallaroo SDK Essentials Guide: Pipeline Edge Publication](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-pipelines/wallaroo-sdk-essentials-pipeline-publication/).

Note that `engine_url` field specifies the Wallaroo engine type based on the targeted architecture, which is inherited from the ML model.

In [None]:
assay_pub_arm = pipeline_arm.publish()
assay_pub_arm

In [None]:
print(assay_pub_arm.engine_url)

### Deploy Models to ARM Architecture Edge Devices

The edge deployment is performed with `docker run`, `docker compose`, or `helm` installations.  The following command generates the `docker run` command, with the following values provided by the DevOps Engineer:

* `$REGISTRYURL`: The URL of the OCI registry service hosting the pipeline publish.
* `$OCI_USERNAME`: The username for the OCI registry service.
* `$OCI_PASSWORD`: The password or token for the OCI registry service.

For more details on model edge deployments with Wallaroo, see [Model Operations: Run Anywhere](https://staging.docs.wallaroo.ai/wallaroo-model-operations/wallaroo-model-operations-run-anywhere/).

The edge deployment configuration is taken from the pipeline publish, which shows the ARM engine archicture as part of it's edge deployment configuration that was inherited from the model architecture setting.

In [None]:
# create docker run 

docker_command = f'''
docker run -p 8080:8080 \\
    -e DEBUG=true \\
    -e OCI_REGISTRY=$REGISTRYURL \\
    -e CONFIG_CPUS=1 \\
    -e OCI_USERNAME=$OCI_USERNAME \\
    -e OCI_PASSWORD=$OCI_PASSWORD \\
    -e PIPELINE_URL={assay_pub_x86.pipeline_url} \\
    {assay_pub_x86.engine_url}
'''

print(docker_command)

In [None]:
# create docker run 

docker_command = f'''
docker run -p 8080:8080 \\
    -e DEBUG=true \\
    -e OCI_REGISTRY=$REGISTRYURL \\
    -e CONFIG_CPUS=1 \\
    -e OCI_USERNAME=$OCI_USERNAME \\
    -e OCI_PASSWORD=$OCI_PASSWORD \\
    -e PIPELINE_URL={assay_pub_arm.pipeline_url} \\
    {assay_pub_arm.engine_url}
'''

print(docker_command)

### Edge Inference Examples on ARM Architecture

The following examples demonstrate performing inferences on the model deployed on an `ARM` architecture edge device.  `HOSTNAME_ARM` is the hostname for the ARM architecture edge deployment.  Update this variable to match your deployment.

In [None]:
HOSTNAME_ARM = 'localhost'

In [None]:
# perform the inference through the ARM edge device
!curl -X POST {HOSTNAME_ARM}:8080/pipelines/{arm_pipeline_name} \
    -H "Content-Type: Content-Type: application/json; format=pandas-records" \
    --data @./data/normal-inputs.df.json > results.df.json

# display the first 5 results
result_df = pd.from_json('./results.df.json')
result_df.head(5)