This tutorial and the assets are available as part of the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/blob/wallaroo2025.1_tutorials/development/mlops-api).

## Wallaroo Dashboard Metrics Retrieval Tutorial

The following tutorial demonstrates using the Wallaroo MLOps API to retrieve Wallaroo metrics data.  These requests are compliant with Prometheus API endpoints.  

This tutorial is split into two sections:

* Inference Data Generation:  This section creates Wallaroo pipeline and inference requests to generate the log files and other data.
* Wallaroo Dashboard Metrics Retrieval via the Wallaroo MLOps API:  Details the Wallaroo MLOps API metrics retrieval endpoints and provides a demonstration of retrieving metrics data.

### Prerequisites

This tutorial assumes the following:

* A Wallaroo Ops environment is installed.
* The Wallaroo SDK is installed.  These examples use the Wallaroo SDK to generate the initial inferences information for the metrics requests.

## Inference Data Generation

This part of the tutorial generates the inference results used for the rest of the tutorial.

### Import libraries

The first step is to import the libraries required.

In [1]:
import json
import numpy as np
import pandas as pd

import pytz
import datetime

import requests
from requests.auth import HTTPBasicAuth

import wallaroo

### Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [None]:
wl = wallaroo.Client()

### Create Workspace

Next create the Wallaroo workspace and set it as the default workspace for this session - from this point on, model uploads and other commands will default to this workspace.

The workspace id is stored for further use.

In [49]:
workspace = wl.get_workspace(name="metric-retrieval-tutorial", create_if_not_exist=True)
wl.set_current_workspace(workspace)

{'name': 'metric-retrieval-tutorial', 'id': 1713, 'archived': False, 'created_by': '7d603858-88e0-472e-8f71-e41094afd7ec', 'created_at': '2025-08-05T18:41:42.646046+00:00', 'models': [], 'pipelines': []}

### Upload Model

For this example, the model `ccfraud.onnx` is used.  This is a credit card fraud model is trained to detect credit card fraud based on a 0 to 1 model:  The closer to 0 the less likely the transactions indicate fraud, while the closer to 1 the more likely the transactions indicate fraud.

This model is included in the Wallaroo Native Runtimes, so requires no additional settings at model upload.  For more details on supported models, see [Wallaroo Supported Models](https://docs.wallaroo.ai/wallaroo-model-operations/wallaroo-model-operations-deploy/wallaroo-model-operations-upload-register/)

In [50]:
model_name = "ccfraud-model"
model_file_name = "./models/ccfraud.onnx"
ccfraud_model = (wl.upload_model(name=model_name, 
                                 path=model_file_name, 
                                 framework=wallaroo.framework.Framework.ONNX)
                                 .configure(tensor_fields=["tensor"])
                )

### Deploy Model

Models are deployed through the following process:

* Create a Wallaroo pipeline
* Add the model as a pipeline step
* Define a deployment configuration.  This defines what resources are allocated for the pipeline's exclusive use.

For more details of this process, see [ML Operations: Inference](https://docs.wallaroo.ai/wallaroo-model-operations/wallaroo-model-operations-serve/)

In [51]:
# create the pipeline
pipeline_name = "metrics-retrieval-tutorial-pipeline"
pipeline = wl.build_pipeline(pipeline_name)

# add the model as a pipeline step

pipeline.add_model_step(ccfraud_model)

# set the deployment configuration for 0.5 cpu, 1 replica, 1 Gi RAM
deploy_config = wallaroo.DeploymentConfigBuilder().replica_count(1).cpus(0.5).memory("1Gi").build()

# deploy the pipeline
pipeline.deploy(deployment_config=deploy_config, wait_for_status=False)
# saved for later steps
deploy = pipeline._deployment


Deployment initiated for metrics-retrieval-tutorial-pipeline. Please check pipeline status.


In [52]:
# wait until deployment is complete before continuing
import time
time.sleep(15)

while pipeline.status()['status'] != 'Running':
    time.sleep(15)
    print("Waiting for deployment.")
    pipeline.status()['status']
pipeline.status()['status']


Waiting for deployment.
Waiting for deployment.
Waiting for deployment.
Waiting for deployment.
Waiting for deployment.


'Running'

### Sample Inferences

The following sample inferences are used to generate inference logs records.  Metric retrieval works best with a longer history of inference results;  feel free to rerun this section as needed to create additional records for further testing.

The following will run for one minute.

In [53]:
import time
timeout = time.time() + 60   # 1 minutes from now
while True:
    if time.time() > timeout:
        break
    pipeline.infer_from_file("./data/cc_data_10k.arrow")

### Retrieve Pipeline Logs

The following retrieves the inference log results for the pipeline.

In [54]:
pipeline.logs()



Unnamed: 0,time,in.tensor,out.dense_1,anomaly.count
0,2025-08-05 18:44:22.769,"[-0.12405868, 0.73698884, 1.0311689, 0.5991753...",[0.0010648072],0
1,2025-08-05 18:44:22.769,"[-2.1694233, -3.1647356, 1.2038506, -0.2649221...",[0.00024175644],0
2,2025-08-05 18:44:22.769,"[-0.24798988, 0.40499672, 0.49408177, -0.37252...",[0.00150159],0
3,2025-08-05 18:44:22.769,"[-0.2260837, 0.12802614, -0.8732004, -2.089788...",[0.00037947297],0
4,2025-08-05 18:44:22.769,"[-0.90164274, -0.50116056, 1.2045985, 0.407885...",[0.0001988411],0
...,...,...,...,...
95,2025-08-05 18:44:22.769,"[-0.1093998, -0.031678658, 0.9885652, -0.68602...",[0.00020942092],0
96,2025-08-05 18:44:22.769,"[0.44973943, -0.35288164, 0.5224735, 0.910402,...",[0.00031492114],0
97,2025-08-05 18:44:22.769,"[0.82174337, -0.50793207, -1.358988, 0.3713617...",[0.00081187487],0
98,2025-08-05 18:44:22.769,"[1.0252348, 0.37717652, -1.4182774, 0.7057443,...",[0.001860708],0


## Wallaroo Dashboard Metrics Retrieval via the Wallaroo MLOps API

The Wallaroo MLOps API allows for metrics retrieval.  These are used to track:

* Inference result performance.
* Deployed replicas.
* Inference Latency.

These inference endpoints are compliant with Prometheus endpoints.

<details>
<summary><h3 id="supported-queries">Supported Queries</h3></summary>
The following queries are supported through the Metrics endpoints.  The following references are used here:

* `pipelineID`:  The pipeline's numerical identifier, retrieved from the Wallaroo SDK with `wallaroo.pipeline.Pipeline.name()`.  For example:

    ```python
    pipeline.name()
    ```

    ```text
    sample-pipeline-name
    ```

* `deployment_id`: The Kubernetes namespace for the deployment.

| English Name | Parameterized Query | Example Query | Description |
|---|---|---|---|
| Requests per second | `sum by (pipeline_name) (rate(latency_histogram_ns_count{pipeline_name="{pipelineID}"}[{step}s]))` | `sum by (deploy_id) (rate(latency_histogram_ns_count{deploy_id="deployment_id"}[10s]))` | Number of processed requests per second to a pipeline. |
| Cluster inference rate | `sum by (pipeline_name) (rate(tensor_throughput_batch_count{pipeline_name="{pipelineID}"}[{step}s]))` | `sum by (deploy_id) (rate(tensor_throughput_batch_count{deploy_id="deployment_id"}[10s]))` | Number of inferences processed per second.  This notably differs from requests per second when batch inference requests are made. |
| P50 inference latency | `histogram_quantile(0.50, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6` | `histogram_quantile(0.50, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6` | Histogram for P90 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one. |
| P95 inference latency | `histogram_quantile(0.95, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6` | `histogram_quantile(0.95, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6` | Histogram for P95 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one. |
| P99 inference latency | `histogram_quantile(0.99, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6` | `histogram_quantile(0.99, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6` | Histogram for P99 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one. |
| Engine replica count | `count(container_memory_usage_bytes{namespace="{pipeline_namespace}", container="engine"}) or vector(0)` | `count(container_memory_usage_bytes{namespace="deployment_id", container="engine"}) or vector(0)` | Number of engine replicas currently running in a pipeline |
| Sidekick replica count | `count(container_memory_usage_bytes{namespace="{pipeline_namespace}", container=~"engine-sidekick-.*"}) or vector(0)` | `count(container_memory_usage_bytes{namespace="deployment_id", container=~"engine-sidekick-.*"}) or vector(0)` | Number of sidekick replicas currently running in a pipeline |
| Output tokens per second (TPS) | `sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}]))` | `sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{kubernetes_namespace="deployment_id"}[10s]))` | LLM output tokens per second: this is the number of tokens generated per second for a LLM deployed in Wallaroo with vLLM |
| P99 Time to first token (TTFT) | `histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000` | `histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000` | P99 time to first token: P99 for time to generate the first token for LLMs deployed in Wallaroo with vLLM |
| P95 Time to first token (TTFT) | `histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000` | `histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000` | P95 time to first token: P95 for time to generate the first token for LLMs deployed in Wallaroo with vLLM |
| P50 Time to first token (TTFT) | `histogram_quantile(0.50, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000` | `histogram_quantile(0.50, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000` | P50 time to first token: P50 for time to generate the first token for LLMs deployed in Wallaroo with vLLM |

</details>

### Query Metric Request Endpoints

* **Endpoints**: 
  * `/v1/api/metrics/query` (**GET**)
  * `/v1/api/metrics/query` (**POST**)

#### Query Metric Request Parameters

| Parameter | Type | Description |
|---|---|---|
| query | *String* | The Prometheus expression query string. |
| time | *String* | The evaluation timestamp in either RFC3339 format or Unix timestamp. |
| timeout | *String* | The evaluation timeout in duration format (`5m` for 5 minutes, etc). |

#### Query Metric Request Returns



| Field | &nbsp; | Type | Description |
|---|---|---|---|
| **status** | &nbsp; | *String* | The status of the request of either `success` or `error`. |
| **data** | &nbsp; | *Dict* | The response data. |
| &nbsp; | **data.resultType** | *String* | The type of query result. |
| &nbsp; | **data.result** | *String* | DateTime of the model's creation. |
| **errorType** | &nbsp; | *String* | The error type if `status` is `error`. |
| **errorType** | &nbsp; | *String* | The error messages if `status` is `error`. |
| **warnings** | &nbsp; | *Array[String]* | An array of error messages. |

### Query Range Metric Endpoints

* **Endpoints**
  * `/v1/api/metrics/query_range` (**GET**)
  * `/v1/api/metrics/query_range` (**POST**)

Returns a list of models added to a specific workspace.

#### Query Range Metric Request Parameters

| Parameter | Type | Description |
|---|---|---|
| query | *String* | The Prometheus expression query string. |
| start | *String* | The starting timestamp in either RFC3339 format or Unix timestamp, inclusive. |
| end | *String* | The ending timestamp in either RFC3339 format or Unix timestamp. |
| step | *String* | Query resolution step width in either duration format or as a float number of seconds. |
| timeout | *String* | The evaluation timeout in duration format (`5m` for 5 minutes, etc). |

#### Query Range Metric Request Returns

| Field | &nbsp; | Type | Description |
|---|---|---|---|
| **status** | &nbsp; | *String* | The status of the request of either `success` or `error`. |
| **data** | &nbsp; | *Dict* | The response data. |
| &nbsp; | **resultType** | *String* | The type of query result. For query range, always `matrix`. |
| &nbsp; | **result** | *String* | DateTime of the model's creation. |
| **errorType** | &nbsp; | *String* | The error type if `status` is `error`. |
| **errorType** | &nbsp; | *String* | The error messages if `status` is `error`. |
| **warnings** | &nbsp; | *Array[String]* | An array of error messages. |

### Example Metric Request

The following request shows an example of a Query Range request for requests per second.  For this example, the following Wallaroo SDK methods are used:

* `wl.api_endpoint`: Retrieves the API endpoint for the Wallaroo Ops server.
* `wl.auth.auth_header()`: Retrieves the authentication bearer tokens.

In [57]:
# set prometheus requirements
pipeline_id = pipeline_name # the name of the pipeline
step = "1m" # the step of the calculation

# this will also format the timezone in the parsing section
timezone = "US/Central"

selected_timezone = pytz.timezone(timezone)

# Define the start and end times
data_start = selected_timezone.localize(datetime.datetime(2025, 8, 4, 9, 0, 0))
data_end = selected_timezone.localize(datetime.datetime(2025, 8, 6, 9, 59, 59))

# this is the URL to get prometheus metrics
query_url = f"{wl.api_endpoint}/v1/metrics/api/v1/query_range"

In [70]:
# Retrieve the token 
headers = wl.auth.auth_header()

# Convert to UTC and get the Unix timestamps
start_timestamp = int(data_start.astimezone(pytz.UTC).timestamp())
end_timestamp = int(data_end.astimezone(pytz.UTC).timestamp())    

query_rps = f'sum by (pipeline_name) (rate(latency_histogram_ns_count{{pipeline_name="{pipeline_id}"}}[{step}]))'
#requests per second
params_rps = {
    'query': query_rps,
    'start': start_timestamp,
    'end': end_timestamp,
    'step': step
}

response_rps = requests.get(query_url, headers=headers, params=params_rps)

if response_rps.status_code == 200:
    print("Requests Per Second Data:", response_rps.json())
else:
    print("Failed to fetch RPS data:", response_rps.status_code, response_rps.text)

def parse_prometheus_data(data, timezone):
    parsed_data = []
    for series in data:
        pipeline_name = series['metric']['pipeline_name']
        for timestamp, value in series['values']:
            utc_time = pd.to_datetime(datetime.datetime.utcfromtimestamp(int(timestamp)), utc=True)
            central_time = utc_time.tz_convert(timezone)
            
            parsed_data.append({
                'timestamp': central_time,
                'pipeline_name': pipeline_name,
                'value': float(value)
            })
    return parsed_data

# extract the results from the request response
rps_data = response_rps.json().get('data', {}).get('result', [])

# Parse the data for both RPS and Inference Rate
rps_parsed_data = parse_prometheus_data(rps_data,timezone)

rps_df = pd.DataFrame(rps_parsed_data)

print("")
print("Requests Per Second Data:")
display(rps_df.head()) 
print("")

# Display the rows where value is not 0
print("Non-zero Requests Per Second Data:")
display(rps_df[rps_df['value'] != 0])

Requests Per Second Data: {'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric': {'pipeline_name': 'metrics-retrieval-tutorial-pipeline'}, 'values': [[1754419440, '0.61195'], [1754419500, '0.43636363636363634'], [1754419560, '0'], [1754419620, '0'], [1754419680, '0'], [1754419740, '0'], [1754419800, '0'], [1754419860, '0'], [1754419920, '0'], [1754419980, '0'], [1754420040, '0'], [1754420100, '0'], [1754420160, '0'], [1754420220, '0'], [1754420280, '0'], [1754420340, '0'], [1754420400, '0'], [1754420460, '0']]}]}}

Requests Per Second Data:


Unnamed: 0,timestamp,pipeline_name,value
0,2025-08-05 13:44:00-05:00,metrics-retrieval-tutorial-pipeline,0.61195
1,2025-08-05 13:45:00-05:00,metrics-retrieval-tutorial-pipeline,0.436364
2,2025-08-05 13:46:00-05:00,metrics-retrieval-tutorial-pipeline,0.0
3,2025-08-05 13:47:00-05:00,metrics-retrieval-tutorial-pipeline,0.0
4,2025-08-05 13:48:00-05:00,metrics-retrieval-tutorial-pipeline,0.0



Non-zero Requests Per Second Data:


Unnamed: 0,timestamp,pipeline_name,value
0,2025-08-05 13:44:00-05:00,metrics-retrieval-tutorial-pipeline,0.61195
1,2025-08-05 13:45:00-05:00,metrics-retrieval-tutorial-pipeline,0.436364


The following shows the query inference rate.

In [64]:
query_inference_rate = f'sum by (pipeline_name) (rate(tensor_throughput_batch_count{{pipeline_name="{pipeline_id}"}}[{step}]))'

# inference rte
params_inference_rate = {
    'query': query_inference_rate,
    'start': start_timestamp,
    'end': end_timestamp,
    'step': step
}

response_inference_rate = requests.get(query_url, headers=headers, params=params_inference_rate)

if response_inference_rate.status_code == 200:
    print("Cluster Inference Rate Data:", response_inference_rate.json())
else:
    print("Failed to fetch Inference Rate data:", response_inference_rate.status_code, response_inference_rate.text)


Cluster Inference Rate Data: {'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric': {'pipeline_name': 'metrics-retrieval-tutorial-pipeline'}, 'values': [[1754419440, '6274.9353'], [1754419500, '4474.472727272727'], [1754419560, '0'], [1754419620, '0'], [1754419680, '0'], [1754419740, '0'], [1754419800, '0'], [1754419860, '0'], [1754419920, '0'], [1754419980, '0'], [1754420040, '0'], [1754420100, '0']]}]}}


In [73]:
inference_rate_data = response_inference_rate.json().get('data', {}).get('result', [])

inference_rate_parsed_data = parse_prometheus_data(inference_rate_data, timezone)

inference_rate_df = pd.DataFrame(inference_rate_parsed_data)

print("\nCluster Inference Rate Data:")
display(inference_rate_df.head())

# Display the rows where value is not 0
print("Non-zero Cluster Inference Rate Data:")
display(inference_rate_df[inference_rate_df['value'] != 0])


Cluster Inference Rate Data:


Unnamed: 0,timestamp,pipeline_name,value
0,2025-08-05 13:44:00-05:00,metrics-retrieval-tutorial-pipeline,6274.9353
1,2025-08-05 13:45:00-05:00,metrics-retrieval-tutorial-pipeline,4474.472727
2,2025-08-05 13:46:00-05:00,metrics-retrieval-tutorial-pipeline,0.0
3,2025-08-05 13:47:00-05:00,metrics-retrieval-tutorial-pipeline,0.0
4,2025-08-05 13:48:00-05:00,metrics-retrieval-tutorial-pipeline,0.0


Non-zero Cluster Inference Rate Data:


Unnamed: 0,timestamp,pipeline_name,value
0,2025-08-05 13:44:00-05:00,metrics-retrieval-tutorial-pipeline,6274.9353
1,2025-08-05 13:45:00-05:00,metrics-retrieval-tutorial-pipeline,4474.472727
