# Ray Dashboard w/ enabled Prometheus + Grafana Metrics 
This approach initializes the Ray on Spark dashboard interactively (from a notebook/Python script), then runs a bash script (`setup_monitoring.sh` in same directory) to enable extra Prometheus+Grafana metrics.

This code was tested on a cluster with the following config:
```
        "cluster_name": "ray_multinode_no_init",
        "spark_version": "15.4.x-scala2.12",
        "spark_conf": {
            "spark.databricks.pyspark.dataFrameChunk.enabled": "true"
        },
        "single_user_name": "...@databricks.com",
        "data_security_mode": "DATA_SECURITY_MODE_AUTO",
        "runtime_engine": "STANDARD",
        "kind": "CLASSIC_PREVIEW",
        "use_ml_runtime": true,
        "is_single_node": false,
        "num_workers": 6
```

## Step 1: Install and Setup Ray cluster

In [0]:
%pip install -U "ray[default]"
dbutils.library.restartPython()

Collecting ray[default]
  Obtaining dependency information for ray[default] from https://files.pythonhosted.org/packages/c1/2b/f2efd0e7bcef06d51422db1af48cc5695a3f9b40a444f9d270a2d4663252/ray-2.49.2-cp311-cp311-manylinux2014_x86_64.whl.metadata
  Downloading ray-2.49.2-cp311-cp311-manylinux2014_x86_64.whl.metadata (21 kB)
Collecting opentelemetry-sdk>=1.30.0 (from ray[default])
  Obtaining dependency information for opentelemetry-sdk>=1.30.0 from https://files.pythonhosted.org/packages/9f/62/9f4ad6a54126fb00f7ed4bb5034964c6e4f00fcd5a905e115bd22707e20d/opentelemetry_sdk-1.37.0-py3-none-any.whl.metadata
  Downloading opentelemetry_sdk-1.37.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-prometheus (from ray[default])
  Obtaining dependency information for opentelemetry-exporter-prometheus from https://files.pythonhosted.org/packages/a6/e3/50e9cdc5a52c2ab19585dd69e668ec9fee0343fafc4bffa919ca79230a4f/opentelemetry_exporter_prometheus-0.58b0-py3-none-any.whl.metadata


In [0]:
# Get Driver Proxy API ; uses same approach as Ray on Spark setup: https://github.com/ray-project/ray/blob/c11c8583cbebf62408204c0c75a6570cf56b37c9/python/ray/util/spark/databricks_hook.py#L62

import os
# Note: Include the protocol in the below URL; Do NOT include a trailing slash after the port
grafana_port = 3000
driverLocal = spark._jvm.com.databricks.backend.daemon.driver.DriverLocal
commandContextTags = driverLocal.commandContext().get().toStringMap().apply("tags")
orgId = commandContextTags.apply("orgId")
clusterId = commandContextTags.apply("clusterId")
proxy_link = f"/driver-proxy/o/{orgId}/{clusterId}/{grafana_port}"
proxy_url = f"https://dbc-dp-{orgId}.cloud.databricks.com{proxy_link}"

# REQUIRED: the bash script needs these environment variables set to function properly
print(f"Setting Grafana IFrame host to: {proxy_url}")
os.environ["RAY_GRAFANA_IFRAME_HOST"] = proxy_url
os.environ["CLUSTER_ID"] = clusterId

Setting Grafana IFrame host to: https://dbc-dp-1444828305810485.cloud.databricks.com/driver-proxy/o/1444828305810485/0304-200350-iqaebq6s/3000


In [0]:
# Standard Global Ray cluster setup script - see docs: https://docs.databricks.com/aws/en/machine-learning/ray/ray-create
import ray
from ray.util.spark import setup_global_ray_cluster
setup_global_ray_cluster(
    min_worker_nodes=1,
    max_worker_nodes=6,
    # collect_log_to_path="/Volumes/your_catalog/your_schema/ray_logs",
    is_blocking=False, 
    head_node_options={
            "dashboard_port":9999
            }
     )

ray.init(ignore_reinit_error=True)

2025-09-23 18:16:24,471	INFO cluster_init.py:569 -- Ray head hostname: 10.0.22.190, port: 9502, ray client server port: 10001.


2025-09-23 18:16:25,689	INFO usage_lib.py:473 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-09-23 18:16:25,689	INFO scripts.py:913 -- [37mLocal node IP[39m: [1m10.0.22.190[22m
2025-09-23 18:16:28,198	SUCC scripts.py:949 -- [32m--------------------[39m
2025-09-23 18:16:28,198	SUCC scripts.py:950 -- [32mRay runtime started.[39m
2025-09-23 18:16:28,198	SUCC scripts.py:951 -- [32m--------------------[39m
2025-09-23 18:16:28,198	INFO scripts.py:953 -- [36mNext steps[39m
2025-09-23 18:16:28,199	INFO scripts.py:956 -- To add another node to this Ray cluster, run
2025-09-23 18:16:28,199	INFO scripts.py:959 -- [1m  ray start --address='10.0.22.190:950

2025-09-23 18:16:44,494	INFO cluster_init.py:693 -- Ray head node started.
2025-09-23 18:16:44,495	INFO databricks_hook.py:116 -- The Ray cluster will keep running until you manually detach the Databricks notebook or call `ray.util.spark.shutdown_ray_cluster()`.
2025-09-23 18:16:44,506	INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.0.22.190:9502...
2025-09-23 18:16:44,523	INFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://10.0.22.190:9999 [39m[22m


To monitor and debug Ray from Databricks, view the dashboard at 
 https://dbc-dp-1444828305810485.cloud.databricks.com/driver-proxy/o/1444828305810485/0304-200350-iqaebq6s/9999/


2025-09-23 18:16:48,637	INFO cluster_init.py:168 -- Started 1 Ray worker nodes, meet the minimum number of Ray worker nodes required.
2025-09-23 18:16:48,731	INFO worker.py:1630 -- Using address 10.0.22.190:9502 set in the environment variable RAY_ADDRESS
2025-09-23 18:16:48,733	INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.0.22.190:9502...
2025-09-23 18:16:48,739	INFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://10.0.22.190:9999 [39m[22m


0,1
Python version:,3.11.11
Ray version:,2.49.2
Dashboard:,http://10.0.22.190:9999


## Step 2: Run script to initialize Prometheus+Grafana on running Ray cluster

Using this baseline script, advanced users can adapt to their unique monitoring requirements

In [0]:
%sh ./setup_monitoring.sh

[INFO] Using Cloud=AWS
[INFO] Using Org ID=1444828305810485
[INFO] Using Cluster Id=0304-200350-iqaebq6s
[INFO] Using Data Plane URL=dbc-dp-1444828305810485.cloud.databricks.com
[INFO] Using ray_grafana_iframe_host=https://dbc-dp-1444828305810485.cloud.databricks.com/driver-proxy/o/1444828305810485/0304-200350-iqaebq6s/3000
[INFO] Setup completed! Check logs in /local_disk0/tmp/ for details.


./setup_monitoring.sh: line 119: RAY_DASHBOARD_URL: unbound variable


Go to Ray Dashboard --> Metrics tab to see Metrics in real-time as Ray tasks run

The cell below has a smoke test Ray task to confirm metrics are being populated

In [0]:
# Arbitrary Ray "Work" to test metrics render
import ray

@ray.remote
class Counter:
    def __init__(self):
        self.value = 0

    def increment(self):
        self.value += 1
        return self.value

    def get_counter(self):
        return self.value

# Create an actor from this class.
counter = Counter.remote()


obj_ref = counter.increment.remote()
print(ray.get(obj_ref))

# Create ten Counter actors.
counters = [Counter.remote() for _ in range(10)]

# Increment each Counter once and get the results. These tasks all happen in
# parallel.
results = ray.get([c.increment.remote() for c in counters])
print(results)

# Increment the first Counter five times. These tasks are executed serially
# and share state.
results = ray.get([counters[0].increment.remote() for _ in range(5)])
print(results)

1
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[2, 3, 4, 5, 6]
