Skip to content

[Bug]: GPU utilization metrics are not exposed in case of Runpod #2800

@pranitnaik43

Description

@pranitnaik43

Steps to reproduce

  • enable prometheus metrics
  • create a task to run a container on Runpod

Actual behaviour

the GPU utilization metrics are available in the dstack dashboard but not exposed at '/metrics'
Image

  • /metrics
# HELP dstack_instance_duration_seconds_total Total seconds the instance is running
# TYPE dstack_instance_duration_seconds_total counter
dstack_instance_duration_seconds_total{dstack_project_name="main",dstack_fleet_name="temp-159ec52a73",dstack_fleet_id="9044f8e7-54d9-4143-a8c8-5bf3f2730a55",dstack_instance_name="temp-159ec52a73-0",dstack_instance_id="3087ba21-b9a1-4463-a26c-fc21a9f9b727",dstack_instance_type="NVIDIA H100 NVL",dstack_backend="runpod",dstack_gpu="H100NVL"} 337767.301653
# HELP dstack_instance_price_dollars_per_hour Instance price, USD/hour
# TYPE dstack_instance_price_dollars_per_hour gauge
dstack_instance_price_dollars_per_hour{dstack_project_name="main",dstack_fleet_name="temp-159ec52a73",dstack_fleet_id="9044f8e7-54d9-4143-a8c8-5bf3f2730a55",dstack_instance_name="temp-159ec52a73-0",dstack_instance_id="3087ba21-b9a1-4463-a26c-fc21a9f9b727",dstack_instance_type="NVIDIA H100 NVL",dstack_backend="runpod",dstack_gpu="H100NVL"} 2.79
# HELP dstack_instance_gpu_count Instance GPU count
# TYPE dstack_instance_gpu_count gauge
dstack_instance_gpu_count{dstack_project_name="main",dstack_fleet_name="temp-159ec52a73",dstack_fleet_id="9044f8e7-54d9-4143-a8c8-5bf3f2730a55",dstack_instance_name="temp-159ec52a73-0",dstack_instance_id="3087ba21-b9a1-4463-a26c-fc21a9f9b727",dstack_instance_type="NVIDIA H100 NVL",dstack_backend="runpod",dstack_gpu="H100NVL"} 1.0
# HELP dstack_run_count_total Total runs count
# TYPE dstack_run_count_total counter
dstack_run_count_total{dstack_project_name="main",dstack_user_name="admin"} 6591.0
dstack_run_count_total{dstack_project_name="main",dstack_user_name="admin"} 334.0
# HELP dstack_run_count_terminated_total Terminated runs count
# TYPE dstack_run_count_terminated_total counter
dstack_run_count_terminated_total{dstack_project_name="main",dstack_user_name="admin"} 5318.0
dstack_run_count_terminated_total{dstack_project_name="main",dstack_user_name="admin"} 10.0
# HELP dstack_run_count_failed_total Failed runs count
# TYPE dstack_run_count_failed_total counter
dstack_run_count_failed_total{dstack_project_name="main",dstack_user_name="admin"} 1272.0
dstack_run_count_failed_total{dstack_project_name="main",dstack_user_name="admin"} 109.0
# HELP dstack_run_count_done_total Done runs count
# TYPE dstack_run_count_done_total counter
dstack_run_count_done_total{dstack_project_name="main",dstack_user_name="admin"} 0.0
dstack_run_count_done_total{dstack_project_name="main",dstack_user_name="admin"} 215.0
# HELP dstack_job_duration_seconds_total Total seconds the job is running
# TYPE dstack_job_duration_seconds_total counter
dstack_job_duration_seconds_total{dstack_project_name="main",dstack_user_name="admin",dstack_run_name="temp-159ec52a73",dstack_run_id="693bad60-0491-44fa-a2bb-daf4cdb45849",dstack_job_name="temp-159ec52a73-0-0",dstack_job_id="fc21864d-6e2b-4ca1-a1e9-062b0ebb04da",dstack_job_num="0",dstack_replica_num="0",dstack_run_type="task",dstack_backend="runpod",dstack_gpu="H100NVL"} 337777.3958
# HELP dstack_job_price_dollars_per_hour Job instance price, USD/hour
# TYPE dstack_job_price_dollars_per_hour gauge
dstack_job_price_dollars_per_hour{dstack_project_name="main",dstack_user_name="admin",dstack_run_name="temp-159ec52a73",dstack_run_id="693bad60-0491-44fa-a2bb-daf4cdb45849",dstack_job_name="temp-159ec52a73-0-0",dstack_job_id="fc21864d-6e2b-4ca1-a1e9-062b0ebb04da",dstack_job_num="0",dstack_replica_num="0",dstack_run_type="task",dstack_backend="runpod",dstack_gpu="H100NVL"} 2.79
# HELP dstack_job_gpu_count Job GPU count
# TYPE dstack_job_gpu_count gauge
dstack_job_gpu_count{dstack_project_name="main",dstack_user_name="admin",dstack_run_name="temp-159ec52a73",dstack_run_id="693bad60-0491-44fa-a2bb-daf4cdb45849",dstack_job_name="temp-159ec52a73-0-0",dstack_job_id="fc21864d-6e2b-4ca1-a1e9-062b0ebb04da",dstack_job_num="0",dstack_replica_num="0",dstack_run_type="task",dstack_backend="runpod",dstack_gpu="H100NVL"} 1.0
# HELP dstack_job_cpu_count Job CPU count
# TYPE dstack_job_cpu_count gauge
dstack_job_cpu_count{dstack_project_name="main",dstack_user_name="admin",dstack_run_name="temp-159ec52a73",dstack_run_id="693bad60-0491-44fa-a2bb-daf4cdb45849",dstack_job_name="temp-159ec52a73-0-0",dstack_job_id="fc21864d-6e2b-4ca1-a1e9-062b0ebb04da",dstack_job_num="0",dstack_replica_num="0",dstack_run_type="task",dstack_backend="runpod",dstack_gpu="H100NVL"} 16.0
# HELP dstack_job_cpu_time_seconds_total Total CPU time consumed by the job, seconds
# TYPE dstack_job_cpu_time_seconds_total counter
dstack_job_cpu_time_seconds_total{dstack_project_name="main",dstack_user_name="admin",dstack_run_name="temp-159ec52a73",dstack_run_id="693bad60-0491-44fa-a2bb-daf4cdb45849",dstack_job_name="temp-159ec52a73-0-0",dstack_job_id="fc21864d-6e2b-4ca1-a1e9-062b0ebb04da",dstack_job_num="0",dstack_replica_num="0",dstack_run_type="task",dstack_backend="runpod",dstack_gpu="H100NVL"} 494407.917682
# HELP dstack_job_memory_total_bytes Total memory allocated for the job, bytes
# TYPE dstack_job_memory_total_bytes gauge
dstack_job_memory_total_bytes{dstack_project_name="main",dstack_user_name="admin",dstack_run_name="temp-159ec52a73",dstack_run_id="693bad60-0491-44fa-a2bb-daf4cdb45849",dstack_job_name="temp-159ec52a73-0-0",dstack_job_id="fc21864d-6e2b-4ca1-a1e9-062b0ebb04da",dstack_job_num="0",dstack_replica_num="0",dstack_run_type="task",dstack_backend="runpod",dstack_gpu="H100NVL"} 193273528320.0
# HELP dstack_job_memory_usage_bytes Memory used by the job (including cache), bytes
# TYPE dstack_job_memory_usage_bytes gauge
dstack_job_memory_usage_bytes{dstack_project_name="main",dstack_user_name="admin",dstack_run_name="temp-159ec52a73",dstack_run_id="693bad60-0491-44fa-a2bb-daf4cdb45849",dstack_job_name="temp-159ec52a73-0-0",dstack_job_id="fc21864d-6e2b-4ca1-a1e9-062b0ebb04da",dstack_job_num="0",dstack_replica_num="0",dstack_run_type="task",dstack_backend="runpod",dstack_gpu="H100NVL"} 130823139328.0
# HELP dstack_job_memory_working_set_bytes Memory used by the job (not including cache), bytes
# TYPE dstack_job_memory_working_set_bytes gauge
dstack_job_memory_working_set_bytes{dstack_project_name="main",dstack_user_name="admin",dstack_run_name="temp-159ec52a73",dstack_run_id="693bad60-0491-44fa-a2bb-daf4cdb45849",dstack_job_name="temp-159ec52a73-0-0",dstack_job_id="fc21864d-6e2b-4ca1-a1e9-062b0ebb04da",dstack_job_num="0",dstack_replica_num="0",dstack_run_type="task",dstack_backend="runpod",dstack_gpu="H100NVL"} 122454171648.0

Expected behaviour

The GPU utilization and VRAM metric should be available at /metrics for Runpod tasks.

dstack version

0.19.9

Additional information

the dstack server collects metrics from Runner and shim.

The metrics collected from Runner are displayed on the dstack metrics dashboard.

The metrics exposed at /metrics contain data from both the collectors (Runner and shim). However, the gpu metrics exposed here are collected from shim. In case of virtualized platforms like Runpod, shim does not run hence the metrics are missing.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingno-stale

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions