Skip to content

[Feature]: Collect and export DCGM metrics #2359

@un-def

Description

@un-def

Problem

Currently, the only GPU metrics collected by dstack are VRAM usage and GPU utilization. The latter is not very useful — it represents “the percentage of time during which one or more kernels were executing on the GPU”, meaning that a single kernel running busy loop would be reported as 100% GPU utilization.

DCGM Exporter is de facto standard way to collect detailed metrics from NVIDIA GPUs.

Solution

  1. Install DCGM Exporter on the hosts
  2. Collect metrics from the hosts
  3. Accumulate, enrich and reexport metrics by dstack

Workaround

No response

Would you like to help us implement this feature by sending a PR?

Yes

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions