[Feature]: Collect and export DCGM metrics

### Problem

Currently, the only GPU metrics collected by dstack are VRAM usage and GPU utilization. The latter is not very useful — it represents “the percentage of time during which one or more kernels were executing on the GPU”, meaning that a single kernel running busy loop would be reported as 100% GPU utilization.

[DCGM Exporter](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html) is de facto standard way to collect detailed metrics from NVIDIA GPUs.



### Solution

1. Install DCGM Exporter on the hosts
2. Collect metrics from the hosts
3. Accumulate, enrich and reexport metrics by dstack

### Workaround

_No response_

### Would you like to help us implement this feature by sending a PR?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Collect and export DCGM metrics #2359

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Collect and export DCGM metrics #2359

Description

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions