Problem
Currently, the only GPU metrics collected by dstack are VRAM usage and GPU utilization. The latter is not very useful — it represents “the percentage of time during which one or more kernels were executing on the GPU”, meaning that a single kernel running busy loop would be reported as 100% GPU utilization.
DCGM Exporter is de facto standard way to collect detailed metrics from NVIDIA GPUs.
Solution
- Install DCGM Exporter on the hosts
- Collect metrics from the hosts
- Accumulate, enrich and reexport metrics by dstack
Workaround
No response
Would you like to help us implement this feature by sending a PR?
Yes