Skip to content

Conversation

@un-def
Copy link
Collaborator

@un-def un-def commented Feb 25, 2025

  • shim: start dcgm-exporter if available, proxy requests
  • periodically collect and store last metrics
  • enrich metrics with dstack labels, export

Closes: #2359

* shim: start dcgm-exporter if available, proxy requests
* periodically collect and store last metrics
* enrich metrics with dstack labels, export

Closes: #2359
@un-def un-def requested a review from r4victor February 25, 2025 17:00
Comment on lines 217 to 219
`dcgm-exporter` and `libdcgm` must be installed on the instance to enable these metrics.
On AWS, Azure, GCP, and OCI backends the required packages are already installed.
If you use SSH fleets, install `datacenter-gpu-manager-4-core` and `datacenter-gpu-manager-exporter`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other non-ssh VM backends like Lambda, Tensordock, Vultr, etc? Seems they don't support dcgm metrics, but the docs can be interpreted like users can install something to make it work.

Copy link
Collaborator

@r4victor r4victor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@un-def un-def merged commit d4a6061 into master Feb 27, 2025
24 checks passed
@un-def un-def deleted the issue_2359_reexport_dcgm_metrics branch February 27, 2025 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Collect and export DCGM metrics

3 participants