From 7e018616ca77865d8f07a4d77270e8112e9b1f31 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Wed, 19 Mar 2025 23:18:30 -0700 Subject: [PATCH 1/2] [Docs]: Update the `Metrics` guide --- docs/assets/stylesheets/extra.css | 2 + docs/docs/guides/metrics.md | 164 +++++++++++++++++++----------- 2 files changed, 107 insertions(+), 59 deletions(-) diff --git a/docs/assets/stylesheets/extra.css b/docs/assets/stylesheets/extra.css index 2b5319d61..6d6346ce1 100644 --- a/docs/assets/stylesheets/extra.css +++ b/docs/assets/stylesheets/extra.css @@ -201,6 +201,8 @@ .md-typeset__scrollwrap { margin-top: 0; margin-bottom: 0; + margin-block-start: 1em; + margin-block-end: 1em; } .md-typeset__table { diff --git a/docs/docs/guides/metrics.md b/docs/docs/guides/metrics.md index b3bf1092a..0104bb483 100644 --- a/docs/docs/guides/metrics.md +++ b/docs/docs/guides/metrics.md @@ -1,73 +1,119 @@ -# Prometheus metrics +# Metrics -If enabled, `dstack` collects and exports Prometheus metrics. Metrics are available at the `/metrics` path. +## Prometheus -By default, metrics are disabled. To enable, set the `DSTACK_ENABLE_PROMETHEUS_METRICS` variable. +When enabled, `dstack` is able to collect various from fleets and runs and export them +to Prometheus. -!!! info "Convention" - *type?* denotes an optional type. If a type is optional, an empty string is a valid value. +### Setup -## Instance metrics +To enable collecting and exporting metrics to Prometheus, +set the `DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable, and point Prometheus to collect metrics +from the `/metrics` endpoint. -| Metric | Type | Description | Examples | -|---|---|---|---| -| `dstack_instance_duration_seconds_total` | *counter* | Total seconds the instance is running | `1123763.22` | -| `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0`| -| `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` | +??? info "NVIDIA DCGM" + NVIDIA DCGM metrics are automatically collected for AWS, Azure, GCP, and OCI backends, as well as for SSH fleets. + + To ensure NVIDIA DCGM metrics are collected from SSH fleets, ensure the `datacenter-gpu-manager-4-core`, + `datacenter-gpu-manager-4-proprietary`, and `datacenter-gpu-manager-exporter` packages are installed on the hosts. -| Label | Type | Examples | -|---|---|---| -| `dstack_project_name` | *string* | `main` | -| `dstack_fleet_name` | *string?* | `my-fleet` | -| `dstack_fleet_id` | *string?* | `51e837bf-fae9-4a37-ac9c-85c005606c22` | -| `dstack_instance_name` | *string* | `my-fleet-0` | -| `dstack_instance_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | -| `dstack_instance_type` | *string?* | `g4dn.xlarge` | -| `dstack_backend` | *string?* | `aws`, `runpod` | -| `dstack_gpu` | *string?* | `T4` | +### Fleets -## Job metrics +Fleet metrics include metrics for each instance within a fleet. This includes information such as the instance's running +time, price, GPU name, and more. -| Metric | Type | Description | Examples | -|---|---|---|---| -| `dstack_job_duration_seconds_total` | *counter* | Total seconds the job is running | `520.37` | -| `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0`| -| `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` | +=== "Metrics" + | Name | Type | Description | Examples | + |------------------------------------------|-----------|-----------------------------------|--------------| + | `dstack_instance_duration_seconds_total` | *counter* | Total instance runtime in seconds | `1123763.22` | + | `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0` | + | `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` | -| Label | Type | Examples | -|---|---|---| -| `dstack_project_name` | *string* | `main` | -| `dstack_user_name` | *string* | `alice` | -| `dstack_run_name` | *string* | `nccl-tests` | -| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` | -| `dstack_job_name` | *string* | `nccl-tests-0-0` | -| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | -| `dstack_job_num` | *integer* | `0` | -| `dstack_replica_num` | *integer* | `0` | -| `dstack_run_type` | *string* | `task`, `dev-environment` | -| `dstack_backend` | *string* | `aws`, `runpod` | -| `dstack_gpu` | *string?* | `T4` | +=== "Labels" + | Name | Type | Description | Examples | + |------------------------|-----------|:--------------|----------------------------------------| + | `dstack_project_name` | *string* | Project name | `main` | + | `dstack_fleet_name` | *string?* | Fleet name | `my-fleet` | + | `dstack_fleet_id` | *string?* | Fleet ID | `51e837bf-fae9-4a37-ac9c-85c005606c22` | + | `dstack_instance_name` | *string* | Instance name | `my-fleet-0` | + | `dstack_instance_id` | *string* | Instance ID | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | + | `dstack_instance_type` | *string?* | Instance type | `g4dn.xlarge` | + | `dstack_backend` | *string?* | Backend | `aws`, `runpod` | + | `dstack_gpu` | *string?* | GPU name | `H100` | -## NVIDIA DCGM job metrics +### Runs -A fixed subset of NVIDIA GPU metrics from [DCGM Exporter :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html){:target="_blank"} on supported cloud backends — AWS, Azure, GCP, OCI — and SSH fleets. +Run metrics include metrics for each job within a run. +This includes information such as job runtime, price, GPU name, DCGM metrics, and more. -??? info "SSH fleets" - In order for DCGM metrics to work, the following packages must be installed on the instances: +=== "Metrics" - * `datacenter-gpu-manager-4-core` - * `datacenter-gpu-manager-4-proprietary` - * `datacenter-gpu-manager-exporter` + | Name | Type | Description | Examples | + |-------------------------------------------------|-----------|--------------------------------------------------------------------------------------------|--------------| + | `dstack_job_duration_seconds_total` | *counter* | Total job runtime in seconds | `520.37` | + | `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0` | + | `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` | + | `DCGM_FI_DEV_GPU_UTIL` | gauge | GPU utilization (in %). | | + | `DCGM_FI_DEV_MEM_COPY_UTIL` | gauge | Memory utilization (in %). | | + | `DCGM_FI_DEV_ENC_UTIL` | gauge | Encoder utilization (in %). | | + | `DCGM_FI_DEV_DEC_UTIL` | gauge | Decoder utilization (in %). | | + | `DCGM_FI_DEV_FB_FREE` | gauge | Framebuffer memory free (in MiB). | | + | `DCGM_FI_DEV_FB_USED` | gauge | Framebuffer memory used (in MiB). | | + | `DCGM_FI_PROF_GR_ENGINE_ACTIVE` | gauge | The ratio of cycles during which a graphics engine or compute engine remains active. | | + | `DCGM_FI_PROF_SM_ACTIVE` | gauge | The ratio of cycles an SM has at least 1 warp assigned. | | + | `DCGM_FI_PROF_SM_OCCUPANCY` | gauge | The ratio of number of warps resident on an SM. | | + | `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` | gauge | Ratio of cycles the tensor (HMMA) pipe is active. | | + | `DCGM_FI_PROF_PIPE_FP64_ACTIVE` | gauge | Ratio of cycles the fp64 pipes are active. | | + | `DCGM_FI_PROF_PIPE_FP32_ACTIVE` | gauge | Ratio of cycles the fp32 pipes are active. | | + | `DCGM_FI_PROF_PIPE_FP16_ACTIVE` | gauge | Ratio of cycles the fp16 pipes are active. | | + | `DCGM_FI_PROF_PIPE_INT_ACTIVE` | gauge | Ratio of cycles the integer pipe is active. | | + | `DCGM_FI_PROF_DRAM_ACTIVE` | gauge | Ratio of cycles the device memory interface is active sending or receiving data. | | + | `DCGM_FI_PROF_PCIE_TX_BYTES` | counter | The number of bytes of active PCIe tx (transmit) data including both header and payload. | | + | `DCGM_FI_PROF_PCIE_RX_BYTES` | counter | The number of bytes of active PCIe rx (read) data including both header and payload. | | + | `DCGM_FI_DEV_SM_CLOCK` | gauge | SM clock frequency (in MHz). | | + | `DCGM_FI_DEV_MEM_CLOCK` | gauge | Memory clock frequency (in MHz). | | + | `DCGM_FI_DEV_MEMORY_TEMP` | gauge | Memory temperature (in C). | | + | `DCGM_FI_DEV_GPU_TEMP` | gauge | GPU temperature (in C). | | + | `DCGM_FI_DEV_POWER_USAGE` | gauge | Power draw (in W). | | + | `DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION` | counter | Total energy consumption since boot (in mJ). | | + | `DCGM_FI_DEV_PCIE_REPLAY_COUNTER` | counter | Total number of PCIe retries. | | + | `DCGM_FI_DEV_XID_ERRORS` | gauge | Value of the last XID error encountered. | | + | `DCGM_FI_DEV_POWER_VIOLATION` | counter | Throttling duration due to power constraints (in us). | | + | `DCGM_FI_DEV_THERMAL_VIOLATION` | counter | Throttling duration due to thermal constraints (in us). | | + | `DCGM_FI_DEV_SYNC_BOOST_VIOLATION` | counter | Throttling duration due to sync-boost constraints (in us). | | + | `DCGM_FI_DEV_BOARD_LIMIT_VIOLATION` | counter | Throttling duration due to board limit constraints (in us). | | + | `DCGM_FI_DEV_LOW_UTIL_VIOLATION` | counter | Throttling duration due to low utilization (in us). | | + | `DCGM_FI_DEV_RELIABILITY_VIOLATION` | counter | Throttling duration due to reliability constraints (in us). | | + | `DCGM_FI_DEV_ECC_SBE_VOL_TOTAL` | counter | Total number of single-bit volatile ECC errors. | | + | `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL` | counter | Total number of double-bit volatile ECC errors. | | + | `DCGM_FI_DEV_ECC_SBE_AGG_TOTAL` | counter | Total number of single-bit persistent ECC errors. | | + | `DCGM_FI_DEV_ECC_DBE_AGG_TOTAL` | counter | Total number of double-bit persistent ECC errors. | | + | `DCGM_FI_DEV_RETIRED_SBE` | counter | Total number of retired pages due to single-bit errors. | | + | `DCGM_FI_DEV_RETIRED_DBE` | counter | Total number of retired pages due to double-bit errors. | | + | `DCGM_FI_DEV_RETIRED_PENDING` | counter | Total number of pages pending retirement. | | + | `DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS` | counter | Number of remapped rows for uncorrectable errors | | + | `DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS` | counter | Number of remapped rows for correctable errors | | + | `DCGM_FI_DEV_ROW_REMAP_FAILURE` | gauge | Whether remapping of rows has failed | | + | `DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL` | counter | Total number of NVLink flow-control CRC errors. | | + | `DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL` | counter | Total number of NVLink data CRC errors. | | + | `DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL` | counter | Total number of NVLink retries. | | + | `DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL` | counter | Total number of NVLink recovery errors. | | + | `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL` | counter | Total number of NVLink bandwidth counters for all lanes. | | + | `DCGM_FI_DEV_NVLINK_BANDWIDTH_L0` | counter | The number of bytes of active NVLink rx or tx data including both header and payload. | | + | `DCGM_FI_PROF_NVLINK_RX_BYTES` | counter | The number of bytes of active PCIe rx (read) data including both header and payload. | | + | `DCGM_FI_PROF_NVLINK_TX_BYTES` | counter | The number of bytes of active NvLink tx (transmit) data including both header and payload. | | -Check [`dcgm/exporter.go`](https://github.com/dstackai/dstack/blob/master/runner/internal/shim/dcgm/exporter.go) for the list of metrics. - -| Label | Type | Examples | -|---|---|---| -| `dstack_project_name` | *string* | `main` | -| `dstack_user_name` | *string* | `alice` | -| `dstack_run_name` | *string* | `nccl-tests` | -| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` | -| `dstack_job_name` | *string* | `nccl-tests-0-0` | -| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | -| `dstack_job_num` | *integer* | `0` | -| `dstack_replica_num` | *integer* | `0` | +=== "Labels" + | Label | Type | | Examples | + |-----------------------|-----------|:-----------------------|----------------------------------------| + | `dstack_project_name` | *string* | Project name | `main` | + | `dstack_user_name` | *string* | User name | `alice` | + | `dstack_run_name` | *string* | Run name | `nccl-tests` | + | `dstack_run_id` | *string* | Run ID | `51e837bf-fae9-4a37-ac9c-85c005606c22` | + | `dstack_job_name` | *string* | Job name | `nccl-tests-0-0` | + | `dstack_job_id` | *string* | Job ID | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | + | `dstack_job_num` | *integer* | Job number | `0` | + | `dstack_replica_num` | *integer* | Replica number | `0` | + | `dstack_run_type` | *string* | Run configuration type | `task`, `dev-environment` | + | `dstack_backend` | *string* | Backend | `aws`, `runpod` | + | `dstack_gpu` | *string?* | GPU name | `H100` | From e1f4f28e450d464661c8a3931f2d155a15b58952 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Thu, 20 Mar 2025 00:02:45 -0700 Subject: [PATCH 2/2] [Docs]: Update the `Metrics` guide (review feedback) --- docs/docs/guides/metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/guides/metrics.md b/docs/docs/guides/metrics.md index 0104bb483..c9a6b952e 100644 --- a/docs/docs/guides/metrics.md +++ b/docs/docs/guides/metrics.md @@ -2,7 +2,7 @@ ## Prometheus -When enabled, `dstack` is able to collect various from fleets and runs and export them +When enabled, `dstack` is able to collect various metrics from fleets and runs and export them to Prometheus. ### Setup