Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/assets/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,8 @@
.md-typeset__scrollwrap {
margin-top: 0;
margin-bottom: 0;
margin-block-start: 1em;
margin-block-end: 1em;
}

.md-typeset__table {
Expand Down
164 changes: 105 additions & 59 deletions docs/docs/guides/metrics.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,119 @@
# Prometheus metrics
# Metrics

If enabled, `dstack` collects and exports Prometheus metrics. Metrics are available at the `/metrics` path.
## Prometheus

By default, metrics are disabled. To enable, set the `DSTACK_ENABLE_PROMETHEUS_METRICS` variable.
When enabled, `dstack` is able to collect various metrics from fleets and runs and export them
to Prometheus.

!!! info "Convention"
*type?* denotes an optional type. If a type is optional, an empty string is a valid value.
### Setup

## Instance metrics
To enable collecting and exporting metrics to Prometheus,
set the `DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable, and point Prometheus to collect metrics
from the `<dstack server URL>/metrics` endpoint.

| Metric | Type | Description | Examples |
|---|---|---|---|
| `dstack_instance_duration_seconds_total` | *counter* | Total seconds the instance is running | `1123763.22` |
| `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0`|
| `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` |
??? info "NVIDIA DCGM"
NVIDIA DCGM metrics are automatically collected for AWS, Azure, GCP, and OCI backends, as well as for SSH fleets.
To ensure NVIDIA DCGM metrics are collected from SSH fleets, ensure the `datacenter-gpu-manager-4-core`,
`datacenter-gpu-manager-4-proprietary`, and `datacenter-gpu-manager-exporter` packages are installed on the hosts.

| Label | Type | Examples |
|---|---|---|
| `dstack_project_name` | *string* | `main` |
| `dstack_fleet_name` | *string?* | `my-fleet` |
| `dstack_fleet_id` | *string?* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
| `dstack_instance_name` | *string* | `my-fleet-0` |
| `dstack_instance_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
| `dstack_instance_type` | *string?* | `g4dn.xlarge` |
| `dstack_backend` | *string?* | `aws`, `runpod` |
| `dstack_gpu` | *string?* | `T4` |
### Fleets

## Job metrics
Fleet metrics include metrics for each instance within a fleet. This includes information such as the instance's running
time, price, GPU name, and more.

| Metric | Type | Description | Examples |
|---|---|---|---|
| `dstack_job_duration_seconds_total` | *counter* | Total seconds the job is running | `520.37` |
| `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0`|
| `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` |
=== "Metrics"
| Name | Type | Description | Examples |
|------------------------------------------|-----------|-----------------------------------|--------------|
| `dstack_instance_duration_seconds_total` | *counter* | Total instance runtime in seconds | `1123763.22` |
| `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0` |
| `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` |

| Label | Type | Examples |
|---|---|---|
| `dstack_project_name` | *string* | `main` |
| `dstack_user_name` | *string* | `alice` |
| `dstack_run_name` | *string* | `nccl-tests` |
| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
| `dstack_job_name` | *string* | `nccl-tests-0-0` |
| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
| `dstack_job_num` | *integer* | `0` |
| `dstack_replica_num` | *integer* | `0` |
| `dstack_run_type` | *string* | `task`, `dev-environment` |
| `dstack_backend` | *string* | `aws`, `runpod` |
| `dstack_gpu` | *string?* | `T4` |
=== "Labels"
| Name | Type | Description | Examples |
|------------------------|-----------|:--------------|----------------------------------------|
| `dstack_project_name` | *string* | Project name | `main` |
| `dstack_fleet_name` | *string?* | Fleet name | `my-fleet` |
| `dstack_fleet_id` | *string?* | Fleet ID | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
| `dstack_instance_name` | *string* | Instance name | `my-fleet-0` |
| `dstack_instance_id` | *string* | Instance ID | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
| `dstack_instance_type` | *string?* | Instance type | `g4dn.xlarge` |
| `dstack_backend` | *string?* | Backend | `aws`, `runpod` |
| `dstack_gpu` | *string?* | GPU name | `H100` |

## NVIDIA DCGM job metrics
### Runs

A fixed subset of NVIDIA GPU metrics from [DCGM Exporter :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html){:target="_blank"} on supported cloud backends — AWS, Azure, GCP, OCI — and SSH fleets.
Run metrics include metrics for each job within a run.
This includes information such as job runtime, price, GPU name, DCGM metrics, and more.

??? info "SSH fleets"
In order for DCGM metrics to work, the following packages must be installed on the instances:
=== "Metrics"

* `datacenter-gpu-manager-4-core`
* `datacenter-gpu-manager-4-proprietary`
* `datacenter-gpu-manager-exporter`
| Name | Type | Description | Examples |
|-------------------------------------------------|-----------|--------------------------------------------------------------------------------------------|--------------|
| `dstack_job_duration_seconds_total` | *counter* | Total job runtime in seconds | `520.37` |
| `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0` |
| `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` |
| `DCGM_FI_DEV_GPU_UTIL` | gauge | GPU utilization (in %). | |
| `DCGM_FI_DEV_MEM_COPY_UTIL` | gauge | Memory utilization (in %). | |
| `DCGM_FI_DEV_ENC_UTIL` | gauge | Encoder utilization (in %). | |
| `DCGM_FI_DEV_DEC_UTIL` | gauge | Decoder utilization (in %). | |
| `DCGM_FI_DEV_FB_FREE` | gauge | Framebuffer memory free (in MiB). | |
| `DCGM_FI_DEV_FB_USED` | gauge | Framebuffer memory used (in MiB). | |
| `DCGM_FI_PROF_GR_ENGINE_ACTIVE` | gauge | The ratio of cycles during which a graphics engine or compute engine remains active. | |
| `DCGM_FI_PROF_SM_ACTIVE` | gauge | The ratio of cycles an SM has at least 1 warp assigned. | |
| `DCGM_FI_PROF_SM_OCCUPANCY` | gauge | The ratio of number of warps resident on an SM. | |
| `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` | gauge | Ratio of cycles the tensor (HMMA) pipe is active. | |
| `DCGM_FI_PROF_PIPE_FP64_ACTIVE` | gauge | Ratio of cycles the fp64 pipes are active. | |
| `DCGM_FI_PROF_PIPE_FP32_ACTIVE` | gauge | Ratio of cycles the fp32 pipes are active. | |
| `DCGM_FI_PROF_PIPE_FP16_ACTIVE` | gauge | Ratio of cycles the fp16 pipes are active. | |
| `DCGM_FI_PROF_PIPE_INT_ACTIVE` | gauge | Ratio of cycles the integer pipe is active. | |
| `DCGM_FI_PROF_DRAM_ACTIVE` | gauge | Ratio of cycles the device memory interface is active sending or receiving data. | |
| `DCGM_FI_PROF_PCIE_TX_BYTES` | counter | The number of bytes of active PCIe tx (transmit) data including both header and payload. | |
| `DCGM_FI_PROF_PCIE_RX_BYTES` | counter | The number of bytes of active PCIe rx (read) data including both header and payload. | |
| `DCGM_FI_DEV_SM_CLOCK` | gauge | SM clock frequency (in MHz). | |
| `DCGM_FI_DEV_MEM_CLOCK` | gauge | Memory clock frequency (in MHz). | |
| `DCGM_FI_DEV_MEMORY_TEMP` | gauge | Memory temperature (in C). | |
| `DCGM_FI_DEV_GPU_TEMP` | gauge | GPU temperature (in C). | |
| `DCGM_FI_DEV_POWER_USAGE` | gauge | Power draw (in W). | |
| `DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION` | counter | Total energy consumption since boot (in mJ). | |
| `DCGM_FI_DEV_PCIE_REPLAY_COUNTER` | counter | Total number of PCIe retries. | |
| `DCGM_FI_DEV_XID_ERRORS` | gauge | Value of the last XID error encountered. | |
| `DCGM_FI_DEV_POWER_VIOLATION` | counter | Throttling duration due to power constraints (in us). | |
| `DCGM_FI_DEV_THERMAL_VIOLATION` | counter | Throttling duration due to thermal constraints (in us). | |
| `DCGM_FI_DEV_SYNC_BOOST_VIOLATION` | counter | Throttling duration due to sync-boost constraints (in us). | |
| `DCGM_FI_DEV_BOARD_LIMIT_VIOLATION` | counter | Throttling duration due to board limit constraints (in us). | |
| `DCGM_FI_DEV_LOW_UTIL_VIOLATION` | counter | Throttling duration due to low utilization (in us). | |
| `DCGM_FI_DEV_RELIABILITY_VIOLATION` | counter | Throttling duration due to reliability constraints (in us). | |
| `DCGM_FI_DEV_ECC_SBE_VOL_TOTAL` | counter | Total number of single-bit volatile ECC errors. | |
| `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL` | counter | Total number of double-bit volatile ECC errors. | |
| `DCGM_FI_DEV_ECC_SBE_AGG_TOTAL` | counter | Total number of single-bit persistent ECC errors. | |
| `DCGM_FI_DEV_ECC_DBE_AGG_TOTAL` | counter | Total number of double-bit persistent ECC errors. | |
| `DCGM_FI_DEV_RETIRED_SBE` | counter | Total number of retired pages due to single-bit errors. | |
| `DCGM_FI_DEV_RETIRED_DBE` | counter | Total number of retired pages due to double-bit errors. | |
| `DCGM_FI_DEV_RETIRED_PENDING` | counter | Total number of pages pending retirement. | |
| `DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS` | counter | Number of remapped rows for uncorrectable errors | |
| `DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS` | counter | Number of remapped rows for correctable errors | |
| `DCGM_FI_DEV_ROW_REMAP_FAILURE` | gauge | Whether remapping of rows has failed | |
| `DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL` | counter | Total number of NVLink flow-control CRC errors. | |
| `DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL` | counter | Total number of NVLink data CRC errors. | |
| `DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL` | counter | Total number of NVLink retries. | |
| `DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL` | counter | Total number of NVLink recovery errors. | |
| `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL` | counter | Total number of NVLink bandwidth counters for all lanes. | |
| `DCGM_FI_DEV_NVLINK_BANDWIDTH_L0` | counter | The number of bytes of active NVLink rx or tx data including both header and payload. | |
| `DCGM_FI_PROF_NVLINK_RX_BYTES` | counter | The number of bytes of active PCIe rx (read) data including both header and payload. | |
| `DCGM_FI_PROF_NVLINK_TX_BYTES` | counter | The number of bytes of active NvLink tx (transmit) data including both header and payload. | |

Check [`dcgm/exporter.go`](https://github.com/dstackai/dstack/blob/master/runner/internal/shim/dcgm/exporter.go) for the list of metrics.

| Label | Type | Examples |
|---|---|---|
| `dstack_project_name` | *string* | `main` |
| `dstack_user_name` | *string* | `alice` |
| `dstack_run_name` | *string* | `nccl-tests` |
| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
| `dstack_job_name` | *string* | `nccl-tests-0-0` |
| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
| `dstack_job_num` | *integer* | `0` |
| `dstack_replica_num` | *integer* | `0` |
=== "Labels"
| Label | Type | | Examples |
|-----------------------|-----------|:-----------------------|----------------------------------------|
| `dstack_project_name` | *string* | Project name | `main` |
| `dstack_user_name` | *string* | User name | `alice` |
| `dstack_run_name` | *string* | Run name | `nccl-tests` |
| `dstack_run_id` | *string* | Run ID | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
| `dstack_job_name` | *string* | Job name | `nccl-tests-0-0` |
| `dstack_job_id` | *string* | Job ID | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
| `dstack_job_num` | *integer* | Job number | `0` |
| `dstack_replica_num` | *integer* | Replica number | `0` |
| `dstack_run_type` | *string* | Run configuration type | `task`, `dev-environment` |
| `dstack_backend` | *string* | Backend | `aws`, `runpod` |
| `dstack_gpu` | *string?* | GPU name | `H100` |