Skip to content

Implement metrics API#1827

Merged
r4victor merged 17 commits intomasterfrom
issue_1809_metrics_api
Oct 15, 2024
Merged

Implement metrics API#1827
r4victor merged 17 commits intomasterfrom
issue_1809_metrics_api

Conversation

@r4victor
Copy link
Copy Markdown
Collaborator

@r4victor r4victor commented Oct 14, 2024

Closes #1809

This PR:

  • Implements runner metrics API. The runner collects cpu and memory metrics via cgroups fs and nvidia gpu metrics via nvidia-smi.
  • Adds server background tasks to collect and delete job metrics. By default, metrics are collected every 10 seconds and live for 1 hour.
  • Implements server metrics API.
  • Implements dstack stats command to view run metrics.

Backward compatibility:

  • The server handles old runners that don't provide metrics API and emits warnings that metrics collection fails.

Tested:

  • Metrics of a run with cgroups v1 host.
  • Metrics of a run with cgroups v2 host.
  • Metrics of a run with no GPUs.
  • Metrics of a run with one GPU.
  • Metrics of a run with multiple GPUs.
  • Metrics of a run executed by the old runner.
  • Multi-replica server metrics collection.

TODOs:

  • Support AMD GPU metrics.

dstack stats example:

✗ dstack stats hot-frog-1 -w
 NAME        CPU  MEMORY           GPU                        
 hot-frog-1  2%   15307MB/49152MB  #0 22764MB/24576MB 0% Util 

Metrics API example: GET /api/project/{project_name}/metrics/job/{run_name}

{
	"metrics": [
		{
			"name": "cpu_usage_percent",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				147
			]
		},
		{
			"name": "memory_usage_bytes",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				652775424
			]
		},
		{
			"name": "memory_working_set_bytes",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				167317504
			]
		},
		{
			"name": "gpus_detected_num",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				2
			]
		},
		{
			"name": "gpu_memory_usage_bytes_gpu0",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				2097152
			]
		},
		{
			"name": "gpu_util_percent_gpu0",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				0
			]
		},
		{
			"name": "gpu_memory_usage_bytes_gpu1",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				2097152
			]
		},
		{
			"name": "gpu_util_percent_gpu1",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				0
			]
		}
	]
}

@r4victor r4victor marked this pull request as ready for review October 14, 2024 09:32
@r4victor r4victor requested a review from un-def October 14, 2024 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add job metrics API

2 participants