[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics #3968

eero-t · 2022-01-27T17:34:35Z

ChangeLog: gpu_sysman plugin: Add collectd v6 'gpu_sysman' plugin for Intel GPU metrics

This is based on earlier v5 version PR of the plugin: #3883

New Sysman plugin for collectd

Pre-production beta version.

Plugin links against Level Zero loader:

https://github.com/oneapi-src/level-zero

And provides GPU metrics using Level Zero Sysman API:

https://spec.oneapi.com/level-zero/latest/sysman/api.html

Metric reporting

One of the uses for this plugin is providing GPU metrics for Prometheus in Kubernetes cluster environment.

Because some metric values can fluctuate between their possible min & max values much more frequently than those values can be reported in such an environment, plugin can be configured to query few of the GPU metrics (currently frequency & memory usage) internally at higher rate + cache the results internally, and provide aggregate metrics of the cached data (currently min/max for given period) at the reporting intervals.

If there are errors with some metric types when their values are queried, plugin disables that metrics type for corresponding GPU. Each of the GPU metrics type queries can also be disabled from the plugin configuration.

Supported metrics

Plugin supports most of the GPU metric APIs provided by the Level-Zero Sysman v1.1 (frontend) specification:

Temperature (Celsius)
Power usage (ujoules counter / Watts)
Memory usage (bytes used / usage ratio)
Memory BW (byte counter / usage ratio)
Frequency (requested + actual frequency)
- Frequency throttling (usec counter / throttle ratio)
Engine utilization (usec counter / utilization ratio)
RAS error counters (several correctable + uncorrectable categories)

What metrics are available depends on the underlying HW, FW and (Level-Zero backend) driver stack. Which of those metrics get queried + reported, can be controlled with the plugin options.

Metric units

While Gauge values are in units recommended by Prometheus and Open Metrics spec (e.g. seconds), "raw" Counter values are not (they use ujoules / usecs). This is collectd issue as it stores Counter values as integers, instead of using floating point also for those, as expected by the Open Metrics spec:

https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#units-and-base-units

Metric labels

Plugin helper function adds originating GPU identification label(s) to all metric values:

"pci_bdf": PCI device address in BDF notation
"dev_file": primary device node file name (enabled with ADD_DEV_FILE define, for use with Kubernetes Intel GPU plugin)

Values can also include following additional labels (when Sysman provides corresponding information for given metric type):

"sub_dev": sub-device ID
"location": location for given metric
"direction": e.g. read or write for bandwidth metric
"type": e.g. memory type for memory metric
"function": aggregation function used to provide value for data that was internally sampled at higher rate (min / max)

Possible other metrics

Support could be later added for other metrics with Sysman APIs, when Sysman backend starts supporting them, such as:

Fabric port read/write BW
PSU voltage, amperage, temp
GPU voltage
Fan speed
Memory health

Quality expectations

GCC gives no warnings for plugin when much stricter warnings are enabled than what collectd build uses by default.

80-90% unit-testing code coverage (depending on used GCC optimization level), with no memory access nor leakage warnings when unit-tests are run under Valgrind.

Among other things, unit-testing validates error handling for all Sysman functions used at run-time. See gpu_sysman_test.c for more information.

In addition to unit tests, most of the metrics have been tested also with data from real HW.

Different versions of the code have been tested semi-regularly since 2020 in (a development) setup consisting of few Intel iGPU and discrete (Xe) GPUs, but there has been no testing on production servers.

Testing has been done using Intel "compute-runtime" Sysman GPU backend, both with the public and internal Intel GPU SW stacks supporting Xe HW (in theory plugin should work also with the other Sysman API GPU backends than Intel one).

Validation

Comment at top of gpu_sysman_test.c tells how to run validation tests.

Intel compute-runtime provides standalone Level Zero test program which can be used to check / validate the metric values reported by the plugin:

https://github.com/intel/compute-runtime/blob/master/level_zero/tools/test/black_box_tests/zello_sysman.cpp

eero-t · 2022-01-27T18:04:01Z

debian_default_toolchain container:collectd/ci:trusty_amd64 Failing after 1m — Task Summary
FAIL: test_plugin_capabilities

@mrunge / @elfiesmelfie please remove that broken Trusty config from CI, or replace it with still maintained / working distro version.

clang-format — Please run: contrib/format.sh src/gpu_sysman.c

Why this test fails? contrib/format.sh does NOT show any changes (any more):

$ contrib/format.sh src/gpu_sysman.c
$ git diff

PS. What clang-format version and config file is used by the https://format.collectd.org/ called by format.sh? It sorts includes in different order than latest (local) clang-format, which is run from collectd directory:

$ clang --version
clang version 13.0.0 (Fedora 13.0.0-3.fc35)

$ cat .clang-format 
---
BasedOnStyle: LLVM
IncludeCategories:
  - Regex: '"collectd.h"'
  - Priority: -1

$ clang-format -i src/gpu_sysman.c

$ git diff
diff --git a/src/gpu_sysman.c b/src/gpu_sysman.c
index a721b618..d731eea0 100644
--- a/src/gpu_sysman.c
+++ b/src/gpu_sysman.c
@@ -60,9 +60,9 @@
 #include <libgen.h>
 #endif
 
-#include "collectd.h"
 #include "plugin.h"
 #include "utils/common/common.h"
+#include "collectd.h"
 
 #define PLUGIN_NAME "gpu_sysman"
 #define METRIC_PREFIX "collectd_" PLUGIN_NAME "_"

eero-t · 2022-02-03T13:12:35Z

@mrunge Thanks for merging the v6 "write_prometheus" PR (it's useful for testing this PR)!

Any idea how that merge could fix the earlier clang-format warning for the code in gpu_sysman.c, which it not touch? Is the "clang-format" CI check flaky?

PS. There are some review comments for all the changes between the v5 and v6 PRs of this plugin here: eero-t#1

mrunge · 2022-02-03T14:09:13Z

I feel bad for keeping you waiting here. I wanted to submit a PR to move the CI from trusty to some not so ancient Ubuntu, but then ... reality here kicked in.

clang-format has been an issue before, then it did not, and apparently it became an issue again.
I've seen clang-format complaining about files not even touched, but it's been working fine for the last ... 2(?) years without an issue, so I forgot about it.

eero-t · 2022-02-03T15:56:28Z

No problem. This plugin and its test code have grown pretty large (line-wise it's now in plugins' top-10), so there's a lot to review.

And the "trusty" CI issue needs to be fixed first, both in main and v6 branch, as its blocking also other things than this.

After that is fixed, I could e.g. rebase this to "collectd-6.0", so that prometheus merge drops from changes list and tests get re-run.

If you want to test building & running the plugin and its test code, all you need is Level-Zero frontend: https://github.com/oneapi-src/level-zero/

However, for the plugin init to actually succeed, and to get metrics, you will need also Level-Zero backend, i.e. Level-Zero enabled version of Intel compute stack. Because that is not yet in distros, you either need to build it yourself (hard), or get it from Intel's driver packages repo.

Besides, most of the metrics come only from discrete GPUs, only few of them are available for Intel integrated GPUs (just frequency and engine utilization, latter only with fairly recent compute-runtime and kernel version). I.e. to test most of the metrics you would need at least DG1 discrete GPU, very latest kernel and enable the dGPU support with kernel force-probe option.

EDIT: tested latest drm-tip + compute-runtime on TigerLake and updated what metrics are available on iGPUs.

eero-t · 2022-02-03T19:13:27Z

I decided to rebase all my v6 based branches on "collectd-6.0" already (after rebasing first to help testing it, I could not really leave rest unrebased as these branches are stacked on each other, sorry about that).

eero-t · 2022-02-24T14:28:51Z

@mrunge if there are some commits that you'd like to be squashed, dropped or added to this PR, just comment.

For example, last commit in the PR is useful only with Kubernetes Intel GPU plugin, so you might not want it in the first version: f049b82

And e.g. the later test code refactoring from my dev-branch could be nice: https://github.com/eero-t/collectd/commits/gpu_sysman-6.0-dev

(Extra metrics added in the dev-branch do not make sense at this point as publicly available SW stack does not provide data even for all of the metrics in this PR on currently publicly available HW. I've filed bugs about latter: https://github.com/intel/compute-runtime/issues?q=is%3Aissue+is%3Aopen+DG1+)

Metrics data is provided by OneAPI Level Zero Sysman API.

See comment at start of src/gpu_sysman_test.c for details.

…c(...) Except for gpu_subarray_alloc(), all allocs are done with calloc(). This way correctness of all of them is easy to check just by grepping for calloc (especially now that clang-format does not wrap those lines any more), and reviewing gpu_subarray_alloc().

Prometheus & OpenMetrics require metric names to be suffixed by the metric unit, and ratios (0-1) to be used instead of percentages (0-100).

…names There's now also support for multiple metrics per family although they are not used yet. "sstrncpy" is not needed any more.

Following labels are used: - sub_dev: subdevice ID (unsigned integer) - location: e.g. "gpu" / "memory" - type: e.g. "request" / "actual" - direction: "read" / "write" Additionally: * Two location label values were fixed * GPU engine indeces are now per engine type (instead of single index being used for all types) * All metric family and label names have been changed to use underscores instead of dashes to separate words, as required by Prometheus i.e. collectd does not need to convert them any more: https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

NOTE: providing NULL as label value to delete it is NOT supported. Test code will assert on labels with NULL values.

Also rename GPU struct "name" member to more explicit "pci_bdf". This allowed simplifying the code slightly. Sysman API supports nowadays also other devices than GPUs, so prefix is removed to to simplify code and to be more future-proof: https://spec.oneapi.io/level-zero/latest/core/api.html#_CPPv416ze_device_type_t (Plugin will still query only GPU devices from Sysman though.)

- do not add "pci_pdf" to metric name for matching - fix for adding metric labels to family copies of them

* Fix memory "type" label overwrite * Replace "free" memory metric with "memory_usage_ratio" one, and rename "memory_bytes" to "memory_used_bytes" metric * Split metric value aggregate function name to a separate "function" label * Have metric family declares always in same place in code * Avoid both setting metric labels, and reporting empty metrics, when higher internal sampling rate is used or there are L0 errors

* Add "memory_usage_ratio" checks * Update validation for metrics that can be sampled at higher rate i.e. have now the new aggregate function label * With empty metrics avoided, dispatch mock-up can assert on them * With extra L0 calls being skipped when not needed, number of calls can differ between query rounds: - refactor multi-sampling test to handle count changes - change error handing checks to be done in single-sampled mode * Debug output is needed to debug triggered multisample asserts, so do that when assert would have been triggered, then abort

And document why const-qual cast is safe, and why GCC does not warn about other assignments to .name & .help members.

More powerful GPUs can have a large number of engines of given type, but user may be interested only on the higher level engine groups utilization. "DisableEngineSingle" option allows skipping individual engine metrics.

This can be used to speciify whether output metrics values will be raw, derived or both. This commit add support just for the configuration option itself, adding / changing metrics to use it happens in next commit.

This adds new counter type metrics for: * memory bandwidth * frequency throttle time * engine execution time (activity) * energy usage Because collecd internally handles counters as integers, all units cannot be ones recommended by Prometheus, but microseconds and microjoules reported by Sysman.

Zero time intervals or max bandwidth would cause div-by-zero issues and (very rare) time wrap around would cause bogus metric value. Skip all of them.

Empty label equals to a missing one, and Prometheus queries can check for non-existence of a label, so let's just skip empty / unneeded ones. Main difference to earlier is that LevelZero error categories that provide non-zero values only for uncorrectable type (according to spec), are now without a type label. Correctable i.e. zero metrics for those categories were skipped already earlier.

And contrib/format.sh include re-order. "dev_file" support is behind a define (enabled by default) because it needs functions that are only part of POSIX, not C99. Intel Kubernetes GPU plugin uses primary GPU node device file names (card0, card1...) as its GPU identifiers. This new label helps in mapping Kubernetes custom metrics to them.

eero-t · 2022-05-30T14:55:09Z

@mrunge I rebased this to latest collectd-6.0 to get all checks passing. Any chance of getting it merged?

I think distro packages for Intel compute-runtime will soon enable level-zero Sysman support, so that one does not need to use Intel repositories for that (or build the stack oneself). Discrete Intel GPUs (= HW that offers more than just the frequency + engine utilization information) are getting more widely available too.

src/gpu_sysman.c

mrunge

Thank you @eero-t . In general, this looks really good.
I have added a few nits here and there, the gist is, we have common memory usage functions. If you could adjust the code to use them: awesome.

src/gpu_sysman.c

And document with what GCC warning options the code is tested / passes.

While for plugin that change does not really help (as target buffer is always larger than source), for test code it is useful. And it shuts up less capabable static checking tools than GCC. As test code cannot use existing collectd functionality for this (test code needs modified versions of some collectd functions, and all collectd code does not pass GCC warnings I use), sstrncpy() is copied to test code. For test code there's also a fix to size given for snprintf(), and removal of redundant string termination for modified plugin_log() copy (vsnprintf() already terminates string).

scalloc() wraps calloc() with exit on alloc failure, similarly to what smalloc() does for malloc().

If asserts were disabled, allocation failures would result in collectd memory errors => replace alloc+assert in the plugin with collectd smalloc/scalloc wrappers that exits after logging allocation error. Downsides are that this does not invoke debugger (which could be in a different control group with plenty of memory), nor tell where / what allocation failed, like enabled assert would, so test code variants of the wrappers still do asserts.

eero-t · 2022-06-07T15:04:58Z

@mrunge I added new scalloc() safety wrapper for calloc() to common.c, and another commit using that + smalloc() in sysman plugin + test code.

Is this is what you wanted?

=> If yes, I could also create a separate PR for v5 main branch, to add scalloc() wrapper there too.

Btw. There appears to be quite a few files places where calloc() is called:

$ git grep '[= ]calloc *(' src/ | wc -l
477

But almost all of them sees to try[1] to handle calloc failing gracefully. src/processes.c was only one that I noticed not to check whether its calloc() succeeds.

[1] I did not see there tests on whether alloc failures are handled gracefully, so most of that handling code could be broken. I myself prefer just exiting if alloc error handling is not tested to work properly, to reduce amount of code to something that is known to work.

PS. sorry about forgetting to run clang-format & having separate commits for that.

mrunge

thank you

This looks good to me now. If we find additional issues, we could fix them afterwards.

eero-t · 2022-06-07T18:08:06Z

@mrunge Thanks! Do you want scalloc() commit as PR for main branch? Possibly with changing src/processes.c to use it?

mrunge · 2022-06-07T18:37:06Z

Oh, the PR would be great, yes!

eero-t · 2022-06-08T09:18:17Z

@mrunge v5 PR for scalloc(): #4014

eero-t · 2024-02-23T12:42:53Z

Note for anybody reading this... Names of metrics and their labels listed in the original description change with OpenTelemetry support being introduced to Sysman plugin for v6.0 release.

eero-t requested a review from a team as a code owner January 27, 2022 17:34

collectd-bot added this to the 6.0 milestone Jan 27, 2022

eero-t force-pushed the gpu_sysman-6.0 branch from eeb2ecb to 1a2ecb2 Compare January 27, 2022 17:40

eero-t mentioned this pull request Jan 28, 2022

gpu_sysman plugin update to v6.0 API + new labels + "raw" metrics + output options eero-t/collectd#1

Closed

eero-t force-pushed the gpu_sysman-6.0 branch from 9e48877 to f049b82 Compare February 3, 2022 19:04

eero-t added 19 commits May 30, 2022 17:20

Add 'gpu_sysman' plugin for (Intel) GPU metrics

3259e83

Metrics data is provided by OneAPI Level Zero Sysman API.

Add unit-testing for 'gpu_sysman' plugin

8cc26f1

See comment at start of src/gpu_sysman_test.c for details.

Integrate 'gpu_sysman' plugin and its unit-testing to collectd build

c65927d

Add 'gpu_sysman' plugin configuration and documentation

c51c8c8

gpu_sysman: minimal v6 API support + add units to metric names

16038a9

Prometheus & OpenMetrics require metric names to be suffixed by the metric unit, and ratios (0-1) to be used instead of percentages (0-100).

gpu_sysman: update test code for minimal v6 API support + new metric …

77a132f

…names There's now also support for multiple metrics per family although they are not used yet. "sstrncpy" is not needed any more.

gpu_sysman: update test code to handle metrics split with labels

1d67d25

NOTE: providing NULL as label value to delete it is NOT supported. Test code will assert on labels with NULL values.

gpu_sysman: fix test code for "pci_bdf" added to metrics family

6883d36

- do not add "pci_pdf" to metric name for matching - fix for adding metric labels to family copies of them

gpu_sysman: add help information for all metric families

4f0e293

And document why const-qual cast is safe, and why GCC does not warn about other assignments to .name & .help members.

gpu_sysman: option for specifying metrics output type

8beb8e9

This can be used to speciify whether output metrics values will be raw, derived or both. This commit add support just for the configuration option itself, adding / changing metrics to use it happens in next commit.

gpu_sysman: skip metrics with div-by-zero or time wrap around issues

da2041b

Zero time intervals or max bandwidth would cause div-by-zero issues and (very rare) time wrap around would cause bogus metric value. Skip all of them.

gpu_sysman: fix test code -Wpedantic + -Wcast-qual warnings

fb56381

eero-t added 2 commits May 30, 2022 17:21

eero-t force-pushed the gpu_sysman-6.0 branch from f049b82 to 8f9843f Compare May 30, 2022 14:21

mrunge reviewed Jun 7, 2022

View reviewed changes

src/gpu_sysman.c Outdated Show resolved Hide resolved

mrunge reviewed Jun 7, 2022

View reviewed changes

src/gpu_sysman.c Outdated Show resolved Hide resolved

mrunge reviewed Jun 7, 2022

View reviewed changes

src/gpu_sysman.c Outdated Show resolved Hide resolved

mrunge reviewed Jun 7, 2022

View reviewed changes

src/gpu_sysman.c Show resolved Hide resolved

src/gpu_sysman.c Show resolved Hide resolved

src/gpu_sysman.c Show resolved Hide resolved

eero-t added 5 commits June 7, 2022 15:14

Move test defines from Sysman plugin to its test code

a79cb63

And document with what GCC warning options the code is tested / passes.

Pass clang-format check for gpu_sysman_test.c comments

7bb89a8

Add scalloc() wrapper similar to smalloc() to common utils

4ca4a40

scalloc() wraps calloc() with exit on alloc failure, similarly to what smalloc() does for malloc().

eero-t requested a review from a team as a code owner June 7, 2022 14:30

Pass clang-format check for gpu_sysman_test.c

910b8f2

mrunge approved these changes Jun 7, 2022

View reviewed changes

mrunge merged commit 662a189 into collectd:collectd-6.0 Jun 7, 2022

eero-t mentioned this pull request Jun 8, 2022

Add 'gpu_sysman' plugin for (Intel) GPU metrics #3883

Closed

eero-t mentioned this pull request Mar 1, 2023

monitoring_shared-dev_nfd fails to apply with "error: evalsymlink failure" intel/intel-device-plugins-for-kubernetes#1342

Closed

eero-t mentioned this pull request Jan 15, 2024

Any release plans? 5.12.1 or 5.13.0 or 6.0.0 ? #4186

Open

eero-t added the Feature label Jan 15, 2024

eero-t mentioned this pull request Feb 23, 2024

Intel Arc? intel/xpumanager#26

Closed

eero-t mentioned this pull request May 14, 2024

intel_pmu plugin: add i915 PMU #2529

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics #3968

[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics #3968

eero-t commented Jan 27, 2022 •

edited

eero-t commented Jan 27, 2022

eero-t commented Feb 3, 2022

mrunge commented Feb 3, 2022

eero-t commented Feb 3, 2022 •

edited

eero-t commented Feb 3, 2022

eero-t commented Feb 24, 2022

eero-t commented May 30, 2022

mrunge left a comment

eero-t commented Jun 7, 2022 •

edited

mrunge left a comment

eero-t commented Jun 7, 2022

mrunge commented Jun 7, 2022

eero-t commented Jun 8, 2022

eero-t commented Feb 23, 2024

[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics #3968

[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics #3968

Conversation

eero-t commented Jan 27, 2022 • edited

New Sysman plugin for collectd

Metric reporting

Supported metrics

Metric units

Metric labels

Possible other metrics

Quality expectations

Validation

eero-t commented Jan 27, 2022

eero-t commented Feb 3, 2022

mrunge commented Feb 3, 2022

eero-t commented Feb 3, 2022 • edited

eero-t commented Feb 3, 2022

eero-t commented Feb 24, 2022

eero-t commented May 30, 2022

mrunge left a comment

Choose a reason for hiding this comment

eero-t commented Jun 7, 2022 • edited

mrunge left a comment

Choose a reason for hiding this comment

eero-t commented Jun 7, 2022

mrunge commented Jun 7, 2022

eero-t commented Jun 8, 2022

eero-t commented Feb 23, 2024

eero-t commented Jan 27, 2022 •

edited

eero-t commented Feb 3, 2022 •

edited

eero-t commented Jun 7, 2022 •

edited