Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics #3968

Merged
merged 27 commits into from
Jun 7, 2022

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Jan 27, 2022

ChangeLog: gpu_sysman plugin: Add collectd v6 'gpu_sysman' plugin for Intel GPU metrics

This is based on earlier v5 version PR of the plugin: #3883

New Sysman plugin for collectd

Pre-production beta version.

Plugin links against Level Zero loader:

And provides GPU metrics using Level Zero Sysman API:

Metric reporting

One of the uses for this plugin is providing GPU metrics for Prometheus in Kubernetes cluster environment.

Because some metric values can fluctuate between their possible min & max values much more frequently than those values can be reported in such an environment, plugin can be configured to query few of the GPU metrics (currently frequency & memory usage) internally at higher rate + cache the results internally, and provide aggregate metrics of the cached data (currently min/max for given period) at the reporting intervals.

If there are errors with some metric types when their values are queried, plugin disables that metrics type for corresponding GPU. Each of the GPU metrics type queries can also be disabled from the plugin configuration.

Supported metrics

Plugin supports most of the GPU metric APIs provided by the Level-Zero Sysman v1.1 (frontend) specification:

  • Temperature (Celsius)
  • Power usage (ujoules counter / Watts)
  • Memory usage (bytes used / usage ratio)
  • Memory BW (byte counter / usage ratio)
  • Frequency (requested + actual frequency)
    • Frequency throttling (usec counter / throttle ratio)
  • Engine utilization (usec counter / utilization ratio)
  • RAS error counters (several correctable + uncorrectable categories)

What metrics are available depends on the underlying HW, FW and (Level-Zero backend) driver stack. Which of those metrics get queried + reported, can be controlled with the plugin options.

Metric units

While Gauge values are in units recommended by Prometheus and Open Metrics spec (e.g. seconds), "raw" Counter values are not (they use ujoules / usecs). This is collectd issue as it stores Counter values as integers, instead of using floating point also for those, as expected by the Open Metrics spec:

Metric labels

Plugin helper function adds originating GPU identification label(s) to all metric values:

  • "pci_bdf": PCI device address in BDF notation
  • "dev_file": primary device node file name (enabled with ADD_DEV_FILE define, for use with Kubernetes Intel GPU plugin)

Values can also include following additional labels (when Sysman provides corresponding information for given metric type):

  • "sub_dev": sub-device ID
  • "location": location for given metric
  • "direction": e.g. read or write for bandwidth metric
  • "type": e.g. memory type for memory metric
  • "function": aggregation function used to provide value for data that was internally sampled at higher rate (min / max)

Possible other metrics

Support could be later added for other metrics with Sysman APIs, when Sysman backend starts supporting them, such as:

  • Fabric port read/write BW
  • PSU voltage, amperage, temp
  • GPU voltage
  • Fan speed
  • Memory health

Quality expectations

GCC gives no warnings for plugin when much stricter warnings are enabled than what collectd build uses by default.

80-90% unit-testing code coverage (depending on used GCC optimization level), with no memory access nor leakage warnings when unit-tests are run under Valgrind.

Among other things, unit-testing validates error handling for all Sysman functions used at run-time. See gpu_sysman_test.c for more information.

In addition to unit tests, most of the metrics have been tested also with data from real HW.

Different versions of the code have been tested semi-regularly since 2020 in (a development) setup consisting of few Intel iGPU and discrete (Xe) GPUs, but there has been no testing on production servers.

Testing has been done using Intel "compute-runtime" Sysman GPU backend, both with the public and internal Intel GPU SW stacks supporting Xe HW (in theory plugin should work also with the other Sysman API GPU backends than Intel one).

Validation

Comment at top of gpu_sysman_test.c tells how to run validation tests.

Intel compute-runtime provides standalone Level Zero test program which can be used to check / validate the metric values reported by the plugin:

@eero-t eero-t requested a review from a team as a code owner January 27, 2022 17:34
@collectd-bot collectd-bot added this to the 6.0 milestone Jan 27, 2022
@eero-t
Copy link
Contributor Author

eero-t commented Jan 27, 2022

debian_default_toolchain container:collectd/ci:trusty_amd64 Failing after 1m — Task Summary
FAIL: test_plugin_capabilities

@mrunge / @elfiesmelfie please remove that broken Trusty config from CI, or replace it with still maintained / working distro version.

clang-format — Please run: contrib/format.sh src/gpu_sysman.c

Why this test fails? contrib/format.sh does NOT show any changes (any more):

$ contrib/format.sh src/gpu_sysman.c
$ git diff

PS. What clang-format version and config file is used by the https://format.collectd.org/ called by format.sh? It sorts includes in different order than latest (local) clang-format, which is run from collectd directory:

$ clang --version
clang version 13.0.0 (Fedora 13.0.0-3.fc35)

$ cat .clang-format 
---
BasedOnStyle: LLVM
IncludeCategories:
  - Regex: '"collectd.h"'
  - Priority: -1

$ clang-format -i src/gpu_sysman.c

$ git diff
diff --git a/src/gpu_sysman.c b/src/gpu_sysman.c
index a721b618..d731eea0 100644
--- a/src/gpu_sysman.c
+++ b/src/gpu_sysman.c
@@ -60,9 +60,9 @@
 #include <libgen.h>
 #endif
 
-#include "collectd.h"
 #include "plugin.h"
 #include "utils/common/common.h"
+#include "collectd.h"
 
 #define PLUGIN_NAME "gpu_sysman"
 #define METRIC_PREFIX "collectd_" PLUGIN_NAME "_"

@eero-t
Copy link
Contributor Author

eero-t commented Feb 3, 2022

@mrunge Thanks for merging the v6 "write_prometheus" PR (it's useful for testing this PR)!

Any idea how that merge could fix the earlier clang-format warning for the code in gpu_sysman.c, which it not touch? Is the "clang-format" CI check flaky?

PS. There are some review comments for all the changes between the v5 and v6 PRs of this plugin here: eero-t#1

@mrunge
Copy link
Member

mrunge commented Feb 3, 2022

I feel bad for keeping you waiting here. I wanted to submit a PR to move the CI from trusty to some not so ancient Ubuntu, but then ... reality here kicked in.

clang-format has been an issue before, then it did not, and apparently it became an issue again.
I've seen clang-format complaining about files not even touched, but it's been working fine for the last ... 2(?) years without an issue, so I forgot about it.

@eero-t
Copy link
Contributor Author

eero-t commented Feb 3, 2022

No problem. This plugin and its test code have grown pretty large (line-wise it's now in plugins' top-10), so there's a lot to review.

And the "trusty" CI issue needs to be fixed first, both in main and v6 branch, as its blocking also other things than this.

After that is fixed, I could e.g. rebase this to "collectd-6.0", so that prometheus merge drops from changes list and tests get re-run.


If you want to test building & running the plugin and its test code, all you need is Level-Zero frontend: https://github.com/oneapi-src/level-zero/

However, for the plugin init to actually succeed, and to get metrics, you will need also Level-Zero backend, i.e. Level-Zero enabled version of Intel compute stack. Because that is not yet in distros, you either need to build it yourself (hard), or get it from Intel's driver packages repo.

Besides, most of the metrics come only from discrete GPUs, only few of them are available for Intel integrated GPUs (just frequency and engine utilization, latter only with fairly recent compute-runtime and kernel version). I.e. to test most of the metrics you would need at least DG1 discrete GPU, very latest kernel and enable the dGPU support with kernel force-probe option.

EDIT: tested latest drm-tip + compute-runtime on TigerLake and updated what metrics are available on iGPUs.

@eero-t
Copy link
Contributor Author

eero-t commented Feb 3, 2022

I decided to rebase all my v6 based branches on "collectd-6.0" already (after rebasing first to help testing it, I could not really leave rest unrebased as these branches are stacked on each other, sorry about that).

@eero-t
Copy link
Contributor Author

eero-t commented Feb 24, 2022

@mrunge if there are some commits that you'd like to be squashed, dropped or added to this PR, just comment.

For example, last commit in the PR is useful only with Kubernetes Intel GPU plugin, so you might not want it in the first version: f049b82

And e.g. the later test code refactoring from my dev-branch could be nice: https://github.com/eero-t/collectd/commits/gpu_sysman-6.0-dev

(Extra metrics added in the dev-branch do not make sense at this point as publicly available SW stack does not provide data even for all of the metrics in this PR on currently publicly available HW. I've filed bugs about latter: https://github.com/intel/compute-runtime/issues?q=is%3Aissue+is%3Aopen+DG1+)

eero-t added 19 commits May 30, 2022 17:20
Metrics data is provided by OneAPI Level Zero Sysman API.
See comment at start of src/gpu_sysman_test.c for details.
…c(...)

Except for gpu_subarray_alloc(), all allocs are done with calloc().
This way correctness of all of them is easy to check just by grepping
for calloc (especially now that clang-format does not wrap those lines
any more), and reviewing gpu_subarray_alloc().
Prometheus & OpenMetrics require metric names to be suffixed by the
metric unit, and ratios (0-1) to be used instead of percentages
(0-100).
…names

There's now also support for multiple metrics per family although they
are not used yet. "sstrncpy" is not needed any more.
Following labels are used:
- sub_dev: subdevice ID (unsigned integer)
- location: e.g. "gpu" / "memory"
- type: e.g. "request" / "actual"
- direction: "read" / "write"

Additionally:

* Two location label values were fixed

* GPU engine indeces are now per engine type
  (instead of single index being used for all types)

* All metric family and label names have been changed to use
  underscores instead of dashes to separate words, as required by
  Prometheus i.e. collectd does not need to convert them any more:
  https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels
NOTE: providing NULL as label value to delete it is NOT supported.
Test code will assert on labels with NULL values.
Also rename GPU struct "name" member to more explicit "pci_bdf".

This allowed simplifying the code slightly.

Sysman API supports nowadays also other devices than GPUs, so prefix
is removed to to simplify code and to be more future-proof:
https://spec.oneapi.io/level-zero/latest/core/api.html#_CPPv416ze_device_type_t

(Plugin will still query only GPU devices from Sysman though.)
- do not add "pci_pdf" to metric name for matching
- fix for adding metric labels to family copies of them
* Fix memory "type" label overwrite

* Replace "free" memory metric with "memory_usage_ratio" one,
  and rename "memory_bytes" to "memory_used_bytes" metric

* Split metric value aggregate function name to a separate
  "function" label

* Have metric family declares always in same place in code

* Avoid both setting metric labels, and reporting empty metrics,
  when higher internal sampling rate is used or there are L0
  errors
* Add "memory_usage_ratio" checks

* Update validation for metrics that can be sampled at higher
  rate i.e. have now the new aggregate function label

* With empty metrics avoided, dispatch mock-up can assert on them

* With extra L0 calls being skipped when not needed, number of calls
  can differ between query rounds:
  - refactor multi-sampling test to handle count changes
  - change error handing checks to be done in single-sampled mode

* Debug output is needed to debug triggered multisample asserts,
  so do that when assert would have been triggered, then abort
And document why const-qual cast is safe, and why GCC does
not warn about other assignments to .name & .help members.
More powerful GPUs can have a large number of engines of given type,
but user may be interested only on the higher level engine groups
utilization.

"DisableEngineSingle" option allows skipping individual engine metrics.
This can be used to speciify whether output metrics values will be
raw, derived or both.

This commit add support just for the configuration option itself,
adding / changing metrics to use it happens in next commit.
This adds new counter type metrics for:
* memory bandwidth
* frequency throttle time
* engine execution time (activity)
* energy usage

Because collecd internally handles counters as integers, all units
cannot be ones recommended by Prometheus, but microseconds and
microjoules reported by Sysman.
Zero time intervals or max bandwidth would cause div-by-zero issues
and (very rare) time wrap around would cause bogus metric value.
Skip all of them.
Empty label equals to a missing one, and Prometheus queries can check
for non-existence of a label, so let's just skip empty / unneeded ones.

Main difference to earlier is that LevelZero error categories that
provide non-zero values only for uncorrectable type (according to
spec), are now without a type label. Correctable i.e. zero metrics for
those categories were skipped already earlier.
And contrib/format.sh include re-order.

"dev_file" support is behind a define (enabled by default) because it
needs functions that are only part of POSIX, not C99.

Intel Kubernetes GPU plugin uses primary GPU node device file names
(card0, card1...) as its GPU identifiers.  This new label helps in
mapping Kubernetes custom metrics to them.
@eero-t
Copy link
Contributor Author

eero-t commented May 30, 2022

@mrunge I rebased this to latest collectd-6.0 to get all checks passing. Any chance of getting it merged?

I think distro packages for Intel compute-runtime will soon enable level-zero Sysman support, so that one does not need to use Intel repositories for that (or build the stack oneself). Discrete Intel GPUs (= HW that offers more than just the frequency + engine utilization information) are getting more widely available too.

src/gpu_sysman.c Outdated Show resolved Hide resolved
src/gpu_sysman.c Outdated Show resolved Hide resolved
src/gpu_sysman.c Outdated Show resolved Hide resolved
Copy link
Member

@mrunge mrunge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eero-t . In general, this looks really good.
I have added a few nits here and there, the gist is, we have common memory usage functions. If you could adjust the code to use them: awesome.

src/gpu_sysman.c Show resolved Hide resolved
src/gpu_sysman.c Show resolved Hide resolved
src/gpu_sysman.c Show resolved Hide resolved
And document with what GCC warning options the code is tested / passes.
While for plugin that change does not really help (as target buffer is
always larger than source), for test code it is useful. And it shuts
up less capabable static checking tools than GCC.

As test code cannot use existing collectd functionality for this (test
code needs modified versions of some collectd functions, and all
collectd code does not pass GCC warnings I use), sstrncpy() is copied
to test code.

For test code there's also a fix to size given for snprintf(), and
removal of redundant string termination for modified plugin_log() copy
(vsnprintf() already terminates string).
scalloc() wraps calloc() with exit on alloc failure,
similarly to what smalloc() does for malloc().
If asserts were disabled, allocation failures would result in collectd
memory errors => replace alloc+assert in the plugin with collectd
smalloc/scalloc wrappers that exits after logging allocation error.

Downsides are that this does not invoke debugger (which could be in a
different control group with plenty of memory), nor tell where / what
allocation failed, like enabled assert would, so test code variants of
the wrappers still do asserts.
@eero-t eero-t requested a review from a team as a code owner June 7, 2022 14:30
@eero-t
Copy link
Contributor Author

eero-t commented Jun 7, 2022

@mrunge I added new scalloc() safety wrapper for calloc() to common.c, and another commit using that + smalloc() in sysman plugin + test code.

Is this is what you wanted?

=> If yes, I could also create a separate PR for v5 main branch, to add scalloc() wrapper there too.

Btw. There appears to be quite a few files places where calloc() is called:

$ git grep '[= ]calloc *(' src/ | wc -l
477

But almost all of them sees to try[1] to handle calloc failing gracefully. src/processes.c was only one that I noticed not to check whether its calloc() succeeds.

[1] I did not see there tests on whether alloc failures are handled gracefully, so most of that handling code could be broken. I myself prefer just exiting if alloc error handling is not tested to work properly, to reduce amount of code to something that is known to work.

PS. sorry about forgetting to run clang-format & having separate commits for that.

Copy link
Member

@mrunge mrunge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

This looks good to me now. If we find additional issues, we could fix them afterwards.

@mrunge mrunge merged commit 662a189 into collectd:collectd-6.0 Jun 7, 2022
@eero-t
Copy link
Contributor Author

eero-t commented Jun 7, 2022

@mrunge Thanks! Do you want scalloc() commit as PR for main branch? Possibly with changing src/processes.c to use it?

@mrunge
Copy link
Member

mrunge commented Jun 7, 2022

Oh, the PR would be great, yes!

@eero-t
Copy link
Contributor Author

eero-t commented Jun 8, 2022

@mrunge v5 PR for scalloc(): #4014

@eero-t
Copy link
Contributor Author

eero-t commented Feb 23, 2024

Note for anybody reading this... Names of metrics and their labels listed in the original description change with OpenTelemetry support being introduced to Sysman plugin for v6.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants