-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[collectd 6] Add 'gpu_sysman' plugin for (Intel) GPU metrics #3968
Conversation
eeb2ecb
to
1a2ecb2
Compare
@mrunge / @elfiesmelfie please remove that broken Trusty config from CI, or replace it with still maintained / working distro version.
Why this test fails?
PS. What clang-format version and config file is used by the https://format.collectd.org/ called by
|
@mrunge Thanks for merging the v6 "write_prometheus" PR (it's useful for testing this PR)! Any idea how that merge could fix the earlier clang-format warning for the code in gpu_sysman.c, which it not touch? Is the "clang-format" CI check flaky? PS. There are some review comments for all the changes between the v5 and v6 PRs of this plugin here: eero-t#1 |
I feel bad for keeping you waiting here. I wanted to submit a PR to move the CI from trusty to some not so ancient Ubuntu, but then ... reality here kicked in. clang-format has been an issue before, then it did not, and apparently it became an issue again. |
No problem. This plugin and its test code have grown pretty large (line-wise it's now in plugins' top-10), so there's a lot to review. And the "trusty" CI issue needs to be fixed first, both in main and v6 branch, as its blocking also other things than this. After that is fixed, I could e.g. rebase this to "collectd-6.0", so that prometheus merge drops from changes list and tests get re-run. If you want to test building & running the plugin and its test code, all you need is Level-Zero frontend: https://github.com/oneapi-src/level-zero/ However, for the plugin init to actually succeed, and to get metrics, you will need also Level-Zero backend, i.e. Level-Zero enabled version of Intel compute stack. Because that is not yet in distros, you either need to build it yourself (hard), or get it from Intel's driver packages repo. Besides, most of the metrics come only from discrete GPUs, only few of them are available for Intel integrated GPUs (just frequency and engine utilization, latter only with fairly recent compute-runtime and kernel version). I.e. to test most of the metrics you would need at least DG1 discrete GPU, very latest kernel and enable the dGPU support with kernel force-probe option. EDIT: tested latest drm-tip + compute-runtime on TigerLake and updated what metrics are available on iGPUs. |
9e48877
to
f049b82
Compare
I decided to rebase all my v6 based branches on "collectd-6.0" already (after rebasing first to help testing it, I could not really leave rest unrebased as these branches are stacked on each other, sorry about that). |
@mrunge if there are some commits that you'd like to be squashed, dropped or added to this PR, just comment. For example, last commit in the PR is useful only with Kubernetes Intel GPU plugin, so you might not want it in the first version: f049b82 And e.g. the later test code refactoring from my dev-branch could be nice: https://github.com/eero-t/collectd/commits/gpu_sysman-6.0-dev (Extra metrics added in the dev-branch do not make sense at this point as publicly available SW stack does not provide data even for all of the metrics in this PR on currently publicly available HW. I've filed bugs about latter: https://github.com/intel/compute-runtime/issues?q=is%3Aissue+is%3Aopen+DG1+) |
Metrics data is provided by OneAPI Level Zero Sysman API.
See comment at start of src/gpu_sysman_test.c for details.
…c(...) Except for gpu_subarray_alloc(), all allocs are done with calloc(). This way correctness of all of them is easy to check just by grepping for calloc (especially now that clang-format does not wrap those lines any more), and reviewing gpu_subarray_alloc().
Prometheus & OpenMetrics require metric names to be suffixed by the metric unit, and ratios (0-1) to be used instead of percentages (0-100).
…names There's now also support for multiple metrics per family although they are not used yet. "sstrncpy" is not needed any more.
Following labels are used: - sub_dev: subdevice ID (unsigned integer) - location: e.g. "gpu" / "memory" - type: e.g. "request" / "actual" - direction: "read" / "write" Additionally: * Two location label values were fixed * GPU engine indeces are now per engine type (instead of single index being used for all types) * All metric family and label names have been changed to use underscores instead of dashes to separate words, as required by Prometheus i.e. collectd does not need to convert them any more: https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels
NOTE: providing NULL as label value to delete it is NOT supported. Test code will assert on labels with NULL values.
Also rename GPU struct "name" member to more explicit "pci_bdf". This allowed simplifying the code slightly. Sysman API supports nowadays also other devices than GPUs, so prefix is removed to to simplify code and to be more future-proof: https://spec.oneapi.io/level-zero/latest/core/api.html#_CPPv416ze_device_type_t (Plugin will still query only GPU devices from Sysman though.)
- do not add "pci_pdf" to metric name for matching - fix for adding metric labels to family copies of them
* Fix memory "type" label overwrite * Replace "free" memory metric with "memory_usage_ratio" one, and rename "memory_bytes" to "memory_used_bytes" metric * Split metric value aggregate function name to a separate "function" label * Have metric family declares always in same place in code * Avoid both setting metric labels, and reporting empty metrics, when higher internal sampling rate is used or there are L0 errors
* Add "memory_usage_ratio" checks * Update validation for metrics that can be sampled at higher rate i.e. have now the new aggregate function label * With empty metrics avoided, dispatch mock-up can assert on them * With extra L0 calls being skipped when not needed, number of calls can differ between query rounds: - refactor multi-sampling test to handle count changes - change error handing checks to be done in single-sampled mode * Debug output is needed to debug triggered multisample asserts, so do that when assert would have been triggered, then abort
And document why const-qual cast is safe, and why GCC does not warn about other assignments to .name & .help members.
More powerful GPUs can have a large number of engines of given type, but user may be interested only on the higher level engine groups utilization. "DisableEngineSingle" option allows skipping individual engine metrics.
This can be used to speciify whether output metrics values will be raw, derived or both. This commit add support just for the configuration option itself, adding / changing metrics to use it happens in next commit.
This adds new counter type metrics for: * memory bandwidth * frequency throttle time * engine execution time (activity) * energy usage Because collecd internally handles counters as integers, all units cannot be ones recommended by Prometheus, but microseconds and microjoules reported by Sysman.
Zero time intervals or max bandwidth would cause div-by-zero issues and (very rare) time wrap around would cause bogus metric value. Skip all of them.
Empty label equals to a missing one, and Prometheus queries can check for non-existence of a label, so let's just skip empty / unneeded ones. Main difference to earlier is that LevelZero error categories that provide non-zero values only for uncorrectable type (according to spec), are now without a type label. Correctable i.e. zero metrics for those categories were skipped already earlier.
And contrib/format.sh include re-order. "dev_file" support is behind a define (enabled by default) because it needs functions that are only part of POSIX, not C99. Intel Kubernetes GPU plugin uses primary GPU node device file names (card0, card1...) as its GPU identifiers. This new label helps in mapping Kubernetes custom metrics to them.
@mrunge I rebased this to latest collectd-6.0 to get all checks passing. Any chance of getting it merged? I think distro packages for Intel compute-runtime will soon enable level-zero Sysman support, so that one does not need to use Intel repositories for that (or build the stack oneself). Discrete Intel GPUs (= HW that offers more than just the frequency + engine utilization information) are getting more widely available too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @eero-t . In general, this looks really good.
I have added a few nits here and there, the gist is, we have common memory usage functions. If you could adjust the code to use them: awesome.
And document with what GCC warning options the code is tested / passes.
While for plugin that change does not really help (as target buffer is always larger than source), for test code it is useful. And it shuts up less capabable static checking tools than GCC. As test code cannot use existing collectd functionality for this (test code needs modified versions of some collectd functions, and all collectd code does not pass GCC warnings I use), sstrncpy() is copied to test code. For test code there's also a fix to size given for snprintf(), and removal of redundant string termination for modified plugin_log() copy (vsnprintf() already terminates string).
scalloc() wraps calloc() with exit on alloc failure, similarly to what smalloc() does for malloc().
If asserts were disabled, allocation failures would result in collectd memory errors => replace alloc+assert in the plugin with collectd smalloc/scalloc wrappers that exits after logging allocation error. Downsides are that this does not invoke debugger (which could be in a different control group with plenty of memory), nor tell where / what allocation failed, like enabled assert would, so test code variants of the wrappers still do asserts.
@mrunge I added new Is this is what you wanted? => If yes, I could also create a separate PR for v5 main branch, to add Btw. There appears to be quite a few files places where
But almost all of them sees to try[1] to handle calloc failing gracefully. [1] I did not see there tests on whether alloc failures are handled gracefully, so most of that handling code could be broken. I myself prefer just exiting if alloc error handling is not tested to work properly, to reduce amount of code to something that is known to work. PS. sorry about forgetting to run clang-format & having separate commits for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you
This looks good to me now. If we find additional issues, we could fix them afterwards.
@mrunge Thanks! Do you want scalloc() commit as PR for main branch? Possibly with changing src/processes.c to use it? |
Oh, the PR would be great, yes! |
Note for anybody reading this... Names of metrics and their labels listed in the original description change with OpenTelemetry support being introduced to Sysman plugin for v6.0 release. |
ChangeLog: gpu_sysman plugin: Add collectd v6 'gpu_sysman' plugin for Intel GPU metrics
This is based on earlier v5 version PR of the plugin: #3883
New Sysman plugin for collectd
Pre-production beta version.
Plugin links against Level Zero loader:
And provides GPU metrics using Level Zero Sysman API:
Metric reporting
One of the uses for this plugin is providing GPU metrics for Prometheus in Kubernetes cluster environment.
Because some metric values can fluctuate between their possible min & max values much more frequently than those values can be reported in such an environment, plugin can be configured to query few of the GPU metrics (currently frequency & memory usage) internally at higher rate + cache the results internally, and provide aggregate metrics of the cached data (currently min/max for given period) at the reporting intervals.
If there are errors with some metric types when their values are queried, plugin disables that metrics type for corresponding GPU. Each of the GPU metrics type queries can also be disabled from the plugin configuration.
Supported metrics
Plugin supports most of the GPU metric APIs provided by the Level-Zero Sysman v1.1 (frontend) specification:
What metrics are available depends on the underlying HW, FW and (Level-Zero backend) driver stack. Which of those metrics get queried + reported, can be controlled with the plugin options.
Metric units
While Gauge values are in units recommended by Prometheus and Open Metrics spec (e.g. seconds), "raw" Counter values are not (they use ujoules / usecs). This is collectd issue as it stores Counter values as integers, instead of using floating point also for those, as expected by the Open Metrics spec:
Metric labels
Plugin helper function adds originating GPU identification label(s) to all metric values:
Values can also include following additional labels (when Sysman provides corresponding information for given metric type):
Possible other metrics
Support could be later added for other metrics with Sysman APIs, when Sysman backend starts supporting them, such as:
Quality expectations
GCC gives no warnings for plugin when much stricter warnings are enabled than what collectd build uses by default.
80-90% unit-testing code coverage (depending on used GCC optimization level), with no memory access nor leakage warnings when unit-tests are run under Valgrind.
Among other things, unit-testing validates error handling for all Sysman functions used at run-time. See
gpu_sysman_test.c
for more information.In addition to unit tests, most of the metrics have been tested also with data from real HW.
Different versions of the code have been tested semi-regularly since 2020 in (a development) setup consisting of few Intel iGPU and discrete (Xe) GPUs, but there has been no testing on production servers.
Testing has been done using Intel "compute-runtime" Sysman GPU backend, both with the public and internal Intel GPU SW stacks supporting Xe HW (in theory plugin should work also with the other Sysman API GPU backends than Intel one).
Validation
Comment at top of
gpu_sysman_test.c
tells how to run validation tests.Intel compute-runtime provides standalone Level Zero test program which can be used to check / validate the metric values reported by the plugin: