Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[collectd 6] gpu_sysman: switch from zeInit() to zesInit() #4293

Merged
merged 9 commits into from
Mar 27, 2024

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Mar 1, 2024

ChangeLog: gpu_sysman: switch from zeInit() to zesInit() for Xe KMD support

LevelZero spec v1.5 introduced zesInit(), and deprecated zeInit(), for initializing Sysman part of L0: https://spec.oneapi.io/level-zero/latest/sysman/PROG.html#initialization

Another motivation for introducing this, is L0 Intel backend supporting the new Xe KMD (for the future GPUs [1]), in addition to earlier i915 KMD (for current Intel GPUs), only when Sysman is initialized with zesInit().

Changing to zesInit() requires also dropping use of L0 core API functions, and using only L0 Sysman API ones, which is not an exact match. Therefore:

  • Some of the information logged at plugin init (when GpuInfo option is used) needed to be dropped
  • pci_dev (PCI device ID) label value needs to be digged from sysfs
  • There are also few other minor changes to the logged GPU info

dev_name label (matching one used by Intel XPU Manager) is added for user-friendly device name.

(Its name can be changed to OTEL compatible one along with other device specific labels when switching to resources attributes.)

This is on top of another PR, which should be merged before this: #4177.


[1] I've tested this with current HW using Xe KMD experimental support for them: https://docs.kernel.org/gpu/rfc/xe.html

@eero-t eero-t requested a review from a team as a code owner March 1, 2024 12:13
@collectd-bot collectd-bot added this to the 6.0 milestone Mar 1, 2024
@eero-t eero-t added the Feature label Mar 1, 2024
@eero-t
Copy link
Contributor Author

eero-t commented Mar 1, 2024

L0 dependencies:

Distro versions of them:

Versions in Debian 12 (Stable / bookworm) are both too old, that's why those CI checks failed. Trixie (Testing) versions are both new enough though.

All Fedora versions have both new enough frontend & backend, so for Fedora & Debian it would be enough to check for >= 1.9 frontend version.

Ubuntu 23.10 is problematic, it has new enough frontend (1.12), but backend is way too old (22.x). Only in forthcoming 24.04 LTS, both are new enough. I.e. in 23.10, L0 backend would return "uninitialized" error when Sysman plugin calls frontend zesInit() function.

I think Ubuntu 23.10 can be ignored, as it's soon out of support, and frontend >= 1.9 check will at least allow plugin to be built when it's enabled (although its init would fail with too old backend).

Only way to support older backends would be having fallback intialization using the old L0 core functions, but I'd rather not go there.

@octo, any comments?

@eero-t
Copy link
Contributor Author

eero-t commented Mar 1, 2024

CI checks pass now, Debian unstable fail is unrelated:

PASS: test_plugin_gpu_sysman
...
FAIL: test_plugin_virt

Sysman plugin is not built any more for the included stable Debian versions, so installing of the related L0 packages for them could be dropped from CI configs once this is merged.

(Those packages could be added for Ubuntu 24.04 LTS once it's released, and added to CI test matrix.)

@eero-t
Copy link
Contributor Author

eero-t commented Mar 1, 2024

Only way to support older backends would be having fallback intialization using the old L0 core functions, but I'd rather not go there.

Investigated Level-Zero frontend implementation. It will return "unitialized" error for zesInit() regardless of whether zesInit() support is missing from backend, or backend function returned some error (e.g. because there was no GPU). I.e. plugin cannot differentiate between these cases (and fallback to older initialization, or just log a warning, if newer backend would be needed).

@eero-t eero-t changed the title [collectd 6, WIP] gpu_sysman: switch from zeInit() to zesInit() [collectd 6] gpu_sysman: switch from zeInit() to zesInit() Mar 12, 2024
@eero-t eero-t force-pushed the xe-kmd-support branch 2 times, most recently from 9a3ae97 to 4e9a756 Compare March 14, 2024 15:20
zesInit() requires Level-Zero frontend v1.9 implementing spec v1.5.

Intel L0 Sysman backend supports only "i915" GPU KMD uAPI variants
when initialized using L0 core zeInit() function. Support for the new
"xe" GPU KMD uAPI requires backend to be initialized using (new)
Sysman-specific zesInit() function instead.

Use of zesInit() means that called driver and device functions also
need to be be L0 Sysman versions instead L0 core versions (skipping L0
core initialization may speed plugin initialization).

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
In the GPU info logging function, as L0 core functions do nothing when
(first) init call (zesInit) is done for Sysman.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Sysman may return zero for physical memory size, but allocatable size
should always be valid.  This can be useful in identifying different
GPU types (on same machine) from each other.

Skip showing of invalid memory information.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
When just Sysman is initialized (by using zesInit() instead of
zeInit()), L0 core functions are no-ops and core structs are not
filled (except for UUID), so drop those.

Try also to present the datata in slightly more readable format
by re-ordering it a bit and making indentation more consistent.

As Sysman does not provide device PCI ID for "pci_dev" label,
take device (model/marketing) name instead, and assing it to
"dev_name" label (like Intel XPU Manager already does).

Changing the "pci_dev" member name, is left for later, as it
would conflict with resource attribute and OpenTelemetry metric
name work being done in parallel.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Mar 14, 2024

Changes:

  • Rebased on top of merged "LogMetrics" code, so this is easier to review
  • Simplified sysfs reading code a bit by using read_text_file_contents()
  • Added header checks for glob() / basename() used by that code

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Mar 19, 2024

Simplified sysfs reading code a bit by using read_text_file_contents()

New code had off-by-one check bug which caused above not to do what it's supposed, which is now fixed (I'm not sure how I did not notice that before, maybe I accidentally tested old version?).

@octo The only CI failure is for test_plugin_virt i.e. nothing related to this PR, but for some reason summary lists 2 fails, although log shows only the virt plugin one: https://github.com/collectd/collectd/actions/runs/8345888360/job/22841843932?pr=4293 ?

Copy link
Member

@octo octo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eero-t, sorry for the late reply. I have done more of a surface level review, and overall this looks good to me.

Optimizing new code for new architectures makes sense to me. It is unfortunate that the new version of LevelZero does not interact with older backends, but that's not something we can fix.

Why is is always the virt plugin that's breaking? This time it's an invalid write inside liblzma

@eero-t eero-t merged commit 3ce45a7 into collectd:collectd-6.0 Mar 27, 2024
22 of 23 checks passed
@eero-t eero-t deleted the xe-kmd-support branch March 27, 2024 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants