-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[collectd 6] gpu_sysman: switch from zeInit() to zesInit() #4293
Conversation
L0 dependencies:
Distro versions of them:
Versions in Debian 12 (Stable / bookworm) are both too old, that's why those CI checks failed. Trixie (Testing) versions are both new enough though. All Fedora versions have both new enough frontend & backend, so for Fedora & Debian it would be enough to check for >= 1.9 frontend version. Ubuntu 23.10 is problematic, it has new enough frontend (1.12), but backend is way too old (22.x). Only in forthcoming 24.04 LTS, both are new enough. I.e. in 23.10, L0 backend would return "uninitialized" error when Sysman plugin calls frontend I think Ubuntu 23.10 can be ignored, as it's soon out of support, and frontend >= 1.9 check will at least allow plugin to be built when it's enabled (although its init would fail with too old backend). Only way to support older backends would be having fallback intialization using the old L0 core functions, but I'd rather not go there. @octo, any comments? |
CI checks pass now, Debian unstable fail is unrelated:
Sysman plugin is not built any more for the included stable Debian versions, so installing of the related L0 packages for them could be dropped from CI configs once this is merged. (Those packages could be added for Ubuntu 24.04 LTS once it's released, and added to CI test matrix.) |
Investigated Level-Zero frontend implementation. It will return "unitialized" error for |
9a3ae97
to
4e9a756
Compare
zesInit() requires Level-Zero frontend v1.9 implementing spec v1.5. Intel L0 Sysman backend supports only "i915" GPU KMD uAPI variants when initialized using L0 core zeInit() function. Support for the new "xe" GPU KMD uAPI requires backend to be initialized using (new) Sysman-specific zesInit() function instead. Use of zesInit() means that called driver and device functions also need to be be L0 Sysman versions instead L0 core versions (skipping L0 core initialization may speed plugin initialization). Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
In the GPU info logging function, as L0 core functions do nothing when (first) init call (zesInit) is done for Sysman. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Sysman may return zero for physical memory size, but allocatable size should always be valid. This can be useful in identifying different GPU types (on same machine) from each other. Skip showing of invalid memory information. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
When just Sysman is initialized (by using zesInit() instead of zeInit()), L0 core functions are no-ops and core structs are not filled (except for UUID), so drop those. Try also to present the datata in slightly more readable format by re-ordering it a bit and making indentation more consistent. As Sysman does not provide device PCI ID for "pci_dev" label, take device (model/marketing) name instead, and assing it to "dev_name" label (like Intel XPU Manager already does). Changing the "pci_dev" member name, is left for later, as it would conflict with resource attribute and OpenTelemetry metric name work being done in parallel. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Changes:
|
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
New code had off-by-one check bug which caused above not to do what it's supposed, which is now fixed (I'm not sure how I did not notice that before, maybe I accidentally tested old version?). @octo The only CI failure is for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @eero-t, sorry for the late reply. I have done more of a surface level review, and overall this looks good to me.
Optimizing new code for new architectures makes sense to me. It is unfortunate that the new version of LevelZero does not interact with older backends, but that's not something we can fix.
Why is is always the virt plugin that's breaking? This time it's an invalid write inside liblzma …
ChangeLog: gpu_sysman: switch from zeInit() to zesInit() for Xe KMD support
LevelZero spec v1.5 introduced
zesInit()
, and deprecatedzeInit()
, for initializing Sysman part of L0: https://spec.oneapi.io/level-zero/latest/sysman/PROG.html#initializationAnother motivation for introducing this, is L0 Intel backend supporting the new Xe KMD (for the future GPUs [1]), in addition to earlier i915 KMD (for current Intel GPUs), only when Sysman is initialized with
zesInit()
.Changing to
zesInit()
requires also dropping use of L0 core API functions, and using only L0 Sysman API ones, which is not an exact match. Therefore:pci_dev
(PCI device ID) label value needs to be digged from sysfsdev_name
label (matching one used by Intel XPU Manager) is added for user-friendly device name.(Its name can be changed to OTEL compatible one along with other device specific labels when switching to resources attributes.)
This is on top of another PR, which should be merged before this: #4177.
[1] I've tested this with current HW using Xe KMD experimental support for them: https://docs.kernel.org/gpu/rfc/xe.html