Skip to content

cpuallocator, topology-aware: handle weird die setups better#643

Merged
askervin merged 2 commits intocontainers:mainfrom
klihub:fixes/weird-die-setup
Mar 18, 2026
Merged

cpuallocator, topology-aware: handle weird die setups better#643
askervin merged 2 commits intocontainers:mainfrom
klihub:fixes/weird-die-setup

Conversation

@klihub
Copy link
Copy Markdown
Collaborator

@klihub klihub commented Mar 16, 2026

This PR fixes the most immediate problems for weird die setups in the topology reported by the kernel. In particular the commits

  • remove some strict and now proven to be occasionally violated assumptions about the relation of sockets, dies and NUMA nodes
  • omit dies for weird setups from the toplogy tree

@klihub klihub requested review from askervin and fmuyassarov March 16, 2026 07:02
Copy link
Copy Markdown
Collaborator

@askervin askervin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and after reading the second commit, too, now I think I recall this case. If we suspect there is an error in kernel die reporting, then Error in log is fine, too.

I'm already good taking this in as is.

LGTM

@klihub
Copy link
Copy Markdown
Collaborator Author

klihub commented Mar 17, 2026

...and after reading the second commit, too, now I think I recall this case. If we suspect there is an error in kernel die reporting, then Error in log is fine, too.

I'm already good taking this in as is.

LGTM

@askervin I'm not sure if this is due to the kernel or if it is really what the topology-enumeration using the CPUID leaf 0x1f really returns on those platforms. We should probably retest this once the 7.0 kernel is out, because that has some topology enumeration fixes. But TBH, I suspect that this is what leaf 0x1f really returns on that particular HW...

klihub added 2 commits March 17, 2026 11:44
Do not panic for weird die setups. If an LLC group spans multiple
dies, sort cache groups only by socket, NUMA node and lowest CPU.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Omit die pools if dies are the same as clusters.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
@klihub klihub force-pushed the fixes/weird-die-setup branch from a4522ad to 2e933df Compare March 17, 2026 09:44
Copy link
Copy Markdown
Collaborator

@askervin askervin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@askervin
Copy link
Copy Markdown
Collaborator

@askervin I'm not sure if this is due to the kernel or if it is really what the topology-enumeration using the CPUID leaf 0x1f

That's right, sounds more of an ACPI issue.

As this PR looks safe when die numbering works fine, and it writes nice warnings in the log in other cases, I'll merge it now and let's test with linux 7.0. (I don't want to keep this PR pending and our topology detection doing weird things on weirdly reported dies...)

@askervin askervin merged commit aab9dd0 into containers:main Mar 18, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants