Skip to content

topology-aware: add l3Cache topology/pool nodes.#635

Merged
klihub merged 4 commits intocontainers:mainfrom
wongchar:l3cache-unlimitedBurstable
Mar 13, 2026
Merged

topology-aware: add l3Cache topology/pool nodes.#635
klihub merged 4 commits intocontainers:mainfrom
wongchar:l3cache-unlimitedBurstable

Conversation

@wongchar
Copy link
Contributor

@wongchar wongchar commented Feb 26, 2026

Hello!

I would like to propose L3 cache restriction/affinity in the NRI Topology Aware resource policy.
The existing die node level supports resource optimization for AMD EPYC SKUs where the die contains a single L3 cache.

However, there are some SKUs of AMD EPYC that will contain up to two L3 caches on a single die.
AMD refers to these L3 cache core groupings as a Core Complex (CCX) with up to two CCXs on a Core Complex Die (CCD).

The other motivation is to extend the unlimitedBurstable feature to the L3 Cache level as well.

Please let me know your thoughts

@wongchar wongchar force-pushed the l3cache-unlimitedBurstable branch from e931356 to e92e4d9 Compare February 26, 2026 23:41
@klihub
Copy link
Collaborator

klihub commented Feb 27, 2026

Hello!

I would like to propose L3 cache restriction/affinity in the NRI Topology Aware resource policy. The existing die node level supports resource optimization for AMD EPYC SKUs where the die contains a single L3 cache.

However, there are some SKUs of AMD EPYC that will contain up to two L3 caches on a single die. AMD refers to these L3 cache core groupings as a Core Complex (CCX) with up to two CCXs on a Core Complex Die (CCD).

The other motivation is to extend the unlimitedBurstable feature to the L3 Cache level as well.

Please let me know your thoughts

@wongchar Thank you ! We are definitely interested to get this merged. But I think it would be good to add e2e test cases for verification. Would you be able to provide a sample sysfs dump or the relevant subset of it for such a HW. It would help check how easily we could add such tests.

We probably cannot emulate this directly by qemu, but this is not the only such feature and we already have some environment based overrides in the sysfs/detection code specifically to be able to fake and test things which can't be emulate properly.

@wongchar
Copy link
Contributor Author

amd-128c-8l3-sysfs.tar.gz

Agreed, attached is the sysfs of an epyc processor (8534P) with two L3 caches per die. (Single numa, single socket, smt-enabled).

Happy to help add e2e tests based on what is most cohesive with the existing codebase :)

@klihub
Copy link
Collaborator

klihub commented Mar 3, 2026

@wongchar Here is a a minimal e2e test case added to your tree. The test case verifies confining unlimited burstability to an L3 cache cluster, which IIUC is the primary motivation for this PR.

I think for getting this merged, we still need to

  • add more e2e test cases, if you have other important uses cases in mind
  • update the documentation, at least to mention the newly introduced topology level

With the documentation update we should first wait for #628 to get merged (it's really intrusive), then rebase this PR on main/HEAD, then update the docs.

@klihub klihub force-pushed the l3cache-unlimitedBurstable branch from e92e4d9 to b827509 Compare March 3, 2026 17:16
@klihub
Copy link
Collaborator

klihub commented Mar 3, 2026

@wongchar Here is a a minimal e2e test case added to your tree. The test case verifies confining unlimited burstability to an L3 cache cluster, which IIUC is the primary motivation for this PR.

I think for getting this merged, we still need to

  • add more e2e test cases, if you have other important uses cases in mind
  • update the documentation, at least to mention the newly introduced topology level

With the documentation update we should first wait for #628 to get merged (it's really intrusive), then rebase this PR on main/HEAD, then update the docs.

@wongchar Oh, and it looks like your sign-off is incorrect. AFAICT the problem is that you used your personal e-mail address as the commit author, but then signed it off with a different e-mail address, apparently your work one.

@wongchar wongchar force-pushed the l3cache-unlimitedBurstable branch 2 times, most recently from a3921bc to ae6eb05 Compare March 5, 2026 00:01
@wongchar
Copy link
Contributor Author

wongchar commented Mar 5, 2026

Added some more test cases. Will address docs next (on my to-do)

@wongchar wongchar force-pushed the l3cache-unlimitedBurstable branch 4 times, most recently from 63050e8 to 961e17d Compare March 5, 2026 23:30
@wongchar
Copy link
Contributor Author

wongchar commented Mar 5, 2026

Added a few more test cases and squashed to the previous commit.

Updated the documentation in a separate commit.

Appreciate your review and feedback, thanks!

@klihub klihub requested review from askervin and klihub March 6, 2026 16:07
} else {
// Single NUMA node per die (or no NUMA subdivision).
// Check for L3 cache groups within this die.
if l3CacheIDs := p.sys.Package(socketID).L3CacheIDs(); len(l3CacheIDs) > 1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now, the topology tree has been constructed so that there has not been nodes with only one child.

Currently that would be a subject to change. For instance, two dies per socket, both dies having their own L3 cache, no sub-NUMA nodes within dies, results in a tree where each die nodes has one child: their own L3 cache node. The same happens if there are multiple NUMA nodes that have their own L3 caches, too.

@klihub, what do you think, should we try to stick with a minimum-depth tree where resources of a child node is always a proper subset of resources of its parent?

If so, then buildL3CachePool() should be called only if there are more than one L3 pools in a die/node scope, instead of more than L3 caches in a package.

Copy link
Collaborator

@klihub klihub Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's good to adhere to that principle. We should not have non-branching subtrees because they do not bring in any new topological information.

I guess the easiest here is to do a final filtering at the end of each iteration in buildNumaNodePool() and inf we end up with a single sole child, just remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @klihub the nodes that has only one child does not bring too much value to the algorithm.
The only exception might be scenario where we have very distinct group of cores for example (on hybrid cpus, group of E or LPE cores groupped by cache cluster).

@askervin
Copy link
Collaborator

askervin commented Mar 9, 2026

L3 cache nodes are not present in the topology tree, for instance when L3 does not split NUMA nodes. This is the case in our topology-aware/n4c16 tests, for instance. If a pod is scheduled on such a node, and it is annotated with unlimited-burstable...: l3cache, would you have good ideas how CPU set should be scoped with such annotations? Should we push it to numa or die level? @kad, recommendations?

@klihub
Copy link
Collaborator

klihub commented Mar 9, 2026

L3 cache nodes are not present in the topology tree, for instance when L3 does not split NUMA nodes. This is the case in our topology-aware/n4c16 tests, for instance. If a pod is scheduled on such a node, and it is annotated with unlimited-burstable...: l3cache, would you have good ideas how CPU set should be scoped with such annotations? Should we push it to numa or die level? @kad, recommendations?

I think the same problem is already present, the most typical candidate being burstability limited to a (non-existent) die (topology level/node). And we don't try/do anything very intelligent in that case. We simply always prefer an exactly matched topology level over unmatched ones. But at the moment we do not, for instance, consider for a target topology T, an pool with level L1 > T better than another pool with level L2 > T, when L1 < L2 (where for the sake of discussion > means higher in the tree, so lower in terms of L.Value()). IOW, if the target cannot be satisfied, we do not consider a tighter fit better than a loser one, instead considering them equally good/bad and letting other comparision criteria take precendence.

And I need to check, but I have some vague recollection thath I also found very recently some bug related to limited burstability when there are target nodes in the pool tree but the container does not fit into any of them... I need to check that one.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds L3 cache as a new topology level to the NRI topology-aware resource policy. This is motivated by AMD EPYC CPUs where a single die can contain multiple L3 caches (Core Complexes / CCXs), enabling finer-grained resource affinity than the existing die-level support.

Changes:

  • Adds CPUTopologyLevelL3Cache as a new topology level with value 5 (between NUMA=4 and L2Cache=6), updating the CRD validation and config to allow "l3cache" as a valid unlimitedBurstable option.
  • Extends pkg/sysfs to discover L3 cache groupings per package (L3CacheIDs(), L3CacheCPUSet()), and adds buildL3CachePool to the topology pool builder that creates L3 cache pools under NUMA nodes, dies, or sockets depending on the detected hierarchy.
  • Adds an l3cachenode implementation in node.go with HintScore, GetPhysicalNodeIDs, and GetMemset methods, along with a comprehensive e2e test suite using cache overrides to simulate AMD CCX topology.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/sysfs/system.go Adds l3CacheCPUs field and discovery logic to cpuPackage, and exposes L3CacheIDs()/L3CacheCPUSet() on the CPUPackage interface
pkg/apis/config/v1alpha1/resmgr/policy/config.go Adds CPUTopologyLevelL3Cache constant with value 5 and updates level ordering
pkg/apis/config/v1alpha1/resmgr/policy/topologyaware/config.go Re-exports new level constant and adds l3cache to kubebuilder validation enum
cmd/plugins/topology-aware/policy/node.go Adds L3CacheNode kind, l3cachenode struct, and its Node interface implementations
cmd/plugins/topology-aware/policy/pools.go Adds buildL3CachePool and wires L3 cache pool creation into the topology builder
cmd/plugins/topology-aware/policy/mocks_test.go Adds stub implementations of the two new CPUPackage interface methods
config/crd/bases/config.nri_topologyawarepolicies.yaml Adds l3cache to the CRD unlimitedBurstable enum
deployment/helm/topology-aware/crds/config.nri_topologyawarepolicies.yaml Same CRD update for the Helm chart
docs/resource-policy/policy/topology-aware.md Documents the new l3cache topology level and updates the pool hierarchy description
test/e2e/policies.test-suite/topology-aware/n4c128/topology.var.json New test topology for the n4c128 test variant
test/e2e/policies.test-suite/topology-aware/n4c128/test19-cacheclusters/ New e2e test for L3 cache pool placement using OVERRIDE_SYS_CACHES

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

func (p *policy) NewL3CacheNode(id idset.ID, cpus cpuset.CPUSet, parent Node) *l3cachenode {
n := &l3cachenode{}
n.self.node = n
n.init(p, fmt.Sprintf("L3 cache #%v", id), L3CacheNode, parent)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NewL3CacheNode function uses fmt.Sprintf("L3 cache #%v", id) to generate the pool name, where id is the sysfs cache ID. On real hardware, cache IDs may not be globally unique across different sockets — they can repeat per-socket (e.g., socket 0 and socket 1 may both have an L3 cache with id=0). This means that on a multi-socket system where L3 cache IDs are not globally unique, multiple L3 cache pool nodes would map to the same key in p.nodes, with later entries silently overwriting earlier ones. When UpdateResources is called and tries to look up the pool by the previously assigned node's name, it may select the wrong pool.

Compare with NewDieNode, which uses fmt.Sprintf("die #%v/%v", pkg.id, id) to include the parent socket ID for uniqueness. The L3 cache node name should similarly include information that makes it globally unique, such as a combination of the parent node's name and the cache ID (e.g., fmt.Sprintf("%s/L3 cache #%v", parent.Name(), id)).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wongchar I wonder if it is really true for cache IDs to not be unique across sockets. For die ID I know it is true, typically on our HW but AFAICT not on yours. Anyway, if it is true for cache IDs then we should name them with a disambiguating prefix as suggested. If it is not true, then there is no need for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cacheIDs are unique, but I agree with this feedback. More details to make it obvious on how to map back to the parent ID is helpful. I meant to add this but it slipped my mind. I will update soon

for _, l3CacheID := range l3CacheIDs {
l3CacheCPUs := p.sys.Package(socketID).L3CacheCPUSet(l3CacheID)
// Only create L3 pool if its CPUs are within this die
if !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs) {
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs) computes l3CacheCPUs.Intersection(cpus) twice. This same pattern appears in both buildDiePool and buildNumaNodePool. The result should be stored in a local variable and reused to avoid the duplicate computation.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leveraged this in getL3CacheIDsForCPUs() of cmd/plugins/topology-aware/policy/pools.go

Will simplify !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs)
to just
if cpus.Intersection(l3CacheCPUs).Equals(l3CacheCPUs)

for _, l3CacheID := range l3CacheIDs {
l3CacheCPUs := p.sys.Package(socketID).L3CacheCPUSet(l3CacheID)
// Only create L3 pool if its CPUs are within this NUMA node
if !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs) {
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same double-computation of l3CacheCPUs.Intersection(cpus) as in buildDiePool. The intersection should be stored in a local variable to avoid the redundant computation.

Copilot uses AI. Check for mistakes.
@wongchar
Copy link
Contributor Author

I see. Just as another proposal in the recent commit, thoughts on a helper function to determine if there are multiple L3 within the current scope (die/NUMA node)? The helper function takes the determined cpus as an input to narrow down to the right scope. This would prevent creating nodes with only one child.

In terms of setting l3cache as unlimitedBurstable when no l3cache pools are created, I understand it to default to the deepest pool available. To prevent silent fallback, would this be a configuration error or require a loud warning at startup that configuration was not honored?

@klihub
Copy link
Collaborator

klihub commented Mar 11, 2026

I see. Just as another proposal in the recent commit, thoughts on a helper function to determine if there are multiple L3 within the current scope (die/NUMA node)? The helper function takes the determined cpus as an input to narrow down to the right scope. This would prevent creating nodes with only one child.

I think that's reasonable.

In terms of setting l3cache as unlimitedBurstable when no l3cache pools are created, I understand it to default to the deepest pool available. To prevent silent fallback, would this be a configuration error or require a loud warning at startup that configuration was not honored?

A warning sounds a better choice for me here.

@klihub klihub requested a review from askervin March 11, 2026 13:58
@wongchar
Copy link
Contributor Author

wongchar commented Mar 11, 2026

A warning sounds a better choice for me here.

Agreed, I will double check that its called out in the logs at a minimum.

@wongchar
Copy link
Contributor Author

A warning sounds a better choice for me here.

Agreed, I will double check that its called out in the logs at a minimum.

Nevermind, I see you already handle the warning in findExistingTopologyLevel. Nothing to do.

Added one more commit to address the items highlighted by copilot. Let me know if I need to squash any commits for simplicity.

Thanks!

@klihub
Copy link
Collaborator

klihub commented Mar 13, 2026

Added one more commit to address the items highlighted by copilot. Let me know if I need to squash any commits for simplicity.

Thanks!

Yes, I think it would be good to squash the last two fix-commits into the original add l3cache topology commit (and preferably also update the summary to be of the form topology-aware: ..., like you have in the to be squashed commits).

Otherwise this now LGTM.

@klihub klihub changed the title add l3Cache node level to topology topology-aware: add l3Cache topology/pool nodes. Mar 13, 2026
  - Extend sysfs package to discover L3 cache topology from the system
  - Add L3 cache pool node type that groups CPUs sharing the same L3 cache
  - Only create L3 cache pools when multiple L3 caches exist within the
    current scope to avoid single-child parent nodes

Signed-off-by: Charles Wong <charles.wong2@amd.com>
klihub and others added 3 commits March 13, 2026 15:16
Add a topology aware test case to verify that burstability can be
limited to an L3 cache cluster in a new n4c128 topology setup with
8 physical cores in an L3 cluster.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Signed-off-by: Charles Wong <charles.wong2@amd.com>
Signed-off-by: Charles Wong <charles.wong2@amd.com>
@wongchar wongchar force-pushed the l3cache-unlimitedBurstable branch from b44cea1 to a25233f Compare March 13, 2026 15:19
@wongchar
Copy link
Contributor Author

Yes, I think it would be good to squash the last two fix-commits into the original add l3cache topology commit (and preferably also update the summary to be of the form topology-aware: ..., like you have in the to be squashed commits).

done!

Copy link
Collaborator

@klihub klihub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@klihub klihub merged commit 70af369 into containers:main Mar 13, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants