topology-aware: add l3Cache topology/pool nodes. by wongchar · Pull Request #635 · containers/nri-plugins

wongchar · 2026-02-26T23:32:39Z

Hello!

I would like to propose L3 cache restriction/affinity in the NRI Topology Aware resource policy.
The existing die node level supports resource optimization for AMD EPYC SKUs where the die contains a single L3 cache.

However, there are some SKUs of AMD EPYC that will contain up to two L3 caches on a single die.
AMD refers to these L3 cache core groupings as a Core Complex (CCX) with up to two CCXs on a Core Complex Die (CCD).

The other motivation is to extend the unlimitedBurstable feature to the L3 Cache level as well.

Please let me know your thoughts

klihub · 2026-02-27T09:34:42Z

Hello!

I would like to propose L3 cache restriction/affinity in the NRI Topology Aware resource policy. The existing die node level supports resource optimization for AMD EPYC SKUs where the die contains a single L3 cache.

However, there are some SKUs of AMD EPYC that will contain up to two L3 caches on a single die. AMD refers to these L3 cache core groupings as a Core Complex (CCX) with up to two CCXs on a Core Complex Die (CCD).

The other motivation is to extend the unlimitedBurstable feature to the L3 Cache level as well.

Please let me know your thoughts

@wongchar Thank you ! We are definitely interested to get this merged. But I think it would be good to add e2e test cases for verification. Would you be able to provide a sample sysfs dump or the relevant subset of it for such a HW. It would help check how easily we could add such tests.

We probably cannot emulate this directly by qemu, but this is not the only such feature and we already have some environment based overrides in the sysfs/detection code specifically to be able to fake and test things which can't be emulate properly.

wongchar · 2026-02-27T15:57:50Z

amd-128c-8l3-sysfs.tar.gz

Agreed, attached is the sysfs of an epyc processor (8534P) with two L3 caches per die. (Single numa, single socket, smt-enabled).

Happy to help add e2e tests based on what is most cohesive with the existing codebase :)

klihub · 2026-03-03T12:45:36Z

@wongchar Here is a a minimal e2e test case added to your tree. The test case verifies confining unlimited burstability to an L3 cache cluster, which IIUC is the primary motivation for this PR.

I think for getting this merged, we still need to

add more e2e test cases, if you have other important uses cases in mind
update the documentation, at least to mention the newly introduced topology level

With the documentation update we should first wait for #628 to get merged (it's really intrusive), then rebase this PR on main/HEAD, then update the docs.

klihub · 2026-03-03T17:25:31Z

@wongchar Here is a a minimal e2e test case added to your tree. The test case verifies confining unlimited burstability to an L3 cache cluster, which IIUC is the primary motivation for this PR.

I think for getting this merged, we still need to

add more e2e test cases, if you have other important uses cases in mind

update the documentation, at least to mention the newly introduced topology level

With the documentation update we should first wait for #628 to get merged (it's really intrusive), then rebase this PR on main/HEAD, then update the docs.

@wongchar Oh, and it looks like your sign-off is incorrect. AFAICT the problem is that you used your personal e-mail address as the commit author, but then signed it off with a different e-mail address, apparently your work one.

wongchar · 2026-03-05T00:05:45Z

Added some more test cases. Will address docs next (on my to-do)

wongchar · 2026-03-05T23:32:53Z

Added a few more test cases and squashed to the previous commit.

Updated the documentation in a separate commit.

Appreciate your review and feedback, thanks!

askervin · 2026-03-09T11:01:19Z

cmd/plugins/topology-aware/policy/pools.go

+	} else {
+		// Single NUMA node per die (or no NUMA subdivision).
+		// Check for L3 cache groups within this die.
+		if l3CacheIDs := p.sys.Package(socketID).L3CacheIDs(); len(l3CacheIDs) > 1 {


Until now, the topology tree has been constructed so that there has not been nodes with only one child.

Currently that would be a subject to change. For instance, two dies per socket, both dies having their own L3 cache, no sub-NUMA nodes within dies, results in a tree where each die nodes has one child: their own L3 cache node. The same happens if there are multiple NUMA nodes that have their own L3 caches, too.

@klihub, what do you think, should we try to stick with a minimum-depth tree where resources of a child node is always a proper subset of resources of its parent?

If so, then buildL3CachePool() should be called only if there are more than one L3 pools in a die/node scope, instead of more than L3 caches in a package.

Yes, I think it's good to adhere to that principle. We should not have non-branching subtrees because they do not bring in any new topological information.

I guess the easiest here is to do a final filtering at the end of each iteration in buildNumaNodePool() and inf we end up with a single sole child, just remove it.

+1 to @klihub the nodes that has only one child does not bring too much value to the algorithm.
The only exception might be scenario where we have very distinct group of cores for example (on hybrid cpus, group of E or LPE cores groupped by cache cluster).

askervin · 2026-03-09T11:11:34Z

L3 cache nodes are not present in the topology tree, for instance when L3 does not split NUMA nodes. This is the case in our topology-aware/n4c16 tests, for instance. If a pod is scheduled on such a node, and it is annotated with unlimited-burstable...: l3cache, would you have good ideas how CPU set should be scoped with such annotations? Should we push it to numa or die level? @kad, recommendations?

klihub · 2026-03-09T15:17:54Z

L3 cache nodes are not present in the topology tree, for instance when L3 does not split NUMA nodes. This is the case in our topology-aware/n4c16 tests, for instance. If a pod is scheduled on such a node, and it is annotated with unlimited-burstable...: l3cache, would you have good ideas how CPU set should be scoped with such annotations? Should we push it to numa or die level? @kad, recommendations?

I think the same problem is already present, the most typical candidate being burstability limited to a (non-existent) die (topology level/node). And we don't try/do anything very intelligent in that case. We simply always prefer an exactly matched topology level over unmatched ones. But at the moment we do not, for instance, consider for a target topology T, an pool with level L1 > T better than another pool with level L2 > T, when L1 < L2 (where for the sake of discussion > means higher in the tree, so lower in terms of L.Value()). IOW, if the target cannot be satisfied, we do not consider a tighter fit better than a loser one, instead considering them equally good/bad and letting other comparision criteria take precendence.

And I need to check, but I have some vague recollection thath I also found very recently some bug related to limited burstability when there are target nodes in the pool tree but the container does not fit into any of them... I need to check that one.

Copilot

Pull request overview

This PR adds L3 cache as a new topology level to the NRI topology-aware resource policy. This is motivated by AMD EPYC CPUs where a single die can contain multiple L3 caches (Core Complexes / CCXs), enabling finer-grained resource affinity than the existing die-level support.

Changes:

Adds CPUTopologyLevelL3Cache as a new topology level with value 5 (between NUMA=4 and L2Cache=6), updating the CRD validation and config to allow "l3cache" as a valid unlimitedBurstable option.
Extends pkg/sysfs to discover L3 cache groupings per package (L3CacheIDs(), L3CacheCPUSet()), and adds buildL3CachePool to the topology pool builder that creates L3 cache pools under NUMA nodes, dies, or sockets depending on the detected hierarchy.
Adds an l3cachenode implementation in node.go with HintScore, GetPhysicalNodeIDs, and GetMemset methods, along with a comprehensive e2e test suite using cache overrides to simulate AMD CCX topology.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`pkg/sysfs/system.go`	Adds `l3CacheCPUs` field and discovery logic to `cpuPackage`, and exposes `L3CacheIDs()`/`L3CacheCPUSet()` on the `CPUPackage` interface
`pkg/apis/config/v1alpha1/resmgr/policy/config.go`	Adds `CPUTopologyLevelL3Cache` constant with value 5 and updates level ordering
`pkg/apis/config/v1alpha1/resmgr/policy/topologyaware/config.go`	Re-exports new level constant and adds `l3cache` to kubebuilder validation enum
`cmd/plugins/topology-aware/policy/node.go`	Adds `L3CacheNode` kind, `l3cachenode` struct, and its `Node` interface implementations
`cmd/plugins/topology-aware/policy/pools.go`	Adds `buildL3CachePool` and wires L3 cache pool creation into the topology builder
`cmd/plugins/topology-aware/policy/mocks_test.go`	Adds stub implementations of the two new `CPUPackage` interface methods
`config/crd/bases/config.nri_topologyawarepolicies.yaml`	Adds `l3cache` to the CRD `unlimitedBurstable` enum
`deployment/helm/topology-aware/crds/config.nri_topologyawarepolicies.yaml`	Same CRD update for the Helm chart
`docs/resource-policy/policy/topology-aware.md`	Documents the new `l3cache` topology level and updates the pool hierarchy description
`test/e2e/policies.test-suite/topology-aware/n4c128/topology.var.json`	New test topology for the n4c128 test variant
`test/e2e/policies.test-suite/topology-aware/n4c128/test19-cacheclusters/`	New e2e test for L3 cache pool placement using `OVERRIDE_SYS_CACHES`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T08:53:34Z

cmd/plugins/topology-aware/policy/node.go

+func (p *policy) NewL3CacheNode(id idset.ID, cpus cpuset.CPUSet, parent Node) *l3cachenode {
+	n := &l3cachenode{}
+	n.self.node = n
+	n.init(p, fmt.Sprintf("L3 cache #%v", id), L3CacheNode, parent)


The NewL3CacheNode function uses fmt.Sprintf("L3 cache #%v", id) to generate the pool name, where id is the sysfs cache ID. On real hardware, cache IDs may not be globally unique across different sockets — they can repeat per-socket (e.g., socket 0 and socket 1 may both have an L3 cache with id=0). This means that on a multi-socket system where L3 cache IDs are not globally unique, multiple L3 cache pool nodes would map to the same key in p.nodes, with later entries silently overwriting earlier ones. When UpdateResources is called and tries to look up the pool by the previously assigned node's name, it may select the wrong pool.

Compare with NewDieNode, which uses fmt.Sprintf("die #%v/%v", pkg.id, id) to include the parent socket ID for uniqueness. The L3 cache node name should similarly include information that makes it globally unique, such as a combination of the parent node's name and the cache ID (e.g., fmt.Sprintf("%s/L3 cache #%v", parent.Name(), id)).

@wongchar I wonder if it is really true for cache IDs to not be unique across sockets. For die ID I know it is true, typically on our HW but AFAICT not on yours. Anyway, if it is true for cache IDs then we should name them with a disambiguating prefix as suggested. If it is not true, then there is no need for it.

cacheIDs are unique, but I agree with this feedback. More details to make it obvious on how to map back to the parent ID is helpful. I meant to add this but it slipped my mind. I will update soon

Copilot · 2026-03-10T08:53:35Z

cmd/plugins/topology-aware/policy/pools.go

+			for _, l3CacheID := range l3CacheIDs {
+				l3CacheCPUs := p.sys.Package(socketID).L3CacheCPUSet(l3CacheID)
+				// Only create L3 pool if its CPUs are within this die
+				if !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs) {


The condition !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs) computes l3CacheCPUs.Intersection(cpus) twice. This same pattern appears in both buildDiePool and buildNumaNodePool. The result should be stored in a local variable and reused to avoid the duplicate computation.

I leveraged this in getL3CacheIDsForCPUs() of cmd/plugins/topology-aware/policy/pools.go

Will simplify !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs)
to just
if cpus.Intersection(l3CacheCPUs).Equals(l3CacheCPUs)

Copilot · 2026-03-10T08:53:35Z

cmd/plugins/topology-aware/policy/pools.go

+		for _, l3CacheID := range l3CacheIDs {
+			l3CacheCPUs := p.sys.Package(socketID).L3CacheCPUSet(l3CacheID)
+			// Only create L3 pool if its CPUs are within this NUMA node
+			if !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs) {


Same double-computation of l3CacheCPUs.Intersection(cpus) as in buildDiePool. The intersection should be stored in a local variable to avoid the redundant computation.

wongchar · 2026-03-10T20:22:19Z

I see. Just as another proposal in the recent commit, thoughts on a helper function to determine if there are multiple L3 within the current scope (die/NUMA node)? The helper function takes the determined cpus as an input to narrow down to the right scope. This would prevent creating nodes with only one child.

In terms of setting l3cache as unlimitedBurstable when no l3cache pools are created, I understand it to default to the deepest pool available. To prevent silent fallback, would this be a configuration error or require a loud warning at startup that configuration was not honored?

klihub · 2026-03-11T12:44:49Z

I see. Just as another proposal in the recent commit, thoughts on a helper function to determine if there are multiple L3 within the current scope (die/NUMA node)? The helper function takes the determined cpus as an input to narrow down to the right scope. This would prevent creating nodes with only one child.

I think that's reasonable.

In terms of setting l3cache as unlimitedBurstable when no l3cache pools are created, I understand it to default to the deepest pool available. To prevent silent fallback, would this be a configuration error or require a loud warning at startup that configuration was not honored?

A warning sounds a better choice for me here.

wongchar · 2026-03-11T17:53:19Z

A warning sounds a better choice for me here.

Agreed, I will double check that its called out in the logs at a minimum.

wongchar · 2026-03-11T20:18:29Z

A warning sounds a better choice for me here.

Agreed, I will double check that its called out in the logs at a minimum.

Nevermind, I see you already handle the warning in findExistingTopologyLevel. Nothing to do.

Added one more commit to address the items highlighted by copilot. Let me know if I need to squash any commits for simplicity.

Thanks!

klihub · 2026-03-13T06:45:05Z

Added one more commit to address the items highlighted by copilot. Let me know if I need to squash any commits for simplicity.

Thanks!

Yes, I think it would be good to squash the last two fix-commits into the original add l3cache topology commit (and preferably also update the summary to be of the form topology-aware: ..., like you have in the to be squashed commits).

Otherwise this now LGTM.

- Extend sysfs package to discover L3 cache topology from the system - Add L3 cache pool node type that groups CPUs sharing the same L3 cache - Only create L3 cache pools when multiple L3 caches exist within the current scope to avoid single-child parent nodes Signed-off-by: Charles Wong <charles.wong2@amd.com>

Add a topology aware test case to verify that burstability can be limited to an L3 cache cluster in a new n4c128 topology setup with 8 physical cores in an L3 cluster. Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>

Signed-off-by: Charles Wong <charles.wong2@amd.com>

wongchar · 2026-03-13T15:24:19Z

Yes, I think it would be good to squash the last two fix-commits into the original add l3cache topology commit (and preferably also update the summary to be of the form topology-aware: ..., like you have in the to be squashed commits).

done!

klihub

LGTM.

wongchar force-pushed the l3cache-unlimitedBurstable branch from e931356 to e92e4d9 Compare February 26, 2026 23:41

klihub force-pushed the l3cache-unlimitedBurstable branch from e92e4d9 to b827509 Compare March 3, 2026 17:16

wongchar force-pushed the l3cache-unlimitedBurstable branch 2 times, most recently from a3921bc to ae6eb05 Compare March 5, 2026 00:01

wongchar force-pushed the l3cache-unlimitedBurstable branch 4 times, most recently from 63050e8 to 961e17d Compare March 5, 2026 23:30

klihub requested review from askervin and klihub March 6, 2026 16:07

askervin reviewed Mar 9, 2026

View reviewed changes

kad requested a review from Copilot March 10, 2026 08:43

Copilot started reviewing on behalf of kad March 10, 2026 08:44 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

klihub requested a review from askervin March 11, 2026 13:58

klihub changed the title ~~add l3Cache node level to topology~~ topology-aware: add l3Cache topology/pool nodes. Mar 13, 2026

klihub and others added 3 commits March 13, 2026 15:16

e2e: add L3 cache cluster tests for topology-aware policy

8708f8e

Signed-off-by: Charles Wong <charles.wong2@amd.com>

docs: add L3 cache topology level to topology-aware documentation

a25233f

Signed-off-by: Charles Wong <charles.wong2@amd.com>

wongchar force-pushed the l3cache-unlimitedBurstable branch from b44cea1 to a25233f Compare March 13, 2026 15:19

klihub approved these changes Mar 13, 2026

View reviewed changes

klihub merged commit 70af369 into containers:main Mar 13, 2026
10 checks passed

Conversation

wongchar commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klihub commented Feb 27, 2026

Uh oh!

wongchar commented Feb 27, 2026

Uh oh!

klihub commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klihub commented Mar 3, 2026

Uh oh!

wongchar commented Mar 5, 2026

Uh oh!

wongchar commented Mar 5, 2026

Uh oh!

askervin Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

klihub Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kad Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

askervin commented Mar 9, 2026

Uh oh!

klihub commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

klihub Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

wongchar Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

wongchar Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

wongchar commented Mar 10, 2026

Uh oh!

klihub commented Mar 11, 2026

Uh oh!

wongchar commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wongchar commented Mar 11, 2026

Uh oh!

klihub commented Mar 13, 2026

Uh oh!

wongchar commented Mar 13, 2026

Uh oh!

klihub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wongchar commented Feb 26, 2026 •

edited

Loading

klihub commented Mar 3, 2026 •

edited

Loading

klihub Mar 9, 2026 •

edited

Loading

wongchar commented Mar 11, 2026 •

edited

Loading