Skip to content

NUMA-aware backend/frontend ring, kthread, and IRQ placement for vNUMA guests#182

Merged
tycho merged 1 commit into
mainfrom
steven/xen-pvh-numa-pinning
May 22, 2026
Merged

NUMA-aware backend/frontend ring, kthread, and IRQ placement for vNUMA guests#182
tycho merged 1 commit into
mainfrom
steven/xen-pvh-numa-pinning

Conversation

@tycho
Copy link
Copy Markdown
Member

@tycho tycho commented May 20, 2026

Xen PV drivers today run their per-queue kthreads and event-channel IRQs wherever the scheduler happens to place them, typically all clustered on a single CPU regardless of the guest's vNUMA layout. When a per-queue ring lives on a different NUMA node from the worker servicing it, every request pays cross-node interconnect cost to walk the ring and to grant-copy payload pages. Multi-queue parallelism is effectively defeated: backend kthreads pile onto one host node because all of the guest's rings end up on one guest node.

This series makes the full Xen I/O path NUMA-aware, on both the backend and frontend sides, so that on a guest whose vNUMA layout is mapped onto host nodes, each queue's ring, backend kthread, backend IRQ, frontend IRQ, and TX steering all land on the same node. The result is an end-to-end NUMA-local path with no cross-node payload movement up to the hypervisor boundary. The work depends on a Xen-side hypercall (XENMEM_get_mfn_pxms) and companion Edera Protect changes; this PR is the Linux portion.

The foundation is making the host node of a foreign frame visible to the mapping side. xen_alloc_unpopulated_pages now partitions its ZONE_DEVICE placeholder free list by Linux node id and registers each section against a specific node, so page_to_nid() of a placeholder is meaningful; a new xen_alloc_unpopulated_pages_node() entry point takes an explicit node and the existing API becomes a wrapper passing numa_node_id(). On top of that, xenbus_ring_host_node() exposes the host node backing a grant-mapped ring, learned at map time via the XENMEM_get_mfn_pxms hypercall, with placeholders relocated post-map when they land on the wrong node, and then collapses to a cheap vmalloc_to_page() + page_to_nid() lookup.

With that information available, the placement work follows. New _on_node variants of the lateeoi evtchn bind helpers allocate the IRQ descriptor on the caller's node, so /proc/irq/N/node is accurate and irqbalance treats the IRQ as NUMA-local instead of floating. A new xenbus_setup_ring_node() draws ring pages from a requested node's buddy list, paired with a xenbus_node_for_queue() helper that rotates per-queue indices over the set of nodes with online CPUs. The backends (netback, blkback) create their per-queue and per-ring kthreads with kthread_create_on_node() on the ring's host node, pin them to that node's cpumask, and bind and steer the per-queue IRQs to the same node. The frontends (netfront, blkfront) allocate per-queue rings on per-queue nodes and steer their IRQs to match; netfront additionally installs an XPS map so a sender on a CPU in node N selects the queue whose rings live on node N.

The whole feature is gated by CONFIG_XEN_BACKEND_NUMA_AFFINITY (defaults to y) and is designed to fall back cleanly. NUMA-aware placement applies in two cases: a guest with no NUMA topology at all, where there is a single node and nothing to get wrong, or a guest with a proper vNUMA topology, which in practice means a PVH guest. A guest with vNUMA information is necessarily PVH. A PV guest is still fine as long as it is small enough to be placed within a single NUMA node, with the hypervisor-level domU pinning enforcing that it stays there; the frontend side simply skips NUMA affinitization when no topology is available. The same applies on a hypervisor without the XENMEM_get_mfn_pxms hypercall: the first query latches a global "unsupported" flag and every subsequent call returns NUMA_NO_NODE with no further hypercalls, so this is safe to ship ahead of a hypervisor update. Whenever node information is unavailable, every NUMA-aware step is skipped and behaviour is identical to the previous NUMA-oblivious path. Single-node guests degenerate to node 0 everywhere, matching prior behaviour. Operator overrides are preserved throughout: writes to /proc/irq/N/smp_affinity and /proc/sys/net/.../xps_cpus still win, and kthread pins can be overridden with taskset.

@tycho tycho force-pushed the steven/xen-pvh-numa-pinning branch from 3e1f1e0 to 47c5d75 Compare May 20, 2026 16:41
Signed-off-by: Steven Noonan <steven@edera.dev>
@tycho tycho force-pushed the steven/xen-pvh-numa-pinning branch from 47c5d75 to e9090d2 Compare May 22, 2026 19:16
@tycho tycho changed the title implement NUMA binding for PV I/O drivers on PVH NUMA-aware backend/frontend ring, kthread, and IRQ placement for vNUMA guests May 22, 2026
@tycho tycho marked this pull request as ready for review May 22, 2026 19:34
@tycho tycho requested review from azenla, bleggett and kaniini as code owners May 22, 2026 19:34
@tycho tycho merged commit e9090d2 into main May 22, 2026
8 checks passed
@tycho tycho deleted the steven/xen-pvh-numa-pinning branch May 22, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants