NUMA-aware shard selection and work stealing by emeryberger · Pull Request #89 · emeryberger/Hoard

emeryberger · 2026-05-28T01:32:22Z

Summary

This PR improves Hoard's performance on NUMA systems by making the sharded global heap NUMA-aware:

1. NUMA-aware shard selection

Instead of just masking the CPU number (cpu & (NumShards - 1)), we now use syscall(SYS_getcpu) to get both the CPU and NUMA node, then compute:

shard = (node * shardsPerNode) + (cpu % shardsPerNode)

This ensures threads on different NUMA nodes use different shard ranges. With 8 shards and 2 NUMA nodes: node 0 uses shards 0-3, node 1 uses shards 4-7.

2. NUMA-aware work stealing

Modified get() to search for superblocks in priority order:

Local shard first
Power-of-two random choices within same NUMA node
Remaining same-node shards
Other NUMA nodes (last resort)

This reduces cross-node memory traffic when stealing superblocks from the global heap.

Benchmark Results

NUMA stress test (20 touches per object to isolate NUMA effects)

At 128 threads on a 2-node NUMA system:

Allocator	Throughput	Hoard Speedup
Hoard	19.5M ops/sec	1.00x
mimalloc	13.6M ops/sec	1.43x
jemalloc	14.1M ops/sec	1.38x
glibc	12.1M ops/sec	1.61x

Hardware performance counter verification

Used AMD perf counters to measure actual NUMA locality at 64 threads:

Counter	Description
`ls_dmnd_fills_from_sys.mem_io_local`	Memory I/O fulfilled locally
`ls_dmnd_fills_from_sys.mem_io_remote`	Memory I/O fulfilled from remote NUMA node
`ls_dmnd_fills_from_sys.ext_cache_local`	L3 cache fills from local node
`ls_dmnd_fills_from_sys.ext_cache_remote`	L3 cache fills from remote node

Results: Hoard's NUMA optimization reduces remote memory access from 40.6% to 33.4% (memory I/O) and 45.5% to 37.1% (cache fills), while improving throughput by 9%:

Allocator	Throughput	Remote Memory I/O	Remote Cache Fills
Hoard (NUMA opt)	83.3M ops/s	33.4%	37.1%
Hoard (no NUMA)	76.4M ops/s	40.6%	45.5%
mimalloc	87.0M ops/s	18.3%	33.4%
jemalloc	53.6M ops/s	15.5%	40.8%
glibc	68.7M ops/s	51.4%	47.8%

Note: mimalloc and jemalloc have lower remote access percentages because they use per-thread arenas (less sharing), but this comes at the cost of higher memory usage. glibc shows the highest remote access (51.4%) and throughput suffers accordingly.

Standard benchmarks (no regression)

Compared against hoard_opt_work branch:

Benchmark	Threads	hoard_opt_work	numa-aware-sharding
threadtest	16	14.3s	11.3s (faster)
threadtest	64	2.9s	2.9s (same)
threadtest	128	6.6s	6.6s (same)
threadtest	256	1.8s	1.8s (same)

Test plan

NUMA stress test at 16-128 threads with 20 touches per object
Verified NUMA effects with hardware performance counters
Standard benchmarks (threadtest, linux-scalability, Phong)
Verified no regression vs hoard_opt_work branch
Tested on 192-core, 2-node NUMA system (AMD EPYC, node 0: CPUs 0-95, node 1: CPUs 96-191)

🤖 Generated with Claude Code

Instead of just masking the CPU number, now use syscall(SYS_getcpu) to get both the CPU and NUMA node, then compute: shard = (node * shardsPerNode) + (cpu % shardsPerNode) This ensures threads on different NUMA nodes use different shard ranges, while still spreading load within each node's range. With 8 shards and 2 nodes: node 0 uses shards 0-3, node 1 uses 4-7. Also adds a NUMA stress test benchmark that measures cross-node allocation performance by pinning threads across NUMA nodes and passing objects between them. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Modified get() to search in three phases: 1. Try local shard 2. Power-of-two random choices within SAME NUMA node 3. Scan remaining same-node shards 4. Only then try shards on OTHER NUMA nodes (last resort) This reduces cross-node memory traffic when stealing superblocks from the global heap, improving performance on NUMA systems. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Includes: - numa_throughput.png: Throughput comparison across thread counts - numa_perf_counters.png: Remote memory access percentages from perf - numa_speedup.png: Hoard speedup over other allocators - generate_numa_benchmarks.py: Script to regenerate graphs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

emeryberger and others added 3 commits May 28, 2026 00:21

emeryberger merged commit 5c3fc45 into master May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NUMA-aware shard selection and work stealing#89

NUMA-aware shard selection and work stealing#89
emeryberger merged 3 commits into
masterfrom
numa-aware-sharding

emeryberger commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

emeryberger commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. NUMA-aware shard selection

2. NUMA-aware work stealing

Benchmark Results

NUMA stress test (20 touches per object to isolate NUMA effects)

Hardware performance counter verification

Standard benchmarks (no regression)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

emeryberger commented May 28, 2026 •

edited

Loading