Skip to content

NUMA-aware shard selection and work stealing#89

Merged
emeryberger merged 3 commits into
masterfrom
numa-aware-sharding
May 28, 2026
Merged

NUMA-aware shard selection and work stealing#89
emeryberger merged 3 commits into
masterfrom
numa-aware-sharding

Conversation

@emeryberger
Copy link
Copy Markdown
Owner

@emeryberger emeryberger commented May 28, 2026

Summary

This PR improves Hoard's performance on NUMA systems by making the sharded global heap NUMA-aware:

1. NUMA-aware shard selection

Instead of just masking the CPU number (cpu & (NumShards - 1)), we now use syscall(SYS_getcpu) to get both the CPU and NUMA node, then compute:

shard = (node * shardsPerNode) + (cpu % shardsPerNode)

This ensures threads on different NUMA nodes use different shard ranges. With 8 shards and 2 NUMA nodes: node 0 uses shards 0-3, node 1 uses shards 4-7.

2. NUMA-aware work stealing

Modified get() to search for superblocks in priority order:

  1. Local shard first
  2. Power-of-two random choices within same NUMA node
  3. Remaining same-node shards
  4. Other NUMA nodes (last resort)

This reduces cross-node memory traffic when stealing superblocks from the global heap.

Benchmark Results

NUMA stress test (20 touches per object to isolate NUMA effects)

NUMA Throughput

Hoard Speedup

At 128 threads on a 2-node NUMA system:

Allocator Throughput Hoard Speedup
Hoard 19.5M ops/sec 1.00x
mimalloc 13.6M ops/sec 1.43x
jemalloc 14.1M ops/sec 1.38x
glibc 12.1M ops/sec 1.61x

Hardware performance counter verification

Used AMD perf counters to measure actual NUMA locality at 64 threads:

Counter Description
ls_dmnd_fills_from_sys.mem_io_local Memory I/O fulfilled locally
ls_dmnd_fills_from_sys.mem_io_remote Memory I/O fulfilled from remote NUMA node
ls_dmnd_fills_from_sys.ext_cache_local L3 cache fills from local node
ls_dmnd_fills_from_sys.ext_cache_remote L3 cache fills from remote node

Results: Hoard's NUMA optimization reduces remote memory access from 40.6% to 33.4% (memory I/O) and 45.5% to 37.1% (cache fills), while improving throughput by 9%:

Performance Counters

Allocator Throughput Remote Memory I/O Remote Cache Fills
Hoard (NUMA opt) 83.3M ops/s 33.4% 37.1%
Hoard (no NUMA) 76.4M ops/s 40.6% 45.5%
mimalloc 87.0M ops/s 18.3% 33.4%
jemalloc 53.6M ops/s 15.5% 40.8%
glibc 68.7M ops/s 51.4% 47.8%

Note: mimalloc and jemalloc have lower remote access percentages because they use per-thread arenas (less sharing), but this comes at the cost of higher memory usage. glibc shows the highest remote access (51.4%) and throughput suffers accordingly.

Standard benchmarks (no regression)

Compared against hoard_opt_work branch:

Benchmark Threads hoard_opt_work numa-aware-sharding
threadtest 16 14.3s 11.3s (faster)
threadtest 64 2.9s 2.9s (same)
threadtest 128 6.6s 6.6s (same)
threadtest 256 1.8s 1.8s (same)

Test plan

  • NUMA stress test at 16-128 threads with 20 touches per object
  • Verified NUMA effects with hardware performance counters
  • Standard benchmarks (threadtest, linux-scalability, Phong)
  • Verified no regression vs hoard_opt_work branch
  • Tested on 192-core, 2-node NUMA system (AMD EPYC, node 0: CPUs 0-95, node 1: CPUs 96-191)

🤖 Generated with Claude Code

emeryberger and others added 3 commits May 28, 2026 00:21
Instead of just masking the CPU number, now use syscall(SYS_getcpu)
to get both the CPU and NUMA node, then compute:

  shard = (node * shardsPerNode) + (cpu % shardsPerNode)

This ensures threads on different NUMA nodes use different shard
ranges, while still spreading load within each node's range.
With 8 shards and 2 nodes: node 0 uses shards 0-3, node 1 uses 4-7.

Also adds a NUMA stress test benchmark that measures cross-node
allocation performance by pinning threads across NUMA nodes and
passing objects between them.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Modified get() to search in three phases:
1. Try local shard
2. Power-of-two random choices within SAME NUMA node
3. Scan remaining same-node shards
4. Only then try shards on OTHER NUMA nodes (last resort)

This reduces cross-node memory traffic when stealing superblocks
from the global heap, improving performance on NUMA systems.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Includes:
- numa_throughput.png: Throughput comparison across thread counts
- numa_perf_counters.png: Remote memory access percentages from perf
- numa_speedup.png: Hoard speedup over other allocators
- generate_numa_benchmarks.py: Script to regenerate graphs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@emeryberger emeryberger merged commit 5c3fc45 into master May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant