NUMA-aware shard selection and work stealing#89
Merged
Conversation
Instead of just masking the CPU number, now use syscall(SYS_getcpu) to get both the CPU and NUMA node, then compute: shard = (node * shardsPerNode) + (cpu % shardsPerNode) This ensures threads on different NUMA nodes use different shard ranges, while still spreading load within each node's range. With 8 shards and 2 nodes: node 0 uses shards 0-3, node 1 uses 4-7. Also adds a NUMA stress test benchmark that measures cross-node allocation performance by pinning threads across NUMA nodes and passing objects between them. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Modified get() to search in three phases: 1. Try local shard 2. Power-of-two random choices within SAME NUMA node 3. Scan remaining same-node shards 4. Only then try shards on OTHER NUMA nodes (last resort) This reduces cross-node memory traffic when stealing superblocks from the global heap, improving performance on NUMA systems. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Includes: - numa_throughput.png: Throughput comparison across thread counts - numa_perf_counters.png: Remote memory access percentages from perf - numa_speedup.png: Hoard speedup over other allocators - generate_numa_benchmarks.py: Script to regenerate graphs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves Hoard's performance on NUMA systems by making the sharded global heap NUMA-aware:
1. NUMA-aware shard selection
Instead of just masking the CPU number (
cpu & (NumShards - 1)), we now usesyscall(SYS_getcpu)to get both the CPU and NUMA node, then compute:This ensures threads on different NUMA nodes use different shard ranges. With 8 shards and 2 NUMA nodes: node 0 uses shards 0-3, node 1 uses shards 4-7.
2. NUMA-aware work stealing
Modified
get()to search for superblocks in priority order:This reduces cross-node memory traffic when stealing superblocks from the global heap.
Benchmark Results
NUMA stress test (20 touches per object to isolate NUMA effects)
At 128 threads on a 2-node NUMA system:
Hardware performance counter verification
Used AMD perf counters to measure actual NUMA locality at 64 threads:
ls_dmnd_fills_from_sys.mem_io_localls_dmnd_fills_from_sys.mem_io_remotels_dmnd_fills_from_sys.ext_cache_localls_dmnd_fills_from_sys.ext_cache_remoteResults: Hoard's NUMA optimization reduces remote memory access from 40.6% to 33.4% (memory I/O) and 45.5% to 37.1% (cache fills), while improving throughput by 9%:
Note: mimalloc and jemalloc have lower remote access percentages because they use per-thread arenas (less sharing), but this comes at the cost of higher memory usage. glibc shows the highest remote access (51.4%) and throughput suffers accordingly.
Standard benchmarks (no regression)
Compared against hoard_opt_work branch:
Test plan
🤖 Generated with Claude Code