Performance optimizations: TLAB fast path, realloc, sharded global heap, and CPU-based heap selection#88
Merged
Merged
Conversation
Key changes: - Skip magic number validation on TLAB free fast path (trust freelist pointers) - Skip normalize() on fast path (require exact pointer per C standard) - Add getObjectSizeUnchecked() to avoid redundant validation - Implement custom xxrealloc to avoid double malloc_usable_size lookup - Use per-bin object counts instead of byte arithmetic for threshold checks Performance impact: - Phong benchmark: 2x faster than mimalloc (realloc-heavy workload) - threadtest: 10% faster at 4-8 threads - Larson: within 15% of mimalloc (malloc/free workload) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces ShardedGlobalHeap with configurable number of shards (default 8). Uses power-of-two random choices for superblock retrieval to balance load while preserving memory efficiency. Design: - put(): CPU-local shard selection for NUMA locality (last-touch policy) - get(): Try local shard first, then power-of-two from fuller shard - Fallback: try all shards sequentially (no memory stranding) - Uses sched_getcpu() on Linux for NUMA-aware shard selection Performance (192-core NUMA system): - Larson: Same speed as mimalloc, 50% less memory at all thread counts - No blowup: total capacity unchanged, any superblock findable via fallback Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace tid-based heap selection with sched_getcpu() on Linux. The old approach (tid % NumHeaps) had two problems: 1. Thread IDs don't correlate with concurrent execution - a thread that exits leaves its ID "slot" unused while new threads may collide with active ones on the same heap. 2. No NUMA awareness - threads on different NUMA nodes could share heaps, causing cross-node memory traffic. The new approach selects heaps based on which CPU is executing: - Threads on the same CPU use the same heap (cache locality) - CPUs on the same NUMA node use nearby heap indices (NUMA locality) - Naturally load-balanced across actually-running threads Benchmarks (192-core, 2-node NUMA system, threadtest 100k iterations): - 96 threads: 5.0s -> 1.5s (3.3x faster) - 128 threads: 3.2s -> 0.9s (3.6x faster) - 192 threads: 0.85s -> 0.46s (1.8x faster) Gap vs mimalloc at 192 threads reduced from 2.4x to 1.04x. Memory efficiency preserved - Hoard still uses less memory than mimalloc and jemalloc on Larson benchmark. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Generate benchmark comparison graphs for Larson, threadtest, linux-scalability, and Phong benchmarks - Add summary graph showing all benchmarks - Update README with benchmark results section showing graphs - Include script to regenerate graphs from benchmark data Benchmarks run on 192-core, 2-node NUMA system comparing Hoard vs mimalloc vs jemalloc. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use seaborn for cleaner styling - Switch to line plots (normalized to Hoard = 1.0) - Add glibc as fourth allocator for comparison - Values > 1.0 mean worse than Hoard, < 1.0 mean better - Green shaded region shows where Hoard wins - Update README with new graph descriptions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use Helvetica font for cleaner appearance - All benchmark graphs now show both time AND memory panels - Clearer caption: "Hoard is the green line (1.0). Above = slower/more memory. Below = faster/less memory." - Consistent styling across all graphs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Left y-axis: normalized to Hoard (1.0) - Right y-axis: actual values based on Hoard's baseline - Makes it easy to see both relative performance and absolute values Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Each benchmark graph now shows BOTH time AND memory panels - Right y-axis label is now just "seconds" or "MB" (not "Hoard (s)") - Use DejaVu Sans font (Helvetica not available on this system) - All titles use consistent sans-serif font Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create individual time and memory graphs for each benchmark (8 total) - Create summary graphs: one for time, one for memory (2x2 grid each) - Use serif font (Times) for academic paper style - Update README with all new graphs - Remove old combined graphs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The right y-axis showing absolute values was misleading when combined with normalization. Graphs now show only the normalized values (relative to Hoard = 1.0). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Green shading above the Hoard baseline (1.0) indicates Hoard performs better (other allocators are slower or use more memory). Pink shading below indicates Hoard performs worse. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Extended all benchmark graphs to start from 1 thread (previously 16) - Changed Larson metric from execution time to throughput (ops/sec) - Hoard shows strong performance at low thread counts: - 5-8x faster than others on linux-scalability (1-8 threads) - 2x faster than others on threadtest (1-8 threads) - 1.3-1.5x higher throughput on Larson across all thread counts - Updated shading: green = Hoard better, pink = Hoard worse - For throughput metrics, shading is inverted (below line = worse) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces several performance optimizations to Hoard:
1. TLAB Fast Path Optimizations
normalize()on fast path (require exact pointer per C standard)getObjectSizeUnchecked()to avoid redundant validation2. Custom
xxreallocImplementationmalloc_usable_sizelookup (was calling it twice per realloc)3. Sharded Global Heap
sched_getcpu()on Linux for locality4. CPU-Based Heap Selection
Benchmark Summary
All graphs normalized to Hoard (1.0 = Hoard, shown as green line). Values above the line mean worse than Hoard.
Execution Time
Memory Usage
Why Thread-ID-Based Heap Selection Was Problematic
The old approach used
tid % NumHeapsto assign threads to heaps. This had two fundamental problems:Thread IDs don't correlate with concurrent execution. When a thread exits, its ID slot becomes unused, but the ID isn't recycled immediately. New threads get fresh IDs that may hash to heaps already used by active threads, causing contention. Meanwhile, heaps assigned to exited threads sit idle.
No NUMA awareness. Threads on different NUMA nodes could hash to the same heap, causing cross-node memory traffic. On our 192-core test system (2 NUMA nodes), this caused up to 1.8x slowdown due to remote memory access.
The new approach uses
sched_getcpu() % NumHeaps:Sharded Global Heap: Locality and Blowup Bound
The sharded global heap uses different strategies for
put()andget()to achieve both NUMA locality and memory efficiency:put()— Local shard for NUMA localitySuperblocks are returned to the CPU-local shard. This preserves NUMA locality because:
get()— Power-of-two choices from fuller shardThis design maintains the blowup bound because:
No memory stranding: The fallback ensures any superblock in any shard can be found. A thread is never forced to allocate fresh memory when a reusable superblock exists somewhere.
Fuller-first concentrates superblocks: Taking from the fuller shard (power-of-two choices) causes superblocks to concentrate in fewer shards over time. This improves reuse and reduces fragmentation.
Total capacity unchanged: Sharding is purely internal reorganization. The sum of all shards equals what the single global heap held before.
Emptiness threshold unchanged: Per-thread heaps still return superblocks at the same threshold. Sharding doesn't change when superblocks move, only where they're stored.
Why the Blowup Bound Matters
Hoard guarantees memory consumption is bounded by O(U + c·P·S·log M) where:
This bound matters for several reasons:
Predictable overhead: Memory consumption scales with actual usage (U) plus a logarithmic term. An application using 1GB will not suddenly consume 10GB due to fragmentation.
Scalability: The P·S·log(M) term means per-thread overhead grows only logarithmically with allocation history, not linearly. A long-running server won't accumulate unbounded fragmentation.
No pathological cases: Some allocators can exhibit O(M) blowup under adversarial or unlucky allocation patterns—memory proportional to peak usage even after most is freed. Hoard's bound prevents this.
Production safety: For memory-constrained environments (containers, embedded systems), a formal bound lets you provision memory with confidence.
The global heap memory is counted as "available to all threads" in this analysis. Since any superblock remains findable via the fallback (no stranding), the sharded design preserves the bound.
Design Notes: Comparison with Other Allocators
We benchmarked against mimalloc, jemalloc, and glibc. While all three specialized allocators target scalable multithreaded allocation, the approaches differ significantly:
What mimalloc does differently
What jemalloc does differently
What Hoard does (unchanged fundamentals)
New optimizations (this PR)
sched_getcpu()instead of thread ID hashing for NUMA-aware heap assignment.The key distinction: these optimizations make Hoard faster while preserving its memory efficiency guarantees, rather than trading memory bounds for speed.
Detailed Benchmark Results (192-core, 2-node NUMA system)
All benchmarks run with
LD_PRELOADto inject each allocator.Larson (server workload simulation)
threadtest (per-thread malloc/free throughput)
linux-scalability (malloc/free pairs)
Phong (realloc-heavy workload)
Test plan
🤖 Generated with Claude Code