It's an allocator. A little messy; has a test suite and that's about all. Don't use it for anything that matters unless you want to run the risk of bugs.
Intentionally disables just about every safety/bounds check/panic in release. Better performance but no fun to debug. Any misuse ever at all is UB. This is a base on which a less risky API is meant to go.
This uses a very large amount of unsafe rust. Because it's an allocator. It has a fine test suite, is model-checked via loom, and checks UB under miri.
Supports only 64-bit targets and latest nightly. Do not run it on 32-bit targets or those without an instruction for 128-bit atomic CAS. It will be slow.
Operates on the assumption that cycles are often wasted, RAM is plentiful but dog-slow, and therefore cache-friendliness is everything. Secondly, avoid stopping the world via global contention because if I wanted lousy tail latency I'd just use a GC.
This is why the old method of metadata in front of a block is now wrong for specific access patterns like simulation/ECS. Hot path is data -> data -> data and cold path, infrequently during sync points, is meta -> meta -> meta. So, keep metadata separately. Here, all metadata for a block fits into a single u64. This means 8 blocks metadata per cache line; alloc/free don't take up too many.
Uses a three-level hierarchical bit-tree, a la Unreal MallocBinned but static size and so therefore limited to 16,384 blocks per 2KB tree. Becauase of this constraint, simply chain when capacity exceeded. Unlike dynamic, finding first free block is O(1) via tzcnt or similar instructions for blocks >=16KB; amortized O(1) for smaller.
Global recycler is lock-free though technically not wait-free. It's technically achievable (cf. Crystalline-W, Nikolaev '21) but often slower. Use a lock-free Trieber stack with ABA guarded by 128-bit CAS (thanks modern architectures); ptr + gen count. Seems the whole hyaline family and subsequent crystalline do not outperform here. Threads only touch global lock when recycler is full.
Thread-local caches batch work when interacting with global pool to amortize cost over more allocations.
Cache flushing is co-operative; trim increments an atomic gen ctr and threads
flush themselves.
Generally, syscalls rarely happen while lock is held. Possibly never; don't recall.
Uses huge pages by default, falling back to smaller sizes if unavailable. 1GB -> 2MB -> system default.
Consider stealing mimalloc's sharded metadata approach; might ding perf a bit but allow free() without size. Probably introduces extra cacheline touches and definitely extra overhead. Maybe not.
Pretty sure true wait-free guarantee on the recycler would cost more than it returns perf-wise vs. existing lock-free. Want to tune recycler cap and bundle size pretty carefully so as to avoid backpressure through locked path. Could possibly use a per-thread remote queue to further reduce global contention under asymmetric workloads. Don't think I want to go full snmalloc message passing.
I really don't want to mess around with ARM MTE. Yeah security etc but come on.
Currently I decommit whole empty blocks and trim trailing empty which is lower- risk, but perhaps take a look at Mesh, which has an interesting compaction strategy of moving things to another page with space then using vmem to remap the old page's address space, finally returning old physical page to OS. Unsure if this even works on anything besides Linux, or if it could be made to, and don't care to go do another deep dive into vmem right now to check. Not sure I like this, though, since necessarily has some form of locking (believe mesh does it with protect + write fault handler that waits) that will mess with tail latencies. "Now you can write in a systems language and still enjoy the benefits of GC latency spikes!" gee thanks. Not nearly as bad but still likely shows in p99.
Google did some interesting work with tcmalloc on tuning via LSTM; could be interesting but want to make sure I don't thrash everything to death with constant config tweaks. Also don't want to force checking atomics all the time in the hot path. Maybe go expose an API for this and check config infrequently at coarse cadence. Keep param space small; maybe bounds and hysteresis.
Currently 128-bit atomic (ptr + generation (curse you red baron (rust core team)
for making gen a keyword)) prevents using LDAPR (ARM 8.3+ instruction with
somewhat weaker guarantees and better performance for mem barrier). CASP
probably dominates anyway though not LDAR vs LDAPR. Not acquiring loads of
shared read-mostly state in tight loops; it's fine.
Considered rejiggering API surface to allow multiple TLCs but this is a rare
use-case and would make it very ugly. Passing &mut cache to every alloc is
annoying. Probably skip.