This post gives a brief tour of Rust's new allocation interface, and some
experiments with a custom Vec
implementation.
Rust has recently made progress in defining an interface for general memory
allocation. This interface is different from the standard malloc
/free
interface from C. Here is a shorter version of the full
interface:
pub unsafe trait Alloc {
unsafe fn alloc(&mut self, layout: Layout) -> Result<*mut u8, AllocErr>;
unsafe fn dealloc(&mut self, ptr: *mut u8, layout: Layout);
// alloc and dealloc are sufficient to implement this trait, but there are
// several other methods in the full trait that have default implementations.
}
Where
Layout
is a type that encapsulates size and alignment information, and
AllocErr
specifies a few common errors that allocators can exhibit (e.g. OOM). The
biggest difference between this interface and the more traditional one is that
allocation requests take an explicit alignment, and size and alignment are
provided on deallocation and reallocation.
elfmalloc
is
currently designed
to support the C interface. As a result, it does extra work to store size
information on the heap so that it can be looked up during a call to free
.
While this means that the current implementation has enough information to
implement the Alloc
trait, I have started thinking about ways that the overall
design could be improved or simplified if it only had to implement that trait.
The end result is a similar design that differs from the original in two chief ways:
-
We do not require complex metadata management to differentiate allocation subsystems (see "Determining Allocation Source..." at the bottom of the elfmalloc doc)
-
We only use the slab subsystem for relatively small objects. Medium-sized objects are now allocated directly from
PageAlloc
-like structures. This cuts down on metadata management and fragmentation for larger objects.
For more information on this design, check out the code.
My main goal here was to try this out in some fairly idiomatic Rust code; the
path of least resistance was definitely Vec
. Not only is Vec
one of the most
widely used Rust constructs that relies on heap allocation, but a lot of the
work is already done for me. Much of the lower-level memory management that
Vec
performs is factored out into
RawVec
,
and RawVec
is now parametric on an Alloc
instance.
Given that, implementing a data-structure with essentially the same interface as
Vec
was as simple as:
pub struct AVec<T, A: Alloc> {
buf: RawVec<T, A>,
len: usize,
}
impl<T, A: Alloc> AVec<T, A> {
fn push(&mut self, val: T) {
if self.len == self.buf.cap() {
self.buf.double();
}
unsafe {
ptr::write(self.buf.ptr().offset(self.len as isize), val);
}
self.len += 1;
}
unsafe fn to_slice(&self) -> *mut [T] {
::std::slice::from_raw_parts_mut(self.buf.ptr(), self.len)
}
}
Along with some boilerplate trait implementations. The smallvec project was an excellent guide when doing this.
Warning: These numbers appear to stress the
mmap
implementation in Windows Subsystem for Linux (WSL) for benchmarks with larger objects. As such, the numbers may not easily transfer to other systems.
This performance evaluation includes 3 separate Vec
types:
- Rust's native
Vec
(suffix_vec
) AVec
instantiated withHeap
(suffix_avec_heap
)AVec
instantiated with the rust-specificelfmalloc
(suffix_avec_elf
)
Options 1 and 2 both allocate memory using jemalloc
. They should ideally be
the same, but both are included in order to flag any cases in which the
implementations of AVec
and Vec
might diverge. There are currently
measurements for 3 workloads:
PSA: If you are ever doing operations like this in a tight loop and performance is a serious concern, you are probably better off calling
reserve
or initializing usingwith_capacity
to pre-allocate all the needed space ahead of time. These are microbenchmarks, and as a result they do have caveats with regard to how they translate to impacts on real workloads.
Workload 1: Small Allocations This involves creating a vector and pushing 500
usize
s onto it. We start all vectors at size 0, so this should involve a small
number of reallocations from doubling operations. We include results with both 1
thread and 32 threads. For this workload only, we have an "inner loop" where the
500-push workload is repeated 8 times; this is to reduce the overhead caused by
having too many samples.
The multithreaded numbers represent the same workload run across several threads, with each thread's run reported as a separate sample for the mean and variation statistics. One would expect an algorithm that scales perfectly to have numbers matching the single-threaded performance of the workload.
benchmark-n01 bench_push_avec_elf 15.883 μs per iteration (+/- 10.59%)
benchmark-n32 bench_push_avec_elf 38.035 μs per iteration (+/- 4.01%)
benchmark-n01 bench_push_vec 17.954 μs per iteration (+/- 9.27%)
benchmark-n32 bench_push_vec 37.127 μs per iteration (+/- 4.12%)
benchmark-n01 bench_push_avec_heap 17.865 μs per iteration (+/- 9.27%)
benchmark-n32 bench_push_avec_heap 37.468 μs per iteration (+/- 4.01%)
Here we can see that these are about the same, perhaps with a slight edge to
elfmalloc
in a single-threaded setting. This makes sense: all of these
allocations are going to be hitting thread-local caching layers and the
corresponding data-structures seem to be fairly well-optimized in both cases.
There is also fairly little room for improvement here: 16 microseconds for 4000
push operations on a processor that turbos to 2.3Ghz is pretty good.
Workload 2: Medium Allocations: This workload is very similar to the first,
except that it starts the benchmark with a vector with 16K elements and pushes
500,000 more usize
s onto the vector. This workload is repeated 50 times.
benchmark-n01 bench_push_medium_avec_elf 2.565 ms per iteration (+/- 0.83%)
benchmark-n32 bench_push_medium_avec_elf 23.415 ms per iteration (+/- 38.84%)
benchmark-n01 bench_push_medium_vec 8.239 ms per iteration (+/- 2.45%)
benchmark-n32 bench_push_medium_vec 271.837 ms per iteration (+/- 4.97%)
benchmark-n01 bench_push_medium_avec_heap 7.901 ms per iteration (+/- 2.94%)
benchmark-n32 bench_push_medium_avec_heap 272.775 ms per iteration (+/- 4.85%)
These numbers are quite good! Too good, I suspect; and what is going on with the
high variance for 32 threads? While part of this speedup may be due to the
scalable
BagPipe
data-structure we use for medium objects, it is possible that WSL is getting in
jemalloc
's way here. My current theory is that jemalloc
is more aggressive
with reclaiming memory than we are in this scenario; that means it has to call
mmap
and munmap
a lot more than elfmalloc
. This is exacerbated by the fact
that these system calls appear to be something of a scaling bottleneck on WSL,
though that is just an educated guess by looking at per-core load running
certain workloads.
Workload 3: Large Allocations: This workload is essentially the same, but
instead of pushing usize
s onto the vectors we push [usize; 1024]
s. The
benchmark pushes 1000 of these and repeats the workload 50 times.
benchmark-n01 bench_push_large_avec_elf 10.407 ms per iteration (+/- 4.07%)
benchmark-n32 bench_push_large_avec_elf 368.045 ms per iteration (+/- 5.18%)
benchmark-n01 bench_push_large_vec 14.317 ms per iteration (+/- 2.80%)
benchmark-n32 bench_push_large_vec 560.585 ms per iteration (+/- 3.27%)
benchmark-n01 bench_push_large_avec_heap 15.015 ms per iteration (+/- 3.07%)
benchmark-n32 bench_push_large_avec_heap 568.226 ms per iteration (+/- 4.10%)
Here, elfmalloc
again performs better than the default heap allocator. I
suspect that this is due to our use of the BagPipe
data-structure to cache
these objects. jemalloc
may simply fall back on mmap
for allocations this
large. elfmalloc
does this as well, but it has a higher threshold here. The
improvement at this end demonstrates the benefit of being able to easily
configure one's allocator for a particular workload if required.
In the elfmalloc
crate, build with
cargo build --release --features=local_cache,use_default_allocator
And then run target/release/bench_vec
. It will automatically scale to the
number of available hardware threads on the current machine. It is highly
unlikely that these benchmarks will run on anything but Linux.
I currently do not have a good solution for tracking fine-grained memory
consumption in pure-Rust benchmarks like this. I have an idea of how to
replicate the techniques used in the more mature malloc
benchmarks
we have access to, but getting everything working well will take a good deal of
time.
As for rough estimates, I did keep a close eye on task manager throughout the
benchmark execution. While it is a very rough metric; I believe elfmalloc
and
jemalloc
used comparable memory (within, say, a factor of 2), with jemalloc
often using less memory.