elfmalloc Performance Evaluation
This document describes the performance of
elfmalloc in existing benchmarks
that have been used to evaluate
malloc performance in C/C++. Using the
LD_PRELOAD mechanism on Linux, we can get something of an apples-to-apples
measure of how our work stacks up.
This document is a work in progress. We are currently working on providing
results for more
malloc implementations in more workloads. We are also
exploring various optimizations for the allocator. Various tuning parameters
elfmalloc performance substantially in some of these benchmarks,
sometimes at the cost of performance elsewhere. We are also still investigating
the best balance of these parameters. We are also still investigating the
different benchmark settings in use to ensure that all settings are fair wrt
different allocators' thread-local cache sizes.
Two variants of
In addition to a standard version (called simply
elfmalloc), we provide the
-l refers to the use of the
for the allocator, as opposed to the
MagazineCache front-end used in the
default configuration. While
elfmalloc is almost always superior in terms of
both throughput and memory usage,
elfmalloc-l is useful for performance
evaluation as it provides a "no-cache" baseline for allocator performance.
We use a subset of the benchmarks used in the
paper. These benchmarks
come from several teams of developers that have worked on allocator
performance over the years. We describe them briefly here, but we
encourage anyone curious to examine Section 7 of that paper.
Threadtest: This benchmark allocates and deallocates a number of small objects in rounds each round with several allocations, causing thread-local caches to be overrun. While this is performed with multiple threads, we call this workload "thread-local" as all pointers are freed in the same thread in which they were allocated. Threadtest was developed to assess allocator performance during Hoard's development.
Shbench: Similar to Threadtest but with varying object sizes and varying object lifetimes. Our understanding is that this benchmark is due to Larson et al.
rpmalloc-benchmark: This benchmark performs a (somewhat unrealistic) workload of randomly-distributed allocation sizes. This benchmark is included because it can be configured to provide a succinct producer-consumer workload: one where objects are allocated and freed in different threads. This is the configuration we use in the measurements below.
The data here were gathered using a version of the
artifact. The only
modifications to the artifact were tweaks of object sizes (for the "large
object" variants) and iteration counts to reduce variance and test robustness
for larger workloads. The rpmalloc benchmark is not included in the artifact, so
we adapted it to output similar results.
These benchmarks were conducted on a 16-core 32-thread workstation with 2 Xeon E5-2620v4 CPUs on the WSL. We benchmark these workloads at 1, 2, 4, 8, 16, 24, and 32 threads. For the (1-8) thread configurations, all use physical cores. For 16 threads, all threads are scheduled on a single socket using all hardware threads. The 24-thread configuration uses some subset of the available hardware threads, this time crossing a NUMA domain. Finally, the 32-thread benchmark uses all available hardware threads across both sockets.
Other Allocators Measured
We include two representative allocators to benchmark against. We plan
to add more in the future (e.g.
scalloc, which we are having trouble
getting to run on WSL due to some
jemalloc: A mature, efficient
mallocused as the default in the Rust ecosystem and used in production at Facebook. We pulled a dev version of this allocator in early August 2017. Many of its numbers appear improved over the earlier version benchmarked in the
llalloc: An efficient, proprietary
mallocfrom Lockless Inc. link
ptmalloc2: The default allocator on Linux. Throughput numbers are expressed as a multiple of the numbers for this allocator. These numbers represent performance for ptmalloc version 2.19; the default present from the version of glibc in use.
rpmalloc: link for more information.
We provide measurements of both memory consumption and throughput for the 3
workloads described above. A common theme here is that
consistently high throughput, sometimes at the cost of increased memory usage.
By the same token, if you see an allocator failing to scale up it is worth
looking at heap growth for the same workload. Some allocators like
return memory to the OS more aggressively than
elfmalloc, effectively trading
off improved memory efficiency for reduced throughput. We are still working on
elfmalloc to be less profligate with memory under certain
In order to make the graphs at all readable, we express throughput in terms of a
multiple over the performance of
ptmalloc2, which is consistently the slowest
shbench, we provide numbers for both small (64
bytes or smaller) and medium-sized (a few KB) objects. For throughput, more is
better; for memory consumption less is better.
elfmalloc having the highest throughput (except for 32
threads, where it is a bit behind
jemalloc), at the cost of a noticeable
increase in memory usage.
Threadtest Throughput (Small Objects)
Threadtest Memory Consumption (Small Objects)
For larger objects,
elfmalloc comes out on top in terms of throughput for 32
threads. Unlike the case of smaller objects,
elfmalloc has memory usage
roughly in line with that of
jemalloc. One interesting point here is the
memory usage of
elfmalloc-l. Here we see the counter-intuitive result that
increased thread-local cache sizes can lead to lower memory consumption. We
suspect that this is because
threadtest includes long stretches of allocation
followed by relatively limited deallocation, limiting the re-use of allocated
Threadtest Throughput (Medium Objects)
Threadtest Memory Consumption (Medium Objects)
elfmalloc out-performs all competition in terms of both memory
and throughput for higher core-counts, though it lags behind
elfmalloc-l performs similarly well, though slightly worse on average for
smaller objects, and slightly better for larger objects.
provides very good memory performance across the board, sometimes better than
Also of note is the memory consumption of
something about the varying lifetimes of objects seems to trip these allocators
up here, as they use over an order of magnitude more memory than
Shbench Throughput (Small Objects)
Shbench Memory Consumption (Small Objects)
We see a similar improvement for shebench on larger objects.
Shbench Throughput (Medium Objects)
Shbench Memory Consumption (Small Objects)
elfmalloc has the highest throughput in this workload at the
cost of increased memory consumption.
Producer-Consumer Memory Consumption