Find file
Fetching contributors…
Cannot retrieve contributors at this time
275 lines (210 sloc) 11.5 KB
\setlength{\parskip}{1.9ex plus 0.5ex minus 0.2ex}
\title{Proposal for an Efficient Single Object Allocator}
\author{Dhruv Matani}
With increasing memory sizes, and increasing difference between the
size of CPU caches and main memory, improving locality and reducing
the per-object allocation overhead seems to be a desirable thing.
The C++ memory allocation model allows allocators to take advantage of
the fact that the type (and hence the size) of the object being
allocated (and deallocated) is known told to the memory allocator
run-time by the language itself. Many allocators fail to take
advantage of this subtle fact.
Containers such as \texttt{list, map, set, etc\ldots{}} are used very
often, and result in problems such as:
\item Memory fragmentation since nodes are not necessarily freed in
the same order as they are allocated. If an allocator uses free
lists to string together objects of the same size, then objects
lying on different cache lines and pages tend to get grouped
\item High per-object overhead. A linked list node for
\texttt{list$<$double$>$} on a 64-bit system occupies $8 \times 3 =
24$ bytes. If the allocator overhead is $8$ bytes per object, then
we are looking at a $33\%$ overhead, which is pretty high. Some
allocators try to alleviate this problem by allocating small objects
in \textit{arenas or pages} and rounding off the address of the
deallocated memory to the arena or page boundary to determine the
metadata related to the memory being deallocated. These allocators
have a constant overhead, which is not proportional to the number of
objects allocated. However, they suffer from memory
fragmentation\footnote{By \textit{memory fragmentation}, I mean the
behaviour of the allocator to return memory from different pages
even though free memory on the same page is available. Usually,
\textit{free-list} based allocators tend to return memory in the
order it was deallocated.} since free blocks are usually strung as
a linked list.
C++ allocators accept an optional \textit{hint} parameter, which is
usually ignored by most allocators. If the \textit{hint} parameter is
set by containers such as \textit{list, map, set, etc\ldots{}}, then
it could potentially increase the cache usage of the processor if the
allocator allocates objects near the object at the location of the
\textit{hint} parameter.
\section{Current state-of-the-practice}
\textit{Note: I have borrowed the term \textbf{state-of-the-practice}
from Dr. Michael Bender's presentation on Tokutek}
The current \texttt{bitmap\_allocator} is attractive because:
\item The per-object overhead is just 1-bit
\item The allocator metadata and the user memory are at separate
locations, so updating the allocator metadata never touches memory
close to the user memory. Since the bitmaps are very tightly packed
together, it means that the allocator's working set is very hot and
will hopefully reside in the processor cache.
The current \texttt{bitmap\_allocator} achieves the objectives
mentioned above except for the following drawbacks:
\item The bound on the worst-case cost of a single allocation request
is $O(n)$. However, on the average, the running time is $o(\log{n})$
(yes, that's a little-Oh) since the average case is assumed to be a
sequence of allocation requests or that there are enough free
objects available.
\item The free-list-bitmap comes just before the actual memory, so
faulty programs run the risk of corrupting the allocator metadata.
\item The code seems to be unnecessarily complicated and doesn't have
a standard data structure backing it when in fact a \textit{Segment
Tree} is the perfect data structure for such a use-case. The
\textit{Segment Tree} also provides an upper-bound of $\O(\log{n})$
on the worst-case running time of a single call to the
\texttt{allocate()} function.
\section{The Static Segment Tree}
The Static Segment
Tree\footnote{\url{}} is a
data structure commonly used in a geometric setting to do
\textit{range queries}. It is efficient at performing \textit{range
aggregation queries} over a static data set. Fortunately, that is
all we need it for.
The Segment Tree is a summary data structure that stores data in a
heap-like tree-ordered fashion, with the root node holding the summary
information for the complete data set, the left \& right children of
the root node holding the summary information for the left \& right
half of the data set, and so on till you reach the leaf nodes, which
hold information for just a single data item. It is easy to see that
the space requirements for storing $n$ elements is $2n$, which is an
$O(n)$ space requirement to store $n$ elements.
For the purposes of the \texttt{bitmap\_allocator}, we shall use a
segment tree of \textit{bits}, which means that we shall use $2n$ bits
to summarize $n$ objects (or memory regions) each of which correspond
to a single object returned by the \texttt{allocate()} function.
\subsection{Locating free objects}
A set bit (\texttt{1}) at an internal node indicates that there exists
a free object somewhere below this node, whereas a reset bit
(\texttt{0}) indicates that there is no free object below this
node. This allows us to quickly determine whether we should go down
this node in search of a free node or not.
\subsection{Locating a free object close to a given allocated object}
To respect the \textit{hint} parameter that might be passed by the
caller in the \texttt{allocate()} function, we locate the leaf-node
that the allocated \texttt{hint} pointer points to and try to work our
way \textit{up} the tree from there, stopping at a node that has a set
(\texttt{1}) bit and working our way down from that node using the
same algorithm as before.
An idea that is slightly wasteful of space has been shown
here diagrammatically:\\
\section{The Big Picture}
Similar to the older \texttt{bitmap\_allocator}, we use a structure
that is composed of exponentially increasing sized Segment Trees. The
structure is composed of Segment Trees that have sizes 16, 32, 64,
128, 256, etc\ldots{} as the memory allocation requests increase.
A pointer to each such Segment Tree is stored in an array sorted by
the start address of the memory region that each Segment Tree points
to. In the worst case, we will have $\Theta(\log{n})$ Segment Trees,
so doing a linear search on them to find a Segment Tree that has a
free object is acceptable. This linear search is followed by a lookup
into the Segment Tree to actually find that object and mark it as
The same strategy is used to locate the Segment Tree to which an
object to be deallocated belongs. i.e. We just traverse the array of
Segment Trees and check if the pointer being deallocated lies in the
active range of the Segment Tree.
The older \texttt{bitmap\_allocator} was parameterized by
\textit{type}, and that meant that a separate memory pool was used for
\texttt{bitmap\_allocator<int>} \&
\texttt{bitmap\_allocator<float>}. Instead, we can use one more level
of indirection and create an implementation
\texttt{bitmap\_allocator\_impl<SIZE>} that is parameterized on the
size of the object rather than the type of the object. This allows
multiple objects having a different type, but the same size to share
the underlying memory pools.
\textbf{Note:} This optimization is currently \textbf{NOT} implemented
since objects of different types tend to have different lifetimes that
are related mostly to the type of the object rather than the size of
the object. Additionally, we expect there to be locality of reference
as far as different objects of the same type rather than size are
We need not incur a cost of $O(\log{n})$ per
\texttt{allocate/deallocate} request. This is because we can walk the
tree from the place we last left off. For example, if we start off
with an initially empty tree, we will go down $\log{n}$ nodes till we
reach a leaf node, after which we go up just 1 step and cache the
location of this node. The next time we need to allocate a node, we
will just visit the right child of this node and walk up 2 steps to
the grand-parent of the allocated node. This results in an algorithm
that has a \textit{worst-case} cost/operation of $O(\log{n})$, but the
amortized cost/operation is $O(1)$ if we ignore degenerate cases. In
\textit{expectation} though, this algorithm incurs a cost of $O(1)$
per operation (i.e. \texttt{allocate/deallocate}).
Every access in a tree that is full except for 1 empty leaf node will
cost $\log{N}$. This is happens $\dfrac{1}{N}$ times. Similarly, for the 2
subtrees with $\dfrac{N}{2}$ leaves each, we have a cost of $(\log{N}) - 1$.
Summing, we get:
Average Cost = $\dfrac{\log{N}}{N} + \dfrac{2((\log{N}) - 1)}{N} + \dfrac{4((\log{N}) - 2)}{N} + \ldots{} + \dfrac{N(1)}{N}$
$= \dfrac{1}{N}\left((2 \log{N} - 2) + (4 \log{N} - 8) + (8 \log{N} - 24) + \ldots{} + \dfrac{5N}{16} + \dfrac{4N}{8} + \dfrac{3N}{4} + \dfrac{2N}{2} + N\right)$
$= \dfrac{\log{N}}{N} + \dfrac{(\log{N}) - 1}{N/2} + \dfrac{(\log{N}) - 2}{N/4} + \ldots{} + \dfrac{5}{16} + \dfrac{4}{8} + \dfrac{3}{4} + \dfrac{2}{2} + 1$
$= O(1)$
The degenerate case arises when we allocate the \textit{last} free
block in a segment tree and immediately deallocate it, and repeat this
over and over again. This causes the algorithm to repeatedly reset and
set $O(\log{n})$ bits from the root to the affected leaf node.
We notice that we need not update the cache location of our node in
case of a \texttt{deallocate} operation since a \texttt{deallocate}
operation only frees nodes up, which is okay as far as our
optimization is concerned. The cached node location is updated only
when we allocate nodes.
Another optimization we use is that we cache the most recent Segment
Tree from which an allocation/deallocation request succeeded in the
hope that subsequent requests for the same will be satisfied by the
same Segment Tree. This may not always be true, but works well for
many practical workloads.
All these optimizations greatly reduce the constants and asymptotic
complexity of each operation, and improve the performance of the
\section{Debugging Aids}
The older \texttt{bitmap\_allocator} was pretty easy to corrupt (as so
will the new one) since the bitmaps used to locate free objects lies
just before the user memory. There were no checks performed to verify
the sanity and integrity of the bitmaps.
The new allocator should have some sort or magic words on the
boundaries of not just the bitmap, but also the boundaries of the
memory region that stores user memory. Additionally, a debug mode
could enable a more computationally intensive parity checksum on the
\section{Profiling Aids}
In \texttt{Profile Mode}, the \texttt{bitmap\_allocator} should be
able to provide some information about the locality of the memory
access patterns. As a first attempt, it could (in each Segment Tree)
compute the quotient:
$\dfrac{\#\ of\ allocated\ objects}{Total\ \#\ of\ objects\ between\ the\ first\ and\ the\ last\ allocated\ object}$
and determine the \textit{tightness} of the active set. The higher
this ratio (closer to $1$), the better the performance of the