GitHub - benvanik/nedmalloc: An EXTREMELY FAST portable thread caching malloc implementation written in C for multiple threads without lock contention based on dlmalloc. Optimised for x86 and x64. Compatible with C++. Can patch itself into existing binaries on Windows.

benvanik / nedmalloc Public
forked from ned14/nedmalloc
An EXTREMELY FAST portable thread caching malloc implementation written in C for multiple threads without lock contention based on dlmalloc. Optimised for x86 and x64. Compatible with C++. Can patch itself into existing binaries on Windows.
www.nedprod.com/programs/portable/nedmalloc/
BSL-1.0 license
0 stars 76 forks Branches Tags Activity
Star
Notifications
Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
.be		.be
BEBugsAsHTML		BEBugsAsHTML
nedmalloc.xcodeproj		nedmalloc.xcodeproj
nedtries @ dbbc2dd		nedtries @ dbbc2dd
unsupported		unsupported
!GenCHM.bat		!GenCHM.bat
!MakeMSVCProjs.bat		!MakeMSVCProjs.bat
._build		._build
.gitignore		.gitignore
.gitmodules		.gitmodules
Benchmarks.xlsx		Benchmarks.xlsx
Benchmarks_old.xls		Benchmarks_old.xls
Doxyfile		Doxyfile
License.txt		License.txt
Makefile		Makefile
Readme.html		Readme.html
SConscript		SConscript
SConstruct		SConstruct
ScalingTestResults.xlsx		ScalingTestResults.xlsx
doxygen.css		doxygen.css
embedded_printf.c		embedded_printf.c
embedded_printf.h		embedded_printf.h
issue8.cpp		issue8.cpp
make_pgos.c		make_pgos.c
malloc.c.h		malloc.c.h
nedalloc.chm		nedalloc.chm
nedmalloc.c		nedmalloc.c
nedmalloc.h		nedmalloc.h
nedmalloc_dll.rc		nedmalloc_dll.rc
resource.h		resource.h
scalingtest.cpp		scalingtest.cpp
test.c		test.c
test.cpp		test.cpp
unittests.cpp		unittests.cpp
unittests.vcproj		unittests.vcproj
usermodepageallocator.c		usermodepageallocator.c
usermodepageallocatortest.cpp		usermodepageallocatortest.cpp
valgrind32.supp		valgrind32.supp
winpatcher.c		winpatcher.c
winpatcher_errorh.h		winpatcher_errorh.h
Repository files navigation

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>nedalloc Readme</title>
<style type="text/css">
<!--
body {
	text-align: justify;
}
h1, h2, h3, h4, h5, h6 {
	margin-bottom: -0.5em;
}
h1 {
	text-align: center;
}
h2 {
	text-decoration: underline;
	margin-bottom: -0.25em;
}
p {
	margin-top: 0.5em;
	margin-bottom: 0.5em;
}
ul li, ol li {
	margin-top: 0.2em;
	margin-bottom: 0.2em;
}
dl {
	margin-left: 2em;
}
dl dt {
	font-weight: bold;
}
dt + dd {
	margin-bottom: 1em;
}
.gitcommit {
	font-family: "Courier New", Courier, monospace;
	font-size: smaller;
}
-->
</style>
</head>

<body>

<div style="text-align: center">
	<h1 style="text-decoration: underline">nedalloc v1.10 beta 4 (?)</h1>
	<h2 style="text-decoration: none;">by Niall Douglas</h2>
	<p>Web site: <a href="http://www.nedprod.com/programs/portable/nedmalloc/">http://www.nedprod.com/programs/portable/nedmalloc/</a></p>
	<hr /></div>
<p>Enclosed is nedalloc, an alternative malloc implementation for multiple threads 
without lock contention based on <a href="http://g.oswego.edu/" target="_blank">
dlmalloc</a> v2.8.4 and a specialised user mode page allocator (Windows Vista or 
later only). It has the following features:</p>
<ol>
	<li>A per-thread small block cache for maximum CPU scalability.</li>
	<li>A per-thread arena to minimise lock contention.</li>
	<li>The ability to patch Windows binaries to replace the C memory allocation 
	API malloc, realloc(), free() et al such that by simply inserting nedmalloc.dll 
	into a process one realises performance improvements without recompilation.</li>
	<li>On POSIX, it knows how to talk to valgrind so you can track memory 
	corruption and/or memory leaks.</li>
	<li>A unique user mode page allocator implementation which delivers O(1) scaling 
	for blocks of any size, including an O(1) very fast realloc(). Improves medium 
	sized block (~1Mb) allocation speeds by about 25 times on current hardware. 
	Requires Windows Vista or later only, and requires Administrator privileges 
	as well as either UAC disabled or a UAC prompt at the start of each program 
	run.</li>
    <li>A malloc v2 API which enables considerable improvements in efficiency by
    allowing client code to better inform the allocator on what (not) to do.</li>
	<li>An enhanced C++ STL allocator implementation to enable super-fast std::vector&lt;&gt;
	<strong>[unfinished]</strong></li>
</ol>
<p>It is licensed under the
<a href="http://www.boost.org/LICENSE_1_0.txt" target="_blank">Boost Software License</a> 
which basically means you can do anything you like with it. This does not apply 
to the malloc.c.h file which remains copyright to others. Commercial support is 
available from <a href="http://www.nedproductions.biz/" target="_blank">ned Productions 
Limited</a>.</p>
<p>It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64) and 
Apple Mac OS X (x86). It works very well on all of these and is very significantly 
faster than the system allocator on Windows XP and FreeBSD &lt;v7. If you are using 
&gt;= 10.6 Apple Mac OS X or you are on Windows 7 or later then you probably won&#39;t 
see much improvement without modifying your source to use the v2 malloc API (and 
kudos to Apple and Microsoft for adopting excellent allocators).</p>
<p>The user mode page allocator returns jaw dropping real world performance improvements 
but requires running the process as the superuser. Without, it still offers sizeable 
gains on all older operating systems and through the v2 malloc API modest gains 
on all very recent operating systems, especially in these situations:</p>
<ol>
	<li>If you are repeatedly extending large vector arrays, you will see a LARGE 
	improvement if you use the address space reservation features.</li>
	<li>If you do a lot of work with 16 byte aligned vectors e.g. SSE or AVX vector 
	arrays, you will find the v2 malloc API a godsend.</li>
</ol>
<p style="text-decoration: underline"><strong>Table of Contents: </strong></p>
<ol style="list-style-type: upper-alpha; position: relative; margin-top: -0.5em;">
	<li><a href="#touse">How to use</a><ul style="list-style-type: none; margin-left: 0; padding-left: 0">
		<li>A1. <a href="#CPPAPI">The C++ API</a></li>
		<li>A2. <a href="#v2mallocAPI">The v2 malloc C API</a></li>
	</ul>
	</li>
	<li><a href="#notes">Notes</a><ul style="list-style-type: none; margin-left: 0; padding-left: 0">
		<li>B1. <a href="#memorybloat">Memory Bloating</a></li>
		<li>B2. <a href="#memoryleaks">Memory Leakage</a></li>
		<li>B3. <a href="#threadcache">The Threadcache</a></li>
		<li>B4. <a href="#largepages">Large Page support</a></li>
		<li>B5. <a href="#logger">Memory operation logging</a></li>
		<li>B6. <a href="#windowsonly">Windows-only features</a></li>
	</ul>
	</li>
	<li><a href="#speedcomparisons">Speed Comparisons</a></li>
	<li><a href="#troubleshooting">Troubleshooting</a></li>
	<li><a href="#changelog">Changelog</a></li>
</ol>
<h2><a name="touse">A. To use:</a></h2>
<p>The quickest way is to drop nedmalloc.h, nedmalloc.c and malloc.c.h into your 
project. Call nedmalloc(), nedcalloc(), nedrealloc() and nedfree() instead of your 
normal allocator, or nedpmalloc(), nedpcalloc(), nedprealloc() and nedpfree() if 
you want to segment your memory usage into pools. Make sure that you call neddisablethreadcache() 
for every pool you use on thread exit, and don&#39;t forget neddisablethreadcache(0) 
for the system pool if necessary. Run and enjoy!</p>
<p>To test, compile <a href="test.c">test.c</a> (C) and <a href="test.cpp">test.cpp</a> 
(C++). Both will run a comparison between your system allocator and nedalloc and 
tell you how much faster nedalloc is. They also serve as examples of usage.</p>
<p>If you&#39;d like nedalloc as a Windows DLL or POSIX ELF shared object, the easiest 
thing to do is to use <a href="http://www.scons.org/" target="_blank">scons</a> 
which comes with a myriad of build options listed using scons -h. <b>If you want 
to build some MSVC project files for use with Microsoft Visual Studio</b> then what 
you do is (i) install <a href="http://www.python.org/" target="_blank">python</a> 
(ii) install <a href="http://www.scons.org/" target="_blank">scons</a> (iii) open 
a Visual Studio Command Box for the Visual Studio you wish to use via Start Menu 
=&gt; Programs =&gt; Microsoft Visual Studio XXXX =&gt; Visual Studio Tools =&gt; Visual Studio 
XXXX Command Prompt (iv) change directory to the nedmalloc directory (e.g. by dragging 
in its folder) (v) type &quot;!MakeMSVCProjs&quot; and hit Return. Note that for Visual Studio 
2008 and later support you need scons v2.1 or later.</p>
<p>nedalloc comes with two new memory allocator APIs: one is for C++, and the other 
is for C. <strong>Full documentation</strong> for all nedalloc&#39;s APIs and features 
is provided in the enclosed <a href="nedalloc.chm">nedalloc.chm</a> which is in 
Microsoft HTML Help format (Linux and Apple Mac OS X will happily read this format 
too). If you don&#39;t want to use the CHM documentation, <a href="nedmalloc.h">nedmalloc.h</a> 
is extensively commented with <a href="http://www.doxygen.org/" target="_blank">
doxygen markup</a>.</p>
<h3><a name="CPPAPI">A1: The C++ API:</a></h3>
<p>For the v1.10 release which was generously sponsored by
<a href="http://www.ara.com/" target="_blank">Applied Research Associates (USA)</a>, 
a C++ metaprogrammed STL allocator was designed which makes use of advanced nedalloc 
features to remedy many of the long standing problems and inefficiencies caused 
by C++&#39;s traditional over-fondness for copying things. While its implementation 
is complex, usage is extremely easy - simply supply nedallocator&lt;&gt; as the custom 
allocator to STL container classes.</p>
<p>As nedmalloc can do even better for vector extension, nedmalloc.h also contains 
a nedvector&lt;&gt; implementation which is the standard STL vector&lt;&gt; implementation except 
that it makes use of the non-relocating facilities of realloc2() (see below). This 
allows nedvector&lt;&gt; to not need to overallocate memory (most STL vector&lt;&gt; implementations 
will overallocate by 50%) which saves a lot of memory as well as <strong>completely 
avoiding array copy construction</strong> which make std::vector&lt;&gt;::resize() so 
very, very slow.</p>
<p>Even without nedalloc&#39;s major speed improvements as a simple C style allocator, 
the improvements to the C++ memory infrastructure alone can generate huge performance 
gains.</p>
<h3><a name="v2mallocAPI">A2: The v2 malloc C API:</a></h3>
<p><strong>[Note: This API will be completely replaced in v1.2]</strong></p>
<p>For the v1.10 release which was generously sponsored by
<a href="http://www.ara.com/" target="_blank">Applied Research Associates (USA)</a>, 
a new general purpose allocator API was designed which is intended to remedy many 
of the long standing problems and inefficiencies introduced by the ISO C allocator 
API. Internally nedalloc&#39;s implementations of nedmalloc(), nedcalloc(), nedmemalign() 
and nedrealloc() all call into this API:</p>
<ul>
	<li><code>void* malloc2(size_t bytes, size_t alignment, unsigned flags)</code></li>
	<li><code>void* realloc2(void* mem, size_t bytes, size_t alignment, unsigned 
	flags)</code></li>
</ul>
<p>If nedmalloc.h is being included by C++ code, the alignment and flags parameters 
default to zero which makes the new API identical to the old API (roll on the introduction 
of default parameters to C!). The ability for realloc2() to take an alignment is
<em>particularly</em> useful for extending aligned vector arrays such as SSE/AVX 
vector arrays. Hitherto SSE/AVX vector code had to jump through all sorts of unpleasant 
hoops to maintain alignment during array extension :(.</p>
<p>The flags supported include the ability to zero memory, to prevent realloc2() 
from moving a memory block, to force mmap() to be used from the beginning (useful 
when you know an array will be repeatedly extended) and to cause malloc2() to reserve 
additional address space after the allocation such that a realloc2() up to that 
reserved space will be very quick. On 32 bit Windows and Linux this reservation 
costs no address space in your process, so using it will NOT cause premature address 
space exhaustion.</p>
<p>You should note that realloc()&#39;s thunk to realloc2() defaults the flags to M2_RESERVE_MULT(8) 
i.e. if realloc() needs to allocate a block larger than mmap_threshold, it will 
also reserve eight times the address space of that allocation in order to make future 
realloc()&#39;s up to that point much faster. This catches the vast majority of situations 
where large arrays are repeatedly extended.</p>
<h2><a name="notes">B. Notes:</a></h2>
<p>If you want the very latest version of this allocator, get it from the TnFOX 
GIT repository at either of (both are identical mirrors):</p>
<ul>
	<li>
	<a href="git://nedmalloc.git.sourceforge.net/gitroot/nedmalloc/nedmalloc">git://nedmalloc.git.sourceforge.net/gitroot/nedmalloc/nedmalloc</a></li>
	<li><a href="git://github.com/ned14/nedmalloc.git">git://github.com/ned14/nedmalloc.git</a></li>
</ul>
<p>IF YOU THINK YOU HAVE FOUND A BUG, PLEASE CHECK ONE OF THESE REPOS FIRST BEFORE 
REPORTING IT!</p>
<h3><a name="memorybloat">B1: Memory Bloating</a></h3>
<p>Because of how nedalloc allocates an mspace per thread, it <em>can</em> cause 
severe bloating of memory usage under certain allocation patterns. You can substantially 
reduce this wastage by setting DEFAULTMAXTHREADSINPOOL or the threads parameter 
to nedcreatepool() to a fraction of the number of threads which would normally be 
in a pool at once. This will reduce bloating at the cost of an increase in lock 
contention, with DEFAULTMAXTHREADSINPOOL=1 removing almost all bloating. If the 
block sizes typically allocated are less than THREADCACHEMAX, locking is avoided 
90-99% of the time and if most of your allocations are below this value, you can 
safely set DEFAULTMAXTHREADSINPOOL or even MAXTHREADSINPOOL to one.</p>
<p>If you have LOTS of threads you may find that the threadcache held per thread 
is causing memory bloating. You can call nedtrimthreadcache() to trim the cache 
in a thread when you know that it won&#39;t be doing memory allocation (e.g. just before 
going to sleep), or alternatively you can set THREADCACHEMAXFREESPACE to something 
smaller than its default of 1Mb.</p>
<p>Lastly, some people find that memory is not returned to the system when they 
think it ought to be. dlmalloc only returns free memory to the system when there 
is DEFAULT_TRIM_THRESHOLD (default=2Mb) free in a mspace, and it only checks how 
much there is free outside the topmost segment every MAX_RELEASE_CHECK_RATE free()&#39;s. 
In other words, if your program very rapidly deallocates an awful lot of memory 
and then does not call free() for some time thereafter, dlmalloc will not release 
memory to the system. Generally in any real world code scenario free() will be called 
fairly frequently, and if not then you can always force release using nedmalloc_trim().</p>
<h3><a name="memoryleaks">B2: Memory Leakage</a></h3>
<p>You will suffer memory leakage unless you call neddisablethreadcache() per pool 
for every thread which exits (unless you are using nedalloc from its DLL on Windows). 
This is because nedalloc cannot portably know when a thread exits and thus when 
its thread cache can be returned for use by other code. Don&#39;t forget pool zero, 
the system pool. On some POSIX threads implementations there exists a pthread_atexit() 
which registers a termination handler for thread exit - if you don&#39;t have one of 
these then you&#39;ll have to do it manually.</p>
<p>Equally if you use nedalloc from a dynamically loaded DLL or shared object which 
you later kick out of memory, you will leak memory if you don&#39;t disable all thread 
caches for all pools (as per the preceding paragraph), destroy all thread pools 
using neddestroypool() and destroy the system pool using neddestroysyspool().</p>
<h3><a name="threadcache">B3: The Threadcache</a></h3>
<p>For C++ type allocation patterns (where the same small sizes of memory are regularly 
allocated and deallocated as objects are created and destroyed), the threadcache 
always benefits performance as it will cache all malloc/free allocations under THREADCACHEMAX 
in size. If however your allocation patterns are different, searching the threadcache 
may significantly slow down your code - as a rule of thumb, if cache utilisation 
is below 80% (see the source for neddisablethreadcache() for how to enable debug 
printing in release mode) then you should disable the thread cache for that thread. 
You can compile out the threadcache code by setting THREADCACHEMAX to zero.</p>
<h3><a name="largepages">B4: Large Page support</a></h3>
<p>For some applications defining ENABLE_LARGE_PAGES can give a 10-15% performance 
increase by having nedalloc allocate using large pages only (which are 2Mb on x86/x64). 
Large pages take much less space in the TLB cache and can greatly benefit programs 
with a large working set, particularly on 64 bit systems.</p>
<p>Support for large pages is limited to Linux and Windows. On Linux one must employ 
the libhugetlbfs library anyway as this is the &quot;official&quot; form of large page support, 
and setting it up and configuring it involves mounting a special hugetlbfs filing 
system. dlmalloc does not require a dependency on the libhugetlbfs headers, rather 
it searches for the library in the current process and if not found it silently 
disables support.</p>
<p>On Windows, large page support is only implemented on Windows Server 2003/Vista 
or later and they are only permitted to be allocated by users holding the &quot;Lock 
pages in memory&quot; local security setting which is DISABLED by default. Furthermore, 
the process using nedalloc must hold the SeLockMemoryPrivilege privilege. If you 
are using the DLL then the DLL attempts to enable the SeLockMemoryPrivilege during 
initialisation - therefore if you are not using the DLL you will have to do this 
manually yourself. As with Linux support, if at any stage large pages cannot be 
allocated, then dlmalloc silently disables support - this allows one binary to function 
correctly in any environment. <strong>Note that on Windows</strong> if your process 
allocates a lot of memory at once when the machine has been running for an extended 
period, then the whole computer may hang for several seconds as the Windows kernel 
copies memory around in order to coalesce a large page. This is a problem with the 
Windows kernel and its VM design, not nedmalloc! If you would like to see how large 
pages ought to be implemented, research how FreeBSD implemented them.</p>
<h3><a name="logger">B5: Memory operation logging</a></h3>
<p>It is often very useful to have a log of the memory operations which an application 
performs - you would be amazed at the inefficiencies in memory usage that this can 
reveal. nedalloc contains a very fast memory operation logger which keeps a per-thread 
log of selected operations, including an optional stack backtrace. On pool destruction, 
or nedflushlogs(), nedalloc will write out the log as a Comma Separated Value format 
file which can be loaded into applications such as Excel for analysis.</p>
<p>To use, define ENABLE_LOGGING to the bitmask of enum LogEntryType items in which 
you are interested, so 0xffffffff would log absolutely everything. The macro NEDMALLOC_TESTLOGENTRY, 
whose default is (ENABLE_LOGGING &amp; logentrytype), is then used to determine which 
items should be logged. You can also enable stack backtracing on MSVC and GCC using 
NEDMALLOC_STACKBACKTRACEDEPTH.</p>
<h3><a name="windowsonly">B6: Windows-only features</a></h3>
<p>If you are running on Windows, there are quite a few extra options available 
thanks to work generously sponsored by
<a href="http://www.ara.com/" target="_blank">Applied Research Associates (USA)</a>:</p>
<dl>
	<dt>Automatic threadcache cleanup and log output</dt>
	<dd>If you build nedalloc as a DLL and link that into your application, then 
	the DLL can trap thread exits in your application and call neddisablethreadcache() 
	on all currently existing nedpool&#39;s for you. On process exit, the DLL will also 
	call nedflushlogs() for you on all still extant nedpool&#39;s.</dd>
	<dt>Replacing the system allocator in the whole process</dt>
	<dd>
	<p>If you define REPLACE_SYSTEM_ALLOCATOR when building the DLL then the DLL 
	will replace <em>most</em> usage of the MSVCRT allocator (release MSVCRT,<strong> 
	not</strong> debug MSVCRTD)<strong> </strong>within any process it is loaded 
	into with nedalloc&#39;s routines instead, whilst remaining able to handle the odd 
	free() of a MSVCRT allocated block allocated during CRT init. This very conveniently 
	allows you to simply link with the nedalloc DLL and your application magically 
	now uses it with no code changes required, and because the MSVC implementation 
	of operators new and delete both call malloc() and free() it also covers all 
	C++ code. The following code is suggested:</p>
	<code>#pragma comment(lib, &quot;nedmalloc.lib&quot;)</code>
	<p>This asks the linker to link against nedmalloc.lib during linking - without 
	this pragma the linker will generally leave out nedmalloc as there are no explicitly 
	imported routines that it understands. This auto-patching feature can also be 
	combined with
	<a href="http://research.microsoft.com/en-us/projects/detours/" target="_blank">
	Microsoft&#39;s Detours</a> to run any arbitrary application using nedalloc instead 
	of the system allocator:</p>
	<code>withdll /d:nedmalloc.dll program.exe</code>
	<p>For those not able to use Microsoft Detours, there is an enclosed unsupported/nedmalloc_loader 
	program which does one variant of the same thing. It may or may not be useful 
	to you - it is not intended to be maintained, and it probably doesn&#39;t work on 
	newer systems.</p>
	<p>The reason that only the release MSVCRT not the debug MSVCRTD is patched 
	is twofold: (i) usually one <em>wants</em> the debug heap in debug builds so 
	it does memory corruption checking and reports memory leaks and (ii) the MSVC 
	CRT actually implements operator new and malloc using a completely different 
	implementation based on the Windows kernel HeapAlloc() function and it does 
	a lot of hoop jumping to handle mismatching CRT versions and lots of other stuff. 
	You can enable patching of the debug memory allocation functions in winpatcher.c 
	by uncommenting the relevant lines.</p>
	</dd>
	<dt>User mode page allocation</dt>
	<dd>The user mode page allocator is a user space implementation of kernel memory 
	page allocation made possible by misusing the Address Windowing Extensions (AWE) 
	provided by newer versions of Microsoft Windows. AWE allows - with a bit of 
	persuasion - direct control of the Memory Management Unit of the CPU, thus allowing 
	memory pages to be arbitrarily remapped from one address to another. The user 
	mode page allocator can therefore allocate memory in microseconds by simply 
	mapping it into where it needs to be, or it can realloc() gigabytes of memory 
	from its old location into a new bigger space in microseconds. This O(1) scaling 
	gives processes running on the user mode page allocator an <strong>unholy</strong> 
	speed increase which gets exponentially better the larger the data set.<br />
	<br />
    Want to know more in lots of detail? Here are two academic papers on the topic:
    <ol>
      <li>Douglas, N, (2011-May), '<a href="http://arxiv.org/abs/1105.1815">User Mode Memory Page Management: An old idea applied anew to the memory wall problem</a>', ArXiv e-prints, vol: 1105.1815.</li>
      <li>Douglas, N, (2011-May), '<a href="http://arxiv.org/abs/1105.1811">User Mode Memory Page Allocation: A Silver Bullet For Memory Allocation?</a>', ArXiv e-prints, vol: 1105.1811.</li>
    </ol>
  </dd>
</dl>
<h2><a name="speedcomparisons">C. Speed comparisons:</a></h2>
<p>See Benchmarks.xls for details.</p>
<p>The enclosed test.c can do one of two things: it can be a torture test which 
mostly hammers realloc() or it can be a pure speed test which sticks to simple malloc() 
and free(). If you enable C++ mode, half of the allocation sizes will be a two power 
multiple less than 512 bytes (to mimic C++ stack instantiated objects) which are 
extremely common in C++ code.</p>
<p>The torture test is designed to mercilessly work realloc() which is the most 
complex and complete code path in any memory allocator. Most allocators have
<strong>very</strong> poor realloc() performance - not so nedalloc which makes use 
of mremap() support on Linux and Windows. Even without mremap() support nedalloc&#39;s 
realloc() tends to be significantly faster than any standard allocator.</p>
<p>The speed test is designed to be a representative synthetic memory allocator 
test where most allocations follow a stack pattern. It works by randomly mixing 
allocations with frees with sizes being a random value less than 16Kb. </p>
<p>The C++ test.cpp simply benchmarks how much difference nedalloc::nedallocatorise&lt;&gt; 
makes to std::vector&lt;&gt; performance, particularly the performance of push_back(), 
pop_back() and vector assignment all of which are very common in real world code. 
As you will see, the STL - even with C++0x move constructor support - does not perform 
anywhere close to nedalloc&#39;s version which achieves its gains by simply avoiding 
copy and move construction completely.</p>
<p>The real world code results are from Tn&#39;s TestIO benchmark. This is a heavily 
multithreaded and memory intensive benchmark with a lot of branching and other stuff 
modern processors don&#39;t like so much. As you&#39;ll note, the test doesn&#39;t show the 
benefits of the threadcache mostly due to the saturation of the memory bus being 
the limiting factor.</p>
<h2><a name="troubleshooting">D. Troubleshooting:</a></h2>
<p>I get a quite a few bug reports about code not working properly under nedalloc. 
I do not wish to sound presumptuous, however in an overwhelming majority of cases 
the problem is in your application code and not nedalloc (see below for all the 
bugs reported and fixed since 2006). Some of the largest corporations and IT deployments 
in the world use nedalloc pre-v1.10, and pre-v1.10 has been very heavily stress 
tested on everything from 32 processor SMP clusters right through to root DNS servers, 
ATM machine networks and embedded operating systems requiring a very high uptime. 
The v1.10 release adds a LOT of new code and features, and hence there are quite 
likely a lot of new bugs in the new code.</p>
<p>In particular, just because it just happens to appear to work under the system 
allocator does not mean that your application is not riddled with memory corruption 
and non-ANSI usage of the API! And usually<strong> this is not your code&#39;s fault, 
but rather it is usually the third party libraries being used which sadly often 
include system libraries</strong>.</p>
<p>Even though debugging an application for memory errors is a true black art made 
possible only with a great deal of patience, intuition and skill, here is a checklist 
for things to do before reporting a bug in nedalloc:</p>
<ol>
	<li>Make SURE you try nedalloc from GIT HEAD. For around six months of 2007 
	I kept getting the same report of a bug long fixed in GIT HEAD.</li>
	<li>Make SURE you try nedalloc v1.06. If it works in v1.06 but isn&#39;t working 
	in nedalloc &gt;= v1.10, then it&#39;s probably a bug in the new code (please report 
	it to me!)</li>
	<li>Make use of nedalloc&#39;s internal debug routines. Try turning on full sanity 
	checks by #define FULLSANITYCHECKS 1. Also make use of all the assertion checking 
	performed when DEBUG is defined as 1. A lot of bug reports are made before running 
	under a debug build where an assertion trip clearly showed the problem. Lastly, 
	try changing the thread cache by #defining THREADCACHEMAX - this fundamentally 
	changes how the memory allocator behaves: if everything is fine with the thread 
	cache fully on or fully off, then this strongly suggests the source of your 
	problem.</li>
	<li>Make SURE you are matching allocations and frees belonging to nedalloc if 
	you are not defining REPLACE_SYSTEM_ALLOCATOR. Attempting to free a block not 
	allocated by nedalloc will end badly, similarly passing one of nedalloc&#39;s blocks 
	to another allocator will likely also end badly. I have inserted as many assertion 
	and debug checks for this possibility as I can think of (further suggestions 
	are welcome), but no system can ever be watertight. If you&#39;re using C++, make 
	use of the C++ nedallocatorise API provided or else use some form of strong 
	template type system to have the compiler guarantee membership of a memory pointer 
	- see <a href="http://www.boost.org/" target="_blank">the Boost libraries</a>, 
	or indeed <a href="http://www.nedprod.com/TnFOX/" target="_blank">my own TnFOX 
	portability toolkit</a>.</li>
	<li>If you&#39;re still having problems, or more likely your code runs absolutely 
	fine under debug builds but trips up under release which suggests a timing bug, 
	it is time to deploy heavyweight tools. Under Linux, you should use
	<a href="http://valgrind.org/" target="_blank">valgrind</a>. Under Windows, 
	there is an excellent commercial tool called
	<a href="http://www.glowcode.com/" target="_blank">Glowcode</a>. Any programming 
	team serious on quality should ALWAYS run their projects through these tools 
	before each and every release anyway - you would be amazed at what you miss 
	during all other testing.</li>
	<li>Lastly, in the worst case scenario, consider hiring in a memory debugging 
	expert. There are quite a few on the market and they often are authors of memory 
	allocators. <a href="http://www.malloc.de/en/" target="_blank">Wolfram Gloger 
	(the author of ptmalloc) provides consulting services</a>.
	<a href="http://www.nedproductions.biz/" target="_blank">My own consulting company 
	ned Productions Limited</a> may be able to provide such a service depending 
	on our current workload.</li>
</ol>
<p>I hope that these tips help. And I urge anyone considering simply dropping back 
to the system allocator as a quick fix to reconsider: squashing memory bugs often 
brings with it <strong>significant</strong> extra benefits in performance and reliability. 
It may cost what appears to be a lot extra now, but it usually will save itself 
many times its cost over the next few years. I know of one large multinational corporation 
who <strong>saved hundreds of millions of dollars</strong> due to the debugging 
of their system software performed when trying to get it working with nedalloc - 
they found one bug in nedalloc but over a hundred in their own code, and in the 
process improved performance <strong>threefold</strong> which saved an expensive 
hardware upgrade and deployment. The conclusion can only be that fixing memory bugs 
now tends to be worth it in the long run.</p>
<h2><a name="changelog">E. ChangeLog:</a></h2>
<h3>v1.10 beta 4 ?:</h3>
<ul>
	<li><span class="gitcommit">[master 726d9c7]</span> Fixed memory corruption
	introduced when creating more than two nedpool's (issue #7). Thanks to mxmauro
	for reporting this.</li>
	<li><span class="gitcommit">[master c191ea9]</span> Merged dlmalloc v2.8.6.</li>
	<li><span class="gitcommit">[master 06f1c70]</span> Added support for clang, 
	plus fixed up some compile errors in C++11.</li>
	<li><span class="gitcommit">[master 8f8256c]</span> Added support for 
	valgrind instrumentation so valgrind can track programs using nedmalloc.</li>
	<li><span class="gitcommit">[master 69825ca]</span> Fixed issue #8 where 
	memory allocated via the independent_*() functions was being incorrectly 
	identified as system allocated. Thanks to Geri for reporting this.</li>
	<li><span class="gitcommit">[master a6a0dec]</span> Fixed issue #10 where 
    a failure to allocate memory on POSIX was not being trapped correctly. Thanks
    to btaudul for reporting this.</li>
</ul>
<h3>v1.10 beta 3 17th July 2012:</h3>
<ul>
	<li><span class="gitcommit">[master 5f26c1a]</span> Due to a bug introduced
    in sha 7a9dd5c (17th April 2010), nedmalloc has never allocated more than a
    single mspace when using the system pool. This effectively had disabled
    concurrency for any allocation &gt; THREADCACHEMAX (8Kb) which no doubt made
    nedmalloc v1.10 betas 1 and 2 appear no faster than system allocators. My
    thanks to the eagle eyes of Gavin Lambert for spotting this.</li>
</ul>
<h3>v1.10 beta 2 10th July 2012:</h3>
<ul>
	<li><span class="gitcommit">[master 51ab2a2]</span> scons now tests for C++0x
	support before turning it on and tries multiple libraries for clock_gettime()
	rather than assuming it lives in librt. This ought to fix miscompilation on
	Mac OS X. Thanks to Robert D. Blanchet Jr. for reporting this.</li>
	<li><span class="gitcommit">[master b2c3517]</span> Mac defines malloc_size
    to be const void *ptr, not void *ptr</li>
	<li><span class="gitcommit">[master 9333e50]</span> Updated to use the new
    O(1) Cfind(rounds=1) feature in nedtries</li>
	<li><span class="gitcommit">[master 54c7e44]</span> Avoid overflowing allocation
    size. Thanks to Xi Wang for supplying a patch fixing this.</li>
	<li><span class="gitcommit">[master 5b614a0]</span> Removed __try1 and __finally1
    from MinGW support as x64 target no longer supports SEH. Thanks to Geri for
    reporting this.</li>
	<li><span class="gitcommit">[master 48f1aa9]</span> Tidied up bitrot which 
	had broken compilation due to mismatched #if...#endif.</li>
</ul>
<h3>v1.10 beta 1 19th May 2011:</h3>
<ul>
	<li><span class="gitcommit">[master 89f1806]</span> Moved from SVN to GIT. Bumped 
	version to v1.10 as new ARA contract will involve significant further improvements 
	mainly centering around realloc() performance.</li>
	<li><span class="gitcommit">[master 254fe7c]</span> Added nedmemsize() for API 
	compatibility with other allocators. Added DEFAULTMAXTHREADSINPOOL and set it 
	to FOUR which is a BREAKING CHANGE from previous versions of nedalloc (which 
	set it to 16).</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 97d1420]</span> Added win32mremap() 
	implementation.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 8a1001e]</span> Significantly 
	improved test.c with new test options TESTCPLUSPLUS, BLOCKSIZE, TESTTYPE and 
	MAXMEMORY.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 7ea606d]</span> Implemented 
	two variants of direct mremap() on Windows, one using file mappings and the 
	other using over-reservation. The former is used on 32 bit and the latter on 
	64 bit.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 26ff9a7]</span> Added the 
	malloc2() interface to nedalloc.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 5bc5d97]</span> Rewrote 
	Readme.txt to become Readme.html which makes it much clearer to read.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc 2efa595]</span> Added doxygen 
	markup to nedmalloc.h and a first go at a policy driven STL allocator class.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc d851bde]</span> Added a 
	CHM documenting the nedalloc API.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc dbd3991]</span> Added a 
	fast malloc operations logger which outputs a CSV log on process exit.</li>
	<li><span class="gitcommit">[nedmalloc_fast_realloc d6a8585]</span> Added stack 
	backtracing to the logger.</li>
	<li><span class="gitcommit">[master c7ea06d]</span> Finished user mode page 
	allocator, so merged nedmalloc_fast_realloc branch.</li>
	<li><span class="gitcommit">[master 9a8800f]</span> Fixed small bug which was 
	preventing the windows patcher from correctly finding the proper MSVCRT.</li>
	<li><span class="gitcommit">[master 37c58b1]</span> Fixed leak of mutexes when 
	using pthread or win32 mutexs as locks. Thanks to Gavin Lambert for reporting 
	this.</li>
	<li><span class="gitcommit">[master f67e284]</span> Fixed nedflushlogs() not 
	actually flushing data and/or causing a segfault. Thanks to Roman Tatkin for 
	reporting this.</li>
	<li><span class="gitcommit">[master 1324bf3]</span> Finally got round to retiring 
	the MSVC project files as they were sources of never ending hassle due to being 
	out of sync with the SConstruct config. Rebuilt scons build system to be fully 
	compatible with MSVC instead (long overdue!)</li>
	<li><span class="gitcommit">[master 068494e]</span> As the release of v1.10 
	RC1 approaches, fixed a long standing problem with the binary patcher where 
	multiple MSVCRT versions in the process weren&#39;t handled - everything was sent 
	to one MSVCRT only, and needless to say that sorta worked sometimes and sometimes 
	not. Now when nedmalloc passes a foreign block to the system allocator, it runs 
	a stack backtrace to figure out what MSVCRT in the process it ought to pass 
	it to. It&#39;s slow, but fixes a very common segfault on process exit on VS2010.</li>
	<li><span class="gitcommit">[master 4cca52c]</span> Very embarrassingly, nedmalloc 
	has been severely but unpredictably broken on POSIX for over a year now when built with DEBUG defined. 
	This was turning on DEFAULT_GRANULARITY_ALIGNED whose POSIX implementation 
	was causing random segfaults so mysterious that neither gdb nor valgrind 
	could pick them up - in other words, the very worst kind of memory 
	corruption: undetectable, untraceable and undebuggable. I only found them 
	myself due to a recent bug report for TnFOX on POSIX where due to luck, very 
	recent Linux kernels just happened by pure accident to cause this bug to 
	manifest itself as preventing process init right at the very start - so 
	early that no debugger could attach. After over a week of trial &amp; error I 
	narrowed it down to being somewhere in nedmalloc, then having something to 
	do with DEBUG being defined or not, then two hours ago the eureka moment 
	arrived and I quite literally did a jig around the room in joy. Problem is 
	now fixed thank the heavens!!!</li>
	<li><span class="gitcommit">[master 3d55a01]</span> Fixed a problem where the
	binary patcher was early outing too soon and therefore failing to patch all
	the binaries properly. It would seem that the Microsoft linker doesn't sort
	the import table like I had thought it did - I would guess it sorts per DLL
	location, otherwise is unsorted. Thanks to Roman Tatkin for reporting this bug.</li>
	<li><span class="gitcommit">[master 6c74071]</span> Added override of _GNU_SOURCE
	for when HAVE_MREMAP is auto-detected. Thanks to Maxim Zakharov for reporting
	this issue.</li>
	<li><span class="gitcommit">[master dee2d27]</span> Marked off the v2 malloc API
    as deprecated in preparation for beta release. Updated CHM documentation.</li>
</ul>
<h3>v1.06 beta 2 21st March 2010:</h3>
<ul>
	<li>{ 1153 } Added detection of whether host process is using MSVCRT or MSVCRTD 
	and the fixing up of which runtime tolerant nedalloc should use if nedalloc 
	was linked differently. This ought to save a great deal of hassle later on by 
	preventing failed-to-RTM user bug reports :)</li>
	<li>{ 1154 } Fixed nedalloc trying to use MLOCK_T even when USE_LOCKS=0. Thanks 
	to Ariel Manzur for reporting this.</li>
	<li>{ 1155 } Fixed USE_SPIN_LOCKS=0 not compiling on Windows.</li>
	<li>{ 1157 } Fixed bug where foreign blocks entering the threadcache weren&#39;t 
	being marked as such, thus typically causing a segfault on process exit.</li>
	<li>{ 1158 } Fixed compilation problems on mingw. Thanks to Amanieu d&#39;Antras 
	for reporting these.</li>
	<li>{ 1159 } Released as beta2.</li>
</ul>
<h3>v1.06 beta 1 13th January 2010:</h3>
<ul>
	<li>{ 1079 } Fixed misdeclaration of struct mallinfo as C++ type. Thanks to 
	James Mansion for reporting this.</li>
	<li>{ 1082 } Fixed dlmalloc bug which caused header corruption to mmap() allocations 
	when running under multiple threads.</li>
	<li>{ 1088 } Fixed assertion failure for nedblksize() with latest dlmalloc. 
	Thanks to Anteru for reporting this.</li>
	<li>{ 1088 } Added neddestroysyspool(). Thanks to Lars Wehmeyer for suggesting 
	this.</li>
	<li>{ 1088 } Fixed thread id high bit set bug causing SIGABRT on Mac OS X. Thanks 
	to Chris Dillman for reporting this.</li>
	<li>{ 1094 } Integrated dlmalloc v2.8.4 final.</li>
	<li>{ 1095 } Added nedtrimthreadcache(). Thanks to Hayim Hendeles for suggesting 
	this.</li>
	<li>{ 1095 } Fixed silly assertion of null pointer dereference. Thanks to Ullrich 
	Heinemann for reporting this.</li>
	<li>{ 1096 } Fixed lots of level 4 warnings on MSVC. Thanks to Anteru for suggesting 
	this.</li>
	<li>{ 1098 } Improved non-nedalloc block detection to 6.25% probability of being 
	wrong. Thanks to Applied Research Associates for sponsoring this.</li>
	<li>{ 1099 } Added USE_MAGIC_HEADERS which allows nedalloc to handle freeing 
	a system allocated block. Added USE_ALLOCATOR which allows the changing of which 
	backend allocator to use (with choices between the system allocator and dlmalloc 
	- choosing the system allocator is intended for debug situations only e.g. valgrind). 
	Thanks to Applied Research Associates for sponsoring this.</li>
	<li>{ 1105 } Added ability to build nedalloc as a DLL. Added support for a run 
	time PE binary patcher which can patch all usage of the system allocator replacing 
	it with nedalloc. Thanks to Applied Research Associates for sponsoring this.</li>
	<li>{ 1108 } Added patcher loader which can load any arbitrary program injecting 
	the nedalloc DLL which then patches in its replacement for the system allocator. 
	Doesn&#39;t work on all programs, but does on most e.g. Microsoft Word. Thanks to 
	Applied Research Associates for sponsoring this.</li>
	<li>{ 1116 } Finished debugging and optimising the latest additions to the codebase. 
	The patcher now works well on x64 as well as x86. Added support for large pages 
	on Windows. Thanks to Applied Research Associates for sponsoring this.</li>
	<li>{ 1125 } Added nedpoollist() which returns a snapshot of the nedpool&#39;s currently 
	existing. The Windows DLL thread exit code now disables the thread cache for 
	all currently existing nedpool&#39;s. Thanks to Applied Research Associates for 
	sponsoring this.</li>
	<li>{ 1126 } Added ENABLE_TOLERANT_NEDMALLOC which allows nedalloc to recognise 
	system allocator blocks and to do the right thing with them.</li>
	<li>{ 1139 } Added link time code generation support for Windows builds. This 
	currently has zero performance improvement on x64 (on MSVC9) but can add 15% 
	to x86 performance (on MSVC9). Also added scons SConstruct and SConscript files.</li>
</ul>
<h3>v1.05 15th June 2008:</h3>
<ul>
	<li>{ 1042 } Added error check for TLSSET() and TLSFREE() macros. Thanks to 
	Markus Elfring for reporting this.</li>
	<li>{ 1043 } Fixed a segfault when freeing memory allocated using nedindependent_comalloc(). 
	Thanks to Pavel Vozenilek for reporting this.</li>
</ul>
<h3>v1.04 14th July 2007:</h3>
<ul>
	<li>Fixed a bug with the new optimised implementation that failed to lock on 
	a realloc under certain conditions.</li>
	<li>Fixed lack of thread synchronisation in InitPool() causing pool corruption.</li>
	<li>Fixed a memory leak of thread cache contents on disabling. Thanks to Earl 
	Chew for reporting this.</li>
	<li>Added a sanity check for freed blocks being valid.</li>
	<li>Reworked test.c into being a torture test.</li>
	<li>Fixed GCC assembler optimisation misspecification.</li>
</ul>
<h3>v1.04alpha_svn915 7th October 2006:</h3>
<ul>
	<li>Fixed failure to unlock thread cache list if allocating a new list failed. 
	Thanks to Dmitry Chichkov for reporting this. Futher thanks to Aleksey Sanin.</li>
	<li>Fixed realloc(0, &lt;size&gt;) segfaulting. Thanks to Dmitry Chichkov for reporting 
	this.</li>
	<li>Made config defines #ifndef so they can be overriden by the build system. 
	Thanks to Aleksey Sanin for suggesting this.</li>
	<li>Fixed deadlock in nedprealloc() due to unnecessary locking of preferred 
	thread mspace when mspace_realloc() always uses the original block&#39;s mspace 
	anyway. Thanks to Aleksey Sanin for reporting this.</li>
	<li>Made some speed improvements by hacking mspace_malloc() to no longer lock 
	its mspace, thus allowing the recursive mutex implementation to be removed with 
	an associated speed increase. Thanks to Aleksey Sanin for suggesting this.</li>
	<li>Fixed a bug where allocating mspaces overran its max limit. Thanks to Aleksey 
	Sanin for reporting this.</li>
</ul>
<h3>v1.03 10th July 2006:</h3>
<ul>
	<li>Fixed memory corruption bug in threadcache code which only appeared with 
	&gt;4 threads and in heavy use of the threadcache.</li>
</ul>
<h3>v1.02 15th May 2006:</h3>
<ul>
	<li>Integrated dlmalloc v2.8.4, fixing the win32 memory release problem and 
	improving performance still further. Speed is now up to twice the speed of v1.01 
	(average is 67% faster).</li>
	<li>Fixed win32 critical section implementation. Thanks to Pavel Kuznetsov for 
	reporting this.</li>
	<li>Wasn&#39;t locking mspace if all mspaces were locked. Thanks to Pavel Kuznetsov 
	for reporting this.</li>
	<li>Added Apple Mac OS X support.</li>
</ul>
<h3>v1.01 24th February 2006:</h3>
<ul>
	<li>Fixed multiprocessor scaling problems by removing sources of cache sloshing.</li>
	<li>Earl Chew &lt;earl_chew &lt;at&gt; agilent &lt;dot&gt; com&gt; sent patches for the following:
	<ol>
		<li>size2binidx() wasn&#39;t working for default code path (non x86).</li>
		<li>Fixed failure to release mspace lock under certain circumstances which 
		caused a deadlock.</li>
	</ol>
	</li>
</ul>
<h3>v1.00 1st January 2006:</h3>
<ul>
	<li>First release</li>
</ul>

</body>

</html>