A from-scratch implementation of malloc, free, calloc, and realloc in C, built without using any heap allocation functions from libc. Implements core allocator features including thread safety, per-thread caching, and large allocation handling via mmap.
| Feature | Description |
|---|---|
| Doubly linked heap list | Tracks all allocated/free blocks with prev/next pointers |
| Power-of-2 bins (8 bins) | Free blocks organized by size class for O(1) lookup |
| Block splitting | Large free blocks split to minimize internal fragmentation |
| Block coalescing | Adjacent free blocks merged to reduce external fragmentation |
| 8-byte alignment | All allocations aligned via ALIGN(size) macro |
| mmap chunk allocator | Heap grows in 2MB chunks via mmap instead of sbrk |
| Large allocation path | Allocations ≥ 128KB get dedicated mmap regions |
| Large block cache | Up to 32 large mmap blocks cached on free to avoid syscall overhead |
| Per-thread cache (tcache) | Each thread holds up to 128 free blocks per bin, zero-lock fast path |
| Thread safety | Global heaplock mutex protects all shared heap state |
| Allocator statistics | Tracks tcache hit rate, mmap calls, fragmentation, live allocations |
void *mymalloc(size_t size);
void myfree(void *ptr);
void *mycalloc(size_t n, size_t size);
void *myrealloc(void *ptr, size_t size);
void allocator_print_stats(void);mymalloc(size)
│
├─ size < 128KB ──► check per-thread tcache
│ │
│ hit ─► return block (no lock)
│ miss ─► lock heaplock
│ │
│ find_free_block() ─► split if needed
│ reqestspace() ─► carve from 2MB chunk
│
└─ size ≥ 128KB ──► check large_cache[]
│
hit ─► return cached mmap block (heaplock)
miss ─► mmap(size) syscall
myfree(ptr)
│
├─ heap block ──► push to per-thread tcache
│ │
│ count ≥ 128 ─► flush_tcache() ─► global bins + coalesce
│
└─ mmap block ──► push to large_cache[] or munmap() if large cache is full
Each allocation is preceded by a block_header_t (48 bytes):
┌──────────────────────────────┐
│ block_header_t │ 48 bytes
│ size_t size │
│ int isfree │
│ int ismmapped │
│ block_header_t *next/prev │ heap linked list
│ block_header_t *bin_next │ bin free list
│ block_header_t *bin_prev │ bin free list
├──────────────────────────────┤
│ user data │ size bytes ← returned pointer
└──────────────────────────────┘
8 bins covering size ranges using 31 - __builtin_clz(size-1) - 3:
bin[0]: 1 – 8 bytes
bin[1]: 9 – 16 bytes
bin[2]: 17 – 32 bytes
bin[3]: 33 – 64 bytes
bin[4]: 65 – 128 bytes
bin[5]: 129 – 256 bytes
bin[6]: 257 – 512 bytes
bin[7]: 513+ bytes
Each thread gets its own tcache_t via __thread (TLS):
typedef struct {
block_header_t *bins[NUM_BINS]; // free list per bin
int count[NUM_BINS]; // block count per bin
} tcache_t;
__thread tcache_t tcache = {0};Hot path (tcache hit) requires zero locks. Cache flushes to global bins when any bin reaches TCACHE_MAX = 128 blocks.
# compile allocator + tests
gcc -pthread -fsanitize=thread allocator.c tests.c -o test
#or just
make
# run correctness tests
./test
# compile benchmarks
gcc -pthread -O2 -DUSE_CUSTOM benchmark.c allocator.c -o bench_custom
gcc -pthread -O2 benchmark.c -o bench_libc
# run benchmarks
./bench_libc
./bench_customTested on x86-64 Linux, 8 threads, -O2.
cutom libc verdict
single alloc+free: 58ms 29ms 2x slower
batch alloc: 1.44ms 3.59ms 2.5x faster
batch free: 0.36ms 1.54ms 4x faster
mixed sizes: 6.46ms 2.95ms 2x slower
realloc chain: 6.42ms 2.56ms 2.5x slower
multithreaded: 64ms 67ms tie(barely faster)
tcache hit rate: ~99% on typical workloads.
- Batch alloc/free — tcache wins here, beats libc because blocks stay thread-local
- Mixed sizes — large block cache reduced mmap syscalls from 10,000 to ~1, 10x improvement
- Multithreaded — single global
heaplockis the bottleneck; per-thread arenas would close this gap - Single alloc+free — libc wins due to ptmalloc per-arena design
Call allocator_print_stats() after workload to see:
===allocator stats===
tcache hits: 5197038
tcache misses: 12962
tcache hit rate : 99.8%
total allocs: 5210001
total frees: 5210001
live allocations: 0
mmap calls: 1
munmap calls: 0
large cache hits: 9999
large cache wasted : 0 bytes (0.0 MB)
- Single global lock — all threads contend on
heaplockon cache miss; per-thread arenas would fix this - No in-place realloc —
myreallocalways alloc+copy because tcache blocks can't be safely distinguished from global heap blocks without an extra flag - tcache fragmentation — no coalescing inside tcache; adjacent free blocks in cache won't merge until flushed
- Large block splitting — mmap'd blocks can't be partially munmap'd so oversized cached blocks waste memory (tracked by
large_cache_wasted_bytes) - No thread exit cleanup — tcache blocks held by a thread are leaked if the thread exits without flushing
- Ownership is the key to thread safety — most bugs came from unclear ownership boundaries between tcache, global bins, and heap blocks
- The fast path creates the hardest races — every optimization (tcache, lockless large alloc) introduced a new race condition
- Atomic ops aren't free — replacing per-operation atomics with thread-local counters + periodic flush recovered significant performance
- Test harnesses can be wrong too —
rand()is not thread-safe; using it in multithreaded tests caused false corruption reports that looked like allocator bugs
- Per-thread arenas to eliminate heaplock contention
- Exact size classes to remove tcache fragmentation
in_tcacheflag in block header to re-enable in-place realloc- Thread exit hook (
pthread_key_create) to flush tcache on thread death - Heap introspection — walk all blocks and compute fragmentation percentage