smash

smash — a compression-aware memory allocator that transparently compresses cold pages to reduce resident set size (RSS).

Overview

Smash is a drop-in malloc replacement that monitors page access patterns and compresses pages that haven't been touched recently. When compressed pages are accessed again, a signal handler transparently decompresses them before the application sees the data. Smash reduces physical memory usage for applications with large working sets where significant portions of allocated memory are idle at any given time.

Key Features

Transparent compression: No application changes required, works via malloc interposition.
ROI-driven algorithm selection: A return-on-investment model picks zstd-1 (fast tier) or zstd-9 (deep tier) per page at compression time, weighing observed compressibility and observed compression cost against the page's accumulated cold time.
Per-origin learning: Compression-ratio and compression-cost statistics are kept per (arena, size class), so call-site arena routing translates directly into more accurate ROI decisions.
Adaptive worker pool: The compressor scales its active worker count each tick using Little's Law (N = ⌈λ/μ⌉), so an idle process uses one worker and a bulk-allocation phase uses several.

How It Works

Allocation: smash replaces malloc/free via alloc8 interposition. All slab data pages come from a single large virtual memory reservation (VmRegion). A return-address hash routes calls from the same call site into the same arena, producing structurally homogeneous pages.
Access tracking: A background compressor thread periodically sets active pages to read-only (mprotect PROT_READ). Write faults mark pages as "accessed"; pages without writes across multiple intervals are considered cold.
Compression: Cold pages are compressed with zstd-1 or zstd-9, selected by the ROI model based on the page's (arena, size class) observed ratio and the per-tier observed compression cost. Physical backing is released (MADV_FREE_REUSABLE on macOS, MADV_DONTNEED on Linux); compressed data is stored in a separate sharded region.
Decompression: When the application accesses a compressed page, a SIGSEGV/SIGBUS handler recommits the page, decompresses the data, and resumes execution transparently.

Building

Prerequisites

C++20 compiler (Clang 14+ or GCC 12+)
CMake 3.15+
alloc8 source as a sibling directory (or specify -DALLOC8_DIR=...)

Build

mkdir build && cd build
cmake ..
make -j$(nproc)

This produces libsmash.dylib (macOS) or libsmash.so (Linux).

Build with benchmarks

cmake .. -DSMASH_BUILD_BENCH=ON
make -j$(nproc)

SMASH_BUILD_BENCH=ON enables three groups of benchmark targets, each with its own toggle:

Option	Default	What it gates
`SMASH_BUILD_BENCH_DEPS`	`ON`	Build Redis, memcached, DuckDB, RocksDB from source via `make bench_deps`. See "Build with benchmark dependencies" below. Set `OFF` to skip — the rest of the benchmark targets still build.
`SMASH_BUILD_BENCH_ALLOCATORS`	`ON`	Build the allocator-comparison benches (mimalloc, jemalloc, tcmalloc, hoard, mesh, diehard, dieharder) and the `bench_allocator_compare.py` runner. Pulls in tcmalloc / mimalloc via FetchContent + ExternalProject_Add, which adds significant build time and several optional `find_library` probes. Set `OFF` for fast smash-only builds (e.g. CI regression runs).

To build only the smash-internal benches (bench_rss, bench_sqlite, bench_throughput, bench_compression, bench_algo_compare, etc.) without external services or competing allocators:

cmake .. -DSMASH_BUILD_BENCH=ON \
         -DSMASH_BUILD_BENCH_DEPS=OFF \
         -DSMASH_BUILD_BENCH_ALLOCATORS=OFF
make -j$(nproc)

This is what the CI regression-spotting workflow uses (see .github/workflows/ci.yml).

Build with benchmark dependencies (Redis, memcached, DuckDB, RocksDB)

For full A/B benchmarking, build the external dependencies from source. This ensures they use system malloc (libc) instead of their default allocators (jemalloc), which is required for Smash to effectively compress their memory.

cmake .. -DSMASH_BUILD_BENCH=ON -DSMASH_BUILD_BENCH_DEPS=ON
make -j$(nproc)
make bench_deps   # Builds Redis, memcached, DuckDB, RocksDB from source

Note: Building DuckDB from source takes significant time (10-20 minutes). The bench_deps target builds:

Redis 8.0.2 with MALLOC=libc (instead of jemalloc)
memcached 1.6.34 (requires libevent-devel)
DuckDB 1.2.0 CLI
RocksDB 9.8.4 static library

Usage

macOS

DYLD_INSERT_LIBRARIES=./build/libsmash.dylib DYLD_FORCE_FLAT_NAMESPACE=1 ./your_application

Linux

LD_PRELOAD=./build/libsmash.so ./your_application

Large-Only Mode

For applications with their own small-object allocator (e.g., Python 3.13+ uses mimalloc internally), Smash can manage only large allocations while letting the native allocator handle small objects:

# macOS
SMASH_LARGE_ONLY=1 DYLD_INSERT_LIBRARIES=./build/libsmash.dylib ./your_application

# Linux
SMASH_LARGE_ONLY=1 LD_PRELOAD=./build/libsmash.so ./your_application

Allocations <= 16KB pass through to the system allocator; larger allocations go through Smash and are eligible for compression. This avoids interfering with language runtimes that have their own optimized small-object allocators.

Compress-Only Mode

For applications that use custom allocators (jemalloc, tcmalloc, etc.), Smash can run in compress-only mode where it only monitors and compresses pages without replacing malloc:

# macOS
SMASH_MODE=compress_only DYLD_INSERT_LIBRARIES=./build/libsmash.dylib ./your_application

# Linux
SMASH_MODE=compress_only LD_PRELOAD=./build/libsmash.so ./your_application

This mode tracks all heap pages via /proc/self/maps (Linux) or vm_region (macOS) and compresses cold regions regardless of which allocator manages them.

External-Mapping Tracking (`SMASH_TRACK_EXTERNAL=1`)

Standard smash compresses pages within its own MAP_ANON arena (where malloc-routed allocations live). Application code that calls mmap() / mach_vm_allocate() directly bypasses malloc and so escapes the compressor — this is the long pole on Firefox, where SpiderMonkey JS GC arenas, Skia / mozalloc_aligned graphics surfaces, and IPC-aligned shared memory all bypass malloc.

SMASH_TRACK_EXTERNAL=1 registers anonymous-writable application-direct mappings with smash so the compressor can see them too:

# macOS
SMASH_TRACK_EXTERNAL=1 DYLD_INSERT_LIBRARIES=./build/libsmash.dylib ./your_application

# Linux
SMASH_TRACK_EXTERNAL=1 LD_PRELOAD=./build/libsmash.so ./your_application

Filter rules: mmap is tracked only with MAP_ANON | PROT_WRITE (file-backed and read-only mappings are skipped). On macOS mach_vm_allocate and vm_allocate are tracked when allocated in the current task. The interposers themselves always install (cost: one branch per mmap / mach_vm call); only the page registration path is gated by the env var.

This is opt-in because the registration path has not yet been validated for stability on Firefox-class workloads — see smash-benchmarks/FIREFOX_STUDY.md for context.

Optional API

Applications can provide hints for better compression behavior:

#include <smash/smash.h>

smash_hint_cold(ptr, size);   // Suggest region for immediate compression
smash_hint_hot(ptr, size);    // Suggest region should stay decompressed

SmashStats stats;
smash_get_stats(&stats);      // Query compression statistics

Testing

cd build
ctest --output-on-failure

14 tests covering:

Test	What it verifies
`test_bootstrap`	Bootstrap bump allocator
`test_size_classes`	Size class mapping and ordering
`test_span`	Bitmap-based span allocation
`test_slab`	Per-class slab management
`test_vm_region`	Virtual memory reservation and page states
`test_compression`	LZ4 compress/decompress roundtrip, access tracking
`test_integration`	Full SmashHeap malloc/free/memalign
`test_interpose`	malloc interposition via DYLD_INSERT
`test_dictionary`	Dictionary training, ratio improvement, fallback
`test_prefetch`	Adjacent page prefetch, span boundary clipping
`test_contention`	8-thread concurrent alloc/free stress test
`test_fault_cycle`	Real SIGSEGV → decompress → verify data integrity
`test_external_mapping`	`mmap` + `mach_vm_allocate` round-trip through the compressor (under `SMASH_TRACK_EXTERNAL=1`); negative tests confirm file-backed and read-only mappings are not tracked
`test_malloc_compression`	End-to-end compression on the malloc/free path: `SIGUSR2` stats line shows `compressed > 0`, then read-back verifies fault-decompress integrity

The two end-to-end tests run under DYLD_INSERT_LIBRARIES (macOS) / LD_PRELOAD (Linux) with a live compressor, so they exercise the full Phase 1 → Phase 2 → fault-decompress cycle, not just unit-level invariants.

Continuous integration: .github/workflows/ci.yml builds and runs the full ctest suite on ubuntu-latest and macos-latest for every push to master and every pull request.

Benchmarks

Micro-benchmarks

cd build

# Compression ratio comparison: LZ4 vs zstd vs zstd+dict
./bench/bench_compression

# Malloc/free throughput (ops/sec)
./bench/bench_throughput

# Alloc/free latency percentiles (p50/p99/p999)
./bench/bench_latency

# RSS reduction over time
./bench/bench_rss

# Algorithm comparison: WKdm vs LZ4 vs zstd
./bench/bench_algo_compare

Application Benchmarks

These scripts run A/B comparisons (baseline vs Smash) on real applications:

cd build

# Redis (SET → cool → GET workload)
bash bench/bench_redis.sh [--quick]

# Memcached (fill → cool → serve → cold re-access)
bash bench/bench_memcached.sh [--quick]

# DuckDB (TPC-H OLAP queries)
bash bench/bench_duckdb.sh [--quick]

# RocksDB (block cache with hot/cold access)
bash bench/bench_rocksdb.sh [--quick]

Paper Experiments

For reproducible research results:

cd build

# Run all experiments (full — for paper-quality results)
python3 ../bench/run_paper_experiments.py --runs 3

# Quick smoke test
python3 ../bench/run_paper_experiments.py --quick --runs 1

# Results written to paper_results/

Architecture

┌─────────────────────────────────────────────┐
│              Application                     │
│         malloc() / free()                    │
├─────────────────────────────────────────────┤
│  alloc8 interposition layer                  │
├─────────────────────────────────────────────┤
│  SmashHeap                                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│  │ Thread   │ │ Slab[36] │ │ LargeAlloc   │ │
│  │ Cache    │→│ (per-sc) │ │ (>16KB)      │ │
│  └──────────┘ └────┬─────┘ └──────────────┘ │
│                    │                         │
│  ┌─────────────────▼───────────────────────┐ │
│  │           VmRegion                       │ │
│  │  (single contiguous virtual reservation) │ │
│  └─────────────────┬───────────────────────┘ │
│                    │                         │
│  ┌────────────┐ ┌──▼──────────┐ ┌─────────┐ │
│  │ PageState  │ │ Compressor  │ │ Fault   │ │
│  │ Table      │ │ Thread      │ │ Handler │ │
│  └────────────┘ └──┬──────────┘ └────┬────┘ │
│                    │                 │       │
│  ┌─────────────────▼─────────────────▼─────┐ │
│  │         CompressEngine                   │ │
│  │    LZ4 │ zstd │ zstd+dict              │ │
│  └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────┘

Configuration

Key tuning constants in include/smash/config.h:

Constant	Default	Description
`kCompressIntervalMs`	1000	Compression scan interval (ms)
`kColdTicks`	2	Ticks without access before fast-tier compression considered
`kVeryColdTicks`	60	Cold-tick threshold for the deep-tier (zstd-9) profile in the ROI model
`kMinCompressRatio`	0.75	Only keep compressed if < 75% of original
`kPrefetchWindow`	2	Pages prefetched in each direction on fault
`kDictTrainSamples`	0	Pages before dictionary training (disabled by default)
`kNumClasses`	36	Size classes (16B to 16KB)
`kNumArenas`	4	Call-site arenas (must be a power of 2)
`kCompressorWorkers`	2	Initial compressor worker count
`kMaxCompressorWorkers`	8	Cap for adaptive worker scaling
`kCompressStoreShards`	8	Sharded lock count in CompressStore

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
.github/workflows		.github/workflows
bench		bench
include/smash		include/smash
paper_results_macos		paper_results_macos
scripts		scripts
src		src
tests		tests
.clangd		.clangd
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
EXPERIMENTS.md		EXPERIMENTS.md
LINUX_NEXT.md		LINUX_NEXT.md
LINUX_SUPPORT.md		LINUX_SUPPORT.md
OPTIMIZATION_PLAN.md		OPTIMIZATION_PLAN.md
PLAN.md		PLAN.md
README.md		README.md
REDIS_PATCH_ANALYSIS.md		REDIS_PATCH_ANALYSIS.md
TECHNICAL_OVERVIEW.md		TECHNICAL_OVERVIEW.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smash

Overview

Key Features

How It Works

Building

Prerequisites

Build

Build with benchmarks

Build with benchmark dependencies (Redis, memcached, DuckDB, RocksDB)

Usage

macOS

Linux

Large-Only Mode

Compress-Only Mode

External-Mapping Tracking (`SMASH_TRACK_EXTERNAL=1`)

Optional API

Testing

Benchmarks

Micro-benchmarks

Application Benchmarks

Paper Experiments

Architecture

Configuration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

smash

Overview

Key Features

How It Works

Building

Prerequisites

Build

Build with benchmarks

Build with benchmark dependencies (Redis, memcached, DuckDB, RocksDB)

Usage

macOS

Linux

Large-Only Mode

Compress-Only Mode

External-Mapping Tracking (SMASH_TRACK_EXTERNAL=1)

Optional API

Testing

Benchmarks

Micro-benchmarks

Application Benchmarks

Paper Experiments

Architecture

Configuration

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

External-Mapping Tracking (`SMASH_TRACK_EXTERNAL=1`)

Packages