Cachepawl

Pre-alpha. Interfaces are stable enough to plan against; nothing inside is implemented yet.

Cachepawl is a hybrid cache allocator for next-generation language models that mix attention, state-space (SSM/Mamba), and mixture-of-experts layers. Existing KV cache managers (vLLM, SGLang, TensorRT-LLM) were built for pure transformer stacks; they handle a uniform per-layer cache shape and an append-only access pattern. Hybrid Mamba-Transformer-MoE models break both assumptions at once: attention layers want variable-length KV blocks, SSM layers want fixed-size state blocks, and MoE routing turns request shape into a runtime decision.

Cachepawl is the experiment to fix that gap: one allocator that owns a single VRAM budget and serves both cache kinds without leaving most of the device idle.

Target architectures

The initial design targets these published hybrid models:

Mamba-2
Jamba
Zamba2
Samba
Hymba
RecurrentGemma

Reference configs live in src/cachepawl/models/spec.py and are intentionally left as None placeholders until each upstream config is mapped in.

Install

Requires Python 3.10 or newer. Triton is gated to Linux because there are no upstream Triton wheels for macOS or Windows.

uv sync

For environments without a CUDA GPU (CI, laptops, Codespaces) use the CPU-only torch index to keep wheel size down:

uv sync --extra-index-url https://download.pytorch.org/whl/cpu

Quickstart

import cachepawl

print(cachepawl.__version__)

The full allocator API is wired up but not implemented. Calls into MemoryPool, KVCacheManager, StateCacheManager, HybridCacheCoordinator, vram_info, and cuda_capability will raise NotImplementedError with a message pointing at the design milestone that unblocks them.

Layout

src/cachepawl/
  allocator/   block pool and eviction policy interfaces
  cache/       KV, SSM state, and hybrid coordinator managers
  kernels/     reserved for Triton kernels
  models/      hybrid model layout descriptors
  quant/       cache element dtypes (FP16, BF16, INT8, FP8, FP4)
  utils/       device and VRAM helpers

Paper

The AVMP allocator is described in arXiv:2605.22416. Source under research/avmp/.

Documentation

docs/architecture.md: two-pool vs unified-pool tradeoff and prior art in vLLM and SGLang.
docs/design-rationale.md: why hybrid Mamba-Transformer-MoE workloads break existing cache solutions.
benchmarks/README.md: benchmarking strategy and target hardware.

Status

Pre-alpha. APIs may change without notice until the first concrete allocator lands. Track progress in the issues queue.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
research		research
src/cachepawl		src/cachepawl
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cachepawl

Target architectures

Install

Quickstart

Layout

Paper

Documentation

Status

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cachepawl

Target architectures

Install

Quickstart

Layout

Paper

Documentation

Status

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages