v0.1.10
What's Changed
- Full-context draft features above a context threshold: the draft full-attention layer attends all context features once projected context reaches
--draft-full-context-min-ctx(default16384), backed by an in-place step-grown draft KV buffer that removes progressive Metal heap churn on long generations by @bstnxbt in f1b0427 and 76c3d1d - Prefix snapshot lifecycle reform: end-of-request generation snapshots adopt the live cache arrays instead of cloning multi-GB buffers, prefill publishes are skipped when a generation snapshot follows, snapshot refs are released once prefill consumes them, and L2 disk writes pause while a request is served by @bstnxbt in e64e30c, c18629e, dca28bd and 58fcd02
- GDN boundary sidecar: generation snapshots capture recurrent state and logits at the stable-prefix boundary, so next-turn requests that diverge inside a previous generation still restore at the boundary instead of missing (GDN state cannot be rewound) by @bstnxbt in c3e72ad
- L1/L2 correctness wave: coverage-aware fingerprints and lookups (a trimmed snapshot can no longer shadow or be served in place of a full-coverage one), true-LRU L2 eviction (mtime refresh on served hits), generation snapshots exempt from the L1 token cap and consumed on serve, and resident L1 entries spilled to L2 at shutdown for next-session warm starts by @bstnxbt in 4f06598, 5343565, ab22c5a and 4f4580b
- Positional sparse prefill API: new
prompt_token_positionsruntime parameter with position-mapped RoPE on the target, including gemma4 sliding-window masks built from true token positions by @popfido in #43 dflash benchmark --sustained-minutes N: continuous-load benchmark mode that reports the post-throttle plateau instead of fresh-GPU bursts by @bstnxbt in ec6bacd- Serve: emit a valid completion when tool-call parsing fails after a truncated generation by @bstnxbt in 5fa98cd
- Numeric contract gate for the default-on verify kernels: bitwise determinism, pinned per-shape deviation bounds vs stock quantized matmul, margin-certified argmax stability, and per-GPU-profile golden output fingerprints with a regeneration path by @bstnxbt in b1aa3ee
- Observability: per-cycle proposed/posterior/committed token ids in cycle events,
DFLASH_CAPTURE_LOGITSslot-level logit capture (active only with cycle profiling), andsidecar_hitsin the prefix-cache stats line by @bstnxbt in f778281, 9becba0 and d7660bd - Docs:
--copyspec-mode,--quantize-kv-cache, and--prefix-cache-l2-frontier-strideare now documented across serve/generate/benchmark surfaces, env vars and validation rules included, plus a new architecture section on the snapshot lifecycle by @bstnxbt in 9eb1e49
Performance Notes
- Long-context decode is the headline: with full-context draft features, maintainer validation on Qwen3.6-27B-4bit (M5 Max, 64 GB) measured +17–32% tok/s on long-context generations with parity-exact output, and a 32k-context AIME run that previously degraded to 11.2 tok/s now sustains 55.2 tok/s.
- Cached multi-request sessions: the snapshot lifecycle reform cut a long agentic replay wall from 2039 s to 1594 s (−21.8%) with byte-exact outputs across all 9 requests.
- Sparse prefill is experimental and Python-API-only (no CLI flag). Dense requests are bitwise-unaffected. With non-contiguous positions the target output is position-correct and still fully target-verified; draft-side position mapping lands in a follow-up.
--sustained-minutesexists because Apple Silicon throttles sustained 27B decode after ~2–3 minutes; fresh-GPU benchmark numbers overstate long-session throughput.
New Contributors
Full Changelog: v0.1.9...v0.1.10