v0.1.10

What's Changed

Full-context draft features above a context threshold: the draft full-attention layer attends all context features once projected context reaches --draft-full-context-min-ctx (default 16384), backed by an in-place step-grown draft KV buffer that removes progressive Metal heap churn on long generations by @bstnxbt in f1b0427 and 76c3d1d
Prefix snapshot lifecycle reform: end-of-request generation snapshots adopt the live cache arrays instead of cloning multi-GB buffers, prefill publishes are skipped when a generation snapshot follows, snapshot refs are released once prefill consumes them, and L2 disk writes pause while a request is served by @bstnxbt in e64e30c, c18629e, dca28bd and 58fcd02
GDN boundary sidecar: generation snapshots capture recurrent state and logits at the stable-prefix boundary, so next-turn requests that diverge inside a previous generation still restore at the boundary instead of missing (GDN state cannot be rewound) by @bstnxbt in c3e72ad
L1/L2 correctness wave: coverage-aware fingerprints and lookups (a trimmed snapshot can no longer shadow or be served in place of a full-coverage one), true-LRU L2 eviction (mtime refresh on served hits), generation snapshots exempt from the L1 token cap and consumed on serve, and resident L1 entries spilled to L2 at shutdown for next-session warm starts by @bstnxbt in 4f06598, 5343565, ab22c5a and 4f4580b
Positional sparse prefill API: new prompt_token_positions runtime parameter with position-mapped RoPE on the target, including gemma4 sliding-window masks built from true token positions by @popfido in #43
dflash benchmark --sustained-minutes N: continuous-load benchmark mode that reports the post-throttle plateau instead of fresh-GPU bursts by @bstnxbt in ec6bacd
Serve: emit a valid completion when tool-call parsing fails after a truncated generation by @bstnxbt in 5fa98cd
Numeric contract gate for the default-on verify kernels: bitwise determinism, pinned per-shape deviation bounds vs stock quantized matmul, margin-certified argmax stability, and per-GPU-profile golden output fingerprints with a regeneration path by @bstnxbt in b1aa3ee
Observability: per-cycle proposed/posterior/committed token ids in cycle events, DFLASH_CAPTURE_LOGITS slot-level logit capture (active only with cycle profiling), and sidecar_hits in the prefix-cache stats line by @bstnxbt in f778281, 9becba0 and d7660bd
Docs: --copyspec-mode, --quantize-kv-cache, and --prefix-cache-l2-frontier-stride are now documented across serve/generate/benchmark surfaces, env vars and validation rules included, plus a new architecture section on the snapshot lifecycle by @bstnxbt in 9eb1e49

Performance Notes

Long-context decode is the headline: with full-context draft features, maintainer validation on Qwen3.6-27B-4bit (M5 Max, 64 GB) measured +17–32% tok/s on long-context generations with parity-exact output, and a 32k-context AIME run that previously degraded to 11.2 tok/s now sustains 55.2 tok/s.
Cached multi-request sessions: the snapshot lifecycle reform cut a long agentic replay wall from 2039 s to 1594 s (−21.8%) with byte-exact outputs across all 9 requests.
Sparse prefill is experimental and Python-API-only (no CLI flag). Dense requests are bitwise-unaffected. With non-contiguous positions the target output is position-correct and still fully target-verified; draft-side position mapping lands in a follow-up.
--sustained-minutes exists because Apple Silicon throttles sustained 27B decode after ~2–3 minutes; fresh-GPU benchmark numbers overstate long-session throughput.

New Contributors

@popfido made their first contribution in #43

Full Changelog: v0.1.9...v0.1.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.10

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v0.1.10

What's Changed

Performance Notes

New Contributors

Contributors

Uh oh!