Skip to content

Commit cadb762

Browse files
NubsCarsonclaude
andcommitted
server: SWA-aware fallback for spec-decode probe (--spec-type dflash on Qwen3.6, gemma-4, …)
`common_context_can_seq_rm()` in `common/common.cpp:1401` does a 2-token test decode to classify the target context. When that `llama_decode()` returns nonzero it currently classifies the context as `COMMON_CONTEXT_SEQ_RM_TYPE_NO`, which causes `tools/server/server-context.cpp:836` to disable speculative decoding entirely: common_speculative_is_compat: the target context does not support partial sequence removal srv load_model: speculative decoding not supported by this context On SWA-based bodies (Qwen3.6-27B / 35B-A3B, gemma-4-31B, future Eliza-1-27b once the Qwen3.6 backbone Qwen3.5/3.6 SFT lands) the 2-token test fails for reasons unrelated to whether spec-decode actually works on the body. The downstream code already has the right idea: lines 2680-2682 enable `do_checkpoint` when `n_swa > 0`, so spec-decode + checkpoint mode is the supported path for SWA bodies. This patch closes the gap. If the probe classifies the context as `_TYPE_NO` but the model declares SWA (and the operator did not opt into `--swa-full`), demote the classification to `_TYPE_FULL` (use checkpoints) so spec-decode initializes. All non-SWA bodies are unaffected — the demotion only fires when both `_TYPE_NO` AND `llama_model_n_swa(model) > 0`. Companion to elizaOS/eliza#7635 in the downstream eliza repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d629a37 commit cadb762

1 file changed

Lines changed: 21 additions & 0 deletions

File tree

tools/server/server-context.cpp

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -833,6 +833,27 @@ struct server_context_impl {
833833
slots.clear();
834834

835835
ctx_tgt_seq_rm_type = common_context_can_seq_rm(ctx_tgt);
836+
837+
// SWA-aware probe fallback. The `common_context_can_seq_rm` probe
838+
// does a 2-token test decode that returns `_TYPE_NO` whenever the
839+
// first `llama_decode()` call returns nonzero. On SWA-based bodies
840+
// (Qwen3.6, gemma-4) the probe's all-zero positions / single-seq
841+
// batch can fail validation (M-RoPE position constraints, SWA
842+
// window setup edge cases) even though the body would otherwise
843+
// be perfectly usable for spec-decode via the checkpoint path
844+
// (the `do_checkpoint` block ~1850 lines below already enables
845+
// it when `n_swa > 0`). When the probe returns `_TYPE_NO` but
846+
// the model declares SWA, demote the probe result to `_TYPE_FULL`
847+
// (use checkpoints) rather than disable spec-decode entirely.
848+
// This unblocks DFlash speculative decoding on Qwen3.6-class
849+
// bodies paired with their community / Eliza-1 distilled drafters.
850+
if (ctx_tgt_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_NO &&
851+
llama_model_n_swa(model_tgt) > 0 &&
852+
!params_base.swa_full) {
853+
SRV_WRN("%s", "seq_rm probe failed but model declares SWA — falling back to checkpoint-mode spec-decode\n");
854+
ctx_tgt_seq_rm_type = COMMON_CONTEXT_SEQ_RM_TYPE_FULL;
855+
}
856+
836857
if (ctx_tgt_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_NO) {
837858
SRV_WRN("%s", "speculative decoding not supported by this context\n");
838859
}

0 commit comments

Comments
 (0)