server: SWA-aware fallback for spec-decode probe (--spec-type dflash on Qwen3.6, gemma-4, …)

NubsCarson · claude · NubsCarson · commit cadb762da441 · 2026-05-13T01:35:29.000Z
`common_context_can_seq_rm()` in `common/common.cpp:1401` does a 2-token test decode to classify the target context. When that `llama_decode()` returns nonzero it currently classifies the context as `COMMON_CONTEXT_SEQ_RM_TYPE_NO`, which causes `tools/server/server-context.cpp:836` to disable speculative decoding entirely: common_speculative_is_compat: the target context does not support partial sequence removal srv load_model: speculative decoding not supported by this context On SWA-based bodies (Qwen3.6-27B / 35B-A3B, gemma-4-31B, future Eliza-1-27b once the Qwen3.6 backbone Qwen3.5/3.6 SFT lands) the 2-token test fails for reasons unrelated to whether spec-decode actually works on the body. The downstream code already has the right idea: lines 2680-2682 enable `do_checkpoint` when `n_swa > 0`, so spec-decode + checkpoint mode is the supported path for SWA bodies. This patch closes the gap. If the probe classifies the context as `_TYPE_NO` but the model declares SWA (and the operator did not opt into `--swa-full`), demote the classification to `_TYPE_FULL` (use checkpoints) so spec-decode initializes. All non-SWA bodies are unaffected — the demotion only fires when both `_TYPE_NO` AND `llama_model_n_swa(model) > 0`. Companion to elizaOS/eliza#7635 in the downstream eliza repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
@@ -833,6 +833,27 @@ struct server_context_impl {
         slots.clear();
 
         ctx_tgt_seq_rm_type = common_context_can_seq_rm(ctx_tgt);
+
+        // SWA-aware probe fallback. The `common_context_can_seq_rm` probe
+        // does a 2-token test decode that returns `_TYPE_NO` whenever the
+        // first `llama_decode()` call returns nonzero. On SWA-based bodies
+        // (Qwen3.6, gemma-4) the probe's all-zero positions / single-seq
+        // batch can fail validation (M-RoPE position constraints, SWA
+        // window setup edge cases) even though the body would otherwise
+        // be perfectly usable for spec-decode via the checkpoint path
+        // (the `do_checkpoint` block ~1850 lines below already enables
+        // it when `n_swa > 0`). When the probe returns `_TYPE_NO` but
+        // the model declares SWA, demote the probe result to `_TYPE_FULL`
+        // (use checkpoints) rather than disable spec-decode entirely.
+        // This unblocks DFlash speculative decoding on Qwen3.6-class
+        // bodies paired with their community / Eliza-1 distilled drafters.
+        if (ctx_tgt_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_NO &&
+            llama_model_n_swa(model_tgt) > 0 &&
+            !params_base.swa_full) {
+            SRV_WRN("%s", "seq_rm probe failed but model declares SWA — falling back to checkpoint-mode spec-decode\n");
+            ctx_tgt_seq_rm_type = COMMON_CONTEXT_SEQ_RM_TYPE_FULL;
+        }
+
         if (ctx_tgt_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_NO) {
             SRV_WRN("%s", "speculative decoding not supported by this context\n");
         }