FLy: a lossy, draft-agnostic "loose accept" gate for speculative decoding #24804

Seanspt · 2026-06-19T11:45:11Z

Seanspt
Jun 19, 2026

FLy: a lossy, draft-agnostic "loose accept" gate for speculative decoding

FLy (Training-Free Loosely Speculative Decoding) is a small, training-free change to the speculative-decoding acceptance step. Standard speculative decoding accepts a draft token only on an exact match with the target's argmax. FLy also accepts a draft token when the target itself considers it almost as likely as its own top choice — gated by a log-probability margin ΔlogP and a deferred verification window.

It is lossy (output can diverge from the target's greedy trajectory) and it touches the sampling core, so this is a Discussion rather than a PR.

Results on the MTP self-draft path (Qwen3-27B, RTX 5090, greedy temp=0, recommended τ=0.3):

Open-ended generation (creative writing): faster than lossless SPD — +5.6% t/s, 95% CI [+4.0, +7.1], n=450.
Strict generation (math): on par with lossless SPD — within ~1%, no measurable accuracy loss (GSM8K 97.0% vs 97.0%, n=300).
The speed effect is task-dependent, with an understood mechanism.

The gate is draft-agnostic — it only reads target logits — so it should drop onto any draft source unchanged (EAGLE/Medusa heads, external draft models, etc.). I'd love for people to try it where I can't.

Branch: https://github.com/Seanspt/llama.cpp/tree/feature/fly-loosely-speculative-decoding

1. The method

For each drafted position the target produces logits. FLy computes:

ΔlogP = logP_target(target_top1) − logP_target(draft_token) — how much less likely the draft token is than the target's own top choice, under the target.
An exact match is accepted as in normal SPD.
A mismatch is loose-accepted only if ΔlogP < τ (the draft token is nearly as probable as the target's pick) and it passes control-token safety checks; otherwise it's rejected and normal SPD verification resumes.
A deferred verification window W: a loose accept is provisional; if the next W positions don't re-converge, it rolls back.

Intuition: ΔlogP directly bounds how much worse an accepted token can be in the target's own eyes. τ is a single speed/fidelity knob; τ → 0 reduces FLy to exact-match SPD. Control-sensitive tokens (chat markers, EOS, …) are never loose-accepted.

2. Setup


Target	Qwen3-27B (Q4_K_M), MTP self-draft (`--spec-type draft-mtp`, `n-max=4`)
Hardware	RTX 5090 (32 GB) + Ryzen 9 9950X
Decoding	greedy, `temp=0`, fixed seed
FLy config	`W=6`, sweep `τ ∈ {0.2, 0.3, 0.4}`
Datasets	GSM8K (strict, 300 q) · WritingPrompts (open, 450 prompts / 3 seeds)
Timing	`eval`-phase tokens/s from `slot print_timing` (generation only)
Baseline	lossless MTP speculative decoding (same flags minus `--spec-fly`)

These are one hardware point and one draft path. Speculative gains depend strongly on whether the target is the bottleneck; FLy's behavioral metrics (loose%, gate decisions) should be hardware-independent, but absolute t/s should not be extrapolated across hardware.

3. Speed: the effect is task-dependent

CUDA graphs on (default), large-n (SE on the mean ≈ 0.5 t/s).

Strict — GSM8K (300 problems, ~320 tok each)

Config	t/s	vs SPD [95% CI]	draft acc	loose%
SPD	144.4	—	0.849	0%
FLy τ=0.2	142.3	−1.4% [−2.2, −0.6]	0.858	0.4%
FLy τ=0.3	143.0	−1.0% [−1.8, −0.2]	0.861	0.5%
FLy τ=0.4	142.5	−1.4% [−2.2, −0.5]	0.865	0.7%

Open — WritingPrompts (creative writing, ~500+ tok)

Config	t/s	vs SPD [95% CI]	draft acc	loose%
SPD	89.6	—	0.416	0%
FLy τ=0.2	91.4	+2.1% [+0.5, +3.6]	0.435	2.3%
FLy τ=0.3	94.6	+5.6% [+4.0, +7.1]	0.443	3.5%
FLy τ=0.4	95.9	+7.0% [+5.4, +8.6]	0.454	4.6%

The two tasks point in opposite directions, and that's the real signal.

Why — FLy's speedup comes from loose-accepting near-tie draft tokens, which only pays off when there's room:

Strict (math): draft acceptance is already high (~85%) and the target distribution is sharp — few near-ties to loose-accept, so gate overhead slightly exceeds the gain (~ −1%, negligible).
Open (creative): draft acceptance is low (~42%) and the distribution is soft — many wordings are nearly equally probable, so loose accepts lift acceptance 2–4 pp and lengthen the accepted run. Net +2…+7%.

FLy accelerates exactly where the target distribution is soft. Note the diminishing return: τ=0.2→0.3 is a significant +3.2 t/s, but τ=0.3→0.4 is +1.3 t/s, CI [−0.1, +2.7] — not significant. Speed has essentially saturated by τ=0.3.

4. Quality: no measurable loss, with a characterized residual risk

Strict — GSM8K accuracy (300 problems)

τ	Accuracy	vs SPD	Harmful flips (SPD✓→FLy✗)
SPD	97.0%	—	—
0.2	96.7%	−0.3 pp	0.67%
0.3	97.0%	±0.0	0.33%
0.4	96.7%	−0.3 pp	0.67%

FLy ≈ SPD within sampling noise. Flips are bidirectional (FLy also fixes some SPD errors) and non-monotonic in τ — τ=0.3 is the minimum.

Open — WritingPrompts (450 prompts, 3 seeds). No ground truth, and FLy/SPD legitimately diverge into different stories, so quality is scored on the objective degradation axis (loops, broken syntax, incoherence), not "which story is better."

τ	True loops	Speed vs SPD	Breakdown rate
SPD	0%	—	—
0.2	0% (0/450)	+2.1%	0.3%
0.3	0% (0/450)	+5.6%	0.3%
0.4	0.44% (2/450) ⚠️	+7.0%	0.3%

Failure mode. At τ=0.4, 2/450 prompts (across two independent seeds) degraded into repetition loops. The mechanism (traced): a loose accept nudges the model onto a semantic path it handles poorly (e.g. listing a bibliography it then hallucinates), and the model dead-locks while self-correcting. FLy's role is to steer, not drive — inside one 20+ token loop only a single token was loose-accepted; the loop is the model's own self-correction dead-lock. The same shape appears in math (a loose accept pushes onto a verbose path that overruns the token budget). So the failure mode is unified: loose accept → semantic path drift → model enters a region it handles poorly.

τ=0.3 as a safety ceiling. Zero true loops across 450 prompts / 3 seeds, while τ=0.4 reproduced loops in two separate batches (0.44%) — the boundary is repeatable across seeds, not one lucky batch. Combined with saturated speed and the lowest GSM8K flip rate, τ=0.3 is the recommended setting. Being explicit about residual risk rather than overclaiming: τ=0.3 accepts ~4.9 loose tokens/prompt with ΔlogP ∈ [0.2, 0.3), almost all harmless wording swaps; the true-loop rate is unobserved at τ≤0.3 but not provably zero. It rises monotonically with τ — the residual risk of a lossy method, controllable but not eliminable.

5. Status and invitation

Solid (this point):

ΔlogP gate is precise — loose accepts are wording-level (mean ΔlogP_loose ≈ 0.20); rejected mismatches are genuine wrong guesses (mean ΔlogP_kill ≈ 4.3, ~70× probability gap).
Task-dependent speed: +5.6% open / ~−1% strict at τ=0.3, with CIs.
No measurable quality loss (GSM8K 97.0% = SPD).
Characterized, τ-monotonic failure mode; τ=0.3 safety ceiling with stated residual risk.

Limitations: one hardware point, one target, one draft path (MTP); absolute t/s won't transfer; lossy by construction (not for applications needing exact reproduction of the target's greedy output).

Invitation: the gate only reads target logits, so it's draft-agnostic and should drop onto any draft source unchanged. I'd love for people to test it where I can't — EAGLE/Medusa heads, larger targets (where the target is more clearly the bottleneck and gains should grow), slower GPUs, other task mixes. Behavioral metrics (loose%, ΔlogP distributions, loop rate vs τ) are hardware-independent, so divergence there is a real signal worth digging into.

Branch: https://github.com/Seanspt/llama.cpp/tree/feature/fly-loosely-speculative-decoding

Feedback on the method very welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FLy: a lossy, draft-agnostic "loose accept" gate for speculative decoding #24804

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

FLy: a lossy, draft-agnostic "loose accept" gate for speculative decoding #24804

Uh oh!

Uh oh!

Seanspt Jun 19, 2026

FLy: a lossy, draft-agnostic "loose accept" gate for speculative decoding

1. The method

2. Setup

3. Speed: the effect is task-dependent

4. Quality: no measurable loss, with a characterized residual risk

5. Status and invitation

Replies: 0 comments

Seanspt
Jun 19, 2026