You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FLy: a lossy, draft-agnostic "loose accept" gate for speculative decoding
FLy (Training-Free Loosely Speculative Decoding) is a small, training-free change to the speculative-decoding acceptance step. Standard speculative decoding accepts a draft token only on an exact match with the target's argmax. FLy also accepts a draft token when the target itself considers it almost as likely as its own top choice — gated by a log-probability margin ΔlogP and a deferred verification window.
It is lossy (output can diverge from the target's greedy trajectory) and it touches the sampling core, so this is a Discussion rather than a PR.
Results on the MTP self-draft path (Qwen3-27B, RTX 5090, greedy temp=0, recommended τ=0.3):
Open-ended generation (creative writing): faster than lossless SPD — +5.6% t/s, 95% CI [+4.0, +7.1], n=450.
Strict generation (math): on par with lossless SPD — within ~1%, no measurable accuracy loss (GSM8K 97.0% vs 97.0%, n=300).
The speed effect is task-dependent, with an understood mechanism.
The gate is draft-agnostic — it only reads target logits — so it should drop onto any draft source unchanged (EAGLE/Medusa heads, external draft models, etc.). I'd love for people to try it where I can't.
For each drafted position the target produces logits. FLy computes:
ΔlogP = logP_target(target_top1) − logP_target(draft_token) — how much less likely the draft token is than the target's own top choice, under the target.
An exact match is accepted as in normal SPD.
A mismatch is loose-accepted only if ΔlogP < τ (the draft token is nearly as probable as the target's pick) and it passes control-token safety checks; otherwise it's rejected and normal SPD verification resumes.
A deferred verification window W: a loose accept is provisional; if the next W positions don't re-converge, it rolls back.
Intuition: ΔlogP directly bounds how much worse an accepted token can be in the target's own eyes. τ is a single speed/fidelity knob; τ → 0 reduces FLy to exact-match SPD. Control-sensitive tokens (chat markers, EOS, …) are never loose-accepted.
eval-phase tokens/s from slot print_timing (generation only)
Baseline
lossless MTP speculative decoding (same flags minus --spec-fly)
These are one hardware point and one draft path. Speculative gains depend strongly on whether the target is the bottleneck; FLy's behavioral metrics (loose%, gate decisions) should be hardware-independent, but absolute t/s should not be extrapolated across hardware.
3. Speed: the effect is task-dependent
CUDA graphs on (default), large-n (SE on the mean ≈ 0.5 t/s).
Strict — GSM8K (300 problems, ~320 tok each)
Config
t/s
vs SPD [95% CI]
draft acc
loose%
SPD
144.4
—
0.849
0%
FLy τ=0.2
142.3
−1.4% [−2.2, −0.6]
0.858
0.4%
FLy τ=0.3
143.0
−1.0% [−1.8, −0.2]
0.861
0.5%
FLy τ=0.4
142.5
−1.4% [−2.2, −0.5]
0.865
0.7%
Open — WritingPrompts (creative writing, ~500+ tok)
Config
t/s
vs SPD [95% CI]
draft acc
loose%
SPD
89.6
—
0.416
0%
FLy τ=0.2
91.4
+2.1% [+0.5, +3.6]
0.435
2.3%
FLy τ=0.3
94.6
+5.6% [+4.0, +7.1]
0.443
3.5%
FLy τ=0.4
95.9
+7.0% [+5.4, +8.6]
0.454
4.6%
The two tasks point in opposite directions, and that's the real signal.
Why — FLy's speedup comes from loose-accepting near-tie draft tokens, which only pays off when there's room:
Strict (math): draft acceptance is already high (~85%) and the target distribution is sharp — few near-ties to loose-accept, so gate overhead slightly exceeds the gain (~ −1%, negligible).
Open (creative): draft acceptance is low (~42%) and the distribution is soft — many wordings are nearly equally probable, so loose accepts lift acceptance 2–4 pp and lengthen the accepted run. Net +2…+7%.
FLy accelerates exactly where the target distribution is soft. Note the diminishing return: τ=0.2→0.3 is a significant +3.2 t/s, but τ=0.3→0.4 is +1.3 t/s, CI [−0.1, +2.7] — not significant. Speed has essentially saturated by τ=0.3.
4. Quality: no measurable loss, with a characterized residual risk
Strict — GSM8K accuracy (300 problems)
τ
Accuracy
vs SPD
Harmful flips (SPD✓→FLy✗)
SPD
97.0%
—
—
0.2
96.7%
−0.3 pp
0.67%
0.3
97.0%
±0.0
0.33%
0.4
96.7%
−0.3 pp
0.67%
FLy ≈ SPD within sampling noise. Flips are bidirectional (FLy also fixes some SPD errors) and non-monotonic in τ — τ=0.3 is the minimum.
Open — WritingPrompts (450 prompts, 3 seeds). No ground truth, and FLy/SPD legitimately diverge into different stories, so quality is scored on the objective degradation axis (loops, broken syntax, incoherence), not "which story is better."
τ
True loops
Speed vs SPD
Breakdown rate
SPD
0%
—
—
0.2
0% (0/450)
+2.1%
0.3%
0.3
0% (0/450)
+5.6%
0.3%
0.4
0.44% (2/450) ⚠️
+7.0%
0.3%
Failure mode. At τ=0.4, 2/450 prompts (across two independent seeds) degraded into repetition loops. The mechanism (traced): a loose accept nudges the model onto a semantic path it handles poorly (e.g. listing a bibliography it then hallucinates), and the model dead-locks while self-correcting. FLy's role is to steer, not drive — inside one 20+ token loop only a single token was loose-accepted; the loop is the model's own self-correction dead-lock. The same shape appears in math (a loose accept pushes onto a verbose path that overruns the token budget). So the failure mode is unified: loose accept → semantic path drift → model enters a region it handles poorly.
τ=0.3 as a safety ceiling. Zero true loops across 450 prompts / 3 seeds, while τ=0.4 reproduced loops in two separate batches (0.44%) — the boundary is repeatable across seeds, not one lucky batch. Combined with saturated speed and the lowest GSM8K flip rate, τ=0.3 is the recommended setting. Being explicit about residual risk rather than overclaiming: τ=0.3 accepts ~4.9 loose tokens/prompt with ΔlogP ∈ [0.2, 0.3), almost all harmless wording swaps; the true-loop rate is unobserved at τ≤0.3 but not provably zero. It rises monotonically with τ — the residual risk of a lossy method, controllable but not eliminable.
5. Status and invitation
Solid (this point):
ΔlogP gate is precise — loose accepts are wording-level (mean ΔlogP_loose ≈ 0.20); rejected mismatches are genuine wrong guesses (mean ΔlogP_kill ≈ 4.3, ~70× probability gap).
Task-dependent speed: +5.6% open / ~−1% strict at τ=0.3, with CIs.
Limitations: one hardware point, one target, one draft path (MTP); absolute t/s won't transfer; lossy by construction (not for applications needing exact reproduction of the target's greedy output).
Invitation: the gate only reads target logits, so it's draft-agnostic and should drop onto any draft source unchanged. I'd love for people to test it where I can't — EAGLE/Medusa heads, larger targets (where the target is more clearly the bottleneck and gains should grow), slower GPUs, other task mixes. Behavioral metrics (loose%, ΔlogP distributions, loop rate vs τ) are hardware-independent, so divergence there is a real signal worth digging into.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
FLy: a lossy, draft-agnostic "loose accept" gate for speculative decoding
FLy (Training-Free Loosely Speculative Decoding) is a small, training-free change to the speculative-decoding acceptance step. Standard speculative decoding accepts a draft token only on an exact match with the target's argmax. FLy also accepts a draft token when the target itself considers it almost as likely as its own top choice — gated by a log-probability margin
ΔlogPand a deferred verification window.It is lossy (output can diverge from the target's greedy trajectory) and it touches the sampling core, so this is a Discussion rather than a PR.
Results on the MTP self-draft path (Qwen3-27B, RTX 5090, greedy
temp=0, recommendedτ=0.3):+5.6%t/s, 95% CI[+4.0, +7.1], n=450.~1%, no measurable accuracy loss (GSM8K 97.0% vs 97.0%, n=300).The gate is draft-agnostic — it only reads target logits — so it should drop onto any draft source unchanged (EAGLE/Medusa heads, external draft models, etc.). I'd love for people to try it where I can't.
Branch: https://github.com/Seanspt/llama.cpp/tree/feature/fly-loosely-speculative-decoding
1. The method
For each drafted position the target produces logits. FLy computes:
ΔlogP = logP_target(target_top1) − logP_target(draft_token)— how much less likely the draft token is than the target's own top choice, under the target.ΔlogP < τ(the draft token is nearly as probable as the target's pick) and it passes control-token safety checks; otherwise it's rejected and normal SPD verification resumes.W: a loose accept is provisional; if the nextWpositions don't re-converge, it rolls back.Intuition:
ΔlogPdirectly bounds how much worse an accepted token can be in the target's own eyes.τis a single speed/fidelity knob;τ → 0reduces FLy to exact-match SPD. Control-sensitive tokens (chat markers, EOS, …) are never loose-accepted.2. Setup
--spec-type draft-mtp,n-max=4)temp=0, fixed seedW=6, sweepτ ∈ {0.2, 0.3, 0.4}eval-phase tokens/s fromslot print_timing(generation only)--spec-fly)These are one hardware point and one draft path. Speculative gains depend strongly on whether the target is the bottleneck; FLy's behavioral metrics (loose%, gate decisions) should be hardware-independent, but absolute t/s should not be extrapolated across hardware.
3. Speed: the effect is task-dependent
CUDA graphs on (default), large-n (SE on the mean ≈ 0.5 t/s).
Strict — GSM8K (300 problems, ~320 tok each)
Open — WritingPrompts (creative writing, ~500+ tok)
The two tasks point in opposite directions, and that's the real signal.
Why — FLy's speedup comes from loose-accepting near-tie draft tokens, which only pays off when there's room:
~ −1%, negligible).+2…+7%.FLy accelerates exactly where the target distribution is soft. Note the diminishing return: τ=0.2→0.3 is a significant
+3.2 t/s, but τ=0.3→0.4 is+1.3 t/s, CI[−0.1, +2.7]— not significant. Speed has essentially saturated byτ=0.3.4. Quality: no measurable loss, with a characterized residual risk
Strict — GSM8K accuracy (300 problems)
FLy ≈ SPD within sampling noise. Flips are bidirectional (FLy also fixes some SPD errors) and non-monotonic in τ —
τ=0.3is the minimum.Open — WritingPrompts (450 prompts, 3 seeds). No ground truth, and FLy/SPD legitimately diverge into different stories, so quality is scored on the objective degradation axis (loops, broken syntax, incoherence), not "which story is better."
Failure mode. At
τ=0.4, 2/450 prompts (across two independent seeds) degraded into repetition loops. The mechanism (traced): a loose accept nudges the model onto a semantic path it handles poorly (e.g. listing a bibliography it then hallucinates), and the model dead-locks while self-correcting. FLy's role is to steer, not drive — inside one 20+ token loop only a single token was loose-accepted; the loop is the model's own self-correction dead-lock. The same shape appears in math (a loose accept pushes onto a verbose path that overruns the token budget). So the failure mode is unified: loose accept → semantic path drift → model enters a region it handles poorly.τ=0.3as a safety ceiling. Zero true loops across 450 prompts / 3 seeds, whileτ=0.4reproduced loops in two separate batches (0.44%) — the boundary is repeatable across seeds, not one lucky batch. Combined with saturated speed and the lowest GSM8K flip rate,τ=0.3is the recommended setting. Being explicit about residual risk rather than overclaiming: τ=0.3 accepts ~4.9 loose tokens/prompt withΔlogP ∈ [0.2, 0.3), almost all harmless wording swaps; the true-loop rate is unobserved at τ≤0.3 but not provably zero. It rises monotonically with τ — the residual risk of a lossy method, controllable but not eliminable.5. Status and invitation
Solid (this point):
ΔlogP_loose ≈ 0.20); rejected mismatches are genuine wrong guesses (meanΔlogP_kill ≈ 4.3, ~70× probability gap).Limitations: one hardware point, one target, one draft path (MTP); absolute t/s won't transfer; lossy by construction (not for applications needing exact reproduction of the target's greedy output).
Invitation: the gate only reads target logits, so it's draft-agnostic and should drop onto any draft source unchanged. I'd love for people to test it where I can't — EAGLE/Medusa heads, larger targets (where the target is more clearly the bottleneck and gains should grow), slower GPUs, other task mixes. Behavioral metrics (loose%, ΔlogP distributions, loop rate vs τ) are hardware-independent, so divergence there is a real signal worth digging into.
Branch: https://github.com/Seanspt/llama.cpp/tree/feature/fly-loosely-speculative-decoding
Feedback on the method very welcome.
Beta Was this translation helpful? Give feedback.
All reactions