Status: NO-GO (2026-04-17). The core finding was an artifact of truncated context. See Post-Mortem below.
Blog post: vLLM Silently Truncated My Reasoning Traces and I Didn't Notice for a Week
Training-free method for improving LLM reasoning accuracy. Prunes low-entropy steps from reasoning traces using [SKIP] placeholders, then lets the model re-derive the answer from the pruned context with sufficient continuation budget.
The paper's central claim, that entropy pruning + long continuation budget produces a "sign flip" from negative to positive gain, does not hold at proper baselines.
-
Session 1 ran at
max_model_len=8192, which truncated 8% of MATH-500 and 50% of AIME24 traces. The resulting baselines were degraded:- MATH-500: 66.0% (published: 94.3% with official eval settings)
- AIME24: 46.7% (published: 72.6%)
-
At this degraded baseline, the re-reasoning ablation showed a "sign flip":
- C7-extract (256 tok): -22.8pp → C7-long (2048 tok): +3.4pp
- This appeared to be a real finding about continuation budget.
-
Session 2 ran at
max_model_len=16384, fixing AIME truncation:- MATH-500: 66.8% (+0.8pp, truncation was not the main issue here)
- AIME24: 66.7% (+20.0pp; half the traces were previously truncated)
-
At the corrected baseline, the sign flip disappeared:
- C7-long (2048 tok): -1.2pp MATH-500, +0.0pp AIME24
- No variant produced positive gain on MATH-500.
Two separate problems were compounding:
Truncation (max_model_len): Primarily affected AIME24. Fixing this recovered +20pp on AIME24 but only +0.8pp on MATH-500, meaning truncation was not the main issue for MATH-500.
Sampling mismatch: DeepSeek's published 94.3% on MATH-500 uses temperature=0.6, top_p=0.95 with 64 samples per query (pass@1 estimation). We used temperature=0.0 (greedy, single sample). This alone likely accounts for most of the 27pp gap on MATH-500. The 5.9pp residual gap on AIME24 (66.7% vs 72.6%) may also be sampling-related.
Additionally, we used a system prompt ("Please reason step by step..."), while the official DeepSeek recommendation is no system prompt for R1 models.
The truncated AIME24 baseline (46.7%) was the primary driver of the fake result. Pruning removes steps from the middle of a trace, preserving structural completeness. Re-reasoning from this scaffold could recover answers that truncation (which cuts off ends) had destroyed. This appeared as a gain from pruning but was actually recovery of truncation damage.
The AIME24 +0.0pp result at 16384 is the clearest falsification: zero repairs, zero damage, the method changed nothing.
- Verify baselines against published numbers before running interventions. Our MATH-500 was 28pp below published and AIME24 was 26pp below. Both should have been investigated first.
- Understand what "published accuracy" means. The 94.3% uses a specific sampling regime (temperature=0.6, 64 samples). Greedy decoding gives very different numbers. Published baselines are not always greedy.
- Separate your confounds. We had truncation AND sampling mismatch at the same time. The 16384 rerun only fixed truncation. The AIME24 result was enough to kill the paper, but we should have diagnosed the MATH-500 gap too.
These results are retained for reference but do not support the paper's claims.
MATH-500 (n=500) AIME24 (n=30)
max_tokens Gain Repair/Damage Gain Repair/Damage
--------- ------ ------------- ------ -------------
256 -22.8pp 11/125 -16.7pp 2/7 ← answer extraction
512 -2.6pp 18/31 -3.3pp 4/5
1024 +2.6pp 23/10 +13.3pp 4/0 ← apparent sign flip
2048 +3.4pp 26/9 +10.0pp 3/0 ← appeared significant
MATH-500 (n=500) AIME24 (n=30)
max_tokens Gain Repair/Damage Gain Repair/Damage
--------- ------ ------------- ------ -------------
256 -25.4pp 7/134 -23.3pp 0/7
512 -5.6pp 9/37 -13.3pp 0/4
1024 -1.2pp 6/12 -3.3pp 0/1
2048 -1.2pp 5/11 +0.0pp 0/0 ← no gain
4096 -0.6pp 6/9 +0.0pp 0/0
uv sync
uv run pytest tests/ -v # 108 tests
uv run python scripts/download_data.pysrc/rethink/
segment.py Step segmentation (transition markers + blank lines + code fences)
signals.py EAT via vLLM logprobs + ACR + commitment detection
anchors.py Rule-based anchor classification
compress.py Trace compression with [SKIP] placeholders
reanswer.py Pass 2: native continuation (batched)
run_condition.py All condition handlers (batched execution)
evaluate.py LaTeX normalization, AIME integer extraction, code execution
analysis.py Results tables, cost frontier, difficulty stratification
scripts/
rereasoning_ablation.py THE key experiment: max_tokens sweep
day1_generate.py C0 trace generation
day2_compute_signals.py EAT/ACR signal computation
day3_run_experiment.py Condition execution
analyze_results.py Results analysis
| Condition | Description | MATH-500 (8192) | MATH-500 (16384) |
|---|---|---|---|
| C0 | Baseline | 66.0% | 66.8% |
| C7-extract | Entropy prune + extract (256 tok) | -22.8pp | -25.4pp |
| C7-scaffold | Entropy prune + re-reason (1024 tok) | +2.6pp | -1.2pp |
| C7-scaffold-long | Entropy prune + re-reason (2048 tok) | +3.4pp | -1.2pp |
| C9 | SC@2 (2x cost) | +4.2pp | not re-tested |
| C4 | Anchor-preserve (original method) | +0.2pp | not re-tested |