Skip to content

deveworld/rethink

Repository files navigation

ReThink: Entropy Pruning as Re-Reasoning Scaffold

Status: NO-GO (2026-04-17). The core finding was an artifact of truncated context. See Post-Mortem below.

Blog post: vLLM Silently Truncated My Reasoning Traces and I Didn't Notice for a Week

Training-free method for improving LLM reasoning accuracy. Prunes low-entropy steps from reasoning traces using [SKIP] placeholders, then lets the model re-derive the answer from the pruned context with sufficient continuation budget.

Post-Mortem

The paper's central claim, that entropy pruning + long continuation budget produces a "sign flip" from negative to positive gain, does not hold at proper baselines.

What happened

  1. Session 1 ran at max_model_len=8192, which truncated 8% of MATH-500 and 50% of AIME24 traces. The resulting baselines were degraded:

    • MATH-500: 66.0% (published: 94.3% with official eval settings)
    • AIME24: 46.7% (published: 72.6%)
  2. At this degraded baseline, the re-reasoning ablation showed a "sign flip":

    • C7-extract (256 tok): -22.8pp → C7-long (2048 tok): +3.4pp
    • This appeared to be a real finding about continuation budget.
  3. Session 2 ran at max_model_len=16384, fixing AIME truncation:

    • MATH-500: 66.8% (+0.8pp, truncation was not the main issue here)
    • AIME24: 66.7% (+20.0pp; half the traces were previously truncated)
  4. At the corrected baseline, the sign flip disappeared:

    • C7-long (2048 tok): -1.2pp MATH-500, +0.0pp AIME24
    • No variant produced positive gain on MATH-500.

Why our baselines were so low

Two separate problems were compounding:

Truncation (max_model_len): Primarily affected AIME24. Fixing this recovered +20pp on AIME24 but only +0.8pp on MATH-500, meaning truncation was not the main issue for MATH-500.

Sampling mismatch: DeepSeek's published 94.3% on MATH-500 uses temperature=0.6, top_p=0.95 with 64 samples per query (pass@1 estimation). We used temperature=0.0 (greedy, single sample). This alone likely accounts for most of the 27pp gap on MATH-500. The 5.9pp residual gap on AIME24 (66.7% vs 72.6%) may also be sampling-related.

Additionally, we used a system prompt ("Please reason step by step..."), while the official DeepSeek recommendation is no system prompt for R1 models.

Why the artifact occurred

The truncated AIME24 baseline (46.7%) was the primary driver of the fake result. Pruning removes steps from the middle of a trace, preserving structural completeness. Re-reasoning from this scaffold could recover answers that truncation (which cuts off ends) had destroyed. This appeared as a gain from pruning but was actually recovery of truncation damage.

The AIME24 +0.0pp result at 16384 is the clearest falsification: zero repairs, zero damage, the method changed nothing.

Lessons learned

  1. Verify baselines against published numbers before running interventions. Our MATH-500 was 28pp below published and AIME24 was 26pp below. Both should have been investigated first.
  2. Understand what "published accuracy" means. The 94.3% uses a specific sampling regime (temperature=0.6, 64 samples). Greedy decoding gives very different numbers. Published baselines are not always greedy.
  3. Separate your confounds. We had truncation AND sampling mismatch at the same time. The 16384 rerun only fixed truncation. The AIME24 result was enough to kill the paper, but we should have diagnosed the MATH-500 gap too.

Session 1 Results (max_model_len=8192, INVALIDATED)

These results are retained for reference but do not support the paper's claims.

              MATH-500 (n=500)           AIME24 (n=30)
max_tokens   Gain    Repair/Damage     Gain    Repair/Damage
---------   ------   -------------    ------   -------------
   256      -22.8pp    11/125         -16.7pp     2/7        ← answer extraction
   512       -2.6pp    18/31           -3.3pp     4/5
  1024       +2.6pp    23/10          +13.3pp     4/0        ← apparent sign flip
  2048       +3.4pp    26/9           +10.0pp     3/0        ← appeared significant

Session 2 Results (max_model_len=16384, DEFINITIVE)

              MATH-500 (n=500)           AIME24 (n=30)
max_tokens   Gain    Repair/Damage     Gain    Repair/Damage
---------   ------   -------------    ------   -------------
   256      -25.4pp     7/134         -23.3pp     0/7
   512       -5.6pp     9/37          -13.3pp     0/4
  1024       -1.2pp     6/12           -3.3pp     0/1
  2048       -1.2pp     5/11           +0.0pp     0/0        ← no gain
  4096       -0.6pp     6/9            +0.0pp     0/0

Quick Start

uv sync
uv run pytest tests/ -v           # 108 tests
uv run python scripts/download_data.py

Project Structure

src/rethink/
  segment.py        Step segmentation (transition markers + blank lines + code fences)
  signals.py        EAT via vLLM logprobs + ACR + commitment detection
  anchors.py        Rule-based anchor classification
  compress.py       Trace compression with [SKIP] placeholders
  reanswer.py       Pass 2: native continuation (batched)
  run_condition.py  All condition handlers (batched execution)
  evaluate.py       LaTeX normalization, AIME integer extraction, code execution
  analysis.py       Results tables, cost frontier, difficulty stratification

scripts/
  rereasoning_ablation.py   THE key experiment: max_tokens sweep
  day1_generate.py          C0 trace generation
  day2_compute_signals.py   EAT/ACR signal computation
  day3_run_experiment.py    Condition execution
  analyze_results.py        Results analysis

Conditions Reference

Condition Description MATH-500 (8192) MATH-500 (16384)
C0 Baseline 66.0% 66.8%
C7-extract Entropy prune + extract (256 tok) -22.8pp -25.4pp
C7-scaffold Entropy prune + re-reason (1024 tok) +2.6pp -1.2pp
C7-scaffold-long Entropy prune + re-reason (2048 tok) +3.4pp -1.2pp
C9 SC@2 (2x cost) +4.2pp not re-tested
C4 Anchor-preserve (original method) +0.2pp not re-tested

About

Entropy pruning as re-reasoning scaffold (NO-GO: artifact of truncated baseline)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors