Grammar-constrained chain-of-thought for reasoning LMs. Zero training. 22× compression on HumanEval+; +14 pp on a recent LiveCodeBench public-test slice.
On HumanEval+ (164 problems) with unsloth/Qwen3.6-35B-A3B-GGUF at Q4_K_M, running on one H100 via llama-cpp-python:
| Mode | pass@1 | mean thinking tokens | mean total tokens |
|---|---|---|---|
Free-form <think>...</think> |
151 / 164 = 92.1% | 3087 | 3410 |
Grammar-constrained <think>GOAL/APPROACH/EDGE</think> |
152 / 164 = 92.7% | 138 | 408 |
Prompt-only terse GOAL/APPROACH/EDGE |
153 / 164 = 93.3% | 2298 | 2764 |
| FSM vs FREE Δ | +0.6 pp | 22.4× shorter | 8.4× shorter |
No distillation. No fine-tuning. No reward-model. A ~20-line GBNF grammar applied to the <think> block at inference time matches full-thinking accuracy with an order-of-magnitude fewer tokens. The prompt-only terse control slightly improves pass@1 on this run, but still uses 2298 thinking tokens on average; the grammar is what reliably enforces the compact token regime.
On a harder recent LiveCodeBench v6 LeetCode slice (50 problems, contest_date >= 2025-01-01, public functional tests), the richer grammars/fsm_grammar_lcb_plan.gbnf grammar improves both reliability and token use:
| Mode | pass@1 | mean thinking tokens | mean total tokens | extraction issues |
|---|---|---|---|---|
Free-form <think>...</think> |
25 / 50 = 50.0% | 11553 | 13632 | empty_code=18 |
FSM_PLAN GOAL/STATE/ALGO/EDGE/VERIFY |
32 / 50 = 64.0% | 267 | 2743 | none |
| FSM_PLAN vs FREE Δ | +14.0 pp | 43.3× shorter | 5.0× shorter | - |
The LiveCodeBench result is a public-test result, not an official private leaderboard score. It also reveals a useful caveat: on harder tasks, the model can move some deliberation into comments or post-think answer text, so total tokens and comment bloat need to be tracked alongside <think> tokens.
See RESULTS.md for the full experimental writeup, per-problem breakdown, and discussion of limitations.
Reasoning models like Qwen3, DeepSeek-R1, QwQ spend thousands of tokens in verbose prose thinking — exploring alternatives, restating, hedging. This work shows, on one benchmark, that for a large chunk of that reasoning, the verbose scaffolding isn't doing real work. The model already has the reasoning capability internally; grammar constraint just extracts it in a denser form.
The engineering implication is direct: inference-time thinking compute can be cut dramatically via a grammar file alone, with no training pipeline or serving changes beyond a GBNF argument. HumanEval+ shows the clean compression case; LiveCodeBench shows the harder-task behavior, where the grammar also prevents many no-code failures but can displace some reasoning into the answer channel.
Single GBNF grammar (see grammars/fsm_grammar.gbnf):
root ::= think code
think ::= "<think>\n" "GOAL: " line "APPROACH: " line "EDGE: " line "</think>\n\n"
line ::= [^\n]+ "\n"
code ::= [\x09\x0A\x0D\x20-\x7E]+
Three lines of structured plan. Code after is unconstrained.
The grammar is applied via llama-cpp-python's server which accepts grammar in the request body. At each generation step inside <think>, logits for tokens that violate the grammar are masked to -∞; the model samples from the constrained distribution.
Tested on an H100 with CUDA 12.4.
uv sync
# llama-cpp-python is pulled from the prebuilt cu124 index via tool.uv.sources.
# CUDA runtime libs pinned so it coexists with torch's cu13 default.
# Before running, expose the bundled CUDA libs to the dynamic loader:
export LD_LIBRARY_PATH=$(uv run python -c "
import site, os
sp = site.getsitepackages()[0]
nv = os.path.join(sp, 'nvidia')
print(':'.join(os.path.join(nv, d, 'lib') for d in os.listdir(nv)
if os.path.isdir(os.path.join(nv, d, 'lib'))))
")For long generations, native llama-server is faster than llama-cpp-python, and can use llama.cpp speculative decoding:
sudo apt-get update
sudo apt-get install -y git cmake build-essential libcurl4-openssl-dev
git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cmake -S ~/llama.cpp -B ~/llama.cpp/build \
-DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build ~/llama.cpp/build --config Release -j
export PATH="$HOME/llama.cpp/build/bin:$PATH"
llama-server --help | headThe repo includes run_llama_server.sh, which defaults to:
llama-server -hf ggml-org/Qwen3.6-27B-GGUF --spec-default \
--host 127.0.0.1 --port 8000 -c 32768 -ngl 999 \
--flash-attn on --reasoning-format noneThe older run_server.sh still starts llama-cpp-python and is kept for reproducing the original run.
export HF_TOKEN=hf_...
huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
--include "*Q4_K_M*" --local-dir ~/models/qwen3.6-ggufThe server can run in the foreground in one pane, or in the background from a single terminal.
BACKGROUND=1 ./run_llama_server.sh
tail -f server.log
curl http://127.0.0.1:8000/v1/modelsBy default this downloads/serves ggml-org/Qwen3.6-27B-GGUF through native llama-server with --spec-default and --reasoning-format none, so <think>...</think> remains visible for token accounting. Override with env vars:
HF_REPO=ggml-org/Qwen3.6-27B-GGUF N_CTX=32768 BACKGROUND=1 ./run_llama_server.sh
MODEL_PATH=/path/to/model.gguf BACKGROUND=1 ./run_llama_server.sh
KV_TYPE=q8_0 BACKGROUND=1 ./run_llama_server.sh
REASONING_FORMAT=deepseek-legacy BACKGROUND=1 ./run_llama_server.shBackground mode writes server.log and server.pid. Stop it with:
kill "$(cat server.pid)"If you prefer the original Python server, use ./run_server.sh. It auto-discovers the GGUF from ~/.cache/huggingface/hub/ or ~/models/, uses 8-bit KV cache (q8_0), flash attention, and n_ctx=65536.
# Smoke test
uv run python fsm_vs_free_eval.py --n-problems 10 --max-tokens 4096
# Full HumanEval+ with FREE + FSM + PROMPT_TERSE controls
uv run python fsm_vs_free_eval.py --n-problems 164 --max-tokens 8192 --only all
# MBPP+
uv run python fsm_vs_free_eval.py --n-problems 100 --dataset mbpp --max-tokens 8192
# LiveCodeBench v6 recent subset (public functional tests)
uv run python fsm_vs_free_eval.py --dataset livecodebench \
--lcb-version release_v6 --date-cutoff 2025-01-01 --platform leetcode \
--n-problems 50 --max-tokens 16384 --only all \
--out-dir lcb_v6_2025_01_01_n50_all
# LiveCodeBench FSM-only baseline grammar
uv run python fsm_vs_free_eval.py --dataset livecodebench \
--lcb-version release_v6 --date-cutoff 2025-01-01 --platform leetcode \
--n-problems 50 --max-tokens 16384 --only fsm \
--grammar-file grammars/fsm_grammar.gbnf \
--out-dir lcb_v6_2025_01_01_fsm_base_grammar
# LiveCodeBench FSM-only plan grammar
uv run python fsm_vs_free_eval.py --dataset livecodebench \
--lcb-version release_v6 --date-cutoff 2025-01-01 --platform leetcode \
--n-problems 50 --max-tokens 16384 --only fsm \
--grammar-file grammars/fsm_grammar_lcb_plan.gbnf \
--out-dir lcb_v6_2025_01_01_fsm_lcb_planEach run produces in fsm_vs_free/:
results.jsonl— per-problem raw generations, extracted think/code, pass/fail, errors, extraction metadatasummary.json— aggregate stats, pass-set overlap, and failure accountingper_problem.md— human-readable report with outcome tags (🔺 / 🔻 / 🟰 / ❌)
To make a side-by-side generation animation for one problem:
uv run python make_tps_animation.py \
--task-id 3781 \
--left-results lcb_v6_2025_01_01_free_n50/results.jsonl \
--left-mode free --left-label FREE --left-seconds 237 \
--right-results lcb_v6_2025_01_01_fsm_lcb_plan_n50/results.jsonl \
--right-mode fsm --right-label FSM_PLAN --right-seconds 279 \
--out animations/lcb_3781_free_vs_fsm_plan.htmlThe summary also reports post_think_tokens_mean, answer_channel_bloat,
code_comment_tokens_mean, and comment_bloat. Those are useful on
LiveCodeBench because a model can obey a short <think> grammar while moving
some of the missing scratchpad into comments or other post-think answer text.
Qwen3.6-35B-A3B is a MoE hybrid:
- 40 layers total, pattern
10 × (3 × GatedDeltaNet + 1 × GatedAttention) - 256 experts, 9 active per token (3B active params)
- Native 262K context, extensible to 1M via YaRN
For the FSM experiment, the base-LM architecture doesn't matter much — we're constraining what tokens are emitted, not how they're processed internally. The same GBNF approach should work with any model served through llama.cpp, vLLM, SGLang, TGI, etc., as long as the server accepts a grammar parameter.
-
Contamination. HumanEval has been in training corpora for years. All modes may be recalling solutions, not reasoning. The "FSM matches FREE" result could mean "grammar extracts the same memorized solution in fewer tokens" rather than "grammar preserves reasoning capability." LiveCodeBench release v6 currently reaches April 2025, so it is a useful recent/lower-contamination check but not a strict post-Qwen3.6-release cutoff.
-
Grammar specificity. The GOAL/APPROACH/EDGE format was tuned for short coding tasks. The active LiveCodeBench grammar,
grammars/fsm_grammar_lcb_plan.gbnf, addsGOAL/STATE/ALGO/EDGE/VERIFYwhile leaving the answer permissive. Math / logic / planning domains may need different symbolic formats. Unclear whether a single "universal" compressed-thinking grammar exists or whether each domain needs its own. -
Reasoning depth. The current result is on problems solvable in one forward pass. For multi-step problems (SWE-Bench, long-horizon planning, agentic tasks), whether grammar compression preserves capability is untested.
-
Model dependency. Run with one model (Qwen3.6-35B-A3B). We don't know yet whether smaller models (1B-7B) can be grammar-constrained the same way or whether the capability requires scale.
- ✅ HumanEval+ full (164 problems): FSM 152/164 vs FREE 151/164, 22× think-token compression
- ✅ LiveCodeBench v6 recent LeetCode subset, public tests (50 problems): FREE 25/50 vs
grammars/fsm_grammar_lcb_plan.gbnf32/50; FSM_PLAN improves pass@1 while cutting mean total tokens from 13632 to 2743 - ⏳ MBPP+ (planned)
- 🔲 Other domains (math, logic, planning)
- 🔲 Cross-model transfer (smaller models)
The direct technical antecedent is Coconut: Training Large Language Models to Reason in a Continuous Latent Space (Hao et al., 2024), which compresses CoT into continuous latents via fine-tuning. This work tests whether a much cheaper intervention — grammar constraint at inference, no training — captures a meaningful fraction of the same benefit.