GitHub - davidheineman/traces: Annotate traces from thinking LLMs

setup

pip install -r requirements.txt

# Spacy for sentence segmentation
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl

GPT OSS Install

# GPT OSS requires pre-release vLLM
# https://cookbook.openai.com/articles/gpt-oss/run-vllm
uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

# ... and install flash infer
uv pip install --system https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl

Run eval

minieval -t minerva_500:cot -m deepseek-ai/DeepSeek-R1-0528-Qwen3-8B -b vllm --writer.save_path out/DeepSeek-R1-0528-Qwen3-8B
minieval -t minerva_500:cot -m open-thoughts/OpenThinker3-7B -b vllm --writer.save_path out/OpenThinker3-7B
minieval -t minerva_500:cot -m Qwen/Qwen3-30B-A3B-Thinking-2507 -b vllm --writer.save_path out/Qwen3-30B-A3B-Thinking-2507

# Args required for A100, not for H100
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 TORCH_CUDA_ARCH_LIST=8.0 minieval -t minerva_500:cot -m openai/gpt-oss-20b -b vllm --writer.save_path out/gpt-oss-20b

Additional models

open-thoughts/OpenThinker3-7B
Qwen/Qwen3-235B-A22B-Thinking-2507
Qwen/Qwen3-30B-A3B-Thinking-2507
deepseek-ai/DeepSeek-R1
deepseek-ai/DeepSeek-R1-0528
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
moonshotai/Kimi-K2-Instruct
openai/gpt-oss-20b
openai/gpt-oss-120b

Annotate reasoning trace (per-sentence)

# Use GPT to annotate (~1M input tokens per trace)
python src/annotate_sentences.py -t minerva_500:cot -m OpenThinker3-7B
python src/annotate_sentences.py -t minerva_500:cot -m DeepSeek-R1-0528-Qwen3-8B
python src/annotate_sentences.py -t minerva_500:cot -m gpt-oss-20b

Local models

# see checkpoints
tree -L 2 /oe-eval-default/ai2-llm/checkpoints/davidh/OLMo-RLVR/

# eval
minieval -t minerva_500:cot -m /oe-eval-default/ai2-llm/checkpoints/davidh/OLMo-RLVR/1806rl_qwen2_5_integration_mix_12022__1__1750443080_checkpoints/step_100 -b vllm --writer.save_path out/olmo-rlvr-qwen-2_5-step100
minieval -t minerva_500:cot -m /oe-eval-default/ai2-llm/checkpoints/davidh/OLMo-RLVR/1806rl_qwen2_5_integration_mix_12022__1__1750443080_checkpoints/step_2200 -b vllm --writer.save_path out/olmo-rlvr-qwen-2_5-step2200

# annotate
python src/annotate_sentences.py -t minerva_500:cot -m olmo-rlvr-qwen-2_5-step100
python src/annotate_sentences.py -t minerva_500:cot -m olmo-rlvr-qwen-2_5-step2200

Token-level annotation

Run vLLM server

vllm serve Qwen/Qwen3-32B --port 8000 --max-model-len 32768

Annotate reasoning trace (per-token, custom decoder)

# Custom decoder that re-generates the same output, but allows tagging:
    # "Compute 2+3=5.\n" ==> "[problem_setup]Compute 2+3=5.[/problem_setup]"
python src/annotate_constrained.py # currently 6 TPS on 4o mini (40 minutes for 1 13K token trace)

# run in background
nohup python src/annotate_constrained.py > /tmp/out.out 2>&1 &

Visualize

python analysis/distribution.py

More ideas

Fixes:
- can we split "active computation" into computation for the original answer vs. checking the answer?
How do trends look:
- for intermediate RL steps?
- for different amounts of GPT OSS reasoning?
- for different tasks?
- for the subset of tasks with lots of tokens? a small amount of tokens?
Different properties of traces
- what % is spent developing towards the correct answers? compared to computation that is effectively thrown away?
- how many tokens in is the final answer suggested? (i.e., how long is spent checking the answer?)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
analysis		analysis
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

setup

Visualize

More ideas

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

setup

Visualize

More ideas

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages