Skip to content

anadim/AdderBoard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AdderBoard

AdderBoard

Challenge: Build the smallest transformer that can add two 10-digit numbers with >= 99% accuracy on a held-out 10K test set.

This started with Addition Under Pressure, where I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. Claude Code came back with 6,080 parameters and Codex came back with 1,644. The community has since pushed this dramatically lower.

Maintained by Dimitris Papailiopoulos (@dimitrispapail).

We track two categories:

  • Trained — weights learned from data by any training algorithm (SGD, Adam, evolutionary search, etc.). The algorithm must be generic — it should work with any model and dataset, not just this specific problem. This encourages creative ideas around data format, tokenization, curriculum learning, and architecture search.
  • Hand-coded — weights set analytically. This is a constructive proof that the architecture can represent addition, regardless of whether SGD would find it.

Both are valid. Both are interesting.

Leaderboard

Hand-Coded Weights (Constructive Proofs)

Rank Params Accuracy Author Built with Architecture Key Tricks Link
1 10 100% lokimorty 1L Qwen-derived decoder, d=2, 1h, hd=2, ff=2 RoPE period-19, parametric tied embedding, gate tying via algebraic identity, merged carry scalar gist
2 12 100% lokimorty 1L Qwen-derived decoder, d=2, 1h, hd=2, ff=2 RoPE period-19, parametric tied embedding, sparse attention/MLP, constructive carry hinge gist
3 20 100% yieldthought 1L decoder, d=2, 1h, hd=2 Quadratic tied embedding + tied output head, RoPE-19 digit routing, sparse tied V/O, two-hinge ReLU MLP, parameterless pre-norm gist
4 27 100% Wonderfall (@w0nderfall) 1L decoder, d=2, 1h, hd=2 Tied Q/K + V/O, cross-tied W_vo as MLP w2, factorized quadratic embedding, compressed MLP w1, RoPE period-19 gist
5 28 100% jacobli99 1L decoder, d=2, 5h (MQA), hd=2, ff=4 Tied parabolic decode, RoPE digit routing, sparse O-proj, tied MLP, matrix broadcast gist
6 31 100% Arch222 1L decoder, d=3, 4h/1kv, hd=2, ff=4 RoPE offset-targeted queries, sparse O-proj, SwiGLU carry detection, tied embed decode repo
7 33 100% fblissjr Claude Code + Gemini 1L decoder, d=3, 3h (d_head=1), ff=4 ALiBi prefix sum for carry, e^80 softmax anchoring, residual cancellation head, 2-hinge ReLU step, parabolic LM head, float64 repo
8 36 100% alexlitz 2L decoder, d=5, 5h+1h ALiBi slope=log(10) for base-10 weighting, sparse embed, gated ReLU FFN, float64 gist
9 50 100% lichengliu03 1L custom GPT, d=4, 2h, hd=2 Factorized embed, rotation Q (2 angles), tied embed+V dir, rank-1 MLP, parabolic head, sinusoidal PE (period 11) repo
10 66 100% cosminscn 1L nanoGPT, d=4, 2h Rotation Q (2 angles), sparse c_proj (2 nonzero), parabolic lm_head, factorized embed, sinusoidal PE (period 11) gist
11 87 100% bingbangboom-lab 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 Cross-layer sharing, rank-1 projections, sparse gate, low-rank head, frozen scaling params gist
12 93 100% jacobli99 1L decoder, d=2, 5h (MQA), hd=2, ff=4 Tied parabolic decode, RoPE digit routing, ReLU carry detection gist
13 111 100% corbensorenson Codex 1L decoder, d=3, 4h/1kv, hd=2, ff=2 Tied embed, RoPE, SwiGLU, GQA repo
14 116 100% nino 1L Qwen3, d=3, 4h/1kv, hd=2 Tied embed, shared RMSNorm vectors, RoPE (hd=2) gist
15 121 100% Wonderfall (@w0nderfall) Codex 1L Qwen3, d=3, 4h/1kv, hd=2, ff=2 Tied embed, RoPE digit routing, carry via final norm, SiLU wrap detection gist
16 130 100% cosminscn 1L nanoGPT, d=4, 2h Rank-1 linear, factorized embed, sinusoidal PE (period 11), ReLU carry detection, parabolic logit decoding gist
17 130 100% Wonderfall (@w0nderfall) Codex 1L Qwen3, d=3, 4h/1kv, hd=2, ff=3 Tied embed, RoPE digit routing, SiLU carry logic gist
18 139 100% Wonderfall (@w0nderfall) GPT-5.2 Pro + Codex 1L Qwen3, d=3, 4h/1kv, hd=2 Tied embed, RoPE digit routing, SiLU carry logic gist
19 148 100% bingbangboom-lab 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head, cross-layer sharing gist
20 177 100% xangma (@xangma) GPT + Codex 2L Qwen3, d=5, 2h/1kv, hd=2 Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head gist
21 197 ~100%* xangma (@xangma) GPT + Codex 2L Qwen3, d=5, 2h/1kv, hd=2 Rank-1 linear, factorized embed, sparse gate, param-free norm gist

* Passed 8,192 random tests; not independently verified on our 10K test suite yet.

Trained Weights (Learned from Data)

Rank Params Accuracy Author Built with Architecture Key Tricks Link
1 62 100% tbukic SuperchargeAI + Claude Code 1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU Circular arc embedding (3 params), tied K=V, tied O=Q^T, tied lm_head, Adam no weight decay repo
2 67 100% evindor Claude Code + Codex 1L decoder, d=5(2+3), 1h, qk=4, hd=5, ff=2 Parametric circular embed (3p), tied V/O, tied Q/K+phase, rank-1 out, shared norm, carry-mix curriculum repo
3 83 100% tbukic SuperchargeAI + Claude Code 1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU Tied embed, tied K=V, tied O=Q^T, shared all RMSNorms, iterated targeted fine-tuning repo
4 86 100% tbukic SuperchargeAI + Claude Code 1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU Tied embed, tied K=V, tied O=Q^T, shared block RMSNorms, L-BFGS + targeted fine-tuning repo
5 89 100% tbukic SuperchargeAI + Claude Code 1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU Tied embed, tied K=V, tied O=Q^T, RoPE (zero params), QK norms, 4-stage grokking-aware training repo
6 95 99.03% tbukic SuperchargeAI + Claude Code 1L Qwen3 + circular arc embed, d=3, 1h/1kv, hd=4, ff=3, RoPE θ=3, SwiGLU Circular arc embedding (3 params), tied lm_head to dynamic embed, RoPE, QK norms repo
7 101 100% tbukic SuperchargeAI + Claude Code 1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU Tied embed, tied O=Q^T, RoPE (zero params), QK norms, cosine LR + targeted fine-tuning repo
8 122 99.95% staghado 1L Qwen3, d=3, 1h/1kv, hd=4, ff=3 Tied embed, RoPE θ=3 repo
9 234 99.91% JackCai1206 Claude Code 1L decoder, d=6 (3 tok + 3 pos), 2h, hd=3, ff=2 Parametric spiral PE (4 params), split-head attn (QK-pos/V-tok), shared XYZ pos, tied output head, LSB-first repo
10 262 99.95% lichengliu03 1L decoder, d=4, 1h, ff=8 Rank-3 factorization, shared-A tied-KV, RMSNorm, tied embed, curriculum learning repo
11 275 99.98% ryanyord Gemini 1L decoder, d=4, 1h, ff=8, ranks=(3,3,2,2) SVD truncation of 311p, tied embed, low-rank factorization, shareA_tieKV, RMSNorm repo
12 305 99.98% h3nock 1L decoder, d=4, 1h, ff=9 Low-rank factorization, shared-A tied-KV, RMSNorm, tied embed, learned PE, curriculum learning repo
13 311 99.999% rezabyt (@reza_byt) 1L decoder, d=4, 1h, ff=8 Rank-3 factorization, shared-A tied-KV, RMSNorm, grokking repo
14 456 100% yinglunz 1L decoder, d=7, 1h, ff=14 Rank-3 factorization, shared-A tied-KV, rank-2 attn out, tied embed repo
15 491 99.97% rezabyt (@reza_byt) 1L decoder, d=7 Rank-3 factorization, RMSNorm, curriculum learning repo
16 512 99.988% yinglunz (@yinglun122) 1L decoder, d=7, 1h, ff=14 Rank-3 factorization repo
17 777 99.69% Yeb Havinga (@YebHavinga) Claude Code 1L decoder, d=7, 1h, ff=14 Tied embeddings, no FFN bias, curriculum learning repo
18 1,644 99.04% anadim (@dimitrispapail) Codex 1L decoder, pair tokens Pair token encoding (digit pairs as single tokens) repo
19 6,080 100% anadim (@dimitrispapail) Claude Code 2L decoder, d=16, ff=48 Systematic scaling, found phase transition at d=16 repo

Rules

The Core Constraint: Autoregressive Transformer

The model must operate as a genuine autoregressive transformer. This means:

  1. Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.

  2. The model must be autoregressive. It receives a token sequence as input and predicts the next token. Output digits are generated one at a time, with each new token fed back as input for predicting the next. The carry propagation must emerge from this autoregressive process — not from explicit state variables passed between steps in Python.

  3. Standard forward pass. The model's forward() method must be a standard tensor-in, logits-out computation. No problem-specific control flow (for-loops over digits, explicit carry variables, string manipulation) inside forward(). The autoregressive generation loop lives outside the model, exactly as it would for any language model.

  4. The model does the work, not the code. The inference code should be generic autoregressive decoding that would work with any transformer checkpoint. If your generation loop contains addition-specific logic — manually pairing digits, threading carry state, indexing into specific positions — then the Python code is solving the problem, not the model.

In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.

What's Allowed

  • Architectural variations: rank-1/low-rank projections, factorized embeddings, custom positional encodings, alternative norms
  • Hand-coded weights (constructive proofs are valid — they show the architecture can represent addition)
  • Trained weights via any generic learning algorithm (shows the solution is learnable — encourages creative ideas on data format, tokenization, and curriculum)
  • Input formatting choices (reversed digits, delimiters, etc.) as long as the format is fixed and doesn't encode the answer

Qualification

  • Must achieve >= 99% accuracy on 10,000 random test pairs (held-out, fixed seed)
  • Inputs: two integers in [0, 9,999,999,999]
  • Output: their sum as an integer
  • Verified using verify.py with --seed 2025

Parameter Counting

  • Count unique parameters (after weight tying/deduplication)
  • Fixed/sinusoidal positional encodings are not counted (following the original Transformer paper convention)
  • Learned positional encodings are counted

How to Submit

Option A: Open an Issue (easiest)

  1. Click New Issue and fill in the template
  2. Include a link to your code (GitHub repo, gist, etc.)
  3. Include test results (accuracy on random pairs)
  4. We'll verify and add you to the leaderboard

Option B: Open a Pull Request

  1. Fork this repo
  2. Update the leaderboard in README.md with your entry
  3. Include verification results
  4. We'll review and merge

Updates to the leaderboard are welcome via pull request.

Verification

python verify.py submissions/your_submission.py

This runs:

  • 10 edge cases (boundary values, max carry chains)
  • 10,000 random pairs (seed=2025)
  • Reports accuracy, pass/fail, and timing

Context

This challenge explores a fundamental question: what is the minimal transformer that can represent integer addition?

Addition requires three capabilities:

  1. Digit alignment — pairing corresponding digits from two numbers
  2. Per-digit arithmetic — computing sum and carry for each pair
  3. Carry propagation — threading carry information across positions

Transformers solve these using attention (for alignment), MLPs (for arithmetic), and autoregressive generation (for carry propagation). The question is how small the architecture can be while still implementing all three.

Key Findings from the Community

  • Parameter cliff at ~800: Sharp accuracy transition observed by multiple researchers
  • Single layers beat two layers at equivalent parameter budgets (for trained models)
  • d=7 was the sweet spot for early trained models — multiple independent teams converged on this
  • d=4 now works with rank-3 factorization + grokking (311 params trained)
  • Hand-coded models can go much smaller (10 vs 62 trained) since they don't need to be discoverable by SGD
  • Rank-3 factorization is the key trick for trained models
  • ALiBi enables extreme compression: the 36-param leader uses ALiBi with slope log(10) for base-10 positional weighting, achieving 100% accuracy with a 2-layer decoder (d=5) in float64

License

MIT

About

Smallest transformer that can add two 10-digit numbers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages