AdderBoard

Challenge: Build the smallest transformer that can add two 10-digit numbers with >= 99% accuracy on a held-out 10K test set.

This started with Addition Under Pressure, where I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. Claude Code came back with 6,080 parameters and Codex came back with 1,644. The community has since pushed this dramatically lower.

Maintained by Dimitris Papailiopoulos (@dimitrispapail).

We track two categories:

Trained — weights learned from data by any training algorithm (SGD, Adam, evolutionary search, etc.). The algorithm must be generic — it should work with any model and dataset, not just this specific problem. This encourages creative ideas around data format, tokenization, curriculum learning, and architecture search.
Hand-coded — weights set analytically. This is a constructive proof that the architecture can represent addition, regardless of whether SGD would find it.

Both are valid. Both are interesting.

Leaderboard

Hand-Coded Weights (Constructive Proofs)

Rank	Params	Accuracy	Author	Built with	Architecture	Key Tricks	Link
1	10	100%	lokimorty		1L Qwen-derived decoder, d=2, 1h, hd=2, ff=2	RoPE period-19, parametric tied embedding, gate tying via algebraic identity, merged carry scalar	gist
2	12	100%	lokimorty		1L Qwen-derived decoder, d=2, 1h, hd=2, ff=2	RoPE period-19, parametric tied embedding, sparse attention/MLP, constructive carry hinge	gist
3	20	100%	yieldthought		1L decoder, d=2, 1h, hd=2	Quadratic tied embedding + tied output head, RoPE-19 digit routing, sparse tied V/O, two-hinge ReLU MLP, parameterless pre-norm	gist
4	27	100%	Wonderfall (@w0nderfall)		1L decoder, d=2, 1h, hd=2	Tied Q/K + V/O, cross-tied W_vo as MLP w2, factorized quadratic embedding, compressed MLP w1, RoPE period-19	gist
5	28	100%	jacobli99		1L decoder, d=2, 5h (MQA), hd=2, ff=4	Tied parabolic decode, RoPE digit routing, sparse O-proj, tied MLP, matrix broadcast	gist
6	31	100%	Arch222		1L decoder, d=3, 4h/1kv, hd=2, ff=4	RoPE offset-targeted queries, sparse O-proj, SwiGLU carry detection, tied embed decode	repo
7	33	100%	fblissjr	Claude Code + Gemini	1L decoder, d=3, 3h (d_head=1), ff=4	ALiBi prefix sum for carry, e^80 softmax anchoring, residual cancellation head, 2-hinge ReLU step, parabolic LM head, float64	repo
8	36	100%	alexlitz		2L decoder, d=5, 5h+1h	ALiBi slope=log(10) for base-10 weighting, sparse embed, gated ReLU FFN, float64	gist
9	50	100%	lichengliu03		1L custom GPT, d=4, 2h, hd=2	Factorized embed, rotation Q (2 angles), tied embed+V dir, rank-1 MLP, parabolic head, sinusoidal PE (period 11)	repo
10	66	100%	cosminscn		1L nanoGPT, d=4, 2h	Rotation Q (2 angles), sparse c_proj (2 nonzero), parabolic lm_head, factorized embed, sinusoidal PE (period 11)	gist
11	87	100%	bingbangboom-lab		2L Qwen3, d=5, 2h/1kv, hd=2, ff=3	Cross-layer sharing, rank-1 projections, sparse gate, low-rank head, frozen scaling params	gist
12	93	100%	jacobli99		1L decoder, d=2, 5h (MQA), hd=2, ff=4	Tied parabolic decode, RoPE digit routing, ReLU carry detection	gist
13	111	100%	corbensorenson	Codex	1L decoder, d=3, 4h/1kv, hd=2, ff=2	Tied embed, RoPE, SwiGLU, GQA	repo
14	116	100%	nino		1L Qwen3, d=3, 4h/1kv, hd=2	Tied embed, shared RMSNorm vectors, RoPE (hd=2)	gist
15	121	100%	Wonderfall (@w0nderfall)	Codex	1L Qwen3, d=3, 4h/1kv, hd=2, ff=2	Tied embed, RoPE digit routing, carry via final norm, SiLU wrap detection	gist
16	130	100%	cosminscn		1L nanoGPT, d=4, 2h	Rank-1 linear, factorized embed, sinusoidal PE (period 11), ReLU carry detection, parabolic logit decoding	gist
17	130	100%	Wonderfall (@w0nderfall)	Codex	1L Qwen3, d=3, 4h/1kv, hd=2, ff=3	Tied embed, RoPE digit routing, SiLU carry logic	gist
18	139	100%	Wonderfall (@w0nderfall)	GPT-5.2 Pro + Codex	1L Qwen3, d=3, 4h/1kv, hd=2	Tied embed, RoPE digit routing, SiLU carry logic	gist
19	148	100%	bingbangboom-lab		2L Qwen3, d=5, 2h/1kv, hd=2, ff=3	Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head, cross-layer sharing	gist
20	177	100%	xangma (@xangma)	GPT + Codex	2L Qwen3, d=5, 2h/1kv, hd=2	Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head	gist
21	197	~100%*	xangma (@xangma)	GPT + Codex	2L Qwen3, d=5, 2h/1kv, hd=2	Rank-1 linear, factorized embed, sparse gate, param-free norm	gist

* Passed 8,192 random tests; not independently verified on our 10K test suite yet.

Trained Weights (Learned from Data)

Rank	Params	Accuracy	Author	Built with	Architecture	Key Tricks	Link
1	62	100%	tbukic	SuperchargeAI + Claude Code	1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU	Circular arc embedding (3 params), tied K=V, tied O=Q^T, tied lm_head, Adam no weight decay	repo
2	67	100%	evindor	Claude Code + Codex	1L decoder, d=5(2+3), 1h, qk=4, hd=5, ff=2	Parametric circular embed (3p), tied V/O, tied Q/K+phase, rank-1 out, shared norm, carry-mix curriculum	repo
3	83	100%	tbukic	SuperchargeAI + Claude Code	1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU	Tied embed, tied K=V, tied O=Q^T, shared all RMSNorms, iterated targeted fine-tuning	repo
4	86	100%	tbukic	SuperchargeAI + Claude Code	1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU	Tied embed, tied K=V, tied O=Q^T, shared block RMSNorms, L-BFGS + targeted fine-tuning	repo
5	89	100%	tbukic	SuperchargeAI + Claude Code	1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU	Tied embed, tied K=V, tied O=Q^T, RoPE (zero params), QK norms, 4-stage grokking-aware training	repo
6	95	99.03%	tbukic	SuperchargeAI + Claude Code	1L Qwen3 + circular arc embed, d=3, 1h/1kv, hd=4, ff=3, RoPE θ=3, SwiGLU	Circular arc embedding (3 params), tied lm_head to dynamic embed, RoPE, QK norms	repo
7	101	100%	tbukic	SuperchargeAI + Claude Code	1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU	Tied embed, tied O=Q^T, RoPE (zero params), QK norms, cosine LR + targeted fine-tuning	repo
8	122	99.95%	staghado		1L Qwen3, d=3, 1h/1kv, hd=4, ff=3	Tied embed, RoPE θ=3	repo
9	234	99.91%	JackCai1206	Claude Code	1L decoder, d=6 (3 tok + 3 pos), 2h, hd=3, ff=2	Parametric spiral PE (4 params), split-head attn (QK-pos/V-tok), shared XYZ pos, tied output head, LSB-first	repo
10	262	99.95%	lichengliu03		1L decoder, d=4, 1h, ff=8	Rank-3 factorization, shared-A tied-KV, RMSNorm, tied embed, curriculum learning	repo
11	275	99.98%	ryanyord	Gemini	1L decoder, d=4, 1h, ff=8, ranks=(3,3,2,2)	SVD truncation of 311p, tied embed, low-rank factorization, shareA_tieKV, RMSNorm	repo
12	305	99.98%	h3nock		1L decoder, d=4, 1h, ff=9	Low-rank factorization, shared-A tied-KV, RMSNorm, tied embed, learned PE, curriculum learning	repo
13	311	99.999%	rezabyt (@reza_byt)		1L decoder, d=4, 1h, ff=8	Rank-3 factorization, shared-A tied-KV, RMSNorm, grokking	repo
14	456	100%	yinglunz		1L decoder, d=7, 1h, ff=14	Rank-3 factorization, shared-A tied-KV, rank-2 attn out, tied embed	repo
15	491	99.97%	rezabyt (@reza_byt)		1L decoder, d=7	Rank-3 factorization, RMSNorm, curriculum learning	repo
16	512	99.988%	yinglunz (@yinglun122)		1L decoder, d=7, 1h, ff=14	Rank-3 factorization	repo
17	777	99.69%	Yeb Havinga (@YebHavinga)	Claude Code	1L decoder, d=7, 1h, ff=14	Tied embeddings, no FFN bias, curriculum learning	repo
18	1,644	99.04%	anadim (@dimitrispapail)	Codex	1L decoder, pair tokens	Pair token encoding (digit pairs as single tokens)	repo
19	6,080	100%	anadim (@dimitrispapail)	Claude Code	2L decoder, d=16, ff=48	Systematic scaling, found phase transition at d=16	repo

Rules

The Core Constraint: Autoregressive Transformer

The model must operate as a genuine autoregressive transformer. This means:

Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.
The model must be autoregressive. It receives a token sequence as input and predicts the next token. Output digits are generated one at a time, with each new token fed back as input for predicting the next. The carry propagation must emerge from this autoregressive process — not from explicit state variables passed between steps in Python.
Standard forward pass. The model's forward() method must be a standard tensor-in, logits-out computation. No problem-specific control flow (for-loops over digits, explicit carry variables, string manipulation) inside forward(). The autoregressive generation loop lives outside the model, exactly as it would for any language model.
The model does the work, not the code. The inference code should be generic autoregressive decoding that would work with any transformer checkpoint. If your generation loop contains addition-specific logic — manually pairing digits, threading carry state, indexing into specific positions — then the Python code is solving the problem, not the model.

In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.

What's Allowed

Architectural variations: rank-1/low-rank projections, factorized embeddings, custom positional encodings, alternative norms
Hand-coded weights (constructive proofs are valid — they show the architecture can represent addition)
Trained weights via any generic learning algorithm (shows the solution is learnable — encourages creative ideas on data format, tokenization, and curriculum)
Input formatting choices (reversed digits, delimiters, etc.) as long as the format is fixed and doesn't encode the answer

Qualification

Must achieve >= 99% accuracy on 10,000 random test pairs (held-out, fixed seed)
Inputs: two integers in [0, 9,999,999,999]
Output: their sum as an integer
Verified using verify.py with --seed 2025

Parameter Counting

Count unique parameters (after weight tying/deduplication)
Fixed/sinusoidal positional encodings are not counted (following the original Transformer paper convention)
Learned positional encodings are counted

How to Submit

Option A: Open an Issue (easiest)

Click New Issue and fill in the template
Include a link to your code (GitHub repo, gist, etc.)
Include test results (accuracy on random pairs)
We'll verify and add you to the leaderboard

Option B: Open a Pull Request

Fork this repo
Update the leaderboard in README.md with your entry
Include verification results
We'll review and merge

Updates to the leaderboard are welcome via pull request.

Verification

python verify.py submissions/your_submission.py

This runs:

10 edge cases (boundary values, max carry chains)
10,000 random pairs (seed=2025)
Reports accuracy, pass/fail, and timing

Context

This challenge explores a fundamental question: what is the minimal transformer that can represent integer addition?

Addition requires three capabilities:

Digit alignment — pairing corresponding digits from two numbers
Per-digit arithmetic — computing sum and carry for each pair
Carry propagation — threading carry information across positions

Transformers solve these using attention (for alignment), MLPs (for arithmetic), and autoregressive generation (for carry propagation). The question is how small the architecture can be while still implementing all three.

Key Findings from the Community

Parameter cliff at ~800: Sharp accuracy transition observed by multiple researchers
Single layers beat two layers at equivalent parameter budgets (for trained models)
d=7 was the sweet spot for early trained models — multiple independent teams converged on this
d=4 now works with rank-3 factorization + grokking (311 params trained)
Hand-coded models can go much smaller (10 vs 62 trained) since they don't need to be discoverable by SGD
Rank-3 factorization is the key trick for trained models
ALiBi enables extreme compression: the 36-param leader uses ALiBi with slope log(10) for base-10 positional weighting, achieving 100% accuracy with a 2-layer decoder (d=5) in float64

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
README.md		README.md
adderboard.png		adderboard.png
submission_template.py		submission_template.py
verify.py		verify.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdderBoard

Leaderboard

Hand-Coded Weights (Constructive Proofs)

Trained Weights (Learned from Data)

Rules

The Core Constraint: Autoregressive Transformer

What's Allowed

Qualification

Parameter Counting

How to Submit

Verification

Context

Key Findings from the Community

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

AdderBoard

Leaderboard

Hand-Coded Weights (Constructive Proofs)

Trained Weights (Learned from Data)

Rules

The Core Constraint: Autoregressive Transformer

What's Allowed

Qualification

Parameter Counting

How to Submit

Verification

Context

Key Findings from the Community

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages