A pre-registered, audit-resolved diagnostic for physics literacy in large language models. PhysLit asks whether a frontier LLM can reason inside an unfamiliar physics framework — not whether it can solve textbook problems. Outputs are binary cognitive judgments, not leaderboard scores.
PhysLit is a research artifact, not a product. Every design decision optimizes for methodological auditability: pre-registered predictions, SHA-256-sealed inputs, fresh API session per stage, dual-LLM judging with an IRR gate, and a human-audit pathway for disagreement.
Two predictions, locked at SHA-256 769818275e6a256...0c7df425 (tag prereg-v0.1-locked) before any production trial, evaluated on Aristotelian Mechanics across Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro at N=5 trials each:
- P1 — Induction failure under training-data conflict: CONFIRMED. 2 of 3 models (Claude, Gemini) introduce banned modern-physics concepts (
dense,forceful,surface-supported, …) in ≥ 3/5 trials of Stage 1, despite an explicit ban in the prompt. - P3 — Meta-cognitive miscalibration: CONFIRMED. 10 trials contain at least one Stage-1-3 failure; in 7 of those 10 (70 %) the model fails to identify its own failure during Stage 4 self-reflection — well above the pre-registered 30 % threshold.
A third finding emerged from the methodology itself:
- Cross-vendor LLM-judge inter-rater reliability = 36.67 %. Two independent judges (Claude + OpenAI) disagreed on more than a third of all PASS/FAIL classifications, triggering the prereg-mandated human audit. No single-judge LLM benchmark would have been reliable on this material.
Scope: 1 framework × 3 models × N=5 × 4 stages = 60 production API calls + 120 judge calls = 180 calls, ≈ $14 USD total.
| Where to look | What's in it |
|---|---|
analysis/v0_1_report.md |
English narrative report — motivation, design, results, next steps |
analysis/v0_1_findings.md |
Auto-generated pre- and post-audit numerics + pipeline diagram |
analysis/v0_1_audit_human_review.md |
All 22 human-audit verdicts on DISAGREE cases |
results/<model-id>/ |
Verbatim trial JSONs + judge verdicts for every API call |
Existing LLM physics benchmarks count correct answers and report a percentage. Two structural flaws follow:
- The percentage cannot distinguish "understands physics" from "has seen similar problems during training."
- The percentage carries no information about cognitive boundaries — 90 % vs 91 % tells you nothing about what the model can and cannot do.
PhysLit asks a different question: can the model do the cognitive work that constitutes physical reasoning — induction, formulation, prediction — inside a framework whose conclusions don't match its training prior? Aristotelian Mechanics is the cleanest test case: historically real, internally consistent, present in training data primarily as a position the training data argues against. A model that has "learned Aristotle" is precisely a model that has learned to dismiss this framework; the test is whether it can suspend that dismissal long enough to reason inside the framework on its own terms.
Full motivation and design rationale: docs/product-spec.md (中文).
flowchart LR
PRE["prereg<br/>(SHA-256 sealed)"] --> RUN["Production runner<br/>3 models × 5 trials × 4 stages"]
RUN --> JUDGE["Dual-judge<br/>(Claude + OpenAI)"]
JUDGE --> IRR{"IRR > 25 %?"}
IRR -->|no| PUB["Publish verdicts"]
IRR -->|yes| AUDIT["Human audit"] --> PUB
Four design rules, all enforced in code:
- Pre-registration is irreversible. Predictions live in
predictions/v0_1_prereg.md, SHA-256-sealed and git-tag-locked. A pre-commit hook (scripts/verify_prereg_integrity.py) and a matching CI check fail any silent edit. New predictions require a new tag. - Fresh API session per stage. Stages 1, 2, 3, 4 each create a new client and a new session UUID. No context reuse, no multi-turn — the model only sees its own prior outputs replayed as text.
- Open data verbatim. Every prompt sent + every response received is committed under
prompts/andresults/. Selective publishing is forbidden — failed trials are committed as failure records. - Dual-judge IRR + human-audit gate. Stage-1-3 PASS/FAIL judgments run through two independent LLM judges; disagreement > 25 % on any stage triggers a human audit before results can be published.
Full architectural rules: CLAUDE.md.
Every verdict in the v0.1 report is reproducible from the locked tag.
git clone https://github.com/dongzhang84/physlit
cd physlit
git checkout prereg-v0.1-locked
uv sync
# .env.local
ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...
GEMINI_API_KEY=...
uv run python scripts/run_v0_1.py # ≈ $5.76, 60 production API calls, ~30 min
uv run python scripts/judge_v0_1.py # ≈ $8.23, 120 judge API calls
uv run python scripts/apply_audit.py # 0 cost — replays the 22 committed audit verdictsanalysis/v0_1_findings.md will now contain both pre-audit and post-audit blocks. The 22 audit verdicts are committed both as prose (analysis/v0_1_audit_human_review.md) and as an embedded dict in scripts/apply_audit.py; no human re-audit is required to reproduce the published verdicts. Tested-model output is non-deterministic across vendors, so your trial responses will not be byte-identical to ours — but the verdict pattern is robust per prereg.
physlit/
├── predictions/v0_1_prereg.md pre-reg, SHA-256 sealed, tag-locked
├── frameworks/01_aristotelian/ 12 observations, criteria, prediction scenarios
├── prompts/ all stage + judge prompts, frozen at lock
├── results/<model-id>/01_aristotelian/ 60 trial JSONs + 120 judge verdicts, verbatim
├── analysis/ findings, audit, narrative report
├── scripts/ run / judge / audit / verify
├── src/physlit/ runners, schema, judges (Python, mypy strict)
├── docs/product-spec.md methodology, design rules, predictions
├── docs/implementation-guide.md phase-by-phase build plan
├── CLAUDE.md architectural rules (load-bearing)
├── CHANGELOG.md phase-by-phase release notes
├── LICENSE MIT — code
└── LICENSE-DATA CC BY 4.0 — frameworks, predictions, prompts, results, analysis
uv sync # install deps + dev tools
uv run pre-commit install # one-time: hook ruff + prereg-integrity + spec validatorsLocal gates (must all pass before commit):
uv run ruff format --check .
uv run ruff check .
uv run mypy
uv run pytest
uv run python scripts/verify_prereg_integrity.py # confirms prereg SHA-256 unchangedCI never runs real API calls — only mocks in tests/test_runners_with_mock.py. Costly runs (run_v0_1.py, judge_v0_1.py) are gated by a confirmation prompt when the estimated spend exceeds $5.
| Version | Scope | Budget cap | Status |
|---|---|---|---|
| v0.1 | Aristotelian × 3 models × N=5 | $50 | ✅ Done — 2026-05-11 |
| v0.1.1 | Re-judge v0.1 trials with structural criteria (N9-N12: parsimony, independence, traceability, hierarchy) | ~$10 | Planned |
| v0.2 | 5 frameworks across Categories A/B/C × 3 models × N=5 | $250 | Planned, gated on v0.1.1 reliability check |
The original v1.0 ambition of 15 frameworks has been retired in favor of methodology-first iteration. Reasoning in analysis/v0_1_report.md §4.
PhysLit welcomes:
- Reproduction reports — run
scripts/run_v0_1.py+judge_v0_1.py+apply_audit.pyyourself and open an issue if your verdict pattern diverges from ours. - Methodology critique as GitHub issues — especially around the IRR threshold (25 %) and the audit pathway.
- Framework proposals for v0.2 — open an issue describing the framework, its Category (A: historical / B: counterfactual self-consistent / C: arbitrary rules), and a draft observation set. Authoring tier and minimum content checklist live in
docs/implementation-guide.md. - Code PRs must pass
ruff check,mypy --strict,pytest, and the prereg integrity hook.
PhysLit does not accept:
- Changes to the locked prereg or any frozen artifact under the
prereg-v0.1-lockedtag. - Pull requests that compromise the four design rules (multi-turn shortcuts, judge-pruning to lower IRR, selective result publishing, alias-pinned model IDs).
- Code (
src/,tests/,scripts/, configs) — MIT - Data (
frameworks/,predictions/,prompts/,results/,analysis/) — CC BY 4.0
The split is deliberate: re-use the code freely without attribution friction; re-use the data with attribution so the prereg trail stays traceable.
If you use PhysLit in academic or evaluation work, please cite the locked prereg tag:
@misc{physlit_v0_1_2026,
author = {Zhang, Dong},
title = {{PhysLit v0.1}: A Pre-Registered Diagnostic of LLM Physics Literacy on Aristotelian Mechanics},
year = {2026},
howpublished = {\url{https://github.com/dongzhang84/physlit}},
note = {Pre-registration tag \texttt{prereg-v0.1-locked}, SHA-256 \texttt{769818275e6a25665116f13be2a4be440f00a8f49453fd8587239b410c7df425}}
}PhysLit grew out of indie-product-playbook. The original spec lives at ideas/physlit.md upstream.