Adaptive radiation-tolerance scheduling and an economic Pareto curve for commercial-GPU LLM inference under measured space-radiation dose.
A small Python library and CLI that answers two questions, quantitatively:
- Economic Pareto question. For a given orbit / surface location, shielding mass, and target output quality, what combination of replica count and weight-scrubbing rate minimizes $/M-tokens?
- Adaptive scheduling question. Given a live (or simulated) particle-flux signal, what policy of detection + recovery primitives keeps quality above threshold X with throughput cost below Y%, across both quiet sun and SEP events?
The output is a curve and a controller, not a claim. Engineers argue with curves.
- Not a from-scratch LLM-fault-tolerance framework. That ground is already well covered (see
docs/prior-art.md); this project is an additive layer. - Not a proof of flight-grade radiation tolerance. Software fault injection grounded in published beam-test cross-sections and measured surface-dose data is not the same as a real beam test on the actual hardware.
- Not a competitor to ReaLM, SAVE, RedNet, or Suncatcher. It cites and depends conceptually on all of them.
| # | Contribution | Why it isn't already done |
|---|---|---|
| 1 | Economic Pareto curve linking shielding mass × replica count × scrubbing rate → $/M-tokens at iso-quality, parameterized by orbit/surface location. | No published work draws this curve. Industry pieces (Suncatcher economics, Introl/SpaceInvestments reports) discuss launch economics; academic work doesn't connect that to the tolerance-knob trade-off. |
| 2 | Adaptive scheduler that reads a live (simulated) particle-flux signal and dials tolerance primitives — replica count, scrub interval, range guards, voting quorum — in real time. SEP-event-mode escalation. | Primitives exist piecemeal in literature (ATTNChecker, ReaLM, FT-Transformer); the controller that turns environment data into a real-time tolerance policy is unbuilt. |
| Source | Role |
|---|---|
| ReaLM — Xie et al., DAC 2025. arXiv:2503.24053, code (MIT) | LLM-inference fault-model methodology. Cited, not forked (scope mismatch: ReaLM assumes ASIC error detection, we assume plain commercial GPUs). |
| SAVE — Zheng et al., USENIX ATC 2025. USENIX | Closest hardware target: software-only fault tolerance on commodity GPUs. Methodology reference. |
| RedNet — Wang, Qiu et al., 2024. arXiv:2407.11853 | Closest published space-environment → AI-inference bridge (DNN, not LLM). |
| Google Project Suncatcher — Nov 2025. paper | Empirical TPU + AMD-host beam-test data; sanity check on our HBM cross-section assumptions. |
| Chang'E-4 LND — Zhang et al., Science Advances 2020. PMC | Primary anchor. First time-resolved dose-rate measurement on the lunar surface (~116 mGy(Si)/yr unshielded). Every downstream λ_SEU number is traceable to this measurement, not to CREME96 extrapolation. |
| LRO CRaTER — Mazur et al., Space Weather 2011. AGU | Secondary anchor; cross-checks Chang'E-4 within ~15%. Orbital-comparison branch. |
More detail on what we cite and what we deliberately don't do: docs/prior-art.md.
# Install (editable, with dev extras)
pip install -e ".[dev]"
# Verify install
cosmo-regulus --version
# Compute the first-cut Pareto curve for the lunar polar baseline
cosmo-regulus pareto --site connecting-ridge --quality 0.95Output (writes experiments/01-pareto-baseline/result.png + points.csv):
3 Pareto-optimal points (of 84 evaluated) at quality >= 0.95:
shielding_cm replicas scrub_h quality shield_$ $/M-tok
---------------------------------------------------------------------
100 1 168.0 0.9627 2000 0.15
50 2 168.0 0.9755 1000 0.30
25 3 168.0 0.9723 500 0.46
First-cut numbers, anchored on Zhang 2020 Chang'E-4 LND dose data; cross-section and several rate constants are planning placeholders. See docs/results.md for the full readout, the assumptions ledger, and what shifts in v1.
The simulate subcommand (adaptive scheduler) and env validate (LND-data reproduction) are still scaffolded -- they will return "not yet implemented."
cosmo-regulus/
├── README.md ← this file
├── LICENSE ← Apache-2.0
├── pyproject.toml
├── docs/
│ ├── architecture.md what's added, how it depends on others
│ ├── prior-art.md what we cite and what we deliberately don't do
│ └── limitations.md page-1 honest list
├── src/cosmo_regulus/
│ ├── env/ measured space dose → per-GPU λ_SEU
│ │ ├── change4_lnd.py Chang'E-4 LND ground-truth parsing
│ │ ├── crater.py LRO CRaTER cross-check
│ │ ├── shielding.py regolith / Al-equivalent attenuation
│ │ └── seu_rate.py (env, shielding, SKU) → λ_SEU
│ ├── policy/ ← contribution #2: adaptive scheduler
│ │ ├── adaptive.py the controller
│ │ ├── primitives.py detection + recovery primitive interfaces
│ │ └── sep_event.py burst-mode escalation
│ ├── economic/ ← contribution #1: economic Pareto
│ │ ├── pareto.py the headline curve builder
│ │ ├── tokens_per_dollar.py cost-per-M-tokens model
│ │ └── shielding_mass.py kg-of-shielding → launched cost
│ └── cli.py
└── tests/
└── test_smoke.py
- Simulation, not flight test. Software bit injection grounded in published cross-sections is not radiation. A real beam test on the target GPU can shift any number by factors.
- HBM only. Compute-unit transient faults (Tensor Cores, register files, instruction cache) not modeled in v0.
- Latch-up not addressed. Hardware-mitigation problem; assumed solved upstream.
Full list: docs/limitations.md.
Apache-2.0. See LICENSE.
Chosen specifically because AGPL would be a poison pill for SpaceX adoption — a commercial proprietary stack cannot inherit the AGPL obligation. Apache-2.0's explicit patent grant matters at scale.
Built as part of a larger lunar program package at ../ (see ../README.md). The economic Pareto curve is what enables the "compute revenue anchors the cascade" thesis in that program; the adaptive scheduler is what makes commercial-GPU lunar compute survivable over a 25-year design life. The parent program is the why; this repo is the how.
Pre-alpha. First end-to-end Pareto pipeline lands at this commit -- see docs/results.md for the v0 numbers. Next rungs: adaptive scheduler (policy/), and grounding the env model in the actual Chang'E-4 LND time series rather than the published aggregate. See docs/architecture.md for the build sequence.
