Skip to content

dubthree/cosmo-regulus

Repository files navigation

cosmo-regulus

Adaptive radiation-tolerance scheduling and an economic Pareto curve for commercial-GPU LLM inference under measured space-radiation dose.


What this is

A small Python library and CLI that answers two questions, quantitatively:

  1. Economic Pareto question. For a given orbit / surface location, shielding mass, and target output quality, what combination of replica count and weight-scrubbing rate minimizes $/M-tokens?
  2. Adaptive scheduling question. Given a live (or simulated) particle-flux signal, what policy of detection + recovery primitives keeps quality above threshold X with throughput cost below Y%, across both quiet sun and SEP events?

The output is a curve and a controller, not a claim. Engineers argue with curves.

What this is not

  • Not a from-scratch LLM-fault-tolerance framework. That ground is already well covered (see docs/prior-art.md); this project is an additive layer.
  • Not a proof of flight-grade radiation tolerance. Software fault injection grounded in published beam-test cross-sections and measured surface-dose data is not the same as a real beam test on the actual hardware.
  • Not a competitor to ReaLM, SAVE, RedNet, or Suncatcher. It cites and depends conceptually on all of them.

Contributions (what is genuinely new)

# Contribution Why it isn't already done
1 Economic Pareto curve linking shielding mass × replica count × scrubbing rate → $/M-tokens at iso-quality, parameterized by orbit/surface location. No published work draws this curve. Industry pieces (Suncatcher economics, Introl/SpaceInvestments reports) discuss launch economics; academic work doesn't connect that to the tolerance-knob trade-off.
2 Adaptive scheduler that reads a live (simulated) particle-flux signal and dials tolerance primitives — replica count, scrub interval, range guards, voting quorum — in real time. SEP-event-mode escalation. Primitives exist piecemeal in literature (ATTNChecker, ReaLM, FT-Transformer); the controller that turns environment data into a real-time tolerance policy is unbuilt.

What this builds on (cited foundations)

Source Role
ReaLM — Xie et al., DAC 2025. arXiv:2503.24053, code (MIT) LLM-inference fault-model methodology. Cited, not forked (scope mismatch: ReaLM assumes ASIC error detection, we assume plain commercial GPUs).
SAVE — Zheng et al., USENIX ATC 2025. USENIX Closest hardware target: software-only fault tolerance on commodity GPUs. Methodology reference.
RedNet — Wang, Qiu et al., 2024. arXiv:2407.11853 Closest published space-environment → AI-inference bridge (DNN, not LLM).
Google Project Suncatcher — Nov 2025. paper Empirical TPU + AMD-host beam-test data; sanity check on our HBM cross-section assumptions.
Chang'E-4 LND — Zhang et al., Science Advances 2020. PMC Primary anchor. First time-resolved dose-rate measurement on the lunar surface (~116 mGy(Si)/yr unshielded). Every downstream λ_SEU number is traceable to this measurement, not to CREME96 extrapolation.
LRO CRaTER — Mazur et al., Space Weather 2011. AGU Secondary anchor; cross-checks Chang'E-4 within ~15%. Orbital-comparison branch.

More detail on what we cite and what we deliberately don't do: docs/prior-art.md.

Quickstart

# Install (editable, with dev extras)
pip install -e ".[dev]"

# Verify install
cosmo-regulus --version

# Compute the first-cut Pareto curve for the lunar polar baseline
cosmo-regulus pareto --site connecting-ridge --quality 0.95

Output (writes experiments/01-pareto-baseline/result.png + points.csv):

Economic Pareto curve

3 Pareto-optimal points (of 84 evaluated) at quality >= 0.95:

 shielding_cm  replicas   scrub_h   quality     shield_$      $/M-tok
---------------------------------------------------------------------
          100         1     168.0    0.9627         2000         0.15
           50         2     168.0    0.9755         1000         0.30
           25         3     168.0    0.9723          500         0.46

First-cut numbers, anchored on Zhang 2020 Chang'E-4 LND dose data; cross-section and several rate constants are planning placeholders. See docs/results.md for the full readout, the assumptions ledger, and what shifts in v1.

The simulate subcommand (adaptive scheduler) and env validate (LND-data reproduction) are still scaffolded -- they will return "not yet implemented."

Repository layout

cosmo-regulus/
├── README.md                              ← this file
├── LICENSE                                ← Apache-2.0
├── pyproject.toml
├── docs/
│   ├── architecture.md                    what's added, how it depends on others
│   ├── prior-art.md                       what we cite and what we deliberately don't do
│   └── limitations.md                     page-1 honest list
├── src/cosmo_regulus/
│   ├── env/                               measured space dose → per-GPU λ_SEU
│   │   ├── change4_lnd.py                 Chang'E-4 LND ground-truth parsing
│   │   ├── crater.py                      LRO CRaTER cross-check
│   │   ├── shielding.py                   regolith / Al-equivalent attenuation
│   │   └── seu_rate.py                    (env, shielding, SKU) → λ_SEU
│   ├── policy/                            ← contribution #2: adaptive scheduler
│   │   ├── adaptive.py                    the controller
│   │   ├── primitives.py                  detection + recovery primitive interfaces
│   │   └── sep_event.py                   burst-mode escalation
│   ├── economic/                          ← contribution #1: economic Pareto
│   │   ├── pareto.py                      the headline curve builder
│   │   ├── tokens_per_dollar.py           cost-per-M-tokens model
│   │   └── shielding_mass.py              kg-of-shielding → launched cost
│   └── cli.py
└── tests/
    └── test_smoke.py

Limitations (read this first)

  1. Simulation, not flight test. Software bit injection grounded in published cross-sections is not radiation. A real beam test on the target GPU can shift any number by factors.
  2. HBM only. Compute-unit transient faults (Tensor Cores, register files, instruction cache) not modeled in v0.
  3. Latch-up not addressed. Hardware-mitigation problem; assumed solved upstream.

Full list: docs/limitations.md.

License

Apache-2.0. See LICENSE.

Chosen specifically because AGPL would be a poison pill for SpaceX adoption — a commercial proprietary stack cannot inherit the AGPL obligation. Apache-2.0's explicit patent grant matters at scale.

Acknowledgments

Built as part of a larger lunar program package at ../ (see ../README.md). The economic Pareto curve is what enables the "compute revenue anchors the cascade" thesis in that program; the adaptive scheduler is what makes commercial-GPU lunar compute survivable over a 25-year design life. The parent program is the why; this repo is the how.

Status

Pre-alpha. First end-to-end Pareto pipeline lands at this commit -- see docs/results.md for the v0 numbers. Next rungs: adaptive scheduler (policy/), and grounding the env model in the actual Chang'E-4 LND time series rather than the published aggregate. See docs/architecture.md for the build sequence.

About

Open-source Python library quantifying commercial-GPU radiation tolerance on the lunar surface (Chang'E-4 LND data, Apache-2.0)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages