Bad PI: a coordination layer for distributed autoresearch swarms.
Autoresearch by yourself is great. Autoresearch with friends is better!!
1000+ workers, each running 5-minute training experiments on their own GPU, coordinated by Bad PI: maintaining scientific hypotheses, updating beliefs from evidence, eliminating bad ideas deliberately, and continuously narrowing the search space.
Developer smoke-test / regression guide: dev_checklist_readme.md
Bad PI now assumes a shared base template and a mutable live-update block in program.md.
This is required for coordinated worker behavior.
- Meta-agent workspace root must contain program.md (or set
META_BASE_PROGRAM_MD_PATH). - Worker-side autoresearch folder should start with the same base
program.mdcontent.
Workers use that base locally; the server sends updates only after the first generated update.
Your base program.md should include immutable charter text plus these markers:
<!-- BAD_PI_MUTABLE_START -->
... live update block (managed by Meta-PI) ...
<!-- BAD_PI_MUTABLE_END -->Behavior:
- Everything outside the markers is treated as stable charter guidance.
- Meta-PI rewrites only the mutable block when generating updates.
- Workers check for updates every run and apply them only when digest/content changes.
- One-time startup bootstrap: on server initialization, Meta-PI can use base program.md + meta_server/schema.sql + current dimensions to suggest missing structural dimensions (for example architecture axes).
- Later proposals only on stall: after startup, new-dimension proposals are generated only when progress stalls (not on every mutable rewrite).
- Normal mutable rewrites: routine
program.mdupdates use evidence + current state; they do not repeatedly force a schema-wide dimension rewrite each cycle.
The original standalone autoresearch flow assumes a mostly static program.md.
This distributed Meta-PI flow depends on a stable base + controlled incremental updates,
so all workers stay aligned while still receiving evolving instructions.
Workers pull a suggested config from Bad PI, run their 5-minute experiment, and push results back. Bad PI aggregates everything and updates the search strategy every 60 seconds. Green arrows carry data upward (results), teal arrows carry directives downward (configs, program.md updates).
Unlike a plain hyperparameter optimizer, the meta-agent maintains hypotheses — falsifiable scientific claims — and tracks a probability distribution over each one as experiments arrive.
Each hypothesis is assigned a posterior P via Bayesian (Beta-Binomial) updating. Workers are allocated proportionally to information value:
information_value(H) = uncertainty(H) × importance(H)
= 4·P·(1−P) × expected_impact
worker_allocation(H) ∝ softmax(information_value)
To prevent "eternal uncertainty" hypotheses from consuming too much capacity, Bad PI applies an allocation penalty after long indecision streaks and triggers a focused decision sprint to force decisive evidence.
For LLM-proposed hypotheses, allocation is intentionally ramped to avoid wasting resources on hallucinations:
information_value(H) = 4·P·(1−P) × importance × llm_credibility
llm_credibility = 0.25 at n=0, linearly ramping to 1.0 by n=12
So new LLM hypotheses start cheap, then earn full worker share only after accumulating evidence.
- P ≈ 0.84 (high confidence): mostly exploit — refine around the known good region
- P ≈ 0.51 (maximum uncertainty): maximum workers — we genuinely don't know
- P ≈ 0.12 (probably false): deliberate falsification run — hold everything fixed, only vary the relevant dimension, run until we have clean statistical evidence
- P ≈ 0.28 (speculative): small moonshot allocation
Each hypothesis starts with a mild prior:
{
"prior": {"alpha": 2, "beta": 2, "posterior": 0.5}
}Each completed experiment is converted into binary evidence:
{
"experiment": {"delta_bpb": -0.012},
"outcome": "win"
}winifdelta_bpb < 0lossotherwise
Posterior update:
{
"posterior_update": {
"alpha": 2 + wins,
"beta": 2 + losses,
"posterior_mean": "alpha / (alpha + beta)"
}
}The engine also computes exact Beta posterior evidence terms:
{
"bayesian_evidence": {
"credible_interval_90": [0.41, 0.82],
"support_probability": "Pr(theta > 0.60)",
"refute_probability": "Pr(theta < 0.40)",
"rope_probability": "Pr(0.40 <= theta <= 0.60)"
}
}Status is decided from posterior mass, not just the mean:
{
"status_rule": {
"supported": "support_probability >= 0.90 and n >= 10",
"refuted": "refute_probability >= 0.90 and n >= 10",
"active": "otherwise"
}
}The LLM does not generate hypotheses as freeform text. Every proposal goes through a strict three-step pipeline:
Step 1 — Anthropic tool_use call with enforced Pydantic schema
The PI calls propose_hypotheses as a structured tool (program_writer.py), so the LLM response is always valid JSON. The HypothesisProposal schema the LLM must fill exactly:
{
"statement": "Falsifiable claim, e.g. 'DEPTH > 12 interacts with learning_rate'",
"type": "positive | comparative | interaction | null",
"importance": 0.72,
"rationale": "1-sentence statistical reason based only on the data shown",
"config_constraint": {"DEPTH": 12},
"phase": "exploration | validation",
"test_spec": {"type": "single_factor_effect", "variable": "DEPTH", "values": [8, 12], "min_runs_per_cell": 3, "decision_rule": {"threshold": 0.05}},
"parent_id": "optional_parent_hypothesis_id"
}rationale is required — the LLM must cite the observed data, not just assert a claim. config_constraint holds values that must be frozen for a controlled experiment; empty dict means a global hypothesis. test_spec is required for all proposals and defines how the hypothesis will be tested or falsified.
Step 2 — Registry gate (HypothesisRegistry.evaluate_llm_proposal)
After schema validation, each proposal must pass:
| Gate | Rule |
|---|---|
schema_valid |
Pydantic parse succeeded (guaranteed by tool_use) |
novel |
Statement text does not normalize-match any existing hypothesis |
semantic_novel |
Statement must not be a near-duplicate by semantic similarity |
importance_threshold |
importance >= 0.15 — proposals below this are too vague to test |
valid_constraint |
config_constraint must be a dict |
Accepted example:
{
"llm_proposal": {
"statement": "DEPTH > 12 interacts with learning_rate",
"type": "interaction",
"importance": 0.72,
"config_constraint": {}
},
"engine_gate": {
"accepted": true,
"reason": "schema_valid_and_novel",
"registry_add": true,
"immediate_forced_pursuit": false
}
}Rejected example:
{
"llm_proposal": {
"statement": "WINDOW_PATTERN matters",
"importance": 0.05
},
"engine_gate": {
"accepted": false,
"reason": "importance_too_low",
"registry_add": false,
"immediate_forced_pursuit": false
}
}Step 3 — Registry add only, no forced allocation
Accepted proposals are added with a flat Beta(2,2) prior (P=0.5, maximum uncertainty). They are not immediately given workers. Allocation is recomputed at the next cycle from information_value = 4·P·(1−P) × importance × llm_credibility.
Adding to the registry does not guarantee worker allocation. Allocation still depends on the Bayesian evidence and information value.
Each population receives its own program.md generated by the PI (Claude), tailored to its hypothesis and strategy:
| Population | Strategy | What the program.md says |
|---|---|---|
| Pop A | Exploit | "Best region is depth 12-14, lr 1e-3 to 3e-3. Refine here." |
| Pop B | Investigate | "Map the LR x batch interaction surface broadly." |
| Pop C | Falsify | "CONTROLLED: fix all params at best known values. Only vary WINDOW_PATTERN." |
| Pop D | Moonshot | "Try unusual/extreme combinations. High variance is fine." |
The server now maintains a persistent runtime state (meta_server/runtime.py) that:
- Bootstraps hypotheses from the active
dimensionstable on a fresh project (experiment_count == 0), so the registry starts aligned with the current schema. On non-fresh runs it loads fromruntime_state.json. - Spawns a
Populationfor each active hypothesis — one population per hypothesis, each with its own strategy andprogram.md. - Assigns workers to populations on first contact (
/registeror/next_config), allocated by Bayesian information-value softmax. - Shapes every
next_configresponse with the hypothesis'sconfig_constraint(locked values required for controlled experiments). - Returns a population-specific
program.mdthrough/sync/{worker_id}, so different worker populations get different research instructions. - Ingests each completed experiment result into the relevant hypotheses, updates Beta posteriors, and re-syncs populations.
- Archives refuted hypotheses (status=
refuted, n≥12) and frees their workers. - Generates hypothesis proposals from the LLM on the global
program.mdwrite cycle, gates them throughHypothesisRegistry.ingest_llm_proposals, and only adds accepted ones. - Checkpoints the meta-hypothesis log every 100 experiments.
- Persists all state to
runtime_state.json(path configurable viaMETA_RUNTIME_STATE_PATH).
{
"exp_id": "a1b2c3d4-...",
"config_delta": {"DEPTH": 12, "learning_rate": 0.0018},
"budget_seconds": 420,
"priority": 0.72,
"note": "exploit · pop_a8365b — Depth > 10 improves val_bpb",
"population_id": "pop_a8365b",
"population_strategy": "exploit",
"hypothesis_id": "9f3c1a",
"hypothesis_statement": "Depth > 10 improves val_bpb"
}{
"program_md": "# pop_a8365b — EXPLOIT\n*...*",
"experiment_count": 347,
"active_workers": 24,
"population_id": "pop_a8365b",
"population_strategy": "exploit",
"hypothesis_id": "9f3c1a",
"hypothesis_statement": "Depth > 10 improves val_bpb",
"dimensions": [...],
"top_configs": [...]
}meta_server/
runtime_state.json ← persistent registry + population assignments
meta_hypothesis_log.md ← checkpoint journal (auto-written every 100 exps)
Configurable via environment:
META_RUNTIME_STATE_PATH=/data/runtime_state.json # default: meta_server/runtime_state.json
Each hypothesis now also tracks an effect-size Gaussian summary (effect_mu, effect_sem) alongside Beta-Binomial evidence.
Inspect parent/child and linked hypothesis structure:
GET /theory_graph→{ "nodes": [...], "edges": [...] }GET /theory_graph/human→ human-readable derived summary layer
Edge types:
decomposes_into(parent → child)linked(related hypotheses)
The graph JSON is the source of truth. The /theory_graph/human response is explicitly derived (derived_not_authoritative=true) and can use LLM translation with deterministic fallback.
When search is stalled, Bad PI still asks the LLM for new dimensions (not just new hypotheses). Those proposals can now be auto-adopted into live search only if all gates pass:
- Schema-valid and unique dimension name
- Repeated signal: same normalized proposal appears in at least 2 stall cycles
- Bounded search space
- numeric ranges must be sane (
min < max, capped span) - categorical proposals capped at 8 categories
- Canary phase on adoption
- starts with low importance and canary sampling probability (~12% of proposed configs)
- evaluated after 40 completed experiments
- auto-reverted if best delta does not improve by at least 0.001
If canary improves best delta enough, it is promoted to normal full-search behavior automatically.
Every LLM-proposed hypothesis must include an executable test_spec that defines how to test or falsify the claim. The phase field indicates the maturity level:
phase="exploration"→ hypothesis is early-stage; test_spec guides signal collection and incremental belief updatesphase="validation"→ hypothesis is mature; test_spec is a strict deterministic protocol with fixed sample sizes and decision thresholds
Two deterministic test types are live in v1:
single_factor_effect
- Vary one variable over specific arms (e.g.
DEPTHin[8, 12]) - Require
min_runs_per_cellrepeats per arm - Compute arm means and compare effect size to
decision_rule.threshold
interaction_grid
- Build a 2D grid over two variables (e.g.
DEPTH×learning_rate) - Fill each cell to
min_runs_per_cell - Compute deterministic interaction strength (deviation from additive expectation)
- Mark test win/loss by threshold
How they work together in practice:
- Use
single_factor_effectfirst to verify a clean main effect. - Use
interaction_gridnext to test whether variable combinations produce non-additive gains.
Important: for validation hypotheses, Bad PI updates belief on completed tests (one vote per completed protocol), not every individual run.
Detailed spec and examples: docs/testspec_validation.md
Every 100 experiments the PI writes a new checkpoint to meta_hypothesis_log.md:
## Checkpoint 4 · 350 experiments · 2026-04-07 14:22
### Belief movements
| Hypothesis | Prior P | Current P | Delta | n | Status |
|--------------------------------|---------|-----------|--------|----|-----------|
| Depth > 10 improves val_bpb | 0.72 | 0.84 | +0.12 | 40 | supported |
| LR x batch size interact | 0.38 | 0.51 | +0.13 | 22 | active |
| WINDOW_PATTERN affects val_bpb | 0.34 | 0.12 | -0.22 | 22 | refuted |
### Eliminated this cycle
**WINDOW_PATTERN affects val_bpb** — REFUTED (P=0.12, n=22)
> Evidence: [n=20] WIN delta=-0.011 | [n=21] LOSS delta=+0.002 | [n=22] WIN delta=-0.008
### New hypotheses generated
- **"Depth x learning rate interaction"** — P=0.50 (NEW)
Rationale: high-depth experiments show different LR sensitivity curves
### Population changes
- Pop C dissolved (WINDOW refuted) — 8 workers freed
- Spawned pop_e3f1a2 (investigate, 15 workers) for "Depth x learning rate interaction"This log is the institutional memory of the swarm.
Workers report a compact tick at each 20% progress checkpoint:
{ "id": "run-uuid", "p": 0.2, "m": 1.9, "d": -0.05 }The scheduler compares the metric against all other runs that have reached that bucket. Kill decisions are probabilistic — being below the cutoff is not an automatic death sentence. Kill probability scales linearly with depth below the threshold:
p_kill = STOCHASTIC_KILL_MAX_PROB × (threshold_pct − rank_pct) / threshold_pct
= 0.65 × (33.3 − rank_pct) / 33.3 [with eta=3]
| Rank percentile in pool | Kill probability |
|---|---|
| 33rd pct (at the cutoff) | 0% — no risk |
| ~10th percentile | ~46% |
| 0th pct (absolute worst) | 65% |
A run that ranks poorly in one bucket may survive the stochastic reprieve and recover in the next. A consistently bad run accumulates kill draws across multiple buckets and approaches near-certainty of termination. The scheduler avoids being overconfident about single noisy measurements.
The agent responds immediately to each tick:
{ "action": "stop" } // stochastic kill fired
{ "action": "extend", "budget": 420 } // extend — top 11% at final bucket
{} // continue (or stochastic reprieve)This creates three distinct run shapes: short killed runs (coral), full normal runs (gray), and extended top-performer runs (teal, with a dashed extra segment). Only the useful compute budget is consumed.
Run lifecycle: start → update (×5) → complete | stop | extend → complete
Writing a new program.md via the Claude API takes ~30 seconds. Without pipelining, every batch transition causes a GPU idle gap.
The pipeline solves this with ideas from CPU architecture:
Speculative execution — after 10 ticks arrive from the current batch, the meta-agent analyzes early trends, estimates confidence, and starts drafting the next program.md in the background. By the time Batch N finishes, program N+1 is already READY.
Zero idle — workers pre-cache the speculative program via GET /pipeline/spec (or structured GET /pipeline/spec_payload) between runs. When Batch N ends they load N+1 instantly.
Pipeline flush — if a late-arriving anomaly (outlier result, hypothesis flip) invalidates the prediction, POST /pipeline/flush is called. Workers polling GET /pipeline/flush_token detect the token mismatch and re-pull from /program.md. A new spec is drafted for the corrected direction.
The pipeline tracks a hit rate (confirmed / (confirmed + flushed)). A well-calibrated swarm with stable hypotheses achieves 80-90% hit rate, meaning GPUs are idle for only 10-20% of transitions instead of 100%.
This mode is off by default. When enabled, workers can use predicted program.md early, before normal sync.
Deployment is gated by two confidence checks:
- Spec confidence — confidence of the current predicted spec.
- Deployment confidence bank — long-term trust score that increases on confirmed predictions and drops on flushes.
If a deployed prediction is later wrong, a flush is issued and the confidence bank takes a penalty. The bank must rebuild above threshold before speculative deploy is allowed again.
Server env vars:
BAD_PI_SPEC_EXEC_ENABLED=1
BAD_PI_SPEC_AUTO_DEPLOY=1
BAD_PI_SPEC_CONF_THRESHOLD=0.65
BAD_PI_DEPLOY_CONF_THRESHOLD=0.70
BAD_PI_PIPELINE_BATCH_EVERY=20Worker opt-in flag:
python worker/run.py --use-spec-pipelineWorker flow in this mode:
- preloads
GET /pipeline/spec_payload - checks
GET /pipeline/flush_token - if token matches cached spec id, drops speculative file and re-syncs from
/sync/{worker_id}.
Queueing now uses a duplicate policy per exact config_delta fingerprint:
- high-priority configs get a few repeats for confidence
- hard cap on total repeats per config
- cap on simultaneous in-flight duplicates
Current defaults:
MAX_TRIALS_PER_CONFIG = 6
MAX_INFLIGHT_PER_CONFIG = 2
priority >= 0.85 -> desired repeats 4
priority >= 0.70 -> desired repeats 3
priority >= 0.55 -> desired repeats 2
else -> desired repeats 1
This gives repeated validation for promising configs without flooding the cluster with identical runs.
# Simulate 50 concurrent workers and 3 rounds locally
python simulate.py
# More workers, more rounds
python simulate.py --workers 100 --rounds 5
# Run against a live meta-agent server
python simulate.py --against-server --meta-url http://META_IP:8000The simulator uses a synthetic simulate_metric(config, progress) function that mimics a realistic noisy training curve. It lets you validate the early stopping thresholds and pipeline behavior without any GPUs.
An interactive dashboard is included at dashboard/app.py.
It shows:
- server health KPIs
- leaderboard table
- hypothesis posterior trajectories over time (during dashboard session)
- current theory graph status mix
- worker-facing
program_mdpreview via/sync/{worker_id}
pip install -r dashboard/requirements.txt
streamlit run dashboard/app.pyThen open the local URL printed by Streamlit (usually http://localhost:8501).
By default, the dashboard runs only on your local machine and is intended for the organizer to monitor swarm health and belief evolution during active runs. Workers do not have access to it.
To share the dashboard with your team, deploy it behind a reverse proxy with authentication:
# Install Nginx and create auth file
sudo apt-get install nginx
sudo htpasswd -c /etc/nginx/dashboard.htpasswd $USERNAME
# Create /etc/nginx/sites-available/dashboard
server {
listen 8501;
server_name _;
auth_basic "Bad PI Dashboard";
auth_basic_user_file /etc/nginx/dashboard.htpasswd;
location / {
proxy_pass http://127.0.0.1:8502;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
# Enable and restart
sudo ln -s /etc/nginx/sites-available/dashboard /etc/nginx/sites-enabled/
sudo systemctl restart nginx
# Run Streamlit on localhost:8502 (not exposed directly)
streamlit run dashboard/app.py --server.port 8502 --server.address 127.0.0.1Then access via http://YOUR_VM_IP:8501 with credentials.
Wrap the Streamlit app in a Docker container that runs behind your existing meta-agent proxy:
FROM python:3.11-slim
WORKDIR /app
COPY dashboard/requirements.txt .
RUN pip install -r requirements.txt
COPY dashboard/app.py .
CMD ["streamlit", "run", "app.py", "--server.port", "8502", "--server.address", "0.0.0.0"]Build and run:
docker build -f Dockerfile.dashboard -t badpi-dashboard .
docker run -p 127.0.0.1:8502:8502 \
-e STREAMLIT_SERVER_PORT=8502 \
badpi-dashboardThen proxy through Nginx as above (listen on 8501, proxy to 127.0.0.1:8502).
Modify dashboard/app.py to check the X-Worker-Token header:
import streamlit as st
import os
# At the top of app.py, before any other streamlit calls:
if "REQUIRE_DASHBOARD_AUTH" in os.environ:
token = st.query_params.get("token", [""])[0]
valid_tokens = os.environ.get("DASHBOARD_TOKENS", "").split(",")
if token not in valid_tokens:
st.error("Invalid or missing dashboard token")
st.stop()Then deploy with:
export REQUIRE_DASHBOARD_AUTH=1
export DASHBOARD_TOKENS="token1,token2,token3"
streamlit run dashboard/app.pyWorkers would access via http://YOUR_VM_IP:8501?token=TOKEN.
git clone https://github.com/yourname/autoresearch-meta
cd autoresearch-meta
# With Docker (recommended)
ANTHROPIC_API_KEY=sk-... META_ENROLL_TOKEN=team-shared-invite docker compose up -d
# Without Docker
pip install -r meta_server/requirements.txt
uvicorn meta_server.main:app --host 0.0.0.0 --port 8000A $6/mo cloud VM is plenty — the server is CPU-only. Share
http://YOUR_VM_IP:8000with all workers.
If you are the person hosting the meta-agent, do this exactly:
This is the shared secret that workers need the first time they register.
Use one of these commands to generate a strong token:
# Python (works on most machines)
python -c "import secrets; print(secrets.token_urlsafe(24))"
# Or OpenSSL
openssl rand -base64 24Example result:
5qD7nL4sF2KjV8vQmYp2bZ0uHcT9xA1e
Call this your enroll token.
export META_ENROLL_TOKEN="PASTE_YOUR_TOKEN_HERE"
export BAD_PI_LLM_PROVIDER="auto" # auto|anthropic|openai|gemini
# Provide one key for your selected provider (or multiple if using auto)
export ANTHROPIC_API_KEY="sk-..." # optional
export OPENAI_API_KEY="sk-..." # optional
export GEMINI_API_KEY="..." # optional (or GOOGLE_API_KEY)
docker compose up -dIf you are not using Docker:
export META_ENROLL_TOKEN="PASTE_YOUR_TOKEN_HERE"
pip install -r meta_server/requirements.txt
# Optional provider selection + keys
export BAD_PI_LLM_PROVIDER="auto" # auto|anthropic|openai|gemini
# export ANTHROPIC_API_KEY="sk-..."
# export OPENAI_API_KEY="sk-..."
# export GEMINI_API_KEY="..." # or GOOGLE_API_KEY
uvicorn meta_server.main:app --host 0.0.0.0 --port 8000Open this in a browser:
http://YOUR_VM_IP:8000/health
You should see JSON like:
{
"status": "ok",
"experiments": 0,
"queue_depth": 200,
"active_workers": 0
}{
"meta_url": "http://YOUR_VM_IP:8000",
"enroll_token": "PASTE_YOUR_TOKEN_HERE"
}That is all they need to join.
If you are a teammate joining the swarm, you need:
{
"meta_url": "http://YOUR_VM_IP:8000",
"enroll_token": "THE_TOKEN_THE_ORGANIZER_SENT_YOU"
}Then do this exactly once:
- Clone this repo
- Install worker requirements
- Make sure your
train.pyfollows the contract below - Run
worker/setup_worker.py - Start
worker/run.py
What happens during setup:
{
"step_1": "run your unmodified train.py once to measure baseline",
"step_2": "send worker_id + gpu_type + baseline_bpb + enroll_token to /register",
"step_3": "server verifies enroll token",
"step_4": "server returns a private worker_token",
"step_5": "worker saves worker_token into .worker_config.json",
"step_6": "future requests use X-Worker-Token automatically"
}Important: after setup, you do not need to type the token again. The worker stores it and uses it automatically.
git clone https://github.com/yourname/autoresearch-meta
cd autoresearch-meta
pip install -r worker/requirements.txt
python worker/setup_worker.py \
--worker-id YOUR_NAME \
--gpu-type H100 \
--train-py /path/to/autoresearch/train.py \
--meta-url http://META_IP:8000 \
--enroll-token YOUR_ENROLL_TOKENExample:
python worker/setup_worker.py \
--worker-id alice-h100 \
--gpu-type H100 \
--train-py /Users/alice/autoresearch/train.py \
--meta-url http://203.0.113.10:8000 \
--enroll-token 5qD7nL4sF2KjV8vQmYp2bZ0uHcT9xA1eIf setup succeeds, it will:
- print the current
program.md - save a local
worker/.worker_config.json - store your private
worker_tokenthere - tell you to run
python worker/run.py
If the enroll token is wrong, registration will fail with 401 Invalid enroll token.
To avoid confusion, every worker's train.py should follow this contract before running setup_worker.py.
- Expose tunable top-level constants (plain
KEY = valueassignments) - Include
TOTAL_WALL_CLOCK_TIMEas a top-level constant - Call
report(metric, progress)at each evaluation/checkpoint in training
If this contract is missing, workers may still run, but you lose key meta-agent behavior:
- no reliable patching of hyperparameters
- no dynamic budget control
- no early stopping / extension decisions
# ---- agent-patchable top-level constants ----
# These must match the active schema dimensions.
# Current default profile in this repo (MNIST demo):
LR = 1e-3
BATCH_SIZE = 32
HIDDEN_SIZE = 128
N_LAYERS = 2
WEIGHT_DECAY = 1e-4
OPTIMIZER = "adam"
# Required: worker/run.py patches this every run
TOTAL_WALL_CLOCK_TIME = 300
from worker.report import report
def train_loop():
# Example: report at 5 checkpoints (20%,40%,60%,80%,100%)
for step in range(total_steps):
# ... training ...
if should_eval(step):
val_bpb = evaluate()
progress = step / max(1, total_steps)
report(val_bpb, progress)progressmust be between0.0and1.0.- Metric should be lower-is-better (for nanochat this is
val_bpb). - Keep constants as simple top-level assignments so patching works reliably.
python worker/run.pyPull config → patch train.py → run 5 min → push result → repeat, forever.
This repo now supports a fast, low-overhead worker authentication model:
{
"registration": {
"requires_enroll_token": true,
"server_env": "META_ENROLL_TOKEN",
"worker_cli": "--enroll-token"
},
"after_registration": {
"server_issues": "per-worker token",
"worker_stores": ".worker_config.json",
"worker_sends": "X-Worker-Token header on protected endpoints"
}
}Protected worker endpoints:
{
"protected": [
"/next_config/{worker_id}",
"/result",
"/sync/{worker_id}",
"/tick",
"/runs/start/{worker_id}"
],
"public_read_only": [
"/health",
"/leaderboard",
"/program.md",
"/runs/active",
"/runs/stats",
"/pipeline/status",
"/pipeline/spec",
"/pipeline/flush_token",
"/meta_log"
]
}This keeps the system light and fast:
- no OAuth
- no external identity service
- one shared invite token for onboarding
- one per-worker token for ongoing authenticated access
For a small trusted team on one project, this is usually the right tradeoff.
First-time registration:
{
"request": {
"method": "POST",
"path": "/register",
"json": {
"worker_id": "alice-h100",
"gpu_type": "H100",
"baseline_bpb": 1.9234,
"contact": null,
"enroll_token": "TEAM_SHARED_INVITE"
}
},
"response": {
"ok": true,
"message": "Welcome alice-h100! You are worker #1.",
"current_program_md": "...",
"worker_token": "PRIVATE_SERVER_ISSUED_TOKEN"
}
}After that, all protected worker calls include:
{
"headers": {
"X-Worker-Token": "PRIVATE_SERVER_ISSUED_TOKEN"
}
}So the security model is:
{
"first_join": "shared enroll token",
"after_join": "private per-worker token",
"manual_token_entry_after_setup": false
}| Method | Path | Description |
|---|---|---|
POST |
/register |
Register a new worker |
GET |
/next_config/{worker_id} |
Pull next config to run |
POST |
/result |
Submit a completed experiment |
POST |
/tick |
Compact heartbeat: {"id","p","m","d"} → {"action":"stop"} or {} |
GET |
/sync/{worker_id} |
Get latest program.md + belief status |
GET |
/runs/active |
All currently active runs |
GET |
/runs/stats |
Kill rate, bucket pool percentiles, best delta |
DELETE |
/runs/{run_id} |
Manually stop a run |
GET |
/pipeline/status |
Speculative pipeline state + hit rate |
GET |
/pipeline/spec |
Pre-fetch speculative next program.md |
GET |
/pipeline/spec_payload |
Confidence-gated speculative payload (spec_id, program_md) |
POST |
/pipeline/flush |
Manually flush speculative cache |
GET |
/pipeline/flush_token |
Poll for flush signal |
GET |
/dimension_proposals |
Organizer queue of LLM-proposed new dimensions (on stall) |
DELETE |
/dimension_proposals |
Clear reviewed dimension proposals |
GET |
/leaderboard |
Best delta_bpb per worker |
GET |
/health |
Server health + queue depth |
GET |
/program.md |
Latest program.md as plain text |
GET |
/meta_log |
Full meta-hypothesis log as Markdown |
Interactive docs: http://META_IP:8000/docs
Thompson Sampling + fANOVA dimension-level search (every 60s)
+
Hypothesis registry Bayesian Beta-Binomial updates (every experiment)
(Beta-Binomial)
+
Population manager per-hypothesis program.md + worker allocation
(every 100 experiments)
+
Speculative pipeline (optional) confidence-gated predeploy + flush recovery
+
ASHA promotion neighborhood exploitation (every 100 experiments)
+
Meta-hypothesis log timestamped belief history (every 100 experiments)
| Dimension | Type | Range |
|---|---|---|
DEPTH |
int | 4-24 |
learning_rate |
float (log) | 1e-4 to 3e-2 |
TOTAL_BATCH_SIZE |
categorical | 16k / 32k / 64k / 128k |
DEVICE_BATCH_SIZE |
int | 4-64 |
WINDOW_PATTERN |
categorical | L / SL / SSL / SSSL |
head_dim |
categorical | 64 / 128 |
weight_decay |
float (log) | 1e-4 to 1e-1 |
muon_lr |
float (log) | 1e-4 to 1e-2 |
After 1000+ experiments:
- Frozen dimensions — hypothesis falsification proves e.g.
WINDOW_PATTERNis irrelevant with statistical evidence; locked and never sampled again - Concentrated search — all workers sampling a 2-3 dimensional subspace
- Evolved program.md — the PI has synthesised findings into human-readable conclusions
- Population-specific programs — workers investigating different hypotheses get different instructions and config constraints
- Meta-hypothesis log — a complete research journal you could excerpt into a paper
- Best train.py — apply the top config delta for an optimised training file
- Choose the search space for your problem
- Edit
meta_server/schema.sqlto define the dimensions the meta-agent is allowed to search - Edit
meta_server/hypotheses.pyDEFAULT_HYPOTHESESto match your domain - Deploy the meta-agent server
- Share only the server URL and enroll token with workers
- Use your custom
train.py(metric must be lower is better) - Make sure
train.pyfollows the worker contract in this README:- top-level tunable constants
TOTAL_WALL_CLOCK_TIMEreport(metric, progress)calls
- Run
worker/setup_worker.py - Run
worker/run.py
{
"organizer": {
"edits": [
"meta_server/schema.sql",
"meta_server/hypotheses.py"
],
"deploys_server": true,
"shares_with_workers": [
"meta_url",
"enroll_token"
]
},
"worker": {
"edits": [
"their own train.py"
],
"does_not_edit": [
"meta_server/schema.sql",
"meta_server/hypotheses.py"
]
}
}MIT License