Ξ epistemic
The open source research-discipline coding agent.
epistemic gives your coding agent the norms of good ML research. Instead of running experiments, eyeballing a number, and moving on, it enforces a real method: pre-register a hypothesis, reproduce the baseline, run the experiment, attack your own claim, then decide to ship or kill — with an interactive monitor and gates that make the rules automatic.
The skills are the portable manual the agent follows step by step. The harnesses inject that manual into Claude Code, Codex, or the epistemic TUI. The gates are the safety net that enforces it where the harness supports runtime hooks.
Give your agent epistemic: Claude Code, Codex CLI, Codex App, epistemic TUI.
It starts from the moment your agent picks up an empirical task. Instead of jumping straight to running code, it steps back and asks what you're really trying to prove.
Once it has a rough claim, it asks one question at a time — Socratic-style — until the hypothesis is falsifiable, the falsifier is concrete, and the budget is realistic. Before locking in, it generates 2–3 competing explanations with unique disconfirming predictions so you pick the strongest one.
After you sign off, the agent locks the hypothesis in a pre-registration file before touching any experiment code. The prereg gate then blocks any experiment-shaped command that has no matching prereg.md — so there's no way to accidentally run something unregistered.
From there it reproduces the competitor's baseline under your locked judge, runs the full experiment, applies proper statistics, and sends the claim to adversary models that each try to disprove it. If any adversary succeeds, the result is blocked. If all pass, it lands in RESULTS.md and you decide: ship, kill, pivot, or refine.
Because the skills trigger automatically, you don't need to orchestrate anything. Your coding agent just has epistemic.
$ epistemic "does LoRA at rank 8 outperform rank 4 on math benchmarks"
→ Opens research-question skill, refines to a falsifiable hypothesis
$ epistemic "run the registered experiment H-003"
→ Checks prereg gate, routes to the correct compute target, logs costs
$ epistemic monitor
→ Full-screen experiment tree: running, shipped, killed, pending
$ epistemic fleet
→ Launches a parallel agent fleet across all pending hypotheses
$ /skill:falsification-review
→ Sends the current claim to ≥2 adversary models; blocks if any falsify it
$ /skill:kill-or-ship
→ Decision gate: KILL / PIVOT / REFINE / RECOMMIT / SHIPAsk naturally or use slash commands as shortcuts.
| Command | What it does |
|---|---|
/skill:research-question |
Refine a rough idea into a falsifiable, pre-registerable hypothesis |
/skill:preregistration |
Lock hypothesis, judge config, and compute scaffold before running |
/skill:baseline-reproduction |
Reproduce the competitor's result under your locked judge |
/skill:experiment-execution |
Run with discipline — locked env, full sample, cost logging |
/skill:statistical-rigor |
Effect sizes, test selection, multiple-comparison correction, APA reporting |
/skill:falsification-review |
Adversary models try to disprove the claim; blocks if any succeed |
/skill:surprise-triage |
Diagnose results that diverge >15% before they reach RESULTS.md |
/skill:kill-or-ship |
Final decision gate with five outcomes |
/skill:verification-before-publication |
Full pre-publish checklist |
curl -fsSL https://raw.githubusercontent.com/moralespanitz/epistemic/master/install.sh | shTo pin a version:
curl -fsSL https://raw.githubusercontent.com/moralespanitz/epistemic/master/install.sh | sh -s -- --ref v1.0.0Installs to ~/.epistemic with a symlink in ~/.local/bin/epistemic. Requires Node.js v18+.
Default model: openrouter/deepseek/deepseek-v4-pro. If you have OpenAI Codex authed, it uses that instead. If nothing is authed, pi prompts /login.
Installs just the research methodology skills — no TUI, no runtime dependency.
curl -fsSL https://raw.githubusercontent.com/moralespanitz/epistemic/master/install-skills.sh | shWith the optional Hugging Face skills:
curl -fsSL https://raw.githubusercontent.com/moralespanitz/epistemic/master/install-skills.sh | sh -s -- --hfSkills land in ~/.claude/skills/ and are active in Claude Code immediately.
/plugin install epistemic@claude-plugins-officialOr via the epistemic marketplace:
/plugin marketplace add moralespanitz/epistemic
/plugin install epistemic-skills@epistemicHooks (prereg gate + session bootstrap) install automatically via the marketplace, or manually:
epistemic hooks install # add hooks to ~/.claude/settings.json
epistemic hooks status # check what's active
epistemic hooks on | off # toggle without restart- Open the plugin search interface:
/plugins - Search for
epistemicand select Install Plugin.
- In the Codex app, click Plugins in the sidebar.
- Find
epistemicin the Research section and click+.
- research-question — Socratic refinement from rough idea to falsifiable hypothesis. Generates 2–3 competing alternatives with unique disconfirming predictions; archives the ones you don't pick.
- preregistration — Validates all fields, writes
experiments/{id}/prereg.md, hashes the judge config →judge.lock, generates the compute scaffold (Dockerfile / modal-app.py / environment.lock), and commits. Locks in the rules before code runs. - baseline-reproduction — Reads the competitor's paper, validates HuggingFace datasets, reproduces the result under your locked judge. Baselines older than 30 days must be refreshed.
- experiment-execution — Confirms prereg + locks match, routes by compute target (
local/docker/modal), logs every API/compute cost to.epistemic/cost-ledger.jsonl, writes results toexperiments/{id}/smokes/— provisional only. - statistical-rigor — Assumption checking → test selection → effect sizes (Cohen's d, η², R²) alongside p-values → multiple-comparison correction → APA reporting with exact p-values.
- falsification-review — Sends the claim to ≥2 adversary models. Each returns the cheapest experiment that would disprove it. If that experiment is <$1 and unrun, the agent insists on running it first.
- surprise-triage — Triggered when results diverge >15%. Produces ranked explanations (sampling, judge mismatch, data leakage, ceiling effects, prompt drift, version change, bugs) and the cheapest disambiguating test for each.
- kill-or-ship — Decision gate. Sunk-cost rule: killed hypotheses can't be silently revived. Expected kill-to-ship ratio ~5:1 — killing fast is the point.
- verification-before-publication — Full pre-publish checklist: locks present and matching, baselines fresh, falsifier verdicts evaluated, cost ledger current, stats done, alternatives documented.
kill-or-ship outcomes
| Option | When | Effect |
|---|---|---|
| KILL | Spend > 1.5× cap, or >21 days stale | Write KILLED.md, record a lesson |
| PIVOT | Failed but suggests a new direction | Kill old, open a new hypothesis |
| REFINE | Same claim, adjusted method | Re-run from execution (needs override) |
| RECOMMIT | Continue past kill criteria | New cap + override |
| SHIP | All gates pass, falsification clean | Tag and publish |
- using-epistemic — Bootstrap skill injected at
SessionStartin research repos (hasHYPOTHESES.md/experiments/). Tells the agent to load the correct stage skill before any empirical work. - epistemic — Umbrella skill: the shared method contract loaded by all harnesses.
Four bundled research agents, dispatched automatically.
- Researcher — gather evidence across papers, HuggingFace, web, repos, and docs
- Adversary — simulated peer review with the cheapest disconfirming experiment per claim
- Statistician — assumption checking, test selection, effect sizes, APA reporting
- Verifier — pre-publish checklist, lock matching, baseline freshness, cost ledger audit
epistemic monitor (or /monitor in the TUI) opens a full-screen interactive view of every experiment:
Ξ epistemic · mission control [████░░ 16%] $34/$210 2 running · 1 shipped · 1 killed
● ✓ H-001 LoRA fine-tuning…
├─▶ ▶ H-004 Scaling LoRA to 7B…
│ ◇ if acc ≥ 0.80 → ship / H-006 pivot
└─▶ ☓ H-002 High learning rate…
| Key | Action |
|---|---|
↑ / ↓ |
select an experiment |
→ / ← |
open detail / back to tree |
enter |
actions: chat / approve / reject / modify |
q |
back to the chat |
If the skills are well written, the gates never fire — the agent follows the manual. The gates are the safety net.
| Gate | Blocks |
|---|---|
| Prereg | running experiments before pre-registration |
| Judge / Environment lock | changing the judge or environment mid-run |
| Smoke | quoting provisional numbers as results |
| Cost ledger | (transparent — logs every call with cost + category) |
| Claim intercept | comparing to an unreproduced baseline |
| Kill criteria | overrunning 1.5× the cost cap, or going stale |
| Baseline staleness | comparing to a >30-day-old baseline |
Overrides go in OVERRIDES.md with a mandatory reason.
Internal tools the agent can call during research workflows.
- Docker — isolated container execution for reproducible experiment environments
- Modal — serverless GPU compute for burst training and inference
- Web search — evidence gathering across papers, docs, and repos
Hugging Face (optional, requires hf auth login)
| Tool | What it does |
|---|---|
huggingface-papers |
Read any arXiv paper as markdown; structured metadata (authors, linked models, citations) |
hf-cli |
Download/upload models & datasets, manage repos, run HF Jobs |
huggingface-datasets |
Paginate rows, full-text search, filter predicates, get Parquet URLs |
huggingface-community-evals |
Run evals locally with inspect-ai or lighteval |
huggingface-trackio |
Log metrics + alerts during training, sync real-time dashboard to HF Space |
huggingface-llm-trainer |
Fine-tune with TRL (SFT/DPO/GRPO) on HF Jobs cloud GPUs |
@epistemic/omp exposes a typed plugin API to extend epistemic with your own commands, event handlers, and gates:
import type { EpistemicPlugin } from "@epistemic/omp";
export const myPlugin: EpistemicPlugin = (api) => {
api.registerCommand("my-cmd", {
description: "My custom command",
handler: async (args, ctx) => {
ctx.ui.notify(`Running my-cmd with: ${args}`, "info");
},
});
api.on("session_start", async (_event, ctx) => {
ctx.ui.setStatus?.("my-plugin", "● active");
});
api.gate(async (event, ctx) => {
// Return { block: true, reason: "..." } to interrupt
});
};| Method | What it does |
|---|---|
registerCommand(name, opts) |
Registers a /name slash command in the agent chat |
on(event, handler) |
Subscribes to session_start, session_shutdown, before_agent_start, or tool_call |
gate(handler) |
Registers a tool_call gate — return { block, reason } to interrupt |
Built on Pi for the agent runtime, with the epistemic skill core as the shared method contract across all harnesses. Runtime resources follow Pi's package, extension, and skill model.
| Layer | What it does |
|---|---|
| Portable skill core | using-epistemic, epistemic, and stage skills — the shared method across harnesses |
| Harness bootstrap | Claude SessionStart, Codex manifest inject the skill core at the right time |
| Runtime gates | Invisible enforcement that blocks rule violations automatically |
| Monitor | Navigate the experiment tree, drill into a hypothesis, approve / reject / modify |
| Fleet | Parallel agent fleet runner for multi-experiment orchestration |
| HF stack | 6 HF skills (papers, datasets, evals, training, tracking, CLI) auto-discovered |
| State | File-based ledger: HYPOTHESES.md, .epistemic/cost-ledger.jsonl, experiments/{id}/ |
| Path | What it is |
|---|---|
packages/omp/ |
@epistemic/omp — forked oh-my-pi TUI shell (Amber Lab theme, ResearchSidebar) |
src/ |
Extension: gates, commands, monitor, board |
src/tui/render3d.ts |
Shared software 3D renderer (Z-buffer, Lambert shading, amber glow) |
src/cli/intro.ts |
Startup animation (3D spin → name reveal) |
skills/ |
Research methodology skills + 6 HF skills |
hooks/ |
Claude Code hooks (SessionStart, prereg gate) |
.pi/extensions/welcome-header/ |
Persistent 3D Ξ header in the TUI |
.pi/settings.json |
Project settings (Codex default, theme, extensions) |
.claude-plugin/ |
Claude Code plugin manifest and marketplace metadata |
.codex-plugin/ |
Codex plugin manifest for the portable skill core |
themes/epistemic.json |
Amber-on-transparent dark theme |
tests/claude-code/ |
Headless Claude Code harness tests for skill triggering |
- Pre-registration — lock the hypothesis before running a single line
- Reproducibility — you can't beat what you can't run
- Falsification over confirmation — guilty until proven defensible
- Evidence over claims — verify before publishing
- Fork the repository
- Create a branch for your work
- Follow the
writing-skillsskill for creating and testing new and modified skills - Submit a PR
Every skill must work across all supported harnesses (Claude Code, Codex, TUI).
- GUIDE.md — prompt-driven walkthrough and use cases
- TESTING.md — the agent-driven TUI test suite (
npm run verify) - Issues: https://github.com/moralespanitz/epistemic/issues