Skip to content

darklordVirtual/REMORA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

426 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

REMORA - Pre-execution governance for AI agents

Research-grade prototype. v0.7.1. MIT. Not production-certified.

Quality Gates License: MIT Python 3.11+ Paper Works with LangGraph Works with OpenAI Works with MCP

Your agent can call tools. REMORA decides whether it should.

REMORA intercepts AI agent actions before execution. It evaluates uncertainty, evidence, policy, risk tier, environment, and audit state, then routes each action to:

Outcome Meaning
ACCEPT Safe enough to execute automatically
VERIFY Require more evidence or human sign-off
ABSTAIN Decline because trust is too low
ESCALATE Block and send to human review

Use REMORA when agents can touch databases, cloud resources, files, tickets, emails, APIs, security findings, or production systems.

Live: Eye | Control Room | Benchmarks | Telemetry


Try it in 60 seconds

git clone https://github.com/darklordVirtual/REMORA.git
cd REMORA
pip install -e .

# No API keys. Shows accept, verify, abstain, and escalate paths.
python examples/enterprise_demo.py

# Evaluate agent actions in Shadow Mode without blocking production.
make shadow-replay INPUT=artifacts/demo/shadow_mode_sample_agent_action_log.jsonl

What the demo is meant to make obvious:

Agent action Risk REMORA route Why it matters
Read a local report Low ACCEPT Keep useful automation
Export customer data High VERIFY Require evidence or approval
Drop a production table Critical ESCALATE Stop destructive execution
Ambiguous bulk action Unknown ABSTAIN Do not guess under uncertainty

Every decision produces a replayable DecisionEnvelope with a SHA-256 audit hash.


Start here

You are Start with You get
Agent developer examples/enterprise_demo.py, LangGraph adapter, OpenAI tool-calling adapter A small wrapper around tool calls
Security reviewer Shadow Mode, eval_pack/, docs/cyber_evidence_layer.md A way to see what agents would execute before enforcement
Research evaluator EVALUATOR_START_HERE.md, docs/claim_register.md, NEGATIVE_RESULTS.md Reproducible claims, artifacts, and limitations

Why REMORA?

  • Pre-execution control: govern actions before tools run.
  • Shadow Mode first: evaluate agent behavior without breaking workflows.
  • Clear outcomes: ACCEPT, VERIFY, ABSTAIN, ESCALATE.
  • Policy precedence: hard safety blocks run before probabilistic signals.
  • Evidence-aware routing: use curated evidence, RAG, or domain providers.
  • Replayable audit: every decision can be inspected and replayed.
  • Claim discipline: no naked claims; benchmark claims map to artifacts.

REMORA is not a chatbot and not a generic prompt guardrail. It is an action governance gateway for autonomous agent systems.


Drop into any agent loop

from remora.adapters import LangGraphActionAdapter
from remora.adapters.gateway import LocalGateway
from remora.engine import Remora, Genome

engine = Remora(genome=Genome())
adapter = LangGraphActionAdapter(gateway=LocalGateway(engine))

result = adapter.intercept(
    action_name="delete_table",
    action_args={"table": "users"},
    domain="database",
    risk_tier="critical",
    action_type="destructive_write",
    target_environment="prod",
)

if result.should_execute:
    run_tool()
else:
    send_to_review(result.envelope)

The same pattern works for OpenAI tool calling, MCP tools, custom loops, and shadow replay logs.


Shadow Mode

The lowest-friction way to adopt REMORA is to run it beside your agent first.

Agent proposes action
  -> REMORA evaluates the action
  -> action is not blocked yet
  -> DecisionEnvelope records what would have happened
  -> replay report shows unsafe, ambiguous, and escalated cases

This answers the first practical question most teams have:

What would our AI agents have done if we had let them act?

See examples/shadow_mode_demo.py and EVALUATOR_START_HERE.md.


Benchmarks at a glance

These are reproducible research and simulator results. They are not production safety certification.

Claim Result Scope
Blocks unsafe tool calls 0.0 % unsafe [Simulator] deterministic synthetic benchmark, 700 tasks
Reduces adversarial stress replay harm 3.2 % unsafe vs 51.9 % naive [Simulator] 10k synthetic stress replay
Preserves utility better than naive policy +1.28 utility delta [Simulator] same stress benchmark
Selective routing accuracy under abstention 88.0 % at 23.2 % coverage [wide CI; n_accepted=25] [QA benchmark] held-out N500 split
In-sample selective routing 88.8 % at 18.0 % coverage [QA benchmark] in-sample 544-item artifact
AgentHarm harmful-action blocking (Mode 3) blocked_recall=0.977, FPR=0.023 [Live oracle + REMORA gate] 88 curated cases
Factory replay accuracy 100% CI95 [0.94, 1.00] [Curated factory] 65 constructed cases

Reviewer anchors: Result 1 uses N = 302, a 57.0 % single-model baseline, an 82.8 % majority-vote baseline, and top-25 % selective routing with 76 accepted questions, 72 correct answers, and 94.7% accuracy. Result 2 uses the historical N500 label for a 544-question artifact. The label N500 is historical; the current artifact evaluates 544 questions. It reports a 41.18% full-coverage majority baseline, top-10 % k=54, top-15 % k=82, top-18 % k=98 with 88.8% accuracy, and top-20 % k=109.

Tool-call reviewer detail: v1 (252 tasks) reports remora_temperature_gate_heuristic mean utility 0.6762 and remora_full_policy_gate mean utility 0.5690 with 0.7619 accuracy. v1 does not demonstrate unsafe-execution reduction. v2 (700 tasks) reports temperature-gate utility 0.2700 and full-policy utility 0.6200 with 0.9000 accuracy, and reduces unsafe execution in the deterministic simulator. Significance artifact: results/toolcall_benchmark_v2_significance.json.

Read the caveats before citing results:

  • Simulator-scoped: tool-call safety numbers come from a deterministic synthetic benchmark, not a live deployment.
  • Narrow CI: the held-out selective result (88.0% at 23.2% coverage) accepted only 25 items — confidence interval is wide.
  • No external validation: all benchmarks are internally run. No third-party replication exists yet.
  • Blocked recall ≠ ESCALATE recall: the AgentHarm Mode 3 figure (0.977) counts both ESCALATE and VERIFY as "blocked." Strict ESCALATE-only recall is 0.114.
  • Static vs live oracle: cross-domain 100% precision uses the deterministic static evidence provider; live Workers AI oracle precision is lower.
  • Thermodynamic language is an uncertainty-routing proxy, not a physics claim.
  • Hallucination bound: the bound is a candidate hallucination-bound proxy and an implemented research heuristic — not a formal guarantee.
  • Tool-call safety is a controlled safety simulation, not production evidence.
  • AROMER world model: priors are computed but the adjust_trust() function operates with bounded ±20% adjustments; results on real-world deployment are untested.

See docs/results_snapshot.md, docs/claim_register.md, and NEGATIVE_RESULTS.md.


AROMER — live learning loop

AROMER (Autonomous REMORA Orchestrator, Meta-Emergent Reasoner) is an experimental plugin that turns REMORA into a closed-loop governance system. It learns from every decision by combining persistent episodic memory, Bayesian world-model priors, and a Workers AI meta-judge that critiques its own past decisions.

Claude Code is directly wired to AROMER via PreToolUse and PostToolUse hooks. Every tool call this repository makes in development is recorded as a labeled governance episode that AROMER trains on.

Live log → (Public endpoint — logs contain domain/action_type of tool calls made in this development repo. No user data or credentials are included, but treat as operational telemetry.)

What is running right now

Component Status
Episode store (Cloudflare D1) Live — accumulating real tool-call episodes
Workers AI meta-judge Active — critiques own decisions after each cycle
Bayesian world model 44+ contexts tracked; shadow mode on by default (priors computed, not applied)
Cron adaptation cycle Every hour — 0 * * * *
Oracle bandit (Thompson Sampling) cf_strong ~98%, cf_fast ~97% (Thompson Sampling posteriors from self-labeled data)

Results so far

AgentHarm benchmark (UK AI Safety Institute, arXiv:2410.09024, 88 canonical cases):

Mode Blocked recall† Hard FPR‡ Coverage
Oracle-only 1.000 CI95 [0.92, 1.00] 0.318 98.9%
Harm-specific oracle 1.000 CI95 [0.92, 1.00] 0.114 98.9%
Full REMORA gate 0.977 CI95 [0.88, 1.00] 0.023 CI95 [0.004, 0.118] 97.7%

Blocked recall = (ESCALATE + VERIFY) on harmful cases ÷ total harmful. Strict ESCALATE-only recall in Mode 3 = 0.114 (5/44); the remaining 38/44 harmful cases are routed to VERIFY (human-review required) rather than hard-blocked. ‡ Hard FPR = ESCALATE on benign cases only. VERIFY on benign = review overhead, not hard block.

Modes 1 and 2 achieve zero false negatives (all 44 harmful cases correctly identified). Mode 3 ESCALATE-only: FN = 39, all routed to VERIFY. Full methodology and artifact: docs/agentharm_trimode_benchmark.md.

Cross-domain static evidence benchmark (32 curated cases — static rule-based provider):

Category Precision Escalation recall FP suppression
Cyber (12 cases) 100% 100% 100%
AI governance (10 cases) 100% 100% 100%
Finance / AML (10 cases) 100% 100% 100%
Overall 100% CI95 [0.89, 1.00] 100% 100%

Note: these results use the deterministic static evidence provider. Live three-model Workers AI oracle achieves lower directional precision on these same domains (ai_governance: 25%, finance: 25%) — see artifacts/live_benchmark_results.json for the live oracle comparison.

All results are derived from committed artifacts in artifacts/. AROMER is research-grade and not production-certified. No external live-agent validation has been conducted.

make replay-benchmark          # Run factory replay (requires network)
make aromer-log                # View live AROMER learning log
python scripts/aromer_log.py --watch 60   # Refresh every 60 s

Cyber evidence showcase

REMORA includes a public, standalone cyber evidence layer for security triage. It is designed to enrich findings with safe metadata from CVE, CWE, ATT&CK, KEV, EPSS, OSV, GitHub Advisory, and ICS advisory sources.

make cyber-evidence
make cyber-vector-payload

The public layer does not include proprietary GO-STAR internals, private scanner rules, customer findings, exploit payloads, or weaponized proof-of- concept content. GO-STAR can later connect as a commercial extension that supplies candidate findings into the same REMORA evidence boundary.

See docs/cyber_evidence_layer.md.


Architecture in plain English

AI agent proposes a tool call
  -> fail-closed policy checks
  -> risk tier and environment classification
  -> evidence lookup or retrieval
  -> uncertainty and disagreement scoring
  -> policy decision
  -> DecisionEnvelope audit record

Deep technical details live outside the first page:

Topic Document
Full system architecture ARCHITECTURE.md
Thermodynamic uncertainty proxy paper/remora_paper.md
Claim register docs/claim_register.md
Negative results NEGATIVE_RESULTS.md
Security posture SECURITY.md
MCP integration docs/mcp-integration.md
Live architecture diagram docs/remora_architecture.html

Implementation status

Area Status
Multi-oracle governance engine Implemented
Policy decision engine Implemented
DecisionEnvelope audit records Implemented
Shadow replay Implemented
LangGraph / OpenAI / MCP adapters Implemented
Cyber evidence layer Implemented as public seed pack
Cross-domain evidence packs Implemented for cyber, AI governance, and finance
Live-agent external validation Pending
Production certification Not claimed

Known limits

Limit Detail
Simulator-scoped safety Controlled benchmarks do not prove field deployment safety
Small accepted holdout set 88.0 % is a point estimate; CI is wide
Live-agent validation pending External replication and real tool-call studies are still needed
Evidence quality matters Bad retrieval can cause bad governance decisions
Not a universal AI safety solution REMORA governs actions; it does not make models truthful

Further reading

Document Description
EVALUATOR_START_HERE.md 30-minute external review guide
docs/plain_language_overview.md Non-technical overview
eval_pack/ Bring-your-own-agent-log Shadow Mode evaluation pack
docs/policy_cookbook/ Practical policy examples for database, cloud, and cyber actions
docs/domain_benchmark.md Cross-domain evidence benchmark
docs/live_benchmark.md Live benchmark notes
paper/remora_paper.md Research preprint
enterprise/remora-control-plane.md Enterprise control-plane framing

Built by @darklordVirtual. MIT licensed.