Skip to content

RPL Judgement

ankurCES edited this page Jun 8, 2026 · 2 revisions

Raskolnikov's Psychological Loop (RPL-Judgement)

RPL-Judgement is blumi's adversarial, regret-minimizing pre-execution reasoning loop for AI coding agents. Before a risky tool call touches your real files, repo, or network, blumi maps the action's blast radius, puts the plan on trial before an adversarial LLM-as-a-judge ("Porfiry"), and only executes the plan that survives cross-examination — then writes the predicted-vs-actual "regret" back into long-term memory so the agent learns from consequences. A standard agent maximizes success; an RPL agent minimizes regret.

Inspired by Fyodor Dostoevsky's Crime and Punishment — the agent doesn't just plan; it simulates the crime, prosecutes itself, and updates its logic before ever touching a live system.

RPL is off by default (opt-in via rpl.enabled) — it trades extra LLM calls for far fewer catastrophic, irreversible actions, so you turn it on for high-stakes, autonomous, or unattended work.

  • TL;DR: blast-radius risk model → adversarial judge gate → typed actuation → regret-to-memory.
  • Where it lives: crates/blumi-core/src/rpl.rs (pure core) + crates/blumi-core/src/agent.rs (the live loop) + RplConfig in crates/blumi-config.
  • Related: Self-Management · Memory & Knowledge · Configuration.

What is RPL-Judgement?

Most agent frameworks let the model plan and then immediately execute — a single chain of thought followed by a tool call. The failure mode is well known: the model, biased toward task completion, rationalizes a risky step ("just git push --force, it'll be fine") and the side effect is already done by the time anyone notices.

RPL-Judgement inserts a deliberation stage between deciding and doing. It replaces "plan → act" with "plan → map risk → prosecute → (re-plan or) act → confess." It is an application of three ideas:

  1. A cheap, deterministic risk prior (the blast radius) decides whether the expensive review is even worth running — so safe, read-only work pays nothing.
  2. An adversarial LLM-as-a-judge — a separate model invocation whose only instruction is to find the flaw — gates execution. Adversarial review beats self-critique precisely because the judge has no stake in finishing the task.
  3. Outcome-based credit assignment — the gap between the predicted risk and what actually happened (the Error Delta, i.e. "regret") is persisted to memory and feeds value-based memory fitness, so the system gets calibrated over time.

The science of the judgement

1. A structured risk model (the blast radius)

Risk isn't a vibe — it's computed from the capabilities a tool batch requests (blumi-core/src/rpl.rs::BlastRadius::assess). Each finalized tool call's required_capabilities are flattened and bucketed: files written, shell commands, network egress, VCS mutations, sub-agent spawns, and whether any command matches blumi's destructive-command heuristic (rm -rf, sudo, git reset --hard, git push --force, mkfs, dd if=, … — shared with the permission engine).

From that, a 0–100 severity is derived (BlastRadius::severity):

Factor Contribution
A destructive command present +60
Shell commands +20 each (capped at 2)
VCS mutations (commit/reset/push) +25 each (capped at 2)
Network egress to a host +15 each (capped at 2)
File writes +10 each (capped at 3)
Irreversible (destructive / network / VCS) +15

A batch is reversible when it has no destructive command, no network egress, and no VCS mutation (file writes are reversible via the /undo change journal). Only a mutating batch whose severity clears blast_threshold (default 40) triggers the full loop — everything else takes the cheap path and executes normally.

Opaque mutations are covered too. A tool that mutates but declares no capabilities (many MCP tools, self_config, manage_skill, restart_gateway) would otherwise produce an empty blast radius and slip through. BlastRadius::should_review folds in each tool's is_read_only flag, so an undeclared side effect is reviewed anyway and its predicted risk is floored to the threshold — never recorded as "safe".

Code-graph fan-in. When the code graph is built and knowledge.graph.rpl_impact is on, editing a heavily-referenced file adds its incoming-reference count to the severity — so changing a high-fan-in symbol is likelier to face review. No-op when the graph isn't available.

2. Adversarial LLM-as-a-judge ("Porfiry")

The surviving plan is submitted to Porfiry (agent.rs::rpl_porfiry) — a focused LLM call whose system directive (PORFIRY_POLICY) is only to find the flaw: the ignored edge case, the state the plan failed to re-read, the irreversible step taken for granted, the way it could break the environment. The judge returns a single line of JSON — {"approved":bool,"risk":0-100,"flaw":"…"} — parsed leniently and fail-open (parse_porfiry): an unreachable or unparseable judge approves (at the blast severity), so a flaky model never deadlocks the agent.

Why a separate adversarial call instead of asking the planner to double-check itself? Because the planner is optimizing for finishing; a self-review inherits that bias. A judge instructed to assume the plan is flawed and prosecute it surfaces failure modes the planner has already rationalized away. It runs at temperature 0 with a tight token budget — cheap and deterministic.

3. Regret minimization, not reward maximization

After execution, blumi computes the Error Delta (rpl.rs::ErrorDelta::compute) — how far reality diverged from the prediction:

  • Failed despite a low predicted risk ⇒ maximal regret (100 − predicted_risk, floored at 40) — the surprise most worth learning from.
  • Succeeded ⇒ a small residual regret proportional to how scared we were (predicted_risk / 4).

That delta is written as an rpl_delta episode in the agent memory namespace and feeds value-based memory fitness — so memories and decisions that lead to low-regret outcomes gain value, and the agent's risk judgement self-calibrates. This is the credit-assignment loop: regret → memory value → what survives eviction.


How it works — the five phases

RPL is injected into the agent turn between finalize_tool_calls and execute_calls (blumi-core/src/agent.rs) — the tool calls are fully materialized but nothing has run yet.

Phase 1 — The Hypothesis (blast radius)

Map what the batch would touch into a structured BlastRadius + severity. Read-only / low-blast batches skip the loop entirely (the cheap path), keeping cost bounded.

Phase 2 — The Fever Dream (simulation)

For a batch over blast_threshold, predict the worst-case outcome and a risk score. (MVP: a single-plan dry prediction. Multi-branch Tree-of-Thoughts simulation and real sandboxed execution are planned — see Implementation status.)

Phase 3 — The Porfiry Node (adversarial judgment)

The plan goes to the adversarial judge. If it approves, proceed. If it rejects, the plan is bounced: the agent keeps its reasoning, the flaw is injected as guidance, and the model re-plans — bounded by max_defend_rounds (default 2), after which blumi proceeds under caution so the turn can never deadlock. The bounce is protocol-safe (it never leaves a dangling tool_use, mirroring blumi's doom-loop guard).

Phase 4 — The Strike (deterministic actuation)

The surviving plan flows through the normal typed tool pipeline (execute_tool_call) unchanged — the permission engine and pre_tool_use hooks still apply. RPL only decides which plan reaches this point.

Phase 5 — The Confession (regret → memory)

blumi compares the predicted risk to the actual result, computes the Error Delta, and writes it back to memory as an rpl_delta episode — priming future blast-radius judgments and feeding value-based fitness. It evolves from the consequence.


Configuration

In ~/.blumi/settings.json (defaults shown — enabled is false):

"rpl": {
  "enabled": false,
  "blast_threshold": 40,
  "branches": 3,
  "max_defend_rounds": 2,
  "judge_model": "",
  "sandbox": "dry"
}
Field Meaning
enabled Master switch. Off by default — opt in for high-stakes / unattended work.
blast_threshold 0–100 severity a mutating batch must reach to be reviewed. Lower = more reviews.
branches Tree-of-Thoughts branches to simulate per review (1–5). Reserved — see roadmap.
max_defend_rounds Porfiry reject → re-plan rounds before proceeding under caution.
judge_model Model for the Porfiry judge. Empty = reuse the main model.
sandbox dry (predict) or worktree (real sandboxed sim). worktree reserved — see roadmap.

RPL composes with — it does not replace — your permissions and pre_tool_use hooks. Those still gate the Strike; RPL adds an upstream adversarial review so a risky plan is caught and re-thought before it ever reaches them.


The Dostoevsky mapping

RPL-Judgement borrows its structure from Raskolnikov's psychology in Crime and Punishment:

Crime and Punishment RPL-Judgement
Raskolnikov's "Extraordinary Man" theory — deciding the normal rules don't apply The Hypothesis — mapping the blast radius / what protections are being bypassed
The fever dream — rehearsing the act, obsessing over chaotic variables The Fever Dream — worst-case simulation + paranoia score
Porfiry Petrovich, the investigator who traps him in his own logic The Porfiry Node — the adversarial judge that must be satisfied
The crime itself The Strike — deterministic, typed actuation
Guilt — reality didn't match the sterile plan The Confession — the Error Delta written to memory

Implementation status & roadmap

Live today (v0.5.0):

  • ✅ Phase 1 blast-radius gate, including opaque-mutation coverage (should_review / predicted_risk).
  • ✅ Phase 3 Porfiry adversarial judge with reject → re-plan bounce, bounded + fail-open.
  • ✅ Phase 4 typed actuation through the normal pipeline (permissions + hooks intact).
  • ✅ Phase 5 Error-Delta "regret" written to memory, feeding value-based fitness.

Planned (config present, not yet active):

  • 🚧 Multi-branch Tree-of-Thoughts (Phase 2). The MVP judges the single materialized plan; the branches knob and the ParanoiaScore / least-regret machinery in rpl.rs are unit-tested but not yet wired into the live loop.
  • 🚧 sandbox: "worktree". Real sandboxed branch execution in throwaway git worktrees (observe outcomes instead of predicting them). The MVP runs sandbox: "dry".

FAQ

Is RPL-Judgement on by default?

No. It's opt-in via rpl.enabled = true. It adds LLM calls, so it's meant for high-stakes or unattended runs rather than every keystroke.

Does it slow down every action?

No. Only mutating batches whose blast-radius severity clears blast_threshold are reviewed. Read-only and low-risk work takes the cheap path with zero added cost.

What happens if the judge model is down or returns garbage?

It fails open — the action is approved (recorded at the blast severity) — so a flaky judge can never deadlock the agent. RPL is a safety enhancement layered over the permission engine, not a single point of failure.

How is this different from the permission engine or pre_tool_use hooks?

Permissions and hooks gate an individual call at execution time. RPL adds an upstream, plan-level adversarial review that can bounce a whole batch back for re-planning before it reaches those gates. They compose.

Does RPL change what executes?

It changes whether a plan executes (approve / bounce), not how. The surviving plan runs through the exact same typed pipeline, permissions, and approvals as without RPL.

What is the "Error Delta" used for?

It's the regret signal — predicted risk vs. actual outcome — written to long-term memory, where it feeds value-based fitness so the agent's judgement self-calibrates over time.


See also: Self-Management · Memory & Knowledge · Configuration · FAQ.

Clone this wiki locally