[AIG-644] First-class HITL primitives - interrupt/edit/resume (M2-16)#33
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a Human‑In‑The‑Loop interrupt system: types and errors, durable in-memory store with atomic operations, notifier sinks, REST endpoints with fail‑closed auth, CLI inspect/resume commands, a React Interrupts page, documentation, and tests. ChangesHITL Interrupt System
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
tests/unit/consensus-service.test.tsParsing error: "parserOptions.project" has been provided for Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
blackms
left a comment
There was a problem hiding this comment.
Review: AIG-644 — HITL interrupt() Primitives
Verdict:
Excellent scope and code quality overall. Promise-based API is clean, the in-process EventEmitter + cross-process poll fallback is the right architecture, tests cover happy path, schema rejection, timeout, state edits, and end-to-end workflow flow (2197/2197 green). Docs (docs/HITL.md) are thorough.
✅ What's good
- Clear separation:
interrupt-types.ts(pure types/errors) /interrupt.ts(store + primitive) /interrupt-notifier.ts(pluggable sinks). InterruptValidationErrorreopens the interrupt so the operator can retry — nice touch, covered bye2e-3test.- Notifier fan-out is failure-isolated: each sink wraps
sendin try/catch and only logs (Webhook notifier failed,Slack notifier failed). One bad sink can't poison the others. - Persistence is a plug-in interface (
InterruptPersistence) rather than a hard dep on AIG-633 — store falls back to in-memory ifsavethrows (continuing in-memory), so a broken checkpointer never blocks a workflow. - Both Zod-like (
safeParse) and minimal JSON-Schema-subset validators are supported. - Web UI auto-claims on dialog open so concurrent operators see who's reviewing.
⚠️ Must-fix before merge
-
No auth on
/api/v1/interrupts/*(src/web/routes/interrupts.ts).resume,cancel,claimare unauthenticated POSTs — any caller who can reach the dashboard origin can resolve any interrupt with arbitraryinput/stateEdits. The interrupt ID is a UUIDv4 (unguessable), which mitigates random replay, but: (a) IDs leak viaGET /api/v1/interruptswhich is also unauth'd, (b) theclaimmechanism is purely advisory with no ownership check onresume. Either gate the routes behind the existingAuthService(consensus routes likely do — verify) or add a per-record resume token (HMAC ofid+createdAtreturned only to notified channels) and validate it inresumeInterrupt. This is the resume-token-replay concern from the AIG-644 spec. -
Outgoing webhook payload is unsigned (
WebhookSink.send, line ~52). Posts the fullInterruptRecord(incl.state) to the configured URL with no HMAC header. Receivers can't distinguish a real interrupt from a forged POST. RecommendX-Aistack-Signature: sha256=<hmac(secret, body)>like the Slack/GitHub convention. Also: no retry on 5xx and no timeout onfetch()— a slow webhook can stack up open sockets even though the call is fire-and-forget. -
Race in validation-failure reopen path (
interrupt.tsonResolved, ~line 280). On validation failure the handler mutatesr.status = 'pending'directly on the record returned by the event without going through the store API; meanwhilestore.resolve()has already setstatus = 'resolved'and emitted. If two listeners are attached (e.g. another in-process awaiter on the same id — unlikely but possible if the workflow retries) they both seestatus === 'resolved'before the rollback lands. Suggest adding anInterruptStore.reopen(id)method that re-emitspendingatomically. Also:r.resumeValue = undefineddeletes the bad value butdelete r.resolvedAtis missing — set it toundefinedis fine but the persisted record now has staleresolvedAtif persistence already flushed. -
resumeInterruptis not transactional. State edits are persisted viapersistence.save(record)beforestore.resolve(); if resolve throws (e.g. status changed concurrently), the persisted state is mutated but the Promise never resolves and the next operator sees the edits already applied. Wrap the edit+resolve in a single store method or apply edits after a successful resolve.
🧪 Edge cases / lower-priority
- Interrupt during interrupt: nothing prevents a workflow from awaiting
interrupt()twice concurrently on the samesessionId.resumeLatestForSessionpicks the most recently created pending — operator could resume the wrong one. Worth a doc note + a--interrupt-idrecommendation when more than one is pending (CLI already supports--interrupt-id, good). - Per-interrupt polling timer: every
interrupt()opens a 250mssetInterval. With dozens of pending HITL pauses this is fine but noisy; consider a single shared poller indexed by id, or rely solely on the EventEmitter when no cross-process persistence is configured. - Coordination barrel export:
src/coordination/index.tsdoes not re-exportinterrupt,resumeInterrupt,getInterruptStore. Callers must import fromcoordination/interrupt.jsdirectly; inconsistent with the rest of the module. applyStateEditarray indexing:cursor[key] = {}overwrites arrays at intermediate segments (the guardArray.isArray(next)treats arrays as non-objects). Documented behaviour, but worth calling out —path: 'items.0.name'will silently clobber theitemsarray. Consider supporting numeric segments or rejecting them with a clear error.- AIG-633 checkpointer integration is documentation-only. The PR claims persistence "can plug in" but ships no adapter and no test that exercises hydrate→crash→resume. Fine to ship as the in-memory MVP, but the issue title says "first-class HITL primitives" — recommend a follow-up issue to actually wire AIG-633 in and add a
tests/integration/interrupt-persistence.test.ts. - InterruptsPage
useEffectpolls every 5s; could be swapped to the existing WebSocket event bridge to align with how consensus/agents/tasks update.
📝 Nits
interrupt.tsonResolvedre-mutatesr.statuswithout re-emitting; if a third listener was attached elsewhere it would not see the rollback.InterruptsPage.tsxinlines the API client + types (66 lines) — would be cleaner underweb/src/api/interrupts.tsto matchapi/client.tsconvention. Documented as intentional ("minimize file footprint"), but the rationale is weak.consolesink writes directly toprocess.stderr; in TUI/JSON-only environments this leaks into structured logs. Gate behind a flag or use the logger consistently.parseEdit(CLI) does not validatepathis non-empty before passing toapplyStateEdit— error surfaces from deep inside but message is fine.
Recommendation
Block merge on items 1 & 2 (auth + webhook signing — security-relevant). Items 3 & 4 are correctness bugs that should be fixed in this PR. The rest can land as follow-ups under a new AIG-XXX (notably the real AIG-633 wiring).
Good work overall — the API shape is right and the test coverage is genuinely useful.
🤖 Review by Claude Opus 4.7 (sub-agent)
…orphan-record on reopen-race Critical security fix: - requireInterruptsAuth in src/web/routes/interrupts.ts no longer delegates to createAuthMiddleware when neither AISTACK_INTERRUPTS_TOKEN nor a real AuthService is configured. Previously, in non-production NODE_ENV the generic middleware's dev-mode 'allow when service missing' branch was silently granting access with zero credentials. We now pre-flight check isAuthServiceInitialized() (new export from src/web/middleware/auth.ts) and reject with 401 before ever calling the middleware. - Added integration-style test cases that exercise the REAL middleware (not the mocked one) under NODE_ENV=development with no AuthService and no env token, asserting 401 on GET list, POST resume, and verifying record state stays untouched. Bonus fix (orphaned-record on validation-failure reopen race): - src/coordination/interrupt.ts now tracks in-process awaiters per record via registerAwaiter / unregisterAwaiter / hasAwaiter. The interrupt() Promise holds its awaiter slot across the async reopen(), so a racing operator resume between validation rejection and reopen completion still observes a live listener; the slot is dropped only after reopen settles. - resumeInterrupt() now refuses to resolve an interrupt that was created in THIS process but has zero awaiters left (the workflow Promise died), throwing InterruptNoListenerError with code 'no_listener'. Records originating in a different process skip the check via record.originPid to avoid false positives on cross-process resumes (CLI <-> daemon). - InterruptNoListenerError exported from interrupt.ts/interrupt-types.ts.
…-primitives # Conflicts: # src/cli/commands/workflow.ts # src/web/server.ts
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/HITL.md`:
- Around line 17-33: The unlabeled fenced code blocks in docs/HITL.md (the ASCII
diagram block and the CLI output block) are triggering markdownlint MD040;
update both triple-backtick fences to include a language tag (e.g., use "text"
for the ASCII diagram block and "bash" or "text" for the CLI output block) so
the fences become ```text and ```bash respectively, and apply the same change to
the other occurrences noted (around lines 90–110) to silence the linter.
In `@src/cli/commands/workflow-inspect.ts`:
- Around line 65-70: The printed resume examples use only r.sessionId which can
resume the latest pending interrupt for that session; change the examples to use
the interrupt-specific identifier (e.g., r.id or r.interruptId) so the command
targets the exact record shown. Update the strings in the block where r.status
=== 'pending' to include the interrupt id (reference r.id / r.interruptId and
the resume-interrupt invocation) and show both the direct resume and the
--edit-state form using that interrupt-scoped identifier instead of r.sessionId.
In `@src/cli/commands/workflow.ts`:
- Around line 217-230: When --interrupt-id is supplied the CLI currently calls
resumeInterrupt(options.interruptId, ...) without validating it belongs to the
provided sessionId, which can resume the wrong session; fix this by fetching the
interrupt metadata for options.interruptId (create or call a helper like
getInterruptById/fetchInterruptById), verify its sessionId matches the sessionId
argument, and throw/exit with a clear error if it does not; only call
resumeInterrupt when the check passes and keep using
resumeLatestForSession(sessionId, ...) when no interruptId is provided, and
ensure the log/info messages (record.id and sessionId) reflect the validated
interrupt.
In `@src/web/routes/interrupts.ts`:
- Around line 102-107: The handler currently treats unknown status query values
as undefined; update the validation around statusParam (and related
InterruptStatus) in the interrupts route so that if params.query.status is
provided but not one of 'pending'|'resolved'|'cancelled' the endpoint returns a
400 Bad Request instead of silently ignoring it; locate the status parsing logic
(the statusParam / status assignment) and replace it with explicit validation
that replies with a 400 (with a brief error message) when an invalid status is
present, otherwise continue using the validated InterruptStatus value.
In `@tests/unit/interrupt.test.ts`:
- Line 160: Replace the hardcoded sleeps (e.g., the literal await new
Promise((r) => setTimeout(r, 5)) and the other setTimeout(10/30) usages) with
condition-based waits: poll or use a test helper like waitFor/awaitUntil that
repeatedly checks the specific async condition/state you rely on (for example, a
flag, mock call count, or the result of getInterruptState() / the function under
test) with a short interval and an overall timeout; update each occurrence so
the test asserts the target state before proceeding instead of sleeping.
In `@tests/unit/web/routes/interrupts.test.ts`:
- Around line 17-20: After calling vi.resetModules() in the second describe,
replace usages of the top-level getInterruptStore/resetInterruptStore with the
fresh module instance by dynamically importing src/coordination/interrupt.js
(e.g. const interrupt = await import('.../coordination/interrupt.js')) and then
call interrupt.getInterruptStore() and interrupt.resetInterruptStore() for
seeding and assertions; ensure registerInterruptRoutes and resumeInterrupt
remain imported dynamically as done now so the test reads/writes the same
interrupt store instance used by the route handlers.
In `@web/src/pages/InterruptsPage.tsx`:
- Line 149: The code that builds stateEdits currently returns the RHS as a raw
string (return { path: line.slice(0, idx).trim(), value: line.slice(idx + 1) });
but the UI claims values are JSON-parsed; change the construction in the parsing
logic (the function creating these objects around the return at the line with
that expression and the similar occurrence at the other spot ~294) to attempt
JSON.parse on the RHS (trimmed) and, if JSON.parse throws, fall back to the
original string value—ensure you handle empty values correctly and preserve the
property name from line.slice(0, idx).trim().
- Around line 61-65: The apiCall function is sending requests without
credentials so auth-guarded endpoints will 401; update apiCall to include
credentials by adding credentials: 'include' to the fetch options and ensure the
Authorization/API-Key header passed via init.headers is preserved (keep the
existing header merge but add credentials: 'include' to the object sent to fetch
in apiCall).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7c762f06-a1f8-46e1-93f1-9a54202539e1
📒 Files selected for processing (14)
docs/HITL.mdsrc/cli/commands/workflow-inspect.tssrc/cli/commands/workflow.tssrc/coordination/interrupt-notifier.tssrc/coordination/interrupt-types.tssrc/coordination/interrupt.tssrc/web/middleware/auth.tssrc/web/routes/index.tssrc/web/routes/interrupts.tssrc/web/server.tstests/unit/interrupt.test.tstests/unit/web/routes/interrupts.test.tsweb/src/App.tsxweb/src/pages/InterruptsPage.tsx
| .map((line) => { | ||
| const idx = line.indexOf('='); | ||
| if (idx === -1) throw new Error(`Invalid edit (need path=value): ${line}`); | ||
| return { path: line.slice(0, idx).trim(), value: line.slice(idx + 1) }; |
There was a problem hiding this comment.
stateEdits values are not JSON-parsed despite UI claiming they are.
The helper says values are JSON-parsed, but current code always sends strings. Parse RHS as JSON when possible (fallback to string) to match contract and avoid type drift.
Suggested patch
.map((line) => {
const idx = line.indexOf('=');
if (idx === -1) throw new Error(`Invalid edit (need path=value): ${line}`);
- return { path: line.slice(0, idx).trim(), value: line.slice(idx + 1) };
+ const raw = line.slice(idx + 1).trim();
+ let value: unknown = raw;
+ try {
+ value = JSON.parse(raw);
+ } catch {
+ value = raw;
+ }
+ return { path: line.slice(0, idx).trim(), value };
});Also applies to: 294-294
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@web/src/pages/InterruptsPage.tsx` at line 149, The code that builds
stateEdits currently returns the RHS as a raw string (return { path:
line.slice(0, idx).trim(), value: line.slice(idx + 1) }); but the UI claims
values are JSON-parsed; change the construction in the parsing logic (the
function creating these objects around the return at the line with that
expression and the similar occurrence at the other spot ~294) to attempt
JSON.parse on the RHS (trimmed) and, if JSON.parse throws, fall back to the
original string value—ensure you handle empty values correctly and preserve the
property name from line.slice(0, idx).trim().
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unit/consensus-service.test.ts (1)
448-452: ⚡ Quick win
expires_atpast-timestamp precondition is aligned with the store’s time unit—only the delay looks unnecessary.
consensus_checkpoints.expires_atisINTEGER, andexpireOldCheckpoints()compares it toDate.now()viaexpires_at < ?, so setting it toDate.now() - 1correctly creates a past-expiry precondition for the “approved should not expire” assertion.delay(10)doesn’t affect the logic ofservice.expireCheckpoints()(it’s called directly), so it can be removed to keep the test faster and more deterministic.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/consensus-service.test.ts` around lines 448 - 452, The test sets the checkpoint expiry correctly with store.db.prepare('UPDATE consensus_checkpoints SET expires_at = ? WHERE id = ?').run(Date.now() - 1, checkpoint.id) but then calls an unnecessary delay; remove the delay(10) call and call service.expireCheckpoints() directly so expireCheckpoints() runs deterministically against the past-timestamp precondition, keeping the approveCheckpoint(...) and the DB update as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/consensus-service.test.ts`:
- Around line 448-452: The test sets the checkpoint expiry correctly with
store.db.prepare('UPDATE consensus_checkpoints SET expires_at = ? WHERE id =
?').run(Date.now() - 1, checkpoint.id) but then calls an unnecessary delay;
remove the delay(10) call and call service.expireCheckpoints() directly so
expireCheckpoints() runs deterministically against the past-timestamp
precondition, keeping the approveCheckpoint(...) and the DB update as-is.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d63034e8-0007-4535-9a8a-37d34f9e0b81
📒 Files selected for processing (9)
docs/HITL.mdsrc/cli/commands/workflow-inspect.tssrc/cli/commands/workflow.tssrc/coordination/interrupt-notifier.tssrc/web/routes/interrupts.tstests/unit/consensus-service.test.tstests/unit/interrupt.test.tstests/unit/web/routes/interrupts.test.tsweb/src/pages/InterruptsPage.tsx
✅ Files skipped from review due to trivial changes (1)
- docs/HITL.md
🚧 Files skipped from review as they are similar to previous changes (7)
- src/cli/commands/workflow-inspect.ts
- src/cli/commands/workflow.ts
- tests/unit/web/routes/interrupts.test.ts
- src/web/routes/interrupts.ts
- src/coordination/interrupt-notifier.ts
- tests/unit/interrupt.test.ts
- web/src/pages/InterruptsPage.tsx
blackms
left a comment
There was a problem hiding this comment.
Reviewed current head after the nit fix. CodeRabbit latest review has 0 actionable comments; older inline comments are already marked addressed by CodeRabbit. CI is green.
Implements AIG-644.
Milestone: M2 - Differentiation
Estimate: 5 pts
Summary
See commit message for full implementation details (schema, API, CLI, tests, docs).
Notes from PM agent
Only branch where sub-agent had Node toolchain: vitest 2197/2197 passing locally.
Auto-opened by aistack PM agent on 2026-05-28 10:22. Review with /review or human dispatch.
Summary by CodeRabbit
New Features
Security
Documentation
Tests