feat: add worker execution state#234
Conversation
|
@kapi-agent review Revision explanation for current head What changed:
Why this closes the issue:
Verification:
|
kapi-agent review completedThe formal GitHub PR review and required check were submitted separately. This request comment is kept concise to avoid duplicating the full review body. kapi-agent request lifecycle:
|
|
@kapi-agent review Correction: the current PR head is Please review against that head. Verification remains:
|
There was a problem hiding this comment.
kapi-agent review
Verdict: REQUEST_CHANGES
Blocking issues
-
src/domain/task-graph.ts:captureWorkerReport()accepts any report matching a historical dispatch, but it does not verify that the reported claim is still the task’s active, unexpired claim atreportedAt.This means a stale worker report can mutate the graph after the claim has expired or been recovered by another worker.
completeTaskFromWorkerReport()later callscompleteTask()and may fail, but by thencaptureWorkerReport()has already moved the task toverifying/failed/repair_required/blocked, copied evidence refs, and updated the worker state.Suggested fix: in
validateWorkerReport()orcaptureWorkerReport(), look up the current task and call the existing active-claim validation againstreport.claimTokenandreport.reportedAt; also ensure the claim still belongs toreport.workerId. Add regression coverage for an expired/recovered claim report being rejected without changing task/worker/report state.
Warnings / risks
-
src/domain/runtime-state.ts:RuntimeTaskStatusstill lacks"in_progress"whileTaskStatusnow uses"in_progress"as the dispatch state. If runtime-state tasks are expected to persist/project dispatched graph tasks, parsing that state will fail. If these schemas are intentionally separate, consider documenting that boundary or adding a test proving no projection path requires"in_progress". -
src/domain/task-graph.ts: readiness nonce enforcement is optional at dispatch time. That may be intentional for non-nonce substrates, but the README says readiness requires a nonce-equivalent proof. If nonce-backed readiness is mandatory for this phase, dispatch should fail when a ready worker has a nonce and the request omits it.
Suggestions
- Add tests for stale/expired worker reports and recovered-claim reports.
- Consider preventing duplicate dispatch records for the same active claim unless there is a deliberate redispatch/retry model.
- Consider validating non-empty
workerIdand non-empty readiness nonce inputs consistently with other domain text validation.
Looks good
- The worker execution state is kept domain-only and adapter-neutral, which matches the stated phase-3 scope.
- Dispatch checks worker readiness, claim token validity, and worker ownership before moving a task to
in_progress. - Completed reports require summary and evidence refs before task verification/completion can proceed.
- Runtime worker status alignment for
cleanup-releasedandclosedis straightforward and covered by tests. - README updates match the new worker dispatch/report boundary at a high level.
Verification notes
- Verifier gate status: PASS —
npm ci && npm run verifyexited 0. - Size gate status: PASS — 178 changed lines, below the configured 200-line review threshold.
- Revision-explanation status: not required for this first kapi-agent review; explanation was present.
- Ilchul review harness: PASS as gate evidence — STANDARD risk, neutral conclusion, no blocking findings.
- Local/CI evidence in PR body includes full test/check/unused/budget verification and targeted task-graph/runtime-state coverage, but the stale report after expired/recovered claim case appears uncovered.
Engine: pi
|
@kapi-agent review Revision explanation for current head Changes since the requested-changes review:
Verification after fixes:
|
kapi-agent review completedThe formal GitHub PR review and required check were submitted separately. This request comment is kept concise to avoid duplicating the full review body. kapi-agent request lifecycle:
|
There was a problem hiding this comment.
kapi-agent review
Verdict: REQUEST_CHANGES
Blocking issues
- Revision explanation gate failed. This PR was updated after a prior kapi-agent requested-changes review, but the configured review gate reports that no recognized current-revision explanation was found. Per policy, I cannot approve until the author provides a recognized review response explaining what changed since the prior kapi-agent feedback and why those changes close it.
Warnings / risks
- None blocking from the inspected code. The prior stale-report mutation issue appears addressed in
src/domain/task-graph.tsby validating the report against the task’s current active claim and worker ownership before mutating state. markMissingHeartbeats()marks all stale workers unhealthy regardless of current worker state. That may be intended for this domain projection, but future integration should ensure terminal retained/closed workers are not accidentally reclassified if passed into this helper.
Suggestions
- Add a focused assertion in the recovered stale-report test that worker state and reports remain unchanged after the recovered-claim rejection as well, matching the expired-claim branch.
- Consider documenting whether heartbeat staleness should apply to terminal worker states once the runtime projection is wired into adapters.
Looks good
captureWorkerReport()now callsvalidateWorkerReport()before any graph, worker, or report mutation.validateWorkerReport()checks:- timestamp validity,
- non-empty summary,
- evidence refs for completed reports,
- current active unexpired claim token at
reportedAt, - claim ownership by
report.workerId, - matching prior dispatch.
RuntimeTaskStatusnow includesin_progress, aligning runtime-state persistence with the task graph dispatch state.- Tests cover dispatch nonce rejection, heartbeat update, report evidence gating, completion from reports, stale heartbeat projection, and stale expired/recovered report rejection.
- README updates are aligned with the new worker dispatch/report execution boundary.
Verification notes
- Verifier gate status: PASS —
npm ci && npm run verifyexited 0. - Size gate status: PASS with semantic review required — 206 changed lines, mostly bounded source/test/docs changes around worker execution state.
- Ilchul review harness gate: PASS — STANDARD risk, neutral/comment verdict, no blocking findings.
- Revision-explanation status: FAIL — required now, found false by the configured gate.
- Local inspected files:
src/domain/task-graph.ts,src/domain/runtime-state.ts,test/task-graph.test.ts,test/runtime-state.test.ts,README.md.
Engine: pi
|
@kapi-agent review Revision explanation for current head What changed:
Why this closes the prior feedback:
Verification:
|
b9851f9 to
ac1af2b
Compare
|
@kapi-agent review Revision explanation for current head What changed:
Why this closes the prior feedback:
Verification:
|
kapi-agent review completedThe formal GitHub PR review and required check were submitted separately. This request comment is kept concise to avoid duplicating the full review body. kapi-agent request lifecycle:
|
There was a problem hiding this comment.
kapi-agent review
Verdict: REQUEST_CHANGES
Blocking issues
-
src/domain/task-graph.ts:dispatchClaimedTask()validates an active claim token and worker ownership, but it does not validate that the task is currently in the"claimed"state before moving it to"in_progress".This can regress a completed or otherwise non-dispatchable task back into execution if it still carries an active claim. For example,
completeTask()setsstatus: "completed"but does not clearclaim; a laterdispatchClaimedTask()with that still-active token and a ready worker would pass validation, create a dispatch, and change the completed task to"in_progress".Please make
dispatchClaimedTask()fail closed unlesstask.status === "claimed"before mutation, and add regression coverage for a non-claimed task with a valid active claim not being dispatched or mutated.
Warnings / risks
markMissingHeartbeats()currently relies ongetStaleWorkerIds()to only mark"busy"/"ready"workers unhealthy. That is fine as implemented, but worth preserving in future changes so terminal states such as"completed-retained","safe-to-close","cleanup-released", and"closed"are not accidentally reclassified.
Suggestions
- In the recovered stale-report test, also assert the worker state and
reportsremain unchanged for the recovered-claim rejection path, matching the expired-claim assertions.
Looks good
- The prior stale worker-report issue is addressed:
captureWorkerReport()now validates against the task’s current active claim atreportedAt, verifies worker ownership, and only mutates after validation. - Runtime-state alignment for
"in_progress","cleanup-released", and"closed"is appropriate and covered. - The README update is aligned with the new worker execution boundary.
- The new tests cover nonce mismatch, heartbeat refresh, evidence-required completed reports, completion from report evidence, and stale-report rejection paths.
Verification notes
- Verifier gate status: PASS —
npm ci && npm run verifyexited 0. - Size gate status: PASS with semantic review required — 206 changed lines, with 141 source lines and 61 test lines.
- Revision-explanation status: PASS — current-head revision explanation was found.
- Ilchul review harness gate: PASS — STANDARD risk profile, neutral/comment verdict, no blocking findings.
- Local inspected files:
src/domain/task-graph.ts,src/domain/runtime-state.ts,test/task-graph.test.ts, and README diff.
Engine: pi
|
@kapi-agent review Revision explanation for current head What changed:
Why this closes the prior feedback:
Verification:
|
kapi-agent review completedThe formal GitHub PR review and required check were submitted separately. This request comment is kept concise to avoid duplicating the full review body. kapi-agent request lifecycle:
|
There was a problem hiding this comment.
kapi-agent review
Verdict: REQUEST_CHANGES
Blocking issues
-
src/domain/task-graph.ts:captureWorkerReport()can still mutate a terminal/completed task if the original claim is still active.completeTaskFromWorkerReport()ultimately callscompleteTask(), which setsstatus: "completed"but leaves the activeclaimin place. BecausecaptureWorkerReport()validates only active claim ownership plus a historical dispatch, a second report for the same dispatched claim can later move the completed task back toverifying,failed,repair_required, orblocked, overwrite evidence refs, append another report, and update the worker state.Suggested fix: make
captureWorkerReport()fail closed unless the task is in the expected reportable execution state, likelyin_progressonly, or otherwise explicitly reject terminal/non-reportable statuses before any mutation. Add regression coverage for a task completed throughcompleteTaskFromWorkerReport()receiving a later report on the same still-active claim and prove graph, worker, and reports remain unchanged.
Warnings / risks
src/domain/task-graph.ts: if multiple reports for a single dispatch are intended, the state-machine rules should be explicit. Right now duplicate reports are accepted and always mutate task/worker state. That is risky around evidence integrity and terminal-state stability.
Suggestions
- Consider clearing or terminalizing the task claim during
completeTask()if claims should never remain usable after completion. Even with acaptureWorkerReport()status guard, clearing terminal claims would make this class of bug harder to reintroduce. - Add a small assertion to the recovered stale-report test that the recovered task status remains
"claimed"after rejection, not only worker/reports state.
Looks good
- The prior dispatch regression appears addressed:
dispatchClaimedTask()now rejects tasks whose status is not"claimed"before mutation. - The prior stale worker-report issue is substantially improved: report validation now checks the current active claim at
reportedAtand verifies claim worker ownership. - Runtime-state status alignment for
in_progress,cleanup-released, andclosedis consistent with the new execution model. - README updates match the domain-level worker execution boundary.
Verification notes
- Verifier gate status: PASS —
npm ci && npm run verifyexited 0. - Size gate status: PASS with semantic review required — 219 changed lines exceeds the 200-line threshold, but the semantic scope is bounded to domain worker execution state, runtime status alignment, tests, and README documentation.
- Revision-explanation status: PASS — current revision explanation was found.
- Ilchul review harness gate: PASS — STANDARD risk, neutral conclusion, no blocking findings.
- Local/CI evidence provided includes full verify pass, task-graph test coverage, type/check/unused/quality gates, and
git diff --check. - I inspected
src/domain/task-graph.ts,test/task-graph.test.ts,src/domain/runtime-state.ts, and the README diff.
Engine: pi
|
@kapi-agent review Revision explanation for current head What changed:
Why this closes the prior feedback:
Verification:
|
kapi-agent review completedThe formal GitHub PR review and required check were submitted separately. This request comment is kept concise to avoid duplicating the full review body. kapi-agent request lifecycle:
|
There was a problem hiding this comment.
kapi-agent review
Verdict: APPROVE
Final approval summary
Review journey
This PR adds the phase-3 worker execution domain contract for claimed TaskGraph tasks. The semantic scope is bounded to domain/runtime-state execution state, regression tests, and README alignment. Earlier kapi-agent rounds found stale report mutation risks, missing current-revision explanation, dispatch of non-claimed tasks, and late report mutation of completed tasks. The current revision includes an author explanation and addresses those issues.
What changed
Key paths inspected:
src/domain/task-graph.tssrc/domain/runtime-state.tstest/task-graph.test.tstest/runtime-state.test.tsREADME.md
The final code adds worker readiness nonce tracking, dispatch records, heartbeat/staleness helpers, structured worker reports, evidence-gated report completion, in_progress runtime task status, and terminal worker statuses.
Why this is correct
The previously blocking stale-report and terminal-task mutation issues are closed:
dispatchClaimedTask()now rejects tasks unlesstask.status === "claimed"before mutation.captureWorkerReport()now rejects reports unless the current task isin_progress, validates the active claim atreportedAt, checks claim ownership, and requires a matching dispatch before mutating graph/worker/report state.- Regression tests cover non-claimed dispatch rejection, expired stale reports, recovered claim reports, and late report rejection after completion.
Evidence
- Verifier gate: PASS —
npm ci && npm run verifyexited 0. - Size gate: PASS with semantic review required — 225 changed lines, with 143 semantic source lines and 78 test lines.
- Revision-explanation status: PASS — current revision explanation was found.
- Ilchul harness gate: PASS — STANDARD risk, no blocking findings.
- Tests inspected include task-graph worker execution and runtime-state parsing coverage.
Remaining risks and approval rationale
Remaining risk is limited to future integration behavior: live adapters must preserve these domain preconditions when projecting tmux/worktree worker events. No blocking issue remains in this domain-only slice, and the test coverage now matches the previously identified regression cases. Approval is justified.
Blocking issues
- None.
Warnings / risks
- Future adapter integration should ensure it does not bypass
captureWorkerReport()/dispatchClaimedTask()state preconditions when replaying or projecting live worker events.
Suggestions
- Consider adding a future test for
completeTaskFromWorkerReport()rejecting completion if an external transition has moved the task out ofverifying, if that becomes part of the formal lifecycle contract.
Looks good
- The worker report validation now fails closed before mutation.
- Runtime-state status alignment covers
in_progress,cleanup-released, andclosed. - README updates match the new execution boundary without overstating live adapter integration.
Verification notes
- Verifier gate: PASS.
- Size gate: PASS, semantic review performed.
- Revision-explanation status: PASS.
- PR evidence reports
npm run verify, targeted tests, type/check gates, unused checks, quality budgets, and diff whitespace checks passing.
Engine: pi
Summary
TaskGraphtasks.Linked issue
Closes #196
Problem
Issue #196 asks for the MVP worker execution boundary: ready/claimed tasks should dispatch to ready workers, worker heartbeat/readiness should be modeled, structured reports should provide evidence, and task completion should be gated by evidence.
Before this PR, the domain had claim/lease and basic worker heartbeat helpers, but no first-class execution state connecting claimed tasks, ready workers, dispatch records, structured worker reports, and evidence-gated completion.
Options considered
WorkerExecutionStatecontract around claimed tasks and worker reports.RuntimeStateworker schema.Selected approach
Selected option: 2, with a small
RuntimeStatestatus alignment.Why this one: it keeps phase 3 focused on the execution contract. The existing local substrate can later project into these helpers without changing task ownership or evidence semantics.
Implementation by file/surface
src/domain/task-graph.tsWorkerExecutionState,WorkerTaskDispatch, andWorkerTaskReport.WorkerRuntimeRefwith an optional readiness nonce.recordWorkerReady()andrecordWorkerHeartbeat().dispatchClaimedTask()to require a ready worker, optional matching readiness nonce, matching active claim token, and claim worker ownership before moving a task toin_progressand the worker tobusy.captureWorkerReport()to require a prior dispatch, summary, and evidence refs for completed reports; completed reports move tasks toverifyingand workers tocompleted-retained.completeTaskFromWorkerReport()to complete only from a completed report with evidence refs and then mark the worker retained or safe-to-close.markMissingHeartbeats()for heartbeat timeout projection tounhealthy.src/domain/runtime-state.tscleanup-releasedandclosedin addition to existing phase-3 states.test/task-graph.test.tstest/runtime-state.test.tsREADME.mdWorkerExecutionStatein the graph-execution boundary.Why this fixes it
A claimed task can now be dispatched only to the worker that owns its active claim. Worker readiness can carry a nonce-equivalent proof. Heartbeats update runtime liveness and missing heartbeats project to unhealthy. Structured reports are captured only after dispatch, and completed reports require evidence refs before the task can move through
verifyingtocompleted.QA / Verification
npm ci— pass; installed dependencies for fresh worktree.npm test -- test/task-graph.test.ts— pass; package script ran full suite (533tests,522pass,11skipped).npm run check— pass.npm run check:unused— pass.npm run quality:budgets— pass with configured non-failingcode_smellswarning.npm run verify— pass (533tests,522pass,11skipped; then check, unused check, quality budgets).git diff --check— pass.Anomalies observed
npm test -- test/task-graph.test.tsstill runs the full suite because the package script includestest/*.test.tsbefore forwarded args.[DEP0205] module.register()deprecation warnings during tests.npm run quality:budgetsretains the existing non-failing warning:code_smells=60vs target<=20.Risks / Follow-up
captureWorkerReport()stores completed task evidence and moves the task toverifying;completeTaskFromWorkerReport()remains the explicit evidence-gated completion step.kapi-agent review expectations and current-head merge gate
76788d3373238468209ad40dfc908fa5f8e6bb64.origin/dev(172insertions,6deletions), under the review-size gate.kapi-agentshould verify claimed-task dispatch ownership, readiness nonce checking, report evidence gating, heartbeat stale projection, runtime-state status alignment, and README alignment before merge.