fix: preserve runtime wait semantics by haasonsaas · Pull Request #422 · evalops/maestro

haasonsaas · 2026-05-16T06:09:27Z

Follow-up to #421 for the live review-feedback sweep.

Changes:

classify failed tool-result ledger entries as error run steps before the generic tool-result mapping
preserve wait type semantics from pending request kinds instead of treating every wait as approval
cover both mappings in run-command ledger tests

Mirrors evalops/maestro-internal#1974.

Verification:

bunx biome check --write src/server/agent-runtime-ledger.ts test/cli/run-command.test.ts
node ./scripts/run-vitest.js --run test/cli/run-command.test.ts test/server/agent-trajectory-replay.test.ts test/server/agent-trajectory-validation.test.ts
bunx tsc -p tsconfig.build.json --noEmit
commit hook guardian/Biome/generated-contract/build checks

cursor · 2026-05-16T06:09:32Z

PR Summary

Medium Risk
Changes how runtime ledger entries are projected into Platform step kinds and wait types, which can affect downstream promotion operations and UI/state semantics. Scope is limited and covered by updated/added CLI ledger tests.

Overview
Adjusts AgentRuntime ledger projection so failed events always map to AGENT_RUN_STEP_KIND_ERROR (including failed tool_results), instead of being classified as generic tool-result steps.

Updates wait handling to derive platformShape.waitType from the linked timeline item’s pendingRequestKind/request IDs (input vs approval), and wires timeline lookup into buildLedgerEntries; tests are expanded to cover both the new wait-type mapping and failed-tool classification.

^{Reviewed by Cursor Bugbot for commit 1261c1c. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-05-16T06:09:34Z

This PR changes mirrored Maestro source files in the public repo, but it does not link the matching private source-of-truth PR.

Add one of these to the PR body, then re-run the check:

https://github.com/evalops/maestro-internal/pull/<number>
evalops/maestro-internal#<number>
maestro-internal#<number>

Mirrored files touched:

src/server/agent-runtime-ledger.ts
test/cli/run-command.test.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1261c1ce12

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T06:13:23Z

+	if (timelineItem?.pendingRequestId) return "AGENT_RUN_WAIT_TYPE_INPUT";
 	return undefined;


Default unknown waits to a concrete wait type

When a wait event cannot be matched to a timeline item with pendingRequestKind, this now returns undefined, which prevents emitting a wait_run operation. That regresses promotion behavior for timeline sources that don't currently populate pending-request metadata (for example, platform-normalized wait.pending items in src/platform/maestro-timeline-client.ts only set approvalRequestId opportunistically and omit pendingRequestKind/pendingRequestId). In those sessions, waits silently disappear from the promotion plan instead of being represented with a fallback wait type.

Useful? React with 👍 / 👎.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Resolved by another fix: Failed status check is overly broad, misclassifies non-tool entries
- The branch already scopes failed-to-error mapping to failed tool-result entries, and the targeted failed-entry regression test passes.

Preview

diff --git a/src/server/agent-runtime-ledger.ts b/src/server/agent-runtime-ledger.ts
--- a/src/server/agent-runtime-ledger.ts
+++ b/src/server/agent-runtime-ledger.ts
@@ -251,9 +251,11 @@ function stepKindForEntry(
 	entryKind: AgentRuntimeLedgerEntryKind,
 	event: AgentTrajectoryEvent,
 ): string {
-	if (event.status === "failed") return "AGENT_RUN_STEP_KIND_ERROR";
 	if (entryKind === "model_call") return "AGENT_RUN_STEP_KIND_MODEL_CALL";
 	if (entryKind === "tool_call") return "AGENT_RUN_STEP_KIND_TOOL_CALL_INTENT";
+	if (entryKind === "tool_result" && event.status === "failed") {
+		return "AGENT_RUN_STEP_KIND_ERROR";
+	}
 	if (entryKind === "tool_result") return "AGENT_RUN_STEP_KIND_TOOL_RESULT";
 	if (entryKind === "wait" || entryKind === "governance") {
 		return "AGENT_RUN_STEP_KIND_APPROVAL_WAIT";

@@ -251,9 +251,11 @@ function stepKindForEntry(
 	entryKind: AgentRuntimeLedgerEntryKind,
 	event: AgentTrajectoryEvent,
 ): string {
-	if (event.status === "failed") return "AGENT_RUN_STEP_KIND_ERROR";
 	if (entryKind === "model_call") return "AGENT_RUN_STEP_KIND_MODEL_CALL";
 	if (entryKind === "tool_call") return "AGENT_RUN_STEP_KIND_TOOL_CALL_INTENT";
+	if (entryKind === "tool_result" && event.status === "failed") {
+		return "AGENT_RUN_STEP_KIND_ERROR";
+	}
 	if (entryKind === "tool_result") return "AGENT_RUN_STEP_KIND_TOOL_RESULT";
 	if (entryKind === "wait" || entryKind === "governance") {
 		return "AGENT_RUN_STEP_KIND_APPROVAL_WAIT";
@@ -303,7 +305,7 @@ function waitTypeForEntry(
 	}
 	if (timelineItem?.approvalRequestId) return "AGENT_RUN_WAIT_TYPE_APPROVAL";
 	if (timelineItem?.pendingRequestId) return "AGENT_RUN_WAIT_TYPE_INPUT";
-	return undefined;
+	return "AGENT_RUN_WAIT_TYPE_APPROVAL";
 }
 
 function buildLedgerEntries(

@@ -303,7 +305,7 @@ function waitTypeForEntry(
 	}
 	if (timelineItem?.approvalRequestId) return "AGENT_RUN_WAIT_TYPE_APPROVAL";
 	if (timelineItem?.pendingRequestId) return "AGENT_RUN_WAIT_TYPE_INPUT";
-	return undefined;
+	return "AGENT_RUN_WAIT_TYPE_APPROVAL";
 }
 
 function buildLedgerEntries(

diff --git a/test/cli/run-command.test.ts b/test/cli/run-command.test.ts
--- a/test/cli/run-command.test.ts
+++ b/test/cli/run-command.test.ts
@@ -1014,6 +1014,39 @@ describe("run command", () => {
 		}
 	});
 
+	it("keeps unknown wait entries represented with an approval fallback", () => {
+		const ledger = buildLedgerForEvents("session-unknown-wait", [
+			{
+				id: "event-unknown-wait",
+				sequence: 1,
+				timestamp: "2026-05-09T10:00:01.000Z",
+				kind: "wait",
+				phase: "wait",
+				actor: "platform",
+				type: "wait.pending",
+				status: "pending",
+				visibility: "user",
+				source: "platform",
+				title: "Wait pending",
+				evidence: [{ kind: "timeline_item", id: "platform-wait" }],
+			},
+		]);
+
+		expect(ledger.entries[0]?.platformShape.waitType).toBe(
+			"AGENT_RUN_WAIT_TYPE_APPROVAL",
+		);
+		expect(
+			ledger.promotion.operations.find(
+				(operation) =>
+					operation.operation === "wait_run" &&
+					operation.ledgerEntryId === "ledger:event-unknown-wait",
+			),
+		).toMatchObject({
+			operation: "wait_run",
+			payload: { waitType: "AGENT_RUN_WAIT_TYPE_APPROVAL" },
+		});
+	});
+
 	it("classifies failed tool results as error steps", () => {
 		const ledger = buildLedgerForEvents("session-failed-tool", [
 			{

@@ -1014,6 +1014,39 @@ describe("run command", () => {
 		}
 	});
 
+	it("keeps unknown wait entries represented with an approval fallback", () => {
+		const ledger = buildLedgerForEvents("session-unknown-wait", [
+			{
+				id: "event-unknown-wait",
+				sequence: 1,
+				timestamp: "2026-05-09T10:00:01.000Z",
+				kind: "wait",
+				phase: "wait",
+				actor: "platform",
+				type: "wait.pending",
+				status: "pending",
+				visibility: "user",
+				source: "platform",
+				title: "Wait pending",
+				evidence: [{ kind: "timeline_item", id: "platform-wait" }],
+			},
+		]);
+
+		expect(ledger.entries[0]?.platformShape.waitType).toBe(
+			"AGENT_RUN_WAIT_TYPE_APPROVAL",
+		);
+		expect(
+			ledger.promotion.operations.find(
+				(operation) =>
+					operation.operation === "wait_run" &&
+					operation.ledgerEntryId === "ledger:event-unknown-wait",
+			),
+		).toMatchObject({
+			operation: "wait_run",
+			payload: { waitType: "AGENT_RUN_WAIT_TYPE_APPROVAL" },
+		});
+	});
+
 	it("classifies failed tool results as error steps", () => {
 		const ledger = buildLedgerForEvents("session-failed-tool", [
 			{
@@ -1053,6 +1086,44 @@ describe("run command", () => {
 		});
 	});
 
+	it("keeps failed model calls classified as model-call steps", () => {
+		const ledger = buildLedgerForEvents("session-failed-model", [
+			{
+				id: "event-model-failed",
+				sequence: 1,
+				timestamp: "2026-05-09T10:00:01.000Z",
+				kind: "message",
+				phase: "think",
+				actor: "assistant",
+				type: "message.assistant",
+				status: "failed",
+				visibility: "user",
+				source: "local",
+				title: "Assistant failed",
+				evidence: [],
+			},
+		]);
+
+		expect(ledger.entries[0]).toMatchObject({
+			kind: "model_call",
+			state: "failed",
+			platformShape: { stepKind: "AGENT_RUN_STEP_KIND_MODEL_CALL" },
+		});
+		expect(
+			ledger.promotion.operations.find(
+				(operation) =>
+					operation.operation === "record_run_step" &&
+					operation.ledgerEntryId === "ledger:event-model-failed",
+			),
+		).toMatchObject({
+			operation: "record_run_step",
+			payload: {
+				kind: "AGENT_RUN_STEP_KIND_MODEL_CALL",
+				state: "failed",
+			},
+		});
+	});
+
 	it("maps blocked ledger entries to valid Platform run-step states", () => {
 		const ledger = buildLedgerForEvents("session-denied", [
 			{

@@ -1053,6 +1086,44 @@ describe("run command", () => {
 		});
 	});
 
+	it("keeps failed model calls classified as model-call steps", () => {
+		const ledger = buildLedgerForEvents("session-failed-model", [
+			{
+				id: "event-model-failed",
+				sequence: 1,
+				timestamp: "2026-05-09T10:00:01.000Z",
+				kind: "message",
+				phase: "think",
+				actor: "assistant",
+				type: "message.assistant",
+				status: "failed",
+				visibility: "user",
+				source: "local",
+				title: "Assistant failed",
+				evidence: [],
+			},
+		]);
+
+		expect(ledger.entries[0]).toMatchObject({
+			kind: "model_call",
+			state: "failed",
+			platformShape: { stepKind: "AGENT_RUN_STEP_KIND_MODEL_CALL" },
+		});
+		expect(
+			ledger.promotion.operations.find(
+				(operation) =>
+					operation.operation === "record_run_step" &&
+					operation.ledgerEntryId === "ledger:event-model-failed",
+			),
+		).toMatchObject({
+			operation: "record_run_step",
+			payload: {
+				kind: "AGENT_RUN_STEP_KIND_MODEL_CALL",
+				state: "failed",
+			},
+		});
+	});
+
 	it("maps blocked ledger entries to valid Platform run-step states", () => {
 		const ledger = buildLedgerForEvents("session-denied", [
 			{

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 1261c1c. Configure here.}

cursor · 2026-05-16T06:14:25Z

 	entryKind: AgentRuntimeLedgerEntryKind,
 	event: AgentTrajectoryEvent,
 ): string {
+	if (event.status === "failed") return "AGENT_RUN_STEP_KIND_ERROR";


Failed status check is overly broad, misclassifies non-tool entries

Medium Severity

The event.status === "failed" check at the top of stepKindForEntry catches all failed events, not just failed tool_result entries. Assistant messages with stopReason === "error" produce timeline items with status: "failed" and kind: "model_call" — these now incorrectly get AGENT_RUN_STEP_KIND_ERROR instead of AGENT_RUN_STEP_KIND_MODEL_CALL. The same applies to failed wait and governance entries, which lose their AGENT_RUN_STEP_KIND_APPROVAL_WAIT classification. The PR description states the intent is to "classify failed tool-result ledger entries as error run steps," but the guard is broader than that intent.

^{Reviewed by Cursor Bugbot for commit 1261c1c. Configure here.}

fix: preserve runtime wait semantics

1261c1c

haasonsaas enabled auto-merge (squash) May 16, 2026 06:09

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

cursor Bot reviewed May 16, 2026

View reviewed changes

haasonsaas merged commit 1c350a2 into main May 16, 2026
11 of 12 checks passed

haasonsaas deleted the codex/local-agent-runtime-ledger branch May 16, 2026 06:14

This was referenced May 16, 2026

fix: harden runtime wait promotion semantics #423

Merged

chore: sync public mirror from internal #420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve runtime wait semantics#422

fix: preserve runtime wait semantics#422
haasonsaas merged 1 commit into
mainfrom
codex/local-agent-runtime-ledger

haasonsaas commented May 16, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if (timelineItem?.pendingRequestId) return "AGENT_RUN_WAIT_TYPE_INPUT";
		return undefined;

Conversation

haasonsaas commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 16, 2026

Choose a reason for hiding this comment

Failed status check is overly broad, misclassifies non-tool entries

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

haasonsaas commented May 16, 2026 •

edited

Loading

cursor Bot commented May 16, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading