chore: sync public mirror from internal#395
Conversation
PR SummaryMedium Risk Overview Extends Fermata LLM judging with pairwise rubric support ( Tweaks scripted replay validation/error messaging (tool-call Reviewed by Cursor Bugbot for commit 8132ac6. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d703f2fdc6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (statement.kind === "error") { | ||
| partial.stopReason = "error"; | ||
| partial.errorMessage = scriptedErrorMessage(statement); | ||
| partial.errorMessage = statement.message; | ||
| yield { type: "error", reason: "error", error: partial }; |
There was a problem hiding this comment.
Keep transient scripted errors retryable
When a scripted frame emits kind: "error" with type: "transient", we now pass through statement.message verbatim, which drops the explicit retry hint that isRetryableError previously matched (e.g., try again). In practice this turns transient replay failures into non-retryable terminal errors unless the fixture author manually includes retry keywords, so scripted eval runs can fail immediately instead of exercising retry behavior.
Useful? React with 👍 / 👎.
| `Replay scenario ${label} frame ${frame.index} statement ${statementOffset} tool_call tool must be a non-empty string`, | ||
| ); |
There was a problem hiding this comment.
Reinstate validation for tool_call id type
The parser no longer rejects non-string tool_call.id values even though the scenario contract expects id?: string. That allows invalid fixtures (for example numeric IDs) to pass parsing and flow into tool-call emission/matching paths, where IDs are treated as strings, creating hard-to-debug mismatches in tool result correlation and violating the expected wire shape.
Useful? React with 👍 / 👎.
d703f2f to
8132ac6
Compare
Summary
evalops/maestro-internalevalops/maestroas a generated public mirror of the private source of truthbf0d22c180540dad2fd97f83400b14628300b6256aa0e4d7456d3c430baac1f2d6d7488a60fdbb928file(s) to copy/update and0stale file(s) to delete0Source-of-truth status
Public Mirror Drift Audit
@evalops/maestrohttps://github.com/evalops/maestro-internal@main (bf0d22c18054)https://github.com/evalops/maestro@main (6aa0e4d7456d)80public_projection_has_driftSample Changed Paths
Guidance
Let internal main generate and merge the public sync PR before relying on public main.
Drift sample
Public-only commits since last generated sync
Validation
sync-public-release-mirrorworkflow inpublic-treemodeTest Plan
sync-public-release-mirrorworkflow inpublic-treemoderequire-internal-prcheck confirms internal source PR lineageStaged Rollout
evalops/maestro-internal@bf0d22c180540dad2fd97f83400b14628300b625, including existing hidden/evaluation surfaces, and keeps public package parity behind the established public-source-provenance gate.