diff --git a/docs/src/content/docs/reference/effective-tokens-specification.md b/docs/src/content/docs/reference/effective-tokens-specification.md index e82b36015ee..de93e16c512 100644 --- a/docs/src/content/docs/reference/effective-tokens-specification.md +++ b/docs/src/content/docs/reference/effective-tokens-specification.md @@ -127,6 +127,20 @@ Any invocation triggered by another LLM call or orchestration layer. Examples in A directed structure representing all invocations associated with a single top-level request. The root node has no parent; sub-agents reference their triggering invocation as their parent. +### 3.6 Execution-Graph Traversal Entities + +For deterministic aggregation and reporting, implementations MUST distinguish the following traversal +entities when processing an execution graph: + +- **Local invocation cost**: The ET computed from the current node's own `usage.*` payload only. +- **Descendant contribution**: The subtotal accumulated from child nodes and deeper descendants before + the current node's local invocation cost is added. +- **Observed subtree**: A subtree whose invocation nodes have concrete usage payloads and therefore + contribute measured ET rather than fallback zeros. +- **Unobservable subtree**: A subtree whose invocation nodes are known to exist but whose concrete + usage payloads are unavailable; these nodes remain part of traversal order even when their ET is + serialized as `0`. + --- ## 4. Token Accounting Model @@ -365,6 +379,23 @@ implementations **MUST** serialize `usage.input_tokens`, `usage.cached_input_tok include a `flagged` object with schema `{ "code": "UNOBSERVABLE_INVOCATION", "reason": string }`. For fully observed invocation nodes, implementations **MAY** omit `flagged`. +**R-SAFE-007**: Before ET computation begins, implementations **MUST** validate the active model +multiplier registry described in [Model Multiplier Registry](#model-multiplier-registry). Registry +validation **MUST** confirm that `version` and `reference_model` are non-empty strings and that the +reference model has a numeric multiplier entry. + +**R-SAFE-008**: Every declared token class weight and model multiplier loaded from the registry +**MUST** be finite numeric data. `NaN`, infinite values, strings, `null`, and negative multiplier +values **MUST** be rejected before any ET output is produced. + +**R-SAFE-009**: If registry validation fails, implementations **MUST NOT** continue with partially +parsed multiplier data. They **MUST** fail deterministically with an error that identifies the +invalid registry field or model entry. + +**R-SAFE-010**: When a runtime override or custom multiplier map is merged with the embedded +registry, implementations **MUST** apply the same validation rules to the merged result before using +it for ET computation. + --- ## 9. Extensibility @@ -645,6 +676,8 @@ To keep specification and implementation synchronized: 3. When deprecating a model, add a `deprecated` comment alongside the entry and keep it in the registry for at least one minor version before removal (R-REG-009). Update the registry `version` field on removal. 4. Verify loading and fallback behavior in `pkg/cli/effective_tokens_test.go` (`TestModelMultipliersJSONEmbedded`, `TestResolveEffectiveWeightsDefault`, and inventory checks). 5. Run `make build` so the embedded registry is rebuilt into the `gh-aw` binary. +6. Re-run registry validation coverage after any registry edit so malformed multiplier entries fail + before ET computation paths are exercised. Conforming releases SHOULD include a test assertion for newly added model multipliers to ensure implementation-registry parity. diff --git a/docs/src/content/docs/reference/forecast-specification.md b/docs/src/content/docs/reference/forecast-specification.md index 438ef6bee32..b44347e4491 100644 --- a/docs/src/content/docs/reference/forecast-specification.md +++ b/docs/src/content/docs/reference/forecast-specification.md @@ -420,6 +420,14 @@ The implementation MUST use: - **R-MC-001**: For `λ = 0`, the implementation MUST return a projected token total of 0 for that trial without invoking either algorithm. - **R-FC-060**: Implementations MUST use `λ = 15` as the crossover threshold: Knuth's exact algorithm for `λ ≤ 15`, and Normal approximation only for `λ > 15`. Implementations MUST NOT raise this threshold above 15 without a specification revision, because the documented error and comparability assumptions are calibrated to this crossover. +- **R-MC-002**: `λ` MUST be derived from `observed_runs_per_period` using the formula in §3.7 and + MUST be reused unchanged for every trial of the same workflow forecast. Implementations MUST NOT + recalculate or modify `λ` within a single forecast run. +- **R-MC-003**: `λ` MUST be treated as a real-valued rate parameter. Implementations MUST NOT round, + floor, or ceil `λ` before selecting the Poisson branch or before drawing the projected run count. +- **R-MC-004**: If the computed `λ` is negative, `NaN`, or otherwise non-finite, implementations + MUST replace it with `0`, emit a warning, and continue in the same zero-projection mode required + by **R-MC-001**. #### 7.2.2 Per-Run Token Usage (Bootstrap Resampling) @@ -952,6 +960,15 @@ Sync procedure: 2. Update corresponding Go implementation/tests in the files above in the same change. 3. Re-run forecast tests to verify normative parity. +Sync follow-up tasks: + +- Add an implementation-level assertion that verbose diagnostics and JSON output are derived from the + same `λ` value used by the Monte Carlo engine. +- Expand forecast fixtures to cover invalid/non-finite `λ` derivation paths and zero-projection + fallback behavior. +- Re-review Appendix B whenever the Poisson branch threshold or `observed_runs_per_period` + calculation changes. + --- ## 14. Appendices @@ -1004,7 +1021,7 @@ std_dev ≈ 40,000 Knuth's exact Poisson algorithm is used for small λ (≤ 15) because it produces exact integer draws from the Poisson distribution without bias. For large λ, the Poisson distribution converges to a Normal distribution (`N(λ, λ)`), making the Normal approximation computationally efficient and sufficiently accurate. -The threshold of λ = 15 is chosen as the crossover point where Normal approximation error is below 1% for the tails relevant to P10/P90 computation. Implementations MAY lower this threshold (e.g., to λ = 30) for greater accuracy at a minor performance cost. +The threshold of λ = 15 is chosen as the crossover point where Normal approximation error is below 1% for the tails relevant to P10/P90 computation. This fixed crossover is mandated by **R-FC-060** and MUST NOT be changed without a specification revision. ### Appendix C: Bootstrap Resampling Rationale diff --git a/docs/src/content/docs/reference/frontmatter-hash-specification.md b/docs/src/content/docs/reference/frontmatter-hash-specification.md index 54321a12c28..c6731423a22 100644 --- a/docs/src/content/docs/reference/frontmatter-hash-specification.md +++ b/docs/src/content/docs/reference/frontmatter-hash-specification.md @@ -65,7 +65,12 @@ BFS queue order: `[root.md, a.md, b.md, shared.md]` `shared.md` appears twice but is processed only once (after `a.md` in queue order). Canonical hash input order: root → a → b → shared. -This rule ensures that the hash is deterministic regardless of which traversal path first discovers a shared dependency. +If the root import list were reversed to `[b.md, a.md]`, the canonical order would be +`root → b → a → shared`. + +The first sibling encountered in BFS order always claims the shared dependency. Later duplicates are +skipped. This rule ensures that the hash is deterministic regardless of which traversal path first +discovers a shared dependency. ### 2. Field Selection @@ -210,6 +215,29 @@ Both Go and JavaScript implementations MUST: - Special characters and escaping - All workflows in the repository +### 5.1 Cross-Language Validation Protocol + +The project maintains Go and JavaScript implementations of the frontmatter hash algorithm. A +conforming change to either implementation MUST follow this validation protocol: + +1. Update both implementations in the same change whenever the authoritative runtime algorithm or + normalization behavior changes. +2. Execute the shared cross-language test vectors so each implementation validates the other + implementation's output, not just its own fixtures. +3. Treat any byte-level mismatch in canonical JSON or final SHA-256 output as a release-blocking + failure until both implementations are aligned. +4. Recompile workflow lock files only after the cross-language checks pass, so newly generated hashes + reflect a synchronized algorithm. + +**R-XLANG-001**: The shared validation corpus **MUST** include at least one empty-frontmatter case, +one single-file case, one multi-level import case, and one diamond-import case. + +**R-XLANG-002**: A change that alters canonical JSON generation in either language **MUST** update +the shared validation corpus in the same change. + +**R-XLANG-003**: CI or pre-release validation **MUST** fail if Go and JavaScript produce different +hashes for any corpus member. + ## Implementation Notes ### Go Implementation diff --git a/docs/src/content/docs/reference/fuzzy-schedule-specification.md b/docs/src/content/docs/reference/fuzzy-schedule-specification.md index b2efbc6ef58..ca5912b3d90 100644 --- a/docs/src/content/docs/reference/fuzzy-schedule-specification.md +++ b/docs/src/content/docs/reference/fuzzy-schedule-specification.md @@ -1262,6 +1262,14 @@ After changing fuzzy schedule semantics: 2. Update parser/scatter implementation in the mapped files. 3. Re-run parser/scatter tests to verify behavior remains deterministic. +Integration coverage notes: + +- Conforming changes SHOULD exercise end-to-end compile coverage in addition to parser-only tests so + fuzzy expressions are validated after placeholder expansion into emitted cron schedules. +- Changes that affect calendar rendering or weighted slot selection SHOULD include integration + assertions against `pkg/cli/compile_schedule_calendar.go` output, not only unit assertions against + parser helpers. + --- ## 12. Calendar Output Schema diff --git a/docs/src/content/docs/reference/mcp-scripts-specification.md b/docs/src/content/docs/reference/mcp-scripts-specification.md index 0f5ac2780c8..ce5dad9c803 100644 --- a/docs/src/content/docs/reference/mcp-scripts-specification.md +++ b/docs/src/content/docs/reference/mcp-scripts-specification.md @@ -365,6 +365,27 @@ Implementations SHOULD validate: 6. Handler captures output and errors 7. Server returns JSON-RPC response to agent +### 5.1.1 Operations Ordering + +A conforming implementation MUST preserve the following operation order for each tool invocation +attempt: + +1. Authenticate the request and resolve the target tool name before executing any user-defined code. +2. Apply input validation and default-value expansion before runtime startup or dependency + installation. +3. Complete any required dependency installation or runtime bootstrap before invoking the tool body. +4. Execute the tool body exactly once for the current attempt. +5. Sanitize stdout-derived results before classifying success, generating previews, or writing + oversized payloads to disk. +6. Apply the large-output transformation in §8 only after the sanitized success payload has been + fully materialized for the current attempt. +7. Classify failures and set `data.recoverable` before cleanup, then clean up ephemeral resources + before the server returns the final JSON-RPC response. + +Implementations MUST NOT reorder these steps in a way that allows unsanitized output to bypass +§7.4 (Output Sanitization) or allows retry classification to observe partially cleaned-up state from +a different attempt. + ### 5.2 Input Validation Implementations MUST: @@ -465,6 +486,14 @@ plus retries) permitted for a single invocation. 5. Because tool invocations may be non-idempotent, callers **MUST** treat retry safety as a caller responsibility and **MUST** apply idempotency safeguards (e.g., idempotency keys or side-effect checks) before retrying state-changing tools. +6. Each retry **MUST** begin from a fresh invocation attempt: callers and servers **MUST NOT** reuse + partially emitted stdout, partially written large-output files, or partially initialized runtime + state from a previous failed attempt as the result for the retry. +7. When a recoverable attempt fails after producing side effects outside the tool process (for + example, creating a remote resource before timing out), callers **SHOULD** perform explicit + side-effect checks or compensating cleanup before retrying. +8. Once the retry budget is exhausted, the caller **MUST** surface the final failure as terminal and + **SHOULD** include the total attempts made when reporting the error to operators. --- @@ -831,6 +860,19 @@ When tool output exceeds 500 characters, implementations MUST: - `preview.first_item`: First item in array/list - `preview.item_count`: Number of items in collection +### 8.2.1 Response Structure Norms + +- The large-output response **MUST** preserve the original tool result envelope and replace only the + oversized content payload with the `content` metadata object shown above. +- The `content` object **MUST NOT** embed the full original payload inline once the file indirection + path is chosen. +- `preview` is OPTIONAL, but when present it **MUST** summarize sanitized content from the same + attempt that produced `content.path`; implementations **MUST NOT** mix preview data from a prior + failed or retried attempt. +- For collection-shaped outputs, `preview.first_item` and `preview.item_count` SHOULD describe the + collection shape without requiring the client to open the file immediately. For non-collection + outputs, implementations MAY omit these fields and return only `preview.schema`. + ### 8.3 File Access Implementations MUST: