github · pelikhan · May 20, 2026 · May 20, 2026 · May 20, 2026 · May 20, 2026
diff --git a/docs/src/content/docs/reference/effective-tokens-specification.md b/docs/src/content/docs/reference/effective-tokens-specification.md
@@ -127,6 +127,20 @@ Any invocation triggered by another LLM call or orchestration layer. Examples in
 
 A directed structure representing all invocations associated with a single top-level request. The root node has no parent; sub-agents reference their triggering invocation as their parent.
 
+### 3.6 Execution-Graph Traversal Entities
+
+For deterministic aggregation and reporting, implementations MUST distinguish the following traversal
+entities when processing an execution graph:
+
+- **Local invocation cost**: The ET computed from the current node's own `usage.*` payload only.
+- **Descendant contribution**: The subtotal accumulated from child nodes and deeper descendants before
+  the current node's local invocation cost is added.
+- **Observed subtree**: A subtree whose invocation nodes have concrete usage payloads and therefore
+  contribute measured ET rather than fallback zeros.
+- **Unobservable subtree**: A subtree whose invocation nodes are known to exist but whose concrete
+  usage payloads are unavailable; these nodes remain part of traversal order even when their ET is
+  serialized as `0`.
+
 ---
 
 ## 4. Token Accounting Model
@@ -365,6 +379,23 @@ implementations **MUST** serialize `usage.input_tokens`, `usage.cached_input_tok
 include a `flagged` object with schema `{ "code": "UNOBSERVABLE_INVOCATION", "reason": string }`.
 For fully observed invocation nodes, implementations **MAY** omit `flagged`.
 
+**R-SAFE-007**: Before ET computation begins, implementations **MUST** validate the active model
+multiplier registry described in [Model Multiplier Registry](#model-multiplier-registry). Registry
+validation **MUST** confirm that `version` and `reference_model` are non-empty strings and that the
+reference model has a numeric multiplier entry.
+
+**R-SAFE-008**: Every declared token class weight and model multiplier loaded from the registry
+**MUST** be finite numeric data. `NaN`, infinite values, strings, `null`, and negative multiplier
+values **MUST** be rejected before any ET output is produced.
+
+**R-SAFE-009**: If registry validation fails, implementations **MUST NOT** continue with partially
+parsed multiplier data. They **MUST** fail deterministically with an error that identifies the
+invalid registry field or model entry.
+
+**R-SAFE-010**: When a runtime override or custom multiplier map is merged with the embedded
+registry, implementations **MUST** apply the same validation rules to the merged result before using
+it for ET computation.
+
 ---
 
 ## 9. Extensibility
@@ -645,6 +676,8 @@ To keep specification and implementation synchronized:
 3. When deprecating a model, add a `deprecated` comment alongside the entry and keep it in the registry for at least one minor version before removal (R-REG-009). Update the registry `version` field on removal.
 4. Verify loading and fallback behavior in `pkg/cli/effective_tokens_test.go` (`TestModelMultipliersJSONEmbedded`, `TestResolveEffectiveWeightsDefault`, and inventory checks).
 5. Run `make build` so the embedded registry is rebuilt into the `gh-aw` binary.
+6. Re-run registry validation coverage after any registry edit so malformed multiplier entries fail
+   before ET computation paths are exercised.
 
 Conforming releases SHOULD include a test assertion for newly added model multipliers to ensure implementation-registry parity.
 

diff --git a/docs/src/content/docs/reference/forecast-specification.md b/docs/src/content/docs/reference/forecast-specification.md
@@ -420,6 +420,14 @@ The implementation MUST use:
 
 - **R-MC-001**: For `λ = 0`, the implementation MUST return a projected token total of 0 for that trial without invoking either algorithm.
 - **R-FC-060**: Implementations MUST use `λ = 15` as the crossover threshold: Knuth's exact algorithm for `λ ≤ 15`, and Normal approximation only for `λ > 15`. Implementations MUST NOT raise this threshold above 15 without a specification revision, because the documented error and comparability assumptions are calibrated to this crossover.
+- **R-MC-002**: `λ` MUST be derived from `observed_runs_per_period` using the formula in §3.7 and
+  MUST be reused unchanged for every trial of the same workflow forecast. Implementations MUST NOT
+  recalculate or modify `λ` within a single forecast run.
+- **R-MC-003**: `λ` MUST be treated as a real-valued rate parameter. Implementations MUST NOT round,
+  floor, or ceil `λ` before selecting the Poisson branch or before drawing the projected run count.
+- **R-MC-004**: If the computed `λ` is negative, `NaN`, or otherwise non-finite, implementations
+  MUST replace it with `0`, emit a warning, and continue in the same zero-projection mode required
+  by **R-MC-001**.
 
 #### 7.2.2 Per-Run Token Usage (Bootstrap Resampling)
 
@@ -952,6 +960,15 @@ Sync procedure:
 2. Update corresponding Go implementation/tests in the files above in the same change.
 3. Re-run forecast tests to verify normative parity.
 
+Sync follow-up tasks:
+
+- Add an implementation-level assertion that verbose diagnostics and JSON output are derived from the
+  same `λ` value used by the Monte Carlo engine.
+- Expand forecast fixtures to cover invalid/non-finite `λ` derivation paths and zero-projection
+  fallback behavior.
+- Re-review Appendix B whenever the Poisson branch threshold or `observed_runs_per_period`
+  calculation changes.
+
 ---
 
 ## 14. Appendices
@@ -1004,7 +1021,7 @@ std_dev ≈ 40,000
 
 Knuth's exact Poisson algorithm is used for small λ (≤ 15) because it produces exact integer draws from the Poisson distribution without bias. For large λ, the Poisson distribution converges to a Normal distribution (`N(λ, λ)`), making the Normal approximation computationally efficient and sufficiently accurate.
 
-The threshold of λ = 15 is chosen as the crossover point where Normal approximation error is below 1% for the tails relevant to P10/P90 computation. Implementations MAY lower this threshold (e.g., to λ = 30) for greater accuracy at a minor performance cost.
+The threshold of λ = 15 is chosen as the crossover point where Normal approximation error is below 1% for the tails relevant to P10/P90 computation. This fixed crossover is mandated by **R-FC-060** and MUST NOT be changed without a specification revision.
 
 ### Appendix C: Bootstrap Resampling Rationale
 

diff --git a/docs/src/content/docs/reference/frontmatter-hash-specification.md b/docs/src/content/docs/reference/frontmatter-hash-specification.md
@@ -65,7 +65,12 @@ BFS queue order: `[root.md, a.md, b.md, shared.md]`
 `shared.md` appears twice but is processed only once (after `a.md` in queue order).  
 Canonical hash input order: root → a → b → shared.
 
-This rule ensures that the hash is deterministic regardless of which traversal path first discovers a shared dependency.
+If the root import list were reversed to `[b.md, a.md]`, the canonical order would be
+`root → b → a → shared`.
+
+The first sibling encountered in BFS order always claims the shared dependency. Later duplicates are
+skipped. This rule ensures that the hash is deterministic regardless of which traversal path first
+discovers a shared dependency.
 
 ### 2. Field Selection
 
@@ -210,6 +215,29 @@ Both Go and JavaScript implementations MUST:
 - Special characters and escaping
 - All workflows in the repository
 
+### 5.1 Cross-Language Validation Protocol
+
+The project maintains Go and JavaScript implementations of the frontmatter hash algorithm. A
+conforming change to either implementation MUST follow this validation protocol:
+
+1. Update both implementations in the same change whenever the authoritative runtime algorithm or
+   normalization behavior changes.
+2. Execute the shared cross-language test vectors so each implementation validates the other
+   implementation's output, not just its own fixtures.
+3. Treat any byte-level mismatch in canonical JSON or final SHA-256 output as a release-blocking
+   failure until both implementations are aligned.
+4. Recompile workflow lock files only after the cross-language checks pass, so newly generated hashes
+   reflect a synchronized algorithm.
+
+**R-XLANG-001**: The shared validation corpus **MUST** include at least one empty-frontmatter case,
+one single-file case, one multi-level import case, and one diamond-import case.
+
+**R-XLANG-002**: A change that alters canonical JSON generation in either language **MUST** update
+the shared validation corpus in the same change.
+
+**R-XLANG-003**: CI or pre-release validation **MUST** fail if Go and JavaScript produce different
+hashes for any corpus member.
+
 ## Implementation Notes
 
 ### Go Implementation

diff --git a/docs/src/content/docs/reference/fuzzy-schedule-specification.md b/docs/src/content/docs/reference/fuzzy-schedule-specification.md
@@ -1262,6 +1262,14 @@ After changing fuzzy schedule semantics:
 2. Update parser/scatter implementation in the mapped files.
 3. Re-run parser/scatter tests to verify behavior remains deterministic.
 
+Integration coverage notes:
+
+- Conforming changes SHOULD exercise end-to-end compile coverage in addition to parser-only tests so
+  fuzzy expressions are validated after placeholder expansion into emitted cron schedules.
+- Changes that affect calendar rendering or weighted slot selection SHOULD include integration
+  assertions against `pkg/cli/compile_schedule_calendar.go` output, not only unit assertions against
+  parser helpers.
+
 ---
 
 ## 12. Calendar Output Schema

diff --git a/docs/src/content/docs/reference/mcp-scripts-specification.md b/docs/src/content/docs/reference/mcp-scripts-specification.md
@@ -365,6 +365,27 @@ Implementations SHOULD validate:
 6. Handler captures output and errors
 7. Server returns JSON-RPC response to agent
 
+### 5.1.1 Operations Ordering
+
+A conforming implementation MUST preserve the following operation order for each tool invocation
+attempt:
+
+1. Authenticate the request and resolve the target tool name before executing any user-defined code.
+2. Apply input validation and default-value expansion before runtime startup or dependency
+   installation.
+3. Complete any required dependency installation or runtime bootstrap before invoking the tool body.
+4. Execute the tool body exactly once for the current attempt.
+5. Sanitize stdout-derived results before classifying success, generating previews, or writing
+   oversized payloads to disk.
+6. Apply the large-output transformation in §8 only after the sanitized success payload has been
+   fully materialized for the current attempt.
+7. Classify failures and set `data.recoverable` before cleanup, then clean up ephemeral resources
+   before the server returns the final JSON-RPC response.
+
+Implementations MUST NOT reorder these steps in a way that allows unsanitized output to bypass
+§7.4 (Output Sanitization) or allows retry classification to observe partially cleaned-up state from
+a different attempt.
+
 ### 5.2 Input Validation
 
 Implementations MUST:
@@ -465,6 +486,14 @@ plus retries) permitted for a single invocation.
 5. Because tool invocations may be non-idempotent, callers **MUST** treat retry safety as a
    caller responsibility and **MUST** apply idempotency safeguards (e.g., idempotency keys or
    side-effect checks) before retrying state-changing tools.
+6. Each retry **MUST** begin from a fresh invocation attempt: callers and servers **MUST NOT** reuse
+   partially emitted stdout, partially written large-output files, or partially initialized runtime
+   state from a previous failed attempt as the result for the retry.
+7. When a recoverable attempt fails after producing side effects outside the tool process (for
+   example, creating a remote resource before timing out), callers **SHOULD** perform explicit
+   side-effect checks or compensating cleanup before retrying.
+8. Once the retry budget is exhausted, the caller **MUST** surface the final failure as terminal and
+   **SHOULD** include the total attempts made when reporting the error to operators.
 
 ---
 
@@ -831,6 +860,19 @@ When tool output exceeds 500 characters, implementations MUST:
 - `preview.first_item`: First item in array/list
 - `preview.item_count`: Number of items in collection
 
+### 8.2.1 Response Structure Norms
+
+- The large-output response **MUST** preserve the original tool result envelope and replace only the
+  oversized content payload with the `content` metadata object shown above.
+- The `content` object **MUST NOT** embed the full original payload inline once the file indirection
+  path is chosen.
+- `preview` is OPTIONAL, but when present it **MUST** summarize sanitized content from the same
+  attempt that produced `content.path`; implementations **MUST NOT** mix preview data from a prior
+  failed or retried attempt.
+- For collection-shaped outputs, `preview.first_item` and `preview.item_count` SHOULD describe the
+  collection shape without requiring the client to open the file immediately. For non-collection
+  outputs, implementations MAY omit these fields and return only `preview.schema`.
+
 ### 8.3 File Access
 
 Implementations MUST: