Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/src/content/docs/reference/effective-tokens-specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,20 @@ Any invocation triggered by another LLM call or orchestration layer. Examples in

A directed structure representing all invocations associated with a single top-level request. The root node has no parent; sub-agents reference their triggering invocation as their parent.

### 3.6 Execution-Graph Traversal Entities

For deterministic aggregation and reporting, implementations MUST distinguish the following traversal
entities when processing an execution graph:

- **Local invocation cost**: The ET computed from the current node's own `usage.*` payload only.
- **Descendant contribution**: The subtotal accumulated from child nodes and deeper descendants before
the current node's local invocation cost is added.
- **Observed subtree**: A subtree whose invocation nodes have concrete usage payloads and therefore
contribute measured ET rather than fallback zeros.
- **Unobservable subtree**: A subtree whose invocation nodes are known to exist but whose concrete
usage payloads are unavailable; these nodes remain part of traversal order even when their ET is
serialized as `0`.

---

## 4. Token Accounting Model
Expand Down Expand Up @@ -365,6 +379,23 @@ implementations **MUST** serialize `usage.input_tokens`, `usage.cached_input_tok
include a `flagged` object with schema `{ "code": "UNOBSERVABLE_INVOCATION", "reason": string }`.
For fully observed invocation nodes, implementations **MAY** omit `flagged`.

**R-SAFE-007**: Before ET computation begins, implementations **MUST** validate the active model
multiplier registry described in [Model Multiplier Registry](#model-multiplier-registry). Registry
validation **MUST** confirm that `version` and `reference_model` are non-empty strings and that the
reference model has a numeric multiplier entry.

**R-SAFE-008**: Every declared token class weight and model multiplier loaded from the registry
**MUST** be finite numeric data. `NaN`, infinite values, strings, `null`, and negative multiplier
values **MUST** be rejected before any ET output is produced.

**R-SAFE-009**: If registry validation fails, implementations **MUST NOT** continue with partially
parsed multiplier data. They **MUST** fail deterministically with an error that identifies the
invalid registry field or model entry.

**R-SAFE-010**: When a runtime override or custom multiplier map is merged with the embedded
registry, implementations **MUST** apply the same validation rules to the merged result before using
it for ET computation.

---

## 9. Extensibility
Expand Down Expand Up @@ -645,6 +676,8 @@ To keep specification and implementation synchronized:
3. When deprecating a model, add a `deprecated` comment alongside the entry and keep it in the registry for at least one minor version before removal (R-REG-009). Update the registry `version` field on removal.
4. Verify loading and fallback behavior in `pkg/cli/effective_tokens_test.go` (`TestModelMultipliersJSONEmbedded`, `TestResolveEffectiveWeightsDefault`, and inventory checks).
5. Run `make build` so the embedded registry is rebuilt into the `gh-aw` binary.
6. Re-run registry validation coverage after any registry edit so malformed multiplier entries fail
before ET computation paths are exercised.

Conforming releases SHOULD include a test assertion for newly added model multipliers to ensure implementation-registry parity.

Expand Down
19 changes: 18 additions & 1 deletion docs/src/content/docs/reference/forecast-specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,14 @@ The implementation MUST use:

- **R-MC-001**: For `λ = 0`, the implementation MUST return a projected token total of 0 for that trial without invoking either algorithm.
- **R-FC-060**: Implementations MUST use `λ = 15` as the crossover threshold: Knuth's exact algorithm for `λ ≤ 15`, and Normal approximation only for `λ > 15`. Implementations MUST NOT raise this threshold above 15 without a specification revision, because the documented error and comparability assumptions are calibrated to this crossover.
- **R-MC-002**: `λ` MUST be derived from `observed_runs_per_period` using the formula in §3.7 and
MUST be reused unchanged for every trial of the same workflow forecast. Implementations MUST NOT
recalculate or modify `λ` within a single forecast run.
- **R-MC-003**: `λ` MUST be treated as a real-valued rate parameter. Implementations MUST NOT round,
floor, or ceil `λ` before selecting the Poisson branch or before drawing the projected run count.
- **R-MC-004**: If the computed `λ` is negative, `NaN`, or otherwise non-finite, implementations
MUST replace it with `0`, emit a warning, and continue in the same zero-projection mode required
by **R-MC-001**.

#### 7.2.2 Per-Run Token Usage (Bootstrap Resampling)

Expand Down Expand Up @@ -952,6 +960,15 @@ Sync procedure:
2. Update corresponding Go implementation/tests in the files above in the same change.
3. Re-run forecast tests to verify normative parity.

Sync follow-up tasks:

- Add an implementation-level assertion that verbose diagnostics and JSON output are derived from the
same `λ` value used by the Monte Carlo engine.
- Expand forecast fixtures to cover invalid/non-finite `λ` derivation paths and zero-projection
fallback behavior.
- Re-review Appendix B whenever the Poisson branch threshold or `observed_runs_per_period`
calculation changes.

---

## 14. Appendices
Expand Down Expand Up @@ -1004,7 +1021,7 @@ std_dev ≈ 40,000

Knuth's exact Poisson algorithm is used for small λ (≤ 15) because it produces exact integer draws from the Poisson distribution without bias. For large λ, the Poisson distribution converges to a Normal distribution (`N(λ, λ)`), making the Normal approximation computationally efficient and sufficiently accurate.

The threshold of λ = 15 is chosen as the crossover point where Normal approximation error is below 1% for the tails relevant to P10/P90 computation. Implementations MAY lower this threshold (e.g., to λ = 30) for greater accuracy at a minor performance cost.
The threshold of λ = 15 is chosen as the crossover point where Normal approximation error is below 1% for the tails relevant to P10/P90 computation. This fixed crossover is mandated by **R-FC-060** and MUST NOT be changed without a specification revision.

### Appendix C: Bootstrap Resampling Rationale

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,12 @@ BFS queue order: `[root.md, a.md, b.md, shared.md]`
`shared.md` appears twice but is processed only once (after `a.md` in queue order).
Canonical hash input order: root → a → b → shared.

This rule ensures that the hash is deterministic regardless of which traversal path first discovers a shared dependency.
If the root import list were reversed to `[b.md, a.md]`, the canonical order would be
`root → b → a → shared`.

The first sibling encountered in BFS order always claims the shared dependency. Later duplicates are
skipped. This rule ensures that the hash is deterministic regardless of which traversal path first
discovers a shared dependency.

### 2. Field Selection

Expand Down Expand Up @@ -210,6 +215,29 @@ Both Go and JavaScript implementations MUST:
- Special characters and escaping
- All workflows in the repository

### 5.1 Cross-Language Validation Protocol

The project maintains Go and JavaScript implementations of the frontmatter hash algorithm. A
conforming change to either implementation MUST follow this validation protocol:

1. Update both implementations in the same change whenever the authoritative runtime algorithm or
normalization behavior changes.
2. Execute the shared cross-language test vectors so each implementation validates the other
implementation's output, not just its own fixtures.
3. Treat any byte-level mismatch in canonical JSON or final SHA-256 output as a release-blocking
failure until both implementations are aligned.
4. Recompile workflow lock files only after the cross-language checks pass, so newly generated hashes
reflect a synchronized algorithm.

**R-XLANG-001**: The shared validation corpus **MUST** include at least one empty-frontmatter case,
one single-file case, one multi-level import case, and one diamond-import case.

**R-XLANG-002**: A change that alters canonical JSON generation in either language **MUST** update
the shared validation corpus in the same change.

**R-XLANG-003**: CI or pre-release validation **MUST** fail if Go and JavaScript produce different
hashes for any corpus member.

## Implementation Notes

### Go Implementation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1262,6 +1262,14 @@ After changing fuzzy schedule semantics:
2. Update parser/scatter implementation in the mapped files.
3. Re-run parser/scatter tests to verify behavior remains deterministic.

Integration coverage notes:

- Conforming changes SHOULD exercise end-to-end compile coverage in addition to parser-only tests so
fuzzy expressions are validated after placeholder expansion into emitted cron schedules.
- Changes that affect calendar rendering or weighted slot selection SHOULD include integration
assertions against `pkg/cli/compile_schedule_calendar.go` output, not only unit assertions against
parser helpers.

---

## 12. Calendar Output Schema
Expand Down
42 changes: 42 additions & 0 deletions docs/src/content/docs/reference/mcp-scripts-specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,27 @@ Implementations SHOULD validate:
6. Handler captures output and errors
7. Server returns JSON-RPC response to agent

### 5.1.1 Operations Ordering

A conforming implementation MUST preserve the following operation order for each tool invocation
attempt:

1. Authenticate the request and resolve the target tool name before executing any user-defined code.
2. Apply input validation and default-value expansion before runtime startup or dependency
installation.
3. Complete any required dependency installation or runtime bootstrap before invoking the tool body.
4. Execute the tool body exactly once for the current attempt.
5. Sanitize stdout-derived results before classifying success, generating previews, or writing
oversized payloads to disk.
6. Apply the large-output transformation in §8 only after the sanitized success payload has been
fully materialized for the current attempt.
7. Classify failures and set `data.recoverable` before cleanup, then clean up ephemeral resources
before the server returns the final JSON-RPC response.

Implementations MUST NOT reorder these steps in a way that allows unsanitized output to bypass
§7.4 (Output Sanitization) or allows retry classification to observe partially cleaned-up state from
a different attempt.

### 5.2 Input Validation

Implementations MUST:
Expand Down Expand Up @@ -465,6 +486,14 @@ plus retries) permitted for a single invocation.
5. Because tool invocations may be non-idempotent, callers **MUST** treat retry safety as a
caller responsibility and **MUST** apply idempotency safeguards (e.g., idempotency keys or
side-effect checks) before retrying state-changing tools.
6. Each retry **MUST** begin from a fresh invocation attempt: callers and servers **MUST NOT** reuse
partially emitted stdout, partially written large-output files, or partially initialized runtime
state from a previous failed attempt as the result for the retry.
7. When a recoverable attempt fails after producing side effects outside the tool process (for
example, creating a remote resource before timing out), callers **SHOULD** perform explicit
side-effect checks or compensating cleanup before retrying.
8. Once the retry budget is exhausted, the caller **MUST** surface the final failure as terminal and
**SHOULD** include the total attempts made when reporting the error to operators.

---

Expand Down Expand Up @@ -831,6 +860,19 @@ When tool output exceeds 500 characters, implementations MUST:
- `preview.first_item`: First item in array/list
- `preview.item_count`: Number of items in collection

### 8.2.1 Response Structure Norms

- The large-output response **MUST** preserve the original tool result envelope and replace only the
oversized content payload with the `content` metadata object shown above.
- The `content` object **MUST NOT** embed the full original payload inline once the file indirection
path is chosen.
- `preview` is OPTIONAL, but when present it **MUST** summarize sanitized content from the same
attempt that produced `content.path`; implementations **MUST NOT** mix preview data from a prior
failed or retried attempt.
- For collection-shaped outputs, `preview.first_item` and `preview.item_count` SHOULD describe the
collection shape without requiring the client to open the file immediately. For non-collection
outputs, implementations MAY omit these fields and return only `preview.schema`.

### 8.3 File Access

Implementations MUST:
Expand Down
Loading