Skip to content

Observability Cost Accounting

Nick edited this page Jun 20, 2026 · 2 revisions

Observability + Cost Accounting (v0.1.31)

Time/duration, failure/verifier/acceptance rates, and token/cost — all DERIVED from run state CW already keeps, with host-attested usage so cost is honest. Shipped in v0.1.31. Repo doc: docs/observability-cost-accounting.7.md.

The design mantra for this layer:

Derive, never collect.
A counter you cannot trust is worse than none.
Cost is attested, never measured.
Pricing is data, not kernel.
Fail closed: n/a over a fabricated zero.

The Borrowed Idea: Base-System Observability, Not a Telemetry Pipeline

CW models metrics the way a Unix base system exposes state — a DERIVED projection of records that already exist, never a separate database. It follows the same philosophy as Run Registry Control Plane and State Explosion Management: the per-run .cw/runs/<id>/state.json is the SINGLE source of truth, metrics are a read-only projection of it, and there is no telemetry pipeline, no background collector daemon, and no hidden counters. Before v0.1.31 there was no metrics module and no token or cost field anywhere; run state already carried createdAt/updatedAt/completedAt/dispatchedAt and outcome statuses on tasks, workers, verifier nodes, candidates, memberships, and feedback. This release projects those into a report — and adds an additive, host-attested usage record so cost can be accounted honestly — without changing the ResultEnvelope schema and without taking ownership of source truth.

durable run state -> pure projection -> rates/durations + attested usage -> cost under a policy

Mechanism vs Policy: Pricing Is Data

The runtime is MECHANISM — it records attested usage and derives rates/durations. The pricing table is POLICY — supplied as DATA (CostPolicy), not baked into the kernel. The same attested usage yields different cost reports under different pricing without touching the runtime. A bundled EXAMPLE policy lives at manifest/pricing.policy.json (USD per 1e6 tokens, an editable starting point — not a live price feed); pass --pricing <path> to use your own, or --pricing default for the bundled example. With no policy supplied, cost is unpriced/unreported, never guessed. CW never measures cost.

The same attested usage also drives bounded dynamic control flow: a loop(...) with until:{kind:"budget-target", target} scales its rounds toward a token target read from this attested-only usage, while the fail-closed limits.tokenBudget cap stays the absolute backstop that can never be overshot.

Principles

  1. Derive, never collect. Every number is a projection of existing durable state. deriveMetricsReport(run, { now, policy }) is a PURE function of one run's state, an injected now, and an optional pricing policy. There is no metrics store. The only now-derived field is generatedAt; durations are computed from recorded timestamps, so a report over a fixed snapshot is byte-reproducible (eval/replay agnostic).

  2. Durations come from recorded timestamps. dispatchedAtcompletedAt for tasks, createdAt→worker output recordedAt for workers, createdAtupdatedAt for the run. Nothing is sampled live.

  3. Three rates, each with its own arithmetic. The failure rate pools failed/rejected workers, failed memberships, failed un-worker-backed tasks, and unresolved (open/tasked) feedback over the total of those samples. The verifier pass rate counts verifier state nodes whose status is a pass (verified/completed/committed) against decided gates (pass + failed/rejected/blocked); pending/running gates are undecided and excluded. The candidate acceptance rate counts selected/verified candidates over all candidate records.

  4. A counter you cannot trust is worse than none. Each rate is a RateMetric carrying state (ok/n/a), count, total, rate, and per-bucket sample counts. Over zero samples the state is n/a and count/rate are null — never 0. No divide-by-zero, no partial-data rate presented as complete. Sample counts and buckets accompany every rate so a reader can audit the numerator and denominator.

  5. Cost is attested, never measured or fabricated. CW does not call the model; the host/worker does. Token usage is recorded as HOST-ATTESTED provenance — a UsageRecord accepted on the existing intake path and stored on the task or worker record, never on ResultEnvelope. CW records what the host attests verbatim and synthesizes nothing. When the host reports no usage the value is an explicit unreported — never 0, never a silent guess. The report surfaces usage.coverage (the fraction of work units carrying attested usage) and usage.unreportedUnits so the gap is visible.

  6. Attested and estimated cost are never conflated. A monetary figure is attested ONLY when derived from attested usage × a recorded pricing policy with an EXACT model match. When a model is priced by the policy's defaultPrice fallback, that portion is a SEPARATE estimated figure and the cost state becomes estimated; the two USD figures are never conflated into one. Cost states: attested (every attested model exact-matched a policy entry), estimated (some attested usage priced by the policy default/fallback), unpriced (attested usage present but no policy entry and no default), unreported (no attested usage to price).

  7. One source, every surface. The metrics verbs are declared once in src/capability-registry.ts, so the CLI and MCP surfaces are two renderings of one core (src/observability.ts) and pass the v0.1.27 parity gate — cw <cmd> --json is byte-identical to cw_<tool> (durations are integers from recorded timestamps; only the ISO generatedAt is now-derived and neutralized by the parity probe). The v0.1.30 Workbench renders a read-only metrics panel from the same payload, showing coverage and unreported/n/a honestly — it shows nothing the CLI/MCP cannot.

  8. Snapshots are rebuildable and freshness is fail-closed. The per-run report is persisted as a rebuildable, fingerprinted snapshot under .cw/runs/<id>/metrics/metrics-report.json; the cross-repo summary reports each run's snapshot freshness as valid|stale|absent against current source — fail closed, exactly like the registry.

  9. Backward compatible by construction. Usage/cost fields are additive and optional. Old runs load and report unreported cost while still yielding correct time and rate metrics from their existing timestamps and outcomes; the ResultEnvelope schema is unchanged.

Commands

cw metrics show <run-id>   derived per-run report: durations, the three rates with
                           sample counts, attested usage with coverage, and cost.
                           --json for the canonical payload; --pricing <path>|default
cw metrics summary         cross-repo rollup over the v0.1.28 run registry: pooled
                           rates, summed attested usage/cost with coverage, per-app
                           and per-backend breakdowns. --scope repo|home;
                           unreadable runs counted (unreadableRuns), never dropped

Usage attestation rides the existing intake path:

cw result <run-id> <task-id> <file> --usage-input-tokens N --usage-output-tokens M --usage-model ID --usage-source host-attested
cw worker output <run-id> <worker-id> <file> --usage-input-tokens N --usage-output-tokens M --usage-model ID

MCP hosts call cw_metrics_show and cw_metrics_summary with identical payloads.

Why It Matters

Metrics that are collected drift from truth, fabricate zeros over empty samples, and quietly conflate measured cost with guessed cost. CW does none of that: every rate is auditable against its own numerator/denominator, fails to n/a rather than lying, and cost is attested provenance priced by a data policy CW never measures — with the unreported gap always visible. You get honest accounting that stays byte-reproducible across CLI, MCP, and Workbench.

See Also

Clone this wiki locally