Skip to content

feat: sandbox tags and usage visibility API#184

Merged
ZIJ merged 18 commits intomainfrom
feat/sandbox-tags-and-usage
Apr 22, 2026
Merged

feat: sandbox tags and usage visibility API#184
ZIJ merged 18 commits intomainfrom
feat/sandbox-tags-and-usage

Conversation

@ZIJ
Copy link
Copy Markdown
Contributor

@ZIJ ZIJ commented Apr 22, 2026

Ships

Five endpoints for per-sandbox and per-tag spend attribution, plus tags + tagsLastUpdatedAt on existing sandbox reads. Units are GB-seconds — Stripe remains the pricing source of truth.

GET /api/usage                    aggregator: groupBy=sandbox | tag:<key>
GET /api/tags                     org-wide tag-key discovery
GET /api/sandboxes/{id}/usage     per-sandbox drilldown
GET /api/sandboxes/{id}/tags      read current tags
PUT /api/sandboxes/{id}/tags      full-replace

GET /api/sandboxes and GET /api/sandboxes/{id} responses gain tags + tagsLastUpdatedAt (additive, all four code paths).

Tag storage: new table, PK on (org_id, sandbox_id, key)

Sandbox IDs are short sb-xxxxxxxx strings generated independently per create path and not schema-unique across orgs. A (sandbox_id, key) PK plus sandbox-id-only lookups would let a single cross-org ID collision alias tag state across tenants, or deny the rightful owner access to their own sandbox via ownsSandbox. Every tag read, write, and join — including the session lookups used by ownsSandbox and the drilldown — scopes on (org_id, sandbox_id). TestBuildUsageQuery_JoinsIncludeOrgID pins the SQL-level invariant.

sandbox_sessions.metadata is left untouched — semantically it's a per-session create-time snapshot, not sandbox-level tags. No SDK surface for it changes.

Math: bit-for-bit identical to GetOrgUsage

Per-sandbox, per-tag, and untagged sums must reconcile to the org rollup the Stripe pipeline reports. Inline GB-second math mirrors GetOrgUsage and DiskOverageGBSeconds verbatim — same COALESCE(ended_at, LEAST(now(), $to)) - GREATEST(started_at, $from) idiom, same max(0, disk_mb - 20480) / 1024 * duration formula. The pgfixture test pins the full chain within 1e-6:

GetOrgUsage rollup == ExecuteOrgTotals
                   == Σ ExecuteUsageQuery(groupBy=sandbox)
                   == Σ ExecuteUsageQuery(groupBy=tag:<k>) + ExecuteUntaggedTotals

Any future change to the rollup math must change this code in lockstep.

Attribution: retagging rewrites history

Queries join live sandbox_tags. A retag changes all prior attribution for that sandbox going forward — fine for ops, hazardous for chargebacks. tagsLastUpdatedAt (max updated_at across the sandbox's tag rows) is surfaced on every sandbox-level response so dashboards can annotate edits. No snapshot audit in v1; the upgrade path if needed is a separate sandbox_tag_changes table.

Freshness: fresh on clean shutdown, lagged until worker restart otherwise

ReconcileWorkerSessions closes zombie scale events at worker startup, not on a heartbeat — a silently-dead worker leaves events open and COALESCE(ended_at, now()) accruing until it restarts. This matches what GetOrgUsage and the Stripe rollup already do today; we surface the existing behavior through new endpoints rather than introducing a new risk. Called out in the /usage reference docs.

Not in v1

Dollars (Stripe), time-series bucketing (adds as ?interval=1d), multi-dim groupBy (extends to comma-separated), filter trees, historical tag snapshotting / audit log, GET /sandboxes?tag=k:v, CSV export, CLI/dashboard. The endpoint shape accommodates each additively.

PUT is full-replace only — no merge mode. One filter[<dim>] param per dimension; comma-OR values inside, AND across dimensions, repeats return 400. Aligns the HTTP surface with the map-shaped SDK.

Testing

  • Go: go test ./internal/db/ ./internal/api/ — query builder SQL shape, tenancy-predicate pinning, parsing, cursor round-trip, input validation, PUT size/charset/reserved-prefix rules, charset, handler smoke. All pass.
  • TS SDK: npx tsc --noEmit under sdks/typescript/ is clean.
  • Python SDK: module imports and dataclasses load cleanly.
  • Reconciliation invariant (internal/db/usage_query_pgfixture_test.go, build tag pgfixture): deterministic fixture covering tagged + untagged + still-running + pre-window sandboxes, asserts the four-way equality chain above. Gated on TEST_DATABASE_URL.
TEST_DATABASE_URL=postgres://... go test -tags=pgfixture \
  ./internal/db/ -run Reconciliation -v

Test plan

  • go test ./internal/db/ ./internal/api/ green in CI
  • Migration 026_sandbox_tags applies cleanly on an environment with existing sandbox_sessions data
  • TS SDK npx tsc --noEmit clean in CI; Python SDK imports
  • pgfixture reconciliation test wired into CI and passing — merge gate

🤖 Generated with Claude Code

ZIJ and others added 3 commits April 22, 2026 21:22
The directory is no longer a one-off WIP experiment — it's where
agent-facing design docs and completed plans live. Renaming drops the
"wip" qualifier and introduces the durable layout:

  .agents/design/   — in-flight design docs
  .agents/done/     — shipped / superseded docs, kept for time-travel

docs-plan.md moves to done/ as a completed plan; future design docs
land in design/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Capture the prefix pattern that's already dominant on origin so future
agents (and humans) default to it instead of following whichever branch
they happened to land on.

Current state on origin at time of writing:
  feat/*  — 26 branches
  fix/*   — 25 branches
  docs/*  — 17 branches

Plus a long tail of unprefixed kebab-case names and one-offs. The rule
prefers the three prefixes for any new branch, and explicitly warns
off personal-initials prefixes (ig/..., etc.) because they make
in-flight work harder to find by topic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
API-only surface for attributing sandbox spend to customer-defined
groupings (team, env, customer, etc.) and drilling down to individual
sandboxes. Unit is GB-seconds — Stripe stays the source of truth for
dollars.

Key decisions captured in the doc:

- Tags reuse the existing `sandbox_sessions.metadata` JSONB column.
  Audit showed the field is persisted but never queried today, so we
  rescope it as tags rather than introducing a parallel Tags field
  that would overlap semantically and break no SDKs.

- Attribution model: current tags drive all historical spend.
  Retagging re-buckets. Chose this over per-event tag snapshotting
  for simplicity; revisit if stable historical attribution is asked
  for.

- Unit: GB-seconds (memory + disk overage). Disk exposed as overage
  only (matches what's actually billed). CPU not exposed — it's
  deterministic from memory (1 vCPU per 4GB).

- API shape: one `GET /usage` aggregator with `groupBy` and
  `filter[]` query params treating dimensions as data (sandbox,
  tag:<key>, future status/template/region), plus `GET /tags` for
  discovery, `GET /sandboxes/{id}/usage` as a drilldown, and
  `PATCH /sandboxes/{id}` for tag updates.

Alternatives considered and rejected:

- Narrow per-dimension endpoints (/usage/by-sandbox, /usage/by-tag).
  Privileges tags over other dimensions in the URL, forces a new
  route per future dimension.

- Composable query DSL (POST /usage/query with full filter trees and
  nested aggregations, ELK/Prometheus shape). Reinvents a query
  engine for a problem without multi-dim demand yet.

The middle — REST aggregator with dimensioned query params, Stripe
list-endpoint style — keeps the door open to both without paying up
front for either.

Explicit non-goals for v1: dollars, time series / bucketing,
multi-dim group-by, filter trees, per-event snapshotting, tag audit
log, CSV export, dashboard, CLI. Each noted as additive-later, not
breaking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
opensandbox Ready Ready Preview, Comment Apr 22, 2026 10:17pm

Request Review

A pass from a fresh reviewer surfaced a load-bearing flaw and several
real issues in the first cut. Revisions:

- Tags move to a new `sandbox_tags` table, off
  `sandbox_sessions.metadata`. The first cut assumed one-session-per-
  sandbox. Verified: a sandbox owns MANY `sandbox_sessions` rows
  (every get sorts `ORDER BY started_at DESC LIMIT 1`), each with
  metadata captured at that session's create call. Reusing that
  column would silently require a "latest session" subquery on every
  tag query and conflict with PATCH mutation semantics (overwriting
  an otherwise-immutable create-time snapshot). New table:
  (org_id, sandbox_id, key, value, updated_at) PK (sandbox_id, key),
  indexed on (org_id, key, value). Row-per-tag keeps grouping SQL
  trivial and enables tag-count limits. `sandbox_sessions.metadata`
  left intact and unused; no SDK surface for it changes.

- Endpoint scope narrowed from `PATCH /sandboxes/{id}` to `PUT
  /sandboxes/{id}/tags` + `GET /sandboxes/{id}/tags`. Closes the
  door on PATCH feature-creep for alias/memory/etc, each of which
  has its own semantic issues we haven't designed.

- Retagging addressed explicitly. Attribution stays live (retag
  rewrites all history), with `tagsLastUpdatedAt` surfaced in every
  sandbox-level response so dashboards can annotate edits. Full
  snapshotting / audit log named as an upgrade path, out of v1
  scope. Previously this was a single sentence; now it's a whole
  section.

- Response shape normalized: variable key `"tag:team"` replaced with
  `{tagKey, tagValue}`. Untagged bucket moved from a null-valued
  item to a sibling `untagged` field — typed SDKs no longer have to
  null-check items. `sandboxCount` removed when groupBy=sandbox
  (always 1).

- Status values reconciled with the actual state machine: removed
  "destroyed" (not a real state), kept
  running|hibernated|stopped|error.

- GB-second math explicitly tied to `GetOrgUsage` and
  `DiskOverageGBSeconds` bit-for-bit. Reconciliation test required:
  Σ by-sandbox = Σ by-tag + untagged = GetOrgUsage(org) within float
  epsilon.

- Data freshness claim qualified: fresh on clean shutdown,
  reconciliation-lagged on crashes (worker heartbeat closes zombie
  scale events via `ReconcileWorkerSessions`).

- Validation rules added: ≤50 tags, 128/256 char limits, reserved
  `oc:` key prefix for future system-set tags.

- Query guardrails added: ≤90-day window, ≤500 limit, 10s
  handler timeout.

- Added SDK + docs implementation notes. Added authz note (inherits
  org-scope, no sub-org visibility — pre-existing pattern).

- Trimmed the narrow-endpoints-vs-composable-DSL alternatives
  discussion from a full subsection to three sentences. The prose
  was re-litigating the chat.

- `GET /sandboxes?tag=k:v` filter explicitly named as a follow-up
  PR, not v1 scope — obvious next ask once tags exist.

Open questions list updated: tag-destroy behavior, optional PUT
merge-mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZIJ and others added 2 commits April 22, 2026 22:30
Working doc for the sandbox-tags-and-usage delivery. Captures the
plan of record and calls out issues the design doesn't flag that
surfaced during code exploration. Flagged:

- alias lives in sandbox_sessions.config JSONB, not a column — the
  design's responses show a top-level alias; plan is to extract via
  config->>'alias'.
- ReconcileWorkerSessions runs at worker startup, not on a heartbeat,
  so the design's "reconciliation-lagged (minutes)" framing is
  misleading (inherited from the existing billing pipeline, not new).
- Repo has no Postgres test fixture, so the reconciliation invariant
  test needs a pgfixture-tagged integration path gated on
  TEST_DATABASE_URL, plus pure-Go query-builder tests for the rest.
- `:` in tag keys collides with tag:<key> syntax; resolved by
  SplitN(s, ":", 2).
- Additive tags/tagsLastUpdatedAt response fields land in four read
  paths (getSandbox, getSandboxRemote, listSandboxes,
  listSandboxesRemote), not two.

Also records the three design-level open-question decisions
(firstStartedAt/lastEndedAt from sandbox_sessions; tag rows left on
destroy; no PUT merge-mode) and the planned commit order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh code reading against the referenced primary sources
(internal/db/usage.go, cmd/worker/main.go) surfaced one load-bearing
inaccuracy and let the three open questions be closed. Revisions:

- Freshness section rewritten. Previous wording framed staleness as
  "reconciliation-lagged" with an implied heartbeat sweep via
  ReconcileWorkerSessions. Verified: that function is called once at
  worker boot (cmd/worker/main.go:253), not on a heartbeat. A
  silently-dead worker that never restarts leaves scale events open
  and COALESCE(ended_at, now()) accruing indefinitely. Same behavior
  as GetOrgUsage and the Stripe pipeline today — we expose the
  existing risk, we don't introduce a new one. Section now says
  "lagged until worker restart" and names the Stripe pipeline as the
  precedent.

- Open-question 1 (firstStartedAt / lastEndedAt source) closed in
  favour of sandbox_sessions MIN(started_at) / MAX(COALESCE(stopped_at,
  now())), clamped to the query window. Rationale folded into the
  /sandboxes/{id}/usage section: stable across scale-event churn and
  matches the user's "when did this sandbox exist" intent rather than
  "when was it billed."

- Open-question 2 (tag-row behaviour on destroy) closed as "leave,"
  with the minor /tags key-count overstatement accepted. Rationale
  folded into the /sandboxes/{id}/usage section alongside the
  "works for torn-down sandboxes" note.

- Open-question 3 (PUT ?mode=merge) closed as "no." Rationale folded
  into the PUT section: one semantic per verb, atomicity stays
  simple, GET+PUT covers partial updates.

- Validation section gained a bullet on `:`-in-keys parsing
  (SplitN(s, ":", 2) — everything after the first `:` is the tag
  key). Resolves ambiguity between user-namespaced keys like
  "team:payments" and the tag:<key> syntax in groupBy / filter.

- PUT section gained a one-line empty-tags response contract:
  `tags` is `{}` (not null), `tagsLastUpdatedAt` is `null` — typed
  SDKs avoid a null map-check.

- "Open questions for implementation" block removed entirely; all
  three items now have in-body answers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZIJ and others added 7 commits April 22, 2026 22:35
New table owning per-sandbox customer-defined tags for the upcoming
usage-attribution endpoints (design: .agents/design/sandbox-tags-and-
usage.md). Row-per-tag with PK (sandbox_id, key) so PUT semantics stay
cheap and grouping SQL is a single join. org_id is denormalized so
GET /tags can filter on (org_id, key) without joining sandbox_sessions
— matches the indexing pattern already used in sandbox_scale_events.

No change to sandbox_sessions.metadata — left intact and unused, as
the design spelled out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GetSandboxTags, GetSandboxTagsMulti, ReplaceSandboxTags,
ListOrgTagKeys. The multi variant avoids an N+1 when hydrating tags
onto GET /sandboxes list responses.

ReplaceSandboxTags diffs against current state before mutating so
no-op PUTs don't bump updated_at. tagsLastUpdatedAt is the signal
dashboards use to annotate retagging events (design: "Attribution is
live"); an idempotent PUT would confuse that signal if it refreshed
the timestamp on rows whose value didn't change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure-Go builder for the /usage aggregator: BuildUsageQuery,
BuildUntaggedTotals, BuildOrgTotals. Split into separate queries so
the items page and the untagged sibling bucket can paginate
independently — the design puts untagged in a sibling field, not in
items, and keyset pagination doesn't want a sentinel row interleaved.

GB-second math mirrors GetOrgUsage and DiskOverageGBSeconds
bit-for-bit — same `COALESCE(ended_at, LEAST(now(), $to)) -
GREATEST(started_at, $from)` idiom, same `max(0, disk_mb - 20480) /
1024` formula. Reconciliation depends on the per-sandbox /
per-tag sums adding back to GetOrgUsage totals by linearity; any
change here requires an audit against those two call sites.

Cursor is base64(JSON {v, t}) — sort value + tiebreaker (sandbox_id
or tag value). Keyset filter lives in an outer WHERE over a subquery,
not HAVING — Postgres disallows output aliases in HAVING.

Pure-Go tests exercise: groupBy sandbox / tag, tag key containing `:`
(SplitN rule from the design), value-present and key-absent filters,
default and disk-overage sorts, cursor round-trip, all input-
validation error paths. No DB required. The reconciliation invariant
itself still needs a pgfixture-tagged integration test (tracked).

SandboxUsageWindow covers the drilldown endpoint: per-sandbox GB-
seconds plus first/last session bounds from sandbox_sessions (the
closed decision on that open question — more stable than scale-event
edges across scale churn).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five new endpoints from the design:

  GET  /api/usage                       # aggregator
  GET  /api/tags                        # org-wide key discovery
  GET  /api/sandboxes/:id/usage         # per-sandbox drilldown
  GET  /api/sandboxes/:id/tags          # read tag set
  PUT  /api/sandboxes/:id/tags          # full-replace

Tenancy enforcement lives in the handler (ownsSandbox), not in the
sandbox_tags PK — (sandbox_id, key) is sufficient because the PUT
path reads sandbox_sessions.org_id and rejects mismatches as 404
(does not leak existence across orgs). Design F3: PK stays narrow
intentionally.

PUT body is decoded as map[string]json.RawMessage so nested values
fail fast with a typed 400 instead of coercing to string. Validation
applies:
- ≤ 50 tags
- key 1..128 chars, [A-Za-z0-9_.\-:]+  (`:` allowed for user
  namespacing — see SplitN parsing rule in the design)
- value ≤ 256 chars
- `oc:` prefix reserved

Empty-tag responses render as `"tags": {}`, `"tagsLastUpdatedAt":
null` — typed SDKs avoid a null map-check.

/usage delegates SQL to the query builder; handler hydrates items
with alias/status/tags on groupBy=sandbox (alias extracted from
sandbox_sessions.config JSONB per design F1). A per-row
GetSandboxSession call is N+1 but bounded by the 500-row limit and
the 10s handler timeout — fine for v1; promote to a batched store
method if it shows up hot.

isUserInputError inspects BuildUsageQuery's well-known error strings
to map validation failures to 400 instead of 500. Brittle but
self-contained; only the builder emits these.

Pure-Go validation tests cover the PUT guardrails end-to-end. A
handler-level HTTP test requires a Postgres fixture the repo
doesn't have yet — same gap as the reconciliation integration test
(tracked).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Additive fields on four read paths — getSandbox and listSandboxes in
both local/combined and server/remote modes (design F8: easy to
miss). Responses now carry a tags map (always, empty when no tags
are set) and tagsLastUpdatedAt (null when empty).

Extraction factored into sandbox_tags_hydrate.go:
- mergeTagsInto for remote paths that already build response maps
- withTagsHydrated / withTagsHydratedList for local paths that
  return sandbox types as-is — marshal-through-JSON preserves the
  original shape without coupling this file to pkg/types.Sandbox.

Fail-soft semantics: a tag query error never fails the primary
sandbox read — the response comes back without the tag fields
rather than 500. Tags are informational; making them load-bearing
for the list/get endpoints would be a regression.

listSandboxesRemote batches via GetSandboxTagsMulti — one query per
page, not N+1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thin wrappers over the five new endpoints. Types mirror the server
responses; dataclasses on the Python side, interfaces + classes on
the TS side. Convenience wrappers live in the SDK layer, not the
server — bySandbox / byTag / forSandbox are sugar around the single
GET /usage route with different groupBy values.

TS: type-checks under tsc --noEmit. Python: imports cleanly.

Filter param handling preserves the repeatable `filter[tag:<k>]=...`
shape — URLSearchParams in TS and list-of-tuples in Python both
retain duplicates. Tag keys containing `:` pass through as literal
substrings, matching the server's SplitN parsing rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five new pages under docs/api-reference/, plus additive-field notes
on the existing GET /sandboxes and GET /sandboxes/{id} pages. mint.json
wires these into two places:

- api-reference/sandboxes/{get,set}-tags.mdx appended to the existing
  "Sandboxes" group (PUT/GET are sandbox-scoped, so they belong with
  the other sandbox verbs).
- A new "Usage" navigation group holds api-reference/usage/{get-
  usage,get-sandbox-usage,list-tags}.mdx — these are org-level
  reporting, a distinct concept from sandbox lifecycle.

Docs mention the two reader-facing caveats from the design: retagging
rewrites attribution going forward (with tagsLastUpdatedAt as the
signal), and freshness lags until worker restart after a silent-
death worker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZIJ and others added 3 commits April 22, 2026 22:55
Update the design and working notes after critical review.

Key changes:
- move sandbox_tags keying to (org_id, sandbox_id, key)
- require org-scoped tag joins and reads instead of relying on sandbox ID uniqueness
- narrow filter semantics to one param per dimension with comma-separated OR values
- require clamped drilldown timestamps and explicit to<=from rejection
- call out all four sandbox read paths, especially listSandboxesRemote
- raise the testing bar so reconciliation against GetOrgUsage must be proven against real Postgres
- make SDK/docs parity part of review readiness

The main reason for the schema change is tenancy safety: sandbox IDs are short sb-xxxxxxxx values today, so sandbox_id-only tag keying is not a safe boundary.
The design and work docs were updated to tighten four areas that the
first cut did not handle. All four are coordinated (they change the
same method signatures and code paths), so they land together.

F3 — org_id in the keyspace, not just in a lookup index.
  Sandbox IDs are short `sb-xxxxxxxx` strings generated independently
  per create path (internal/api/sandbox.go, internal/qemu/manager.go,
  internal/firecracker/manager.go), a short 32-bit space with no
  schema-enforced cross-org uniqueness. A `(sandbox_id, key)` PK plus
  sandbox_id-only lookups would let a single ID collision alias tag
  state across tenants. Changes:
  - Migration 026 PK is now (org_id, sandbox_id, key); the lookup
    index on (org_id, key, value) is unchanged.
  - GetSandboxTags / GetSandboxTagsMulti / ReplaceSandboxTags all
    take orgID and scope every read/write on (org_id, sandbox_id).
  - Every sandbox_tags join in the usage query builder (filter joins
    for /usage, the groupBy=tag join, the untagged-bucket join)
    adds `x.org_id = e.org_id` alongside the sandbox_id predicate.
  - All handler call sites plumb orgID from auth.GetOrgID down to the
    store. Hydration helpers (mergeTagsInto, withTagsHydrated*) take
    orgID too — hydration must not cross tenancy.
  - Added TestBuildUsageQuery_JoinsIncludeOrgID pinning the invariant
    at the SQL level so a future refactor can't drop the org scope
    silently. Migration has not been applied in prod (feature branch),
    so editing 026 in place rather than stacking a DROP/RECREATE fix.

F9 — drilldown timestamps clamped, invalid window rejected.
  SandboxUsageWindow now wraps MIN(started_at) in GREATEST(…, $from)
  and MAX(COALESCE(stopped_at, now())) in LEAST(…, $to), so
  firstStartedAt / lastEndedAt can never leak outside the query
  window. The /sandboxes/:id/usage handler rejects to <= from with
  400 to match the aggregator path.

F10 — one filter param per dimension.
  The design was narrowed: `filter[tag:<key>]` may only appear once;
  multiple OR values go comma-separated within that one param.
  Accepting repeated filter[] keys would make the SDK map-shaped API
  diverge from the HTTP surface and buy no additional expressiveness.
  parseUsageQuery now returns 400 when it sees more than one value
  for a `filter[...]` key.

F11 — batched session hydration for groupBy=sandbox.
  Under a 500-row × 10s handler budget, issuing one GetSandboxSession
  per result row is a silent multiplier. Added
  GetLatestSandboxSessionsMulti — a single DISTINCT ON
  (sandbox_id)-ordered-by-started_at DESC query that returns latest
  session per ID. hydrateSandboxUsageItems now does two batched
  reads (tags + sessions) and no per-row lookups.

Builder tests still pass; the new org-scope test was added on top,
not in place of existing ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The design and parseUsageQuery now reject duplicate filter[] params
— one param per dimension, comma-separated OR values inside. The
earlier copy on the SDK type and the /usage reference page still
said "repeatable, AND'd," which no longer matches server behavior.
Tightening the wording so readers don't write SDK calls that the
server will 400.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZIJ and others added 2 commits April 22, 2026 23:13
Second review against the branch surfaced four more items; all now
addressed.

F12 — org-scoped session lookups throughout the feature.
  The F3 pass put tag storage on (org_id, sandbox_id, key) but left
  ownsSandbox and the drilldown handler reading sandbox_sessions by
  sandbox_id alone via GetSandboxSession. Sandbox IDs aren't globally
  unique, so on a cross-org ID collision the old code could return
  another org's session row — causing ownsSandbox to deny the
  rightful owner, or hydrating the drilldown's alias/status from the
  wrong tenant. Added GetSandboxSessionInOrg as the org-scoped
  primitive and swapped both call sites. ownsSandbox's 404 rationale
  is unchanged (same response for "doesn't exist" and "not your
  org"), but now it actually rejects the collision case correctly.

F13 — SandboxUsageWindow no longer swallows DB errors.
  The session-bounds query treated any scan failure as "no sessions
  in window" and returned nil timestamps. That downgraded real
  DB/context failures into silent 200 responses with missing fields.
  Rewrote the scan: COALESCE(BOOL_OR(...), false) so an empty result
  set produces a normal successful scan with NULL pointer fields,
  and a scan error now means a genuine failure and is returned up
  the stack.

F14 — reconciliation invariant has an executable test.
  internal/db/usage_query_pgfixture_test.go under build tag
  `pgfixture`, gated on TEST_DATABASE_URL. Seeds a fresh org with a
  deterministic fixture (two tagged sandboxes including one with
  disk overage and one still running, one untagged sandbox, and one
  sandbox entirely before the window) and asserts the full chain:

    GetOrgUsage rollup == ExecuteOrgTotals
                       == Σ(ExecuteUsageQuery bySandbox items)
                       == Σ(ExecuteUsageQuery byTag items)
                          + ExecuteUntaggedTotals

  within 1e-6 float epsilon. Runs with:
    TEST_DATABASE_URL=... go test -tags=pgfixture ./internal/db/ \
      -run Reconciliation -v
  Still needs CI wiring to be a merge gate; the test itself now
  exists so "we don't have the test" stops being the blocker.

F15 — Python SDK docstring matches the narrowed filter contract.
  sdks/python/opencomputer/usage.py _build_params docstring was
  carrying the old "repeatable, httpx preserves duplicates"
  language. Rewrote to match the server: one param per dimension,
  comma-separated OR values, repeated keys rejected. Runtime surface
  (dict[str, str]) already matched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ZIJ ZIJ marked this pull request as ready for review April 22, 2026 22:18
@ZIJ ZIJ requested review from breardon2011 and motatoes April 22, 2026 22:24
@ZIJ ZIJ merged commit 8f2e4b1 into main Apr 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants