Skip to content

feat(metering): VM/snapshot lifecycle event recorder#56

Merged
CMGS merged 13 commits into
masterfrom
feat/lifecycle-metering
May 20, 2026
Merged

feat(metering): VM/snapshot lifecycle event recorder#56
CMGS merged 13 commits into
masterfrom
feat/lifecycle-metering

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 20, 2026

Summary

Cocoonv2 emits raw VM/snapshot lifecycle endpoints to an append-only ledger; the recorder is a leaf subpkg with zero internal cocoon imports. Tenant attribution and pricing live above this layer (vk-cocoon / billing rules), not in cocoonv2.

Two commits:

  1. metering/ subpkgEntry + Shape, 6 Kind constants (vm.compute.{start,stop}, vm.storage.{start,stop}, snap.storage.{start,stop}), 9 Reason constants. Three recorders: FileRecorder (JSON line append under sync.Mutex, falls back to NopRecorder on open failure so callers never see nil), NopRecorder, CaptureRecorder (in-memory for tests). Recorder.Emit(ctx, Entry) is fire-and-forget — errors are logged, never bubbled to the state machine.

  2. Hook wiringBackend / LocalFile gain a Metering field (nil-safe via metering.OrNop). cmd/core lazy-inits one process-wide FileRecorder at $RootDir/metering/ledger.jsonl and injects it into every backend ctor. Six emit sites:

    • vm.compute.start hypervisor/state.go BatchMarkStarted (reason: boot/restart by FirstBooted)
    • vm.compute.stop hypervisor/state.go UpdateStates (only Running→Stopped, reason: stop-user)
    • vm.storage.start hypervisor/create.go FinalizeCreate (reason: boot)
    • vm.storage.stop hypervisor/stop.go DeleteAll (reason: vm-rm)
    • snap.storage.start snapshot/localfile/localfile.go Create
    • snap.storage.stop snapshot/localfile/localfile.go Delete (reason: snap-rm)

DB.Update closures only collect entries; emit runs outside the lock so the ledger write never extends DB lock duration.

Out of scope (future commits)

  • Tenant attribution — vk-cocoon side, Pod label → ledger join
  • Caller-passed Reason for clone/restore/crash paths (today these don't emit; will be a follow-up to thread Reason through BatchMarkStarted etc.)
  • Central sink (rsync ledger → GCS → BigQuery)
  • Snapshot/clone source-snapshot lineage in Entry

Test plan

  • make fmt-check clean
  • make vet (linux + darwin) 0 issue
  • make lint (linux + darwin) 0 issue
  • make test (-race -count=1 -cover) all packages green
  • New unit tests:
    • metering/: round-trip + concurrent emit + Nop/Capture/File recorders + wire-format kind strings locked
    • hypervisor/state_test.go: 5 tests covering all 3 state-machine emit cases, boot/restart reason flip, no-emit on Running/Error states, nil-recorder safety
    • snapshot/localfile/localfile_test.go: Create→Delete emit round-trip with CaptureRecorder

Architecture notes

  • metering package only imports stdlib + projecteru2/core/log; no internal cocoon dependencies, so it stays a pure leaf
  • Nil-safe via metering.OrNop(r) rather than non-nil contract: lets state_test.go build raw Backend{} without injecting a recorder
  • Singleton recorder via sync.Once in cmd/core; the process-wide instance means all backends share one ledger fd (multiple O_APPEND fds in one process is allowed but wasteful)

New subpkg `metering` for append-only VM/snapshot lifecycle endpoints
(vm.compute.{start,stop}, vm.storage.{start,stop},
snap.storage.{start,stop}). Exposes raw signals only; tenant attribution
and pricing remain in upper layers.

- Entry / Shape types, Kind + Reason constants
- Recorder interface (Emit-only, fire-and-forget)
- FileRecorder appends JSON lines under sync.Mutex, falls back to
  NopRecorder on open failure so callers never see nil
- NopRecorder for tests and unconfigured nodes
- CaptureRecorder for in-process assertions

No hypervisor integration yet; subsequent commits will wire the 6 hook
points in hypervisor/state.go, hypervisor/create.go, snapshot/, etc.
@CMGS CMGS force-pushed the feat/lifecycle-metering branch 6 times, most recently from 6123f49 to 624514e Compare May 20, 2026 02:57
…ot backends

Backend / LocalFile gain a Metering field; nil-safe via metering.OrNop so
tests that build raw structs still work. cmd/core lazy-inits one
process-wide FileRecorder at $RootDir/metering/ledger.jsonl and injects
it into every backend ctor.

Hook points emitting metering.Entry:
- vm.compute.start   hypervisor/state.go BatchMarkStarted (reason: boot/restart by FirstBooted)
- vm.compute.stop    hypervisor/state.go UpdateStates (only Running→Stopped, reason: stop-user)
- vm.storage.start   hypervisor/create.go FinalizeCreate (reason: boot)
- vm.storage.stop    hypervisor/stop.go DeleteAll (reason: vm-rm)
- snap.storage.start snapshot/localfile/localfile.go Create
- snap.storage.stop  snapshot/localfile/localfile.go Delete (reason: snap-rm)

DB.Update closures only collect entries; emit runs outside the lock so
the ledger write never extends DB lock duration.

Tests: hypervisor/state_test.go covers all 3 state-machine emit cases +
reason flip + nil-recorder safety; localfile_test.go adds Create/Delete
emit round-trip. make lint / vet / test (race+cover) pass on
linux+darwin.
@CMGS CMGS force-pushed the feat/lifecycle-metering branch from 624514e to 334b083 Compare May 20, 2026 03:08
CMGS added 11 commits May 20, 2026 11:20
Two cocoon processes racing snapshot rm: flock serializes the store.Update
closures, so the loser's closure sees a nil rec. Previously the loser still
emitted a snap.storage.stop with an empty Hypervisor field, duplicating the
winner's correct event.

- Extract Delete's per-id loop body into deleteOne (also testable).
- Track deletedRecord via a captured bool in the closure; emit only when this
  call actually committed the delete. Caller still gets the id in 'deleted'
  because the on-disk state is correct (data is gone).
…timestamp consistency

SKILL.md rule-by-rule pass on the branch:

Helpers extracted in hypervisor/metering.go:
- makeEntry: collapses the 6-field Entry literal at 4 emit sites
  (UpdateStates, BatchMarkStarted ×2, FinalizeCreate).
- emitAll: fans out a batch of entries through a single meter() lookup
  (BatchMarkStarted, UpdateStates).
- emitOpenInterval: storage.start + compute.start pair used by both
  FinalizeClone and emitRestoreSuccess; takes now from caller to keep the
  timestamp consistent with adjacent close events.
- emitDeleteClose: storage.stop unconditionally + compute.stop when the
  record had an open Running interval (DeleteAll body shrinks by ~20 lines).

Time-skew fix: emitRestoreSuccess now captures one now and passes it to
emitOpenInterval so storage.stop and the open pair share a timestamp.

Both emitOpenInterval and emitDeleteClose cache rec := b.meter() before
their emit loops (matches emitAll), removing per-emit interface nil-check.

Comment cleanup across the branch:
- Drop restate-code godocs (NopRecorder.Emit, CaptureRecorder.Emit,
  hypervisor.meter, hypervisor.shapeFromConfig).
- Collapse multi-line godocs to one line where the second line was a
  paraphrase of the first (FinalizeCreate, FinalizeClone, FinalizeRestore,
  UpdateStates, StartAll, BatchMarkStarted, emitRestoreComputeStop,
  Delete, deleteOne, FileRecorder.Emit, metering/file.go header block).
- Drop RestoreSequence inline comment that duplicated
  emitRestoreComputeStop's godoc.
- Merge duplicate Delete godoc in snapshot/localfile/localfile.go.

File ordering: snapshot/localfile/gc.go — backfillSizeBytes moved next to
its sibling helpers (after pickLRU/logEvictRow, before cleanResolvedRecords),
snapshotMeta moved up so types appear before the funcs that use them.
- hypervisor.StartSequence/StopOneSequence/NewBackend collapse the CH+FC
  per-id start/stop and ctor boilerplate into ~10-line backend shims
- snapshot/localfile/metering.go centralizes the 4 Create/Import/Delete/GC
  snapshot.{start,stop} emit sites
- compress 3 multi-line godoc blocks to single lines per project style
- UpdateStates emits vm.compute.stop on Running→Error (stop-crash) so
  MarkError can't silently swallow an open interval; the downstream
  rm --force path won't double-fire because State=Error already closed.
- emitRestoreComputeStop flips State→Stopped alongside the emit so a
  restore failure path (MarkError on Populate/AfterExtract error) sees
  oldState=Stopped and skips the re-emit.
- Layout sweep enforcing SKILL.md (public-above-private, single const
  block, struct methods grouped, standalone utils after methods) — 7
  files: hypervisor/{metering,restore,clone,utils}.go,
  snapshot/localfile/localfile.go, cmd/vm/{run,status}.go.
- BatchMarkStarted now sets r.StoppedAt = &now when closing a
  stale-running interval so DB introspection matches the ledger.
- NewBackend / localfile.New default rec→NopRecorder when nil, so
  hot emit paths can call b.Metering.Emit / lf.metering.Emit directly
  without going through the OrNop helper. OrNop is removed.
- Compress 10 verbose godoc/inline comments introduced this branch.
drops or shortens ~20 godoc/inline comments that restated identifiers
or carried design narration that belongs in the PR body, including:
hypervisor backend Spec types, restore.go helpers, create.go
sequences, metering package doc, FileRecorder/Emit, snapshot/localfile
Create/Delete/deleteOne/gc helpers.
…h,fc}

drops or compresses ~30 godoc/inline comments that restated identifiers
or carried design narration: cmd/core/helpers.go, cmd/snapshot/handler.go,
cmd/vm/{lifecycle,run,status}.go, hypervisor/cloudhypervisor/{clone,stop}.go,
hypervisor/firecracker/clone.go. Removes TODO(inspect) marker (track in
issue tracker instead).
UpdateStates(Error) no longer emits. Many MarkError callers (Shutdown
errors, ctx-cancel, AbortLaunch best-effort) can't prove the process
is dead, so closing the interval there risked a still-alive VM with
no compute account.

PrepareStart now closes the stale interval (DB Running + process not
alive) via a new closeStaleComputeInterval helper — the only point
at start time where we can prove the old VMM is gone.

BatchMarkStarted's stale-close branch stays as a safety net.
…it sweep

metering:
- UpdateStates(Error) no longer writes StoppedAt — leaves it nil so the
  compute interval stays open in the ledger. Many MarkError callers
  (Shutdown error, ctx-cancel, AbortLaunch best-effort) can't prove the
  process is dead.
- closeStaleComputeInterval + DeleteAll + BatchMarkStarted now gate on
  hasOpenComputeInterval(rec) rather than rec.State == Running. After
  MarkError leaves State=Error with an open interval, the next start
  or rm --force confirms the process is dead and closes the account.
- New tests cover Running→MarkError→PrepareStart and
  Running→MarkError→rm --force paths.

audit sweep (~80 pre-existing comments deleted/compressed):
- images/baseconfig.go, images/{cloudimg,oci}/{config,image,*.go,oci.go}
- hypervisor/inspect.go, hypervisor/cloudhypervisor/extend.go
- network/bridge/{bridge_other,bridge_linux}.go, network/cni/config.go
- console/relay.go, lock/flock/flock.go, storage/json/json.go
- gc/orchestrator.go
- cmd/{vm,images,others,snapshot}/commands.go
UpdateStates(Stopped) now gates emit on hasOpenComputeInterval(r) rather
than oldState==Running. After MarkError leaves an open interval, a later
vm stop --force / rm --force succeeds → UpdateStates writes the StoppedAt
sentinel and emits the missing compute.stop(stop-user). Idempotent
Stopped→Stopped no longer overwrites StoppedAt (preserves first stop).

Regression: TestStopAfterMarkErrorEmitsComputeStop covers
Running→MarkError→Stopped.
…ename

Three concerns from the final audit round:

1. hasOpenComputeInterval was StartedAt > StoppedAt — NTP backward step
   could flip the comparison and miss-emit. Switch to the cleaner
   sentinel: StoppedAt == nil. Every transition into Running
   (UpdateStates, BatchMarkStarted, FinalizeRestore) now clears
   StoppedAt explicitly.

2. emitRestoreComputeStop wrote a ledger entry even when DB.Update
   failed or the record had vanished mid-flight. Now fail-closed on
   DB error, skip emit when the closure didn't actually close, and
   route through the new makeSourceEntry helper.
   closeStaleComputeInterval gets the same didClose gate.

3. cap (the builtin) was shadowed 20× across state_test.go,
   localfile_test.go, gc_test.go. Renamed to rec.

Layout: hypervisor/backend.go now keeps Backend's Type() method
adjacent to NewBackend instead of after the 7 Spec types.
@CMGS CMGS merged commit 65a9e8e into master May 20, 2026
4 checks passed
@CMGS CMGS deleted the feat/lifecycle-metering branch May 20, 2026 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant