feat(metering): VM/snapshot lifecycle event recorder#56
Merged
Conversation
New subpkg `metering` for append-only VM/snapshot lifecycle endpoints
(vm.compute.{start,stop}, vm.storage.{start,stop},
snap.storage.{start,stop}). Exposes raw signals only; tenant attribution
and pricing remain in upper layers.
- Entry / Shape types, Kind + Reason constants
- Recorder interface (Emit-only, fire-and-forget)
- FileRecorder appends JSON lines under sync.Mutex, falls back to
NopRecorder on open failure so callers never see nil
- NopRecorder for tests and unconfigured nodes
- CaptureRecorder for in-process assertions
No hypervisor integration yet; subsequent commits will wire the 6 hook
points in hypervisor/state.go, hypervisor/create.go, snapshot/, etc.
6123f49 to
624514e
Compare
…ot backends Backend / LocalFile gain a Metering field; nil-safe via metering.OrNop so tests that build raw structs still work. cmd/core lazy-inits one process-wide FileRecorder at $RootDir/metering/ledger.jsonl and injects it into every backend ctor. Hook points emitting metering.Entry: - vm.compute.start hypervisor/state.go BatchMarkStarted (reason: boot/restart by FirstBooted) - vm.compute.stop hypervisor/state.go UpdateStates (only Running→Stopped, reason: stop-user) - vm.storage.start hypervisor/create.go FinalizeCreate (reason: boot) - vm.storage.stop hypervisor/stop.go DeleteAll (reason: vm-rm) - snap.storage.start snapshot/localfile/localfile.go Create - snap.storage.stop snapshot/localfile/localfile.go Delete (reason: snap-rm) DB.Update closures only collect entries; emit runs outside the lock so the ledger write never extends DB lock duration. Tests: hypervisor/state_test.go covers all 3 state-machine emit cases + reason flip + nil-recorder safety; localfile_test.go adds Create/Delete emit round-trip. make lint / vet / test (race+cover) pass on linux+darwin.
624514e to
334b083
Compare
Two cocoon processes racing snapshot rm: flock serializes the store.Update closures, so the loser's closure sees a nil rec. Previously the loser still emitted a snap.storage.stop with an empty Hypervisor field, duplicating the winner's correct event. - Extract Delete's per-id loop body into deleteOne (also testable). - Track deletedRecord via a captured bool in the closure; emit only when this call actually committed the delete. Caller still gets the id in 'deleted' because the on-disk state is correct (data is gone).
…timestamp consistency SKILL.md rule-by-rule pass on the branch: Helpers extracted in hypervisor/metering.go: - makeEntry: collapses the 6-field Entry literal at 4 emit sites (UpdateStates, BatchMarkStarted ×2, FinalizeCreate). - emitAll: fans out a batch of entries through a single meter() lookup (BatchMarkStarted, UpdateStates). - emitOpenInterval: storage.start + compute.start pair used by both FinalizeClone and emitRestoreSuccess; takes now from caller to keep the timestamp consistent with adjacent close events. - emitDeleteClose: storage.stop unconditionally + compute.stop when the record had an open Running interval (DeleteAll body shrinks by ~20 lines). Time-skew fix: emitRestoreSuccess now captures one now and passes it to emitOpenInterval so storage.stop and the open pair share a timestamp. Both emitOpenInterval and emitDeleteClose cache rec := b.meter() before their emit loops (matches emitAll), removing per-emit interface nil-check. Comment cleanup across the branch: - Drop restate-code godocs (NopRecorder.Emit, CaptureRecorder.Emit, hypervisor.meter, hypervisor.shapeFromConfig). - Collapse multi-line godocs to one line where the second line was a paraphrase of the first (FinalizeCreate, FinalizeClone, FinalizeRestore, UpdateStates, StartAll, BatchMarkStarted, emitRestoreComputeStop, Delete, deleteOne, FileRecorder.Emit, metering/file.go header block). - Drop RestoreSequence inline comment that duplicated emitRestoreComputeStop's godoc. - Merge duplicate Delete godoc in snapshot/localfile/localfile.go. File ordering: snapshot/localfile/gc.go — backfillSizeBytes moved next to its sibling helpers (after pickLRU/logEvictRow, before cleanResolvedRecords), snapshotMeta moved up so types appear before the funcs that use them.
- hypervisor.StartSequence/StopOneSequence/NewBackend collapse the CH+FC
per-id start/stop and ctor boilerplate into ~10-line backend shims
- snapshot/localfile/metering.go centralizes the 4 Create/Import/Delete/GC
snapshot.{start,stop} emit sites
- compress 3 multi-line godoc blocks to single lines per project style
- UpdateStates emits vm.compute.stop on Running→Error (stop-crash) so
MarkError can't silently swallow an open interval; the downstream
rm --force path won't double-fire because State=Error already closed.
- emitRestoreComputeStop flips State→Stopped alongside the emit so a
restore failure path (MarkError on Populate/AfterExtract error) sees
oldState=Stopped and skips the re-emit.
- Layout sweep enforcing SKILL.md (public-above-private, single const
block, struct methods grouped, standalone utils after methods) — 7
files: hypervisor/{metering,restore,clone,utils}.go,
snapshot/localfile/localfile.go, cmd/vm/{run,status}.go.
- BatchMarkStarted now sets r.StoppedAt = &now when closing a stale-running interval so DB introspection matches the ledger. - NewBackend / localfile.New default rec→NopRecorder when nil, so hot emit paths can call b.Metering.Emit / lf.metering.Emit directly without going through the OrNop helper. OrNop is removed. - Compress 10 verbose godoc/inline comments introduced this branch.
drops or shortens ~20 godoc/inline comments that restated identifiers or carried design narration that belongs in the PR body, including: hypervisor backend Spec types, restore.go helpers, create.go sequences, metering package doc, FileRecorder/Emit, snapshot/localfile Create/Delete/deleteOne/gc helpers.
…h,fc}
drops or compresses ~30 godoc/inline comments that restated identifiers
or carried design narration: cmd/core/helpers.go, cmd/snapshot/handler.go,
cmd/vm/{lifecycle,run,status}.go, hypervisor/cloudhypervisor/{clone,stop}.go,
hypervisor/firecracker/clone.go. Removes TODO(inspect) marker (track in
issue tracker instead).
UpdateStates(Error) no longer emits. Many MarkError callers (Shutdown errors, ctx-cancel, AbortLaunch best-effort) can't prove the process is dead, so closing the interval there risked a still-alive VM with no compute account. PrepareStart now closes the stale interval (DB Running + process not alive) via a new closeStaleComputeInterval helper — the only point at start time where we can prove the old VMM is gone. BatchMarkStarted's stale-close branch stays as a safety net.
…it sweep
metering:
- UpdateStates(Error) no longer writes StoppedAt — leaves it nil so the
compute interval stays open in the ledger. Many MarkError callers
(Shutdown error, ctx-cancel, AbortLaunch best-effort) can't prove the
process is dead.
- closeStaleComputeInterval + DeleteAll + BatchMarkStarted now gate on
hasOpenComputeInterval(rec) rather than rec.State == Running. After
MarkError leaves State=Error with an open interval, the next start
or rm --force confirms the process is dead and closes the account.
- New tests cover Running→MarkError→PrepareStart and
Running→MarkError→rm --force paths.
audit sweep (~80 pre-existing comments deleted/compressed):
- images/baseconfig.go, images/{cloudimg,oci}/{config,image,*.go,oci.go}
- hypervisor/inspect.go, hypervisor/cloudhypervisor/extend.go
- network/bridge/{bridge_other,bridge_linux}.go, network/cni/config.go
- console/relay.go, lock/flock/flock.go, storage/json/json.go
- gc/orchestrator.go
- cmd/{vm,images,others,snapshot}/commands.go
UpdateStates(Stopped) now gates emit on hasOpenComputeInterval(r) rather than oldState==Running. After MarkError leaves an open interval, a later vm stop --force / rm --force succeeds → UpdateStates writes the StoppedAt sentinel and emits the missing compute.stop(stop-user). Idempotent Stopped→Stopped no longer overwrites StoppedAt (preserves first stop). Regression: TestStopAfterMarkErrorEmitsComputeStop covers Running→MarkError→Stopped.
…ename Three concerns from the final audit round: 1. hasOpenComputeInterval was StartedAt > StoppedAt — NTP backward step could flip the comparison and miss-emit. Switch to the cleaner sentinel: StoppedAt == nil. Every transition into Running (UpdateStates, BatchMarkStarted, FinalizeRestore) now clears StoppedAt explicitly. 2. emitRestoreComputeStop wrote a ledger entry even when DB.Update failed or the record had vanished mid-flight. Now fail-closed on DB error, skip emit when the closure didn't actually close, and route through the new makeSourceEntry helper. closeStaleComputeInterval gets the same didClose gate. 3. cap (the builtin) was shadowed 20× across state_test.go, localfile_test.go, gc_test.go. Renamed to rec. Layout: hypervisor/backend.go now keeps Backend's Type() method adjacent to NewBackend instead of after the 7 Spec types.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cocoonv2 emits raw VM/snapshot lifecycle endpoints to an append-only ledger; the recorder is a leaf subpkg with zero internal cocoon imports. Tenant attribution and pricing live above this layer (vk-cocoon / billing rules), not in cocoonv2.
Two commits:
metering/subpkg —Entry+Shape, 6Kindconstants (vm.compute.{start,stop},vm.storage.{start,stop},snap.storage.{start,stop}), 9Reasonconstants. Three recorders:FileRecorder(JSON line append undersync.Mutex, falls back toNopRecorderon open failure so callers never see nil),NopRecorder,CaptureRecorder(in-memory for tests).Recorder.Emit(ctx, Entry)is fire-and-forget — errors are logged, never bubbled to the state machine.Hook wiring —
Backend/LocalFilegain aMeteringfield (nil-safe viametering.OrNop).cmd/corelazy-inits one process-wideFileRecorderat$RootDir/metering/ledger.jsonland injects it into every backend ctor. Six emit sites:vm.compute.starthypervisor/state.go BatchMarkStarted (reason:boot/restartbyFirstBooted)vm.compute.stophypervisor/state.go UpdateStates (only Running→Stopped, reason:stop-user)vm.storage.starthypervisor/create.go FinalizeCreate (reason:boot)vm.storage.stophypervisor/stop.go DeleteAll (reason:vm-rm)snap.storage.startsnapshot/localfile/localfile.go Createsnap.storage.stopsnapshot/localfile/localfile.go Delete (reason:snap-rm)DB.Updateclosures only collect entries; emit runs outside the lock so the ledger write never extends DB lock duration.Out of scope (future commits)
Reasonfor clone/restore/crash paths (today these don't emit; will be a follow-up to thread Reason throughBatchMarkStartedetc.)Test plan
make fmt-checkcleanmake vet(linux + darwin) 0 issuemake lint(linux + darwin) 0 issuemake test(-race -count=1 -cover) all packages greenmetering/: round-trip + concurrent emit + Nop/Capture/File recorders + wire-format kind strings lockedhypervisor/state_test.go: 5 tests covering all 3 state-machine emit cases, boot/restart reason flip, no-emit on Running/Error states, nil-recorder safetysnapshot/localfile/localfile_test.go: Create→Delete emit round-trip withCaptureRecorderArchitecture notes
meteringpackage only imports stdlib + projecteru2/core/log; no internal cocoon dependencies, so it stays a pure leafmetering.OrNop(r)rather than non-nil contract: letsstate_test.gobuild rawBackend{}without injecting a recordersync.Oncein cmd/core; the process-wide instance means all backends share one ledger fd (multipleO_APPENDfds in one process is allowed but wasteful)