feat(activity): per-miner success/failure bulk actions detail [backend]#25
Conversation
Introduces migration 000033 and extends UpsertCommandOnDeviceLog to accept an optional error string. Existing callers pass the zero-value (NULL) until M2 wires worker and reaper failure reasons through. This is the first piece of issue #22 -- per-miner success/failure detail in the activity log. Also adds docs/issue-22-activity-log-per-miner-detail.md as a progress tracker for the rest of the milestones. Made-with: Cursor
Propagate the worker's error string (truncated to 2048 runes with a suffix marker) and the reaper reason into command_on_device_log.error_info. The reap queries now RETURN error_info so the Go code doesn't have to duplicate the reason strings baked into the SQL updates. The new helper lives in error_info.go with unit coverage for empty / short / long / multibyte inputs and the nil-error convention. Made-with: Cursor
Migration 000034 adds a dedicated batch_id TEXT column plus two partial indexes: a btree for lookups by batch and a unique index scoped to '*.completed' event types. The unique index guards the finalizer (added in M6) against duplicate completion rows when the crash-recovery reconciler re-runs. InsertActivityLog and ListActivityLogs now read/write the column. Made-with: Cursor
…(M4) Add BatchID to models.Event and models.Entry, and have SQLActivityStore.Insert populate / SQLActivityStore.List read the new column. When the command finalizer (M6) writes a '<event_type>.completed' row for a batch, a concurrent or reconciler-driven retry can collide with the partial unique index added in M3. We detect that specific pgconn unique_violation and silently return nil, so the finalizer remains idempotent. Table-driven unit tests cover isCompletedBatchDuplicate for all relevant combinations (nil event, missing batch id, wrong event suffix, wrong pg code, non-pg error, and the legitimate duplicate case). Made-with: Cursor
Extends the proto schema with CommandBatchDeviceResult plus the new GetCommandBatchDeviceResults request/response, and adds the supporting sqlc queries: - GetBatchHeaderForOrg returns the batch header only when the creator belongs to the caller's organization, giving the RPC tenant isolation without a dedicated org_id column on command_batch_log. - ListBatchDeviceResults returns per-device outcomes with deterministic ordering so the UI can paginate or virtualize without reshuffling. Handler and service gain stubs that route into a CodeUnimplemented error until M8 wires the real logic. Made-with: Cursor
Add composeFinalizers and buildActivityCompletedCallback to command.Service, and wire them into every command RPC. The finalizer runs when the batch reaches FINISHED, reads the success/failure counts via GetBatchStatusAndDeviceCounts, and logs a '<event_type>.completed' activity row with summary metadata (batch_id, total_count, success_count, failure_count) and result=success|failure. Per issue #22, both the '<event_type>' (initiated) and '<event_type>.completed' rows share the same batch_id column so the UI can collapse them into a single timeline entry. The partial unique index plus the store's duplicate swallow keeps completion writes idempotent so the reconciler (M7) can safely backfill after crashes. For DownloadLogs, the finalizer is composed so the ZIP bundle callback runs first and the completion row appears only after the bundle is persisted. CompletedEventSuffix moves into the activity models package so the command and sqlstores packages share one source of truth. Made-with: Cursor
Periodically backfills '<event_type>.completed' activity rows for batches that reached FINISHED without one (server crash between mark and write, or finalizer exhausting its 3 retries). Only acts on batches that have an initiated activity row, so internally-triggered batches (worker-name reapply) stay out of the activity timeline. Attribution for the backfilled row is copied from the initiated row so the completion event matches the original action even when the triggering session is long gone. Idempotency is enforced by the partial unique index on (batch_id, event_type) -- concurrent reconciler runs across replicas are safe. The metadata includes "reconciled": true so operators can distinguish reconciler output from live finalizer output. ActivityLogger interface keeps the reconciler decoupled from the activity domain service. Wired into main.go in M11. Made-with: Cursor
Replaces the M5 stub with a real implementation that:
- scopes by session organization via GetBatchHeaderForOrg (NotFound for
unknown batches or those in another org),
- reads per-device rows via ListBatchDeviceResults, counts success/failure,
- maps the status enum to the proto string ("success" / "failed"),
- caps rows at maxBatchDeviceResults = 5000 to bound memory and response
size; paging left as a follow-up once real fleets trip the cap,
- sets details_pruned = true only when the batch is FINISHED with zero
per-device rows, so mid-run responses stay non-pruned.
Small helper deviceCommandStatusToProto has table-driven unit coverage.
Made-with: Cursor
Introduces the first retention policy for command-audit tables. Before this, command_batch_log, command_on_device_log, and queue_message grew unbounded; only TimescaleDB hypertables had retention. RetentionCleaner runs on a configurable interval (default 1h) and drains three tables in FK-safe order, each with a per-query LIMIT (DeleteBatchLimit, default 1000) and the drain loop keeps going until a page comes back short. Defaults: queue_message terminals 30d, command_on_device_log 90d, command_batch_log 180d. The drain loop itself is unit-tested for happy path, early error, and ctx cancellation; wiring into main.go lands in M11. Made-with: Cursor
Activity log had no retention policy. The command audit cleaner (M9) sweeps command_batch_log and its children; this change adds a parallel sweep for activity_log itself, with longer defaults that match its audit/compliance purpose (1 year vs 180 days). - ActivityStore.DeleteOlderThan added to the interface, implemented by SQLActivityStore and regenerated in the mock store. - activity.Config (new) exposes ActivityLogRetention, CleanupInterval (6h), and DeleteBatchLimit (1000) via env vars. - activity.RetentionCleaner drains activity_log in a loop using the same short-page termination rule as the command cleaner. Unit-tested via a hand-rolled stub store covering the happy loop, mid-pass error propagation, and ctx cancellation. Made-with: Cursor
…M11) Starts three new goroutines with graceful shutdown: - CompletionReconciler backfills missing '<event_type>.completed' activity rows for FINISHED batches whose finalizer didn't write one (server crash or retry exhaustion). - Command RetentionCleaner ages out queue_message terminals, command_on_device_log, and command_batch_log rows. - Activity RetentionCleaner ages out activity_log rows on a longer schedule (1 year default vs 180 days for batch headers). activity.Config is added to the top-level fleetd Config with the ACTIVITY_ env prefix so operators can tune retention independently of the command settings. Made-with: Cursor
Integration tests (require live Postgres): - mark_status_integration_test.go: two new subtests that assert error_info is persisted on FAILED rows with the worker's reason, and left NULL on SUCCESS rows. - reaper_integration_test.go: reap reason propagates end-to-end from queue_message.error_info into command_on_device_log.error_info. - retention_integration_test.go: full FK-safe drain across queue_message, command_on_device_log, and command_batch_log. - reconciler_integration_test.go: end-to-end backfill with correct counts and attribution; idempotent re-run; no backfill when there is no initiated row (worker-name reapply case). Also exposes RunOnceForTest on both RetentionCleaner and CompletionReconciler so the integration tests can drive a single pass without starting the goroutine. All added tests pass locally against the fleet test Postgres. Made-with: Cursor
Closes the multi-org tenant leak flagged in the review. Previously GetBatchHeaderForOrg authorized by joining through user_organization on the batch creator, which meant any org that creator ever belonged to could see the batch. A dual-org user could read batches they never ran against. This commit: - Adds command_batch_log.organization_id via migration 000035, with a partial index, FK to organization, and a deterministic backfill from user_organization. Column stays nullable for safe deploy; a follow-up migration can flip to NOT NULL after soak. - Captures the caller's session.OrganizationID in CreateCommandBatchLog so every new batch records its owning org. - Rewrites GetBatchHeaderForOrg to filter on the column directly. No more creator-membership JOIN. - Threads orgID through saveCommandBatchLogToDB and the two integration seed helpers that create command_batch_log rows. Made-with: Cursor
…hDeviceResults (R2/H1) Counts returned to the UI no longer derive from the capped per-device slice, so success_count + failure_count always equals total_count even for batches with more than maxBatchDeviceResults (5000) devices. Source of truth: GetBatchStatusAndDeviceCounts, which already aggregates per-status counts from command_on_device_log (the same query the finalizer and the stream use). A new bool field Truncated is set when the raw list exceeds the cap so the UI can surface a "showing first N of total_count" affordance and point operators at the CSV/ZIP export for the full list. Pagination is deferred to a follow-up; the cap is plenty for the activity-log drill-down today. Made-with: Cursor
A batch whose selector matched no miners reaches FINISHED with devices_count=0 and never generates per-device rows. The original details_pruned condition (FINISHED + rows==0) false-positive'd it, misleading the UI into saying "Per-miner details are no longer available" when in fact there was nothing to show. Tighten the guard with header.DevicesCount > 0 so we only claim pruning for batches that actually had miners. Made-with: Cursor
Adds service-level integration tests for every branch the review highlighted as unguarded: - Happy path: FINISHED batch with one SUCCESS and one FAILED row. Asserts error_info propagates onto the proto's error_message. - Cross-org NotFound (exercises R1/M1): a caller in Org B cannot read a batch whose organization_id is Org A. - details_pruned=true when FINISHED + devices_count>0 + no codl rows. - details_pruned=false when PENDING/PROCESSING. - details_pruned=false when devices_count=0 (empty-selector batch), covers the R4/M2 guard. - Truncated=true when more than maxBatchDeviceResults (5000) codl rows exist; asserts that aggregate-sourced counts still sum to total_count (the H1 invariant). - InvalidArgument for empty batch_identifier. Also adds two handler-level tests that wire the real command.Service through handler.Handler.GetCommandBatchDeviceResults and verify the FleetError round-trips unchanged for the interceptor to map. While touching the RPC, the sql.ErrNoRows -> NotFound translation now happens inside the WithTransaction callback since the retry wrapper re-wraps non-FleetError errors as Internal, which had been erasing the sentinel before the service's errors.Is check could see it. Pre-existing bug that the new cross-org test caught. Made-with: Cursor
Adds session.Info.Actor so internal orchestrators can declare what kind of principal is acting on the request. The schedulerContext now sets it to "scheduler"; user / API-key traffic leaves it empty. logCommandActivity and buildActivityCompletedCallback translate the string into an activity ActorType via actorTypeFromSession (scheduler -> ActorScheduler, system -> ActorSystem, anything else -> empty so the activity service's existing defaulting to ActorUser applies). This removes the hardcoded ActorUser in the finalizer, which previously forced scheduler-driven completions to misattribute as user events and was inconsistent with the reconciler's backfill (which copies the initiated row's actor type verbatim). Table-driven unit test in actor_test.go covers nil, empty, scheduler, system, and unknown-actor cases. Made-with: Cursor
… (R6/M5) Before this change, a transient failure in the activity finalizer (e.g. a DB blip while reading GetBatchStatusAndDeviceCounts) caused the initializeStatusUpdateRoutine retry loop to re-run the entire composed chain, including the DownloadLogs bundle builder. The bundle builder happens to be idempotent in practice today, but its ScheduleBatchLogCleanup timer got stacked on each retry and the pattern is a latent footgun for any future non-idempotent callback. The composed closure now tracks per-callback completion: a retry picks up only the callbacks that haven't yet succeeded. Single-callback composition still short-circuits with zero tracking overhead. New unit test TestComposeFinalizers_RetrySkipsAlreadySucceededCallbacks drives the failing-callback path and asserts the earlier callback runs exactly once across retries. Made-with: Cursor
defaultRetentionConfig in both the command and activity retention cleaners now: - Enforces FK-safe ordering in the command cleaner: if QueueMessageRetention > DeviceLogRetention, clamp it down (and log a warning) so the cleaner doesn't delete batches while their queue- message siblings are still live. Same for DeviceLog vs BatchLog. - Clamps CleanupInterval to a one-minute minimum so a typo doesn't turn the cleaner into a busy loop. - Clamps DeleteBatchLimit to a 50,000 maximum so a misconfiguration can't issue a million-row delete per tick. All clamps log through slog.Warn so operators see the drift. Five new unit tests cover each clamp path; the existing happy-path tests are adjusted where they crossed the new minimums. Made-with: Cursor
…(R8/L3) Adds activity.Service.LogStrict(ctx, event) error, and routes the command finalizer (buildActivityCompletedCallback) and the reconciler's backfill through it. The existing Service.Log remains fire-and-forget for everyone else. Before: a transient DB failure while writing the .completed row was only visible via slog.Error inside Log; the status routine's retry loop never saw it. The row could silently go missing until the reconciler caught up 2-7 minutes later. After: transient failures bubble up through composeFinalizers (R6) so the status routine retries only the failing leg. The partial unique index + store-level duplicate swallow keep retries idempotent. ActivityLogger interface in the reconciler gains LogStrict; backfillOne uses it so systematic errors (e.g. FK violations) also surface. Made-with: Cursor
L1: Drop the "stubbed here" sentence on Handler.GetCommandBatchDeviceResults
now that it's the real implementation.
L2: Add sync.Mutex around Start/Stop lifecycle in both retention cleaners
and the completion reconciler so back-to-back start+stop calls don't
race on cancel/done fields.
L5: Wrap the three queries in GetCommandBatchDeviceResults (header, counts,
per-device rows) inside a single db.WithTransaction callback so they
come from a consistent snapshot. Prevents a retention sweep between
queries from skewing the counts vs. the returned device list.
L6: Populate testutil.TestUser.ExternalUserID from the fixture's generated
value and use it in reconciler_integration_test.go instead of the
synthetic "ext-"+username literal. Keeps test attributions in sync
with what the auth interceptor would produce.
L7: Add a file-level comment to retention_test.go explaining the
DeleteBatchLimit convention the drain-loop tests rely on.
M4 note: Prominent operational comment at the top of migration 000034
explaining the ACCESS EXCLUSIVE lock behavior of its CREATE INDEX
statements and pointing at the CONCURRENTLY follow-up. No code
change, just operator guidance.
Made-with: Cursor
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] The new batch-results auth boundary trusts the batch header, not the devices in the batch
[MEDIUM]
|
…(R10/H1) M10 added DeleteOlderThan to the ActivityStore interface and regenerated the generated mock, but the hand-rolled noopActivityStore / spyActivityStore in pools/service_test.go wasn't updated, leaving `go test ./internal/domain/pools/...` with a build failure. This is the only hand-rolled fake that had to be updated -- the activity package's own stubStore and the generated MockActivityStore were handled in M10. Made-with: Cursor
…/M1..M4)
M1: Retention clamp cascades correctly when multiple invariants are
violated. Previous order (Queue<=Device, then Device<=Batch) could leave
Queue > Device after a later Device<=Batch clamp reduced Device below the
already-clamped Queue. New order applies the outermost invariant first
(Batch -> Device -> Queue) so every subsequent clamp sees the already-
reduced upstream value. Added
TestDefaultRetentionConfig_ClampsCascadingViolations covering
Queue=1y / Device=30d / Batch=10d; asserts all three end up at 10d.
M2: Added db.WithReadOnlyTransaction backed by sql.TxOptions
{ReadOnly: true, Isolation: RepeatableRead}. GetCommandBatchDeviceResults
now uses it so the three queries (header, aggregate counts, per-device
rows) share a true snapshot; the comment's "consistent snapshot" claim is
now accurate and a concurrent retention sweep cannot perturb the response.
M3: Migration 000035 gained the same operational note as 000034 about the
backfill UPDATE and CREATE INDEX lock durations, plus the pointer to the
CONCURRENTLY follow-up.
M4: Dropped the unreachable sql.ErrNoRows branch on
GetBatchStatusAndDeviceCounts. Under the new read-only REPEATABLE READ
transaction (M2), if the header query has resolved, the aggregate query
cannot return ErrNoRows -- the snapshot is stable and GROUP BY cbl.id
always yields one row when cbl exists. The comment documents the
reasoning.
Made-with: Cursor
Strip out work that isn't required to satisfy issue #22 (per-miner success/failure detail for bulk actions) so the diff is feasible for a single human review pass. Net impact: server/proto diff drops from ~5,481 additions across 57 files to ~2,641 additions across 38 files. What is kept - 3 migrations (error_info on command_on_device_log, batch_id on activity_log with partial unique index, organization_id on command_batch_log + single-org backfill + FK) - Per-device error_info persistence via sanitizedErrorInfo (plugin text still collapses to a generic marker; FleetError DebugMessage still passes through, documented as a follow-up) - activity_log batch_id linkage + idempotent '*.completed' finalizer, with isCompletedBatchDuplicate now pinned to the uq_activity_log_batch_completed constraint - GetCommandBatchDeviceResults RPC, org-scoped via command_batch_log.organization_id, with SQL-enforced LIMIT, Truncated / DetailsPruned flags, and empty-selector guard - Session organization_id validation on processCommand and ReapplyCurrentPoolsWithWorkerNames - session.Actor alias + ActorScheduler for scheduler-triggered completion row attribution What is removed (deferred to follow-up PRs) - Command retention cleaner (queue_message / command_on_device_log / command_batch_log), activity retention cleaner, their configs, SQL DELETE queries, ActivityStore.DeleteOlderThan, main.go wiring, and the noopActivityStore DeleteOlderThan stub in pools/service_test.go - Completion reconciler, ListFinishedBatchesWithoutCompletion, ResultUnknown activity state, ActorSystem session constant, and reconciler config fields (ReconcilerInterval / GracePeriod / MaxBatches) - Aged-out RPC fallback: GetLatestCompletedActivityForBatch, synthesizeAgedOutResponse, parseCompletionMetadata, and the three header-aged-out integration tests. Header miss now returns NotFound directly. - WithReadOnlyTransaction infrastructure and GetBatchDeviceCounts narrow query; the RPC and finalizer now use WithTransaction + GetBatchStatusAndDeviceCounts Verification - go build ./... clean - go vet ./... clean - golangci-lint: 0 issues - Short tests green in the touched packages: command, activity, session, pools, stores/interfaces, stores/sqlstores, infrastructure/db, handlers/command Made-with: Cursor
The migration added both the column and a best-effort backfill for historical rows. Per the review, historical batches can simply stay NULL: they predate the feature, have no per-miner detail worth showing, and remain invisible to GetBatchHeaderForOrg (closed-by-default). With no backfill, the multi-org-ambiguity rationale, the integration test that exercised it, and a handful of comments all go away. - Strip the UPDATE ... WHERE block from migration 000036; keep the nullable column, the partial index (WHERE organization_id IS NOT NULL), and the FK unchanged - Trim the multi-paragraph operator note to the essentials - Delete org_backfill_integration_test.go (it existed solely to pin the single-org backfill rule) - Trim the "precondition for the NOT NULL flip" rationale in saveCommandBatchLogToDB and processCommand; refuse OrganizationID <= 0 remains as a runtime guard against NULL writes - Trim the "legacy backfill miss / closed-by-default posture" comment on GetBatchHeaderForOrg No Go call sites change (organization_id stays nullable in SQL, so CreateCommandBatchLogParams.OrganizationID remains sql.NullInt64 and new writes continue to set Valid: true). Verified: - go build ./... clean - go vet ./... clean - golangci-lint: 0 issues - Short tests green in touched packages (command, activity, stores/sqlstores, handlers/command) Made-with: Cursor
Self-review follow-ups on feat/22; no functional behavior change.
- Lowercase batch-level status in GetCommandBatchDeviceResults so it matches
the device-level "success"/"failed" convention already in the response
("finished" | "pending" | "processing"). Frontend is not yet wired.
- Centralize the session-org invariant: the OrganizationID <= 0 guard now
lives at the top of saveCommandBatchLogToDB (the single funnel for every
command batch write). Removes the duplicated check in processCommand and
ReapplyCurrentPoolsWithWorkerNames so any future entry point inherits
the invariant automatically. Existing validation tests still pass
unchanged because the error message flows through the wrapper.
- Drop the redundant "batch_id" key from buildActivityCompletedCallback's
metadata map; the first-class BatchID column introduced in this PR
already carries it and grep confirms no reader.
- Extend Service.LogStrict godoc to name the uq_activity_log_batch_completed
swallow contract so callers understand why idempotent finalizer retries
look like success.
- Trim trailing blank lines in activity.sql and command.sql.
- sqlc regen picked up a stale GetBatchHeaderForOrg doc comment that drifted
when the backfill scaffolding was dropped earlier on this branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds backend support for per-miner success/failure detail on bulk command batches by persisting per-device outcomes (with bounded failure reasons), emitting a terminal “completed” activity row, and exposing the per-device slice via a new org-scoped RPC.
Changes:
- Add
batch_idtoactivity_logand write a*.completedactivity event at batch FINISHED time with idempotent retry behavior. - Persist per-device
error_infooncommand_on_device_logand propagate it from worker failures / reaper paths. - Add
organization_idtocommand_batch_logand implementGetCommandBatchDeviceResults(SQL + service + handler), scoped by org and capped at 5000 rows.
Reviewed changes
Copilot reviewed 30 out of 38 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| server/sqlc/queries/queue.sql | Return error_info from reap queries so it can be persisted to the audit log. |
| server/sqlc/queries/command.sql | Add organization_id, persist error_info, and add org-scoped header + capped device-result queries. |
| server/sqlc/queries/activity.sql | Add batch_id to insert/list queries to link initiated/completed activity rows. |
| server/migrations/000034_add_error_info_to_command_on_device_log.up.sql | Add nullable error_info column for per-device failure reasons. |
| server/migrations/000034_add_error_info_to_command_on_device_log.down.sql | Rollback for error_info column addition. |
| server/migrations/000035_add_batch_id_to_activity_log.up.sql | Add nullable batch_id plus partial index + partial unique index for completion idempotency. |
| server/migrations/000035_add_batch_id_to_activity_log.down.sql | Rollback for batch_id column and indexes. |
| server/migrations/000036_add_org_id_to_command_batch_log.up.sql | Add nullable organization_id with FK + partial index for org scoping. |
| server/migrations/000036_add_org_id_to_command_batch_log.down.sql | Rollback for organization_id FK/index/column. |
| server/internal/handlers/command/handler.go | Add handler passthrough for GetCommandBatchDeviceResults. |
| server/internal/handlers/command/handler_test.go | Add DB-backed handler tests for happy path + error passthrough. |
| server/internal/domain/stores/sqlstores/activity.go | Insert batch_id and swallow only the specific completion-row unique violation for idempotent retries. |
| server/internal/domain/stores/sqlstores/activity_test.go | Add unit tests pinning the duplicate-detection logic to SQLSTATE + constraint name. |
| server/internal/domain/session/models.go | Introduce session.Actor to tag synthesized sessions (scheduler). |
| server/internal/domain/schedule/processor.go | Set session.Info.Actor for scheduler-driven contexts to attribute activity correctly. |
| server/internal/domain/command/service.go | Add completion finalizer chaining, enforce org-id on new batch writes, and implement GetCommandBatchDeviceResults. |
| server/internal/domain/command/execution_service.go | Persist sanitized error_info for worker failures and propagate reaper error info to device logs. |
| server/internal/domain/command/error_info.go | Add bounded + sanitized error persistence helpers. |
| server/internal/domain/command/error_info_test.go | Unit tests for truncation + sanitization behavior. |
| server/internal/domain/command/finalizers_test.go | Unit tests for callback composition + retry semantics. |
| server/internal/domain/command/actor_test.go | Unit tests for mapping session.Actor to activity actor type. |
| server/internal/domain/command/session_org_validation_test.go | Unit tests ensuring missing org IDs are rejected before DB access. |
| server/internal/domain/command/results_mapping_test.go | Unit tests for device-status-to-proto mapping. |
| server/internal/domain/command/integration_helpers_test.go | Shared DB integration helpers for results tests. |
| server/internal/domain/command/results_integration_test.go | DB-backed integration tests for org scoping, pruning semantics, and truncation. |
| server/internal/domain/command/mark_status_integration_test.go | Integration tests asserting error_info persistence on FAILED and NULL on SUCCESS. |
| server/internal/domain/command/reaper_integration_test.go | Integration test asserting reaper reason persists into the audit row. |
| server/internal/domain/activity/service.go | Add LogStrict API so callers (finalizers) can observe persistence errors. |
| server/internal/domain/activity/models/models.go | Add CompletedEventSuffix and BatchID fields to activity models. |
| proto/minercommand/v1/command.proto | Define new messages and GetCommandBatchDeviceResults RPC contract. |
| server/generated/sqlc/queue.sql.go | Generated: sqlc output updated for new error_info return columns. |
| server/generated/sqlc/models.go | Generated: sqlc models updated for new columns. |
| server/generated/sqlc/db.go | Generated: sqlc prepared statements updated for new queries. |
| server/generated/sqlc/command.sql.go | Generated: sqlc output updated for org/header/results queries and new params. |
| server/generated/sqlc/activity.sql.go | Generated: sqlc output updated for batch_id column. |
| server/generated/grpc/minercommand/v1/minercommandv1connect/command.connect.go | Generated: Connect stubs updated for new RPC. |
| client/src/protoFleet/api/generated/minercommand/v1/command_pb.ts | Generated: TS protobuf output updated for new messages/RPC. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a99c56c1b5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Address review feedback on PR #25: - proto: aggregate counts only sum to total_count for FINISHED batches with intact rows; total_count is from the batch header, success/failure from codl aggregates. - proto: details_pruned doc no longer claims to cover the missing-header case (that surfaces as connect.CodeNotFound, not as this response). - service: document the intentional pre-mark ordering of buildActivityCompletedCallback and the transactional follow-up. Doc-only; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the comment changes from b94806e in the generated Go and TS proto files. Comment-only; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…per-miner-detail-backend
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b45121d906
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The CommandBatchDeviceResult.device_identifier comment claimed the value was "(e.g. MAC address)", but device.device_identifier is a UUID v7 (VARCHAR(36), populated by id.GenerateID via uuid.NewV7). MAC lives in a separate device.mac_address column. Updated the proto source comment and regenerated the dependent Go and TS pb files. Doc-only; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The RPC bundles three sequential reads (header, counts, rows) in one
db.WithTransaction, but the helper opens transactions with the default
isolation (READ COMMITTED) which gives each statement its own snapshot.
A worker writing command_on_device_log between statements could make
success_count/failure_count disagree with device_results, and could
shift truncated/details_pruned to a mismatched basis.
Fix: extend WithTransaction / WithTransactionNoResult with optional
*sql.TxOptions (variadic last arg, additive — no existing callers need
to change), and pass {Isolation: LevelRepeatableRead, ReadOnly: true}
from this RPC so all three reads share one snapshot.
Read-only RR transactions in PostgreSQL don't get serialization
failures, so the existing retry path is unchanged.
Addresses Codex P2 review: discussion_r3128315071.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99419830c9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The batch_id column was added to activity_log (migration 000035) and the domain model already carries it, but entryToProto() dropped it before the wire. Add optional string batch_id = 15 to ActivityEntry and map it in the handler so the frontend can group initiated + completed rows by batch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f08c3d8e9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
composeFinalizers previously stopped on the first callback error, which meant a DownloadLogs bundle-builder failure could prevent the activity finalizer from ever running — leaving a permanent gap in the completion audit trail. Switch to best-effort: run every callback even if earlier ones fail, return the first error so the retry loop still retries on the next tick. Already-succeeded callbacks remain skipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e42fdee437
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
initializeStatusUpdateRoutine permanently abandoned the onFinished callback after maxCallbackRetries, creating an unrecoverable audit gap when transient DB failures lasted ≥3 polling ticks. The batch was marked FINISHED but the *.completed activity row was never written. Decouple the two concerns with a batchMarkedFinished flag: pre-finish retries (up to maxCallbackRetries) block FINISHED marking as before; once exhausted, the batch is marked FINISHED to unblock the client, and the callback gets up to maxCallbackRetries more post-finish attempts before giving up. Addresses Codex P1 discussion_r3128491939. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend implementation for #22.
Problem
Bulk actions appear in the activity log as a single row written at initiation time, with no record of which miners succeeded or failed. Operators have no audit trail for diagnosing recurring per-miner issues after a partial bulk failure.
What ships
Backend only. Frontend follow-up tracked separately.
<event_type>, e.g.reboot) is joined by a new completion row (event_type<event_type>.completed, e.g.reboot.completed) written at batch FINISHED time, carrying the success / failure split in its metadata. Both rows share a newbatch_id TEXTcolumn onactivity_logso the UI can collapse them into one logical entry. The initiated event_type is unchanged from the pre-existing string so existing UI filters keep working.SUCCESS/FAILED) plus a bounded failure reason is stored oncommand_on_device_log. A newGetCommandBatchDeviceResultsConnect RPC returns the org-scoped slice so the activity-log UI can drill into which miners failed and why. Capped at 5000 rows per response (SQL-enforcedLIMIT); the response signalstruncated=truewhen the cap fires.Architecture
flowchart LR subgraph activity [activity_log - append-only] A1["event_type: reboot<br/>metadata: batch_id"] A2["event_type: reboot.completed<br/>metadata: success_count, failure_count"] end subgraph audit [command audit tables] B1["command_batch_log<br/>+ organization_id column"] B2["command_on_device_log<br/>+ error_info column"] end subgraph rpc [On-demand detail] R1["GetCommandBatchDeviceResults"] end A2 -- "batch_id column" --> R1 R1 --> B1 R1 --> B2Migrations
Three non-destructive migrations. All columns are added nullable; no data backfill.
000034_add_error_info_to_command_on_device_log—error_info TEXT NULLfor per-device failure reason.000035_add_batch_id_to_activity_log—batch_id TEXT NULL+ partial btree + partial unique index on(batch_id, event_type) WHERE event_type LIKE '%.completed'for finalizer retry idempotency.000036_add_org_id_to_command_batch_log—organization_id BIGINT NULL+ FK toorganization(id)+ partial index. Pre-migration rows stay NULL and are intentionally invisible toGetCommandBatchDeviceResults(closed-by-default); no cross-org guess is ever made for historical data.Proto
proto/minercommand/v1/command.proto:message CommandBatchDeviceResult— typed per-device row (identifier, status, updated_at, optional error_message).message GetCommandBatchDeviceResultsRequest / Response— response carriesbatch_identifier,command_type,status,total_count,success_count,failure_count,device_results,details_pruned,truncated.rpc GetCommandBatchDeviceResultsonMinerCommandService.Typed messages rather than free-form
Structfields so the API is forward-compatible with future per-device fields.proto/activity/v1/activity.proto:optional string batch_id = 15added toActivityEntryso the frontend can group initiated + completed rows by batch. Wired throughentryToProto()in the activity handler. Commit1f08c3d.Where to review
Per-device error persistence
server/internal/domain/command/error_info.go,server/internal/domain/command/execution_service.gosanitizedErrorInfosurfaces onlyfleeterror.FleetErrormessages; everything else (plugin / device / transport errors) collapses to"command failed". Rawerr.Error()still reachesslog.Errorfor admins.Activity finalizer
server/internal/domain/command/service.go—buildActivityCompletedCallback,composeFinalizers,initializeStatusUpdateRoutineDownloadLogsbundle).composeFinalizersis best-effort: all callbacks run even if earlier ones fail, so a bundle-builder failure cannot block the activity write. Already-succeeded callbacks are skipped on retry.maxCallbackRetriespre-mark attempts → batch marked FINISHED (client unblocked) →maxCallbackRetriesmore post-mark attempts before giving up. This prevents both client stalls and permanent audit gaps from transient DB failures.RPC
server/internal/domain/command/service.go—GetCommandBatchDeviceResultsserver/internal/handlers/command/handler.go— thin pass-throughWithTransactioncovers header + counts + rows, opened atsql.LevelRepeatableRead+ReadOnly: trueso all three reads share one snapshot (default READ COMMITTED would let a concurrent worker write shift counts away fromdevice_results). Org-scoped viacommand_batch_log.organization_id.details_prunedis true only whenDevicesCount > 0 && status == FINISHED && len(rows) == 0.truncatedsignals that the SQL cap fired.total_countcomes from the batch header whilesuccess_count/failure_countarecommand_on_device_logaggregates — they sum tototal_countonly for FINISHED batches with intact rows; in-progress batches havesuccess + failure <= total.Snapshot isolation for bundled reads
server/internal/infrastructure/db/with_transaction.go—WithTransaction/WithTransactionNoResultnow accept an optional variadic*sql.TxOptions(additive; no existing callers change).GetCommandBatchDeviceResults.firstTxOpts(opts)threads the value throughwithTransactionWithRetry→executeTransaction→db.BeginTx(ctx, txOpts); nil preserves the historical default (READ COMMITTED).IsRetryablePostgresErrorretry path is unaffected.Session / scheduler attribution
server/internal/domain/session/models.go—session.Actoralias +ActorSchedulerconstantserver/internal/domain/schedule/processor.go—schedulerContextsetsActor: session.ActorScheduleractorTypeFromSessioninservice.gomaps it intoactivitymodels.ActorSchedulerso scheduler-triggered completion rows are attributed correctly.Session org_id guard
server/internal/domain/command/service.go—saveCommandBatchLogToDBOrganizationID <= 0, so new writes can't land withorganization_id = NULLregardless of entry point. Lives at the top of the helper beforegetDevicesCount, so unit tests without a wired deviceStore still hit the check cleanly. The column stays nullable for historical compatibility, but the invariant "new writes always have a real org" holds.Idempotency swallow
server/internal/domain/stores/sqlstores/activity.go—isCompletedBatchDuplicateuq_activity_log_batch_completedunique-violation (pinned by SQLSTATE + constraint name), so a future unique constraint added toactivity_logcannot be swallowed by accident.SQL
server/sqlc/queries/activity.sql—InsertActivityLogcarriesbatch_idserver/sqlc/queries/command.sql—CreateCommandBatchLogcarriesorganization_id;UpsertCommandOnDeviceLogcarrieserror_info; newGetBatchHeaderForOrgandListBatchDeviceResults(with SQLLIMIT).server/sqlc/queries/queue.sql— reap queries now returnerror_infoso it propagates tocommand_on_device_log.Security review findings (all addressed)
GetBatchHeaderForOrgcould leak across orgs if scoped viaJOIN user_organization ON user_id = created_by: any org the creator belonged to would see the detail.organization_idcolumn captured fromsession.Info.OrganizationIDat write time; pre-migration rows stay NULL and are invisible to the query (closed-by-default, no backfill guess).ListBatchDeviceResults; cost scaled with total batch size.max_rows; RPC passescap + 1so truncation detection stays cheap.sanitizedErrorInforestricts persisted text to server-authoredFleetErrormessages; everything else is"command failed".Post-review follow-ups (all addressed)
details_prunedcovered "neither is available". Neither matched actual behavior.reboot/reboot.completed). Commitb94806e+ regend619282.mcharles-square)CommandBatchDeviceResult.device_identifierwas labeled "(e.g. MAC address)" but it's a UUID v7 (VARCHAR(36)populated byid.GenerateID). MAC lives indevice.mac_address(VARCHAR(17)).// Stable device UUID, preserved for soft-deleted devices.with regen in the same commit. Wider doc-accuracy sweep found no other factual errors. Commit486c23e.GetCommandBatchDeviceResultsran at PG default READ COMMITTED; a concurrent worker write tocommand_on_device_logbetween statements could make counts disagree withdevice_results.WithTransaction/WithTransactionNoResultextended with variadic*sql.TxOptions; RPC passes{LevelRepeatableRead, ReadOnly: true}. Commit9941983.*.completedactivity row BEFOREMarkCommandBatchFinished*, so a transient mark failure leaves the activity log briefly ahead ofbatch_log.status.initializeStatusUpdateRoutineto a three-phase lifecycle: callback getsmaxCallbackRetriespre-mark attempts, then the batch is marked FINISHED (unblocking the client), thenmaxCallbackRetriesmore post-mark attempts before giving up.composeFinalizersswitched to best-effort so independent callbacks (bundle + activity) don't block each other.Test plan
go build ./...andgo vet ./...cleanjust _lint-server—0 issuescommand,activity,activity/models,session,pools,stores/interfaces,stores/sqlstores,infrastructure/db,handlers/commandTestGetCommandBatchDeviceResults_{HappyPath, NotFoundForCrossOrg, DetailsPrunedWhenFinishedWithNoRows, NotPrunedWhilePending, NotPrunedForEmptySelector, TruncatesLargeBatchesWithConsistentCounts, InvalidArgumentOnEmptyIdentifier}TestMarkQueueMessageStatusTransitions/ reaper integration —error_infopersistence on failed and reaped rowsTestSanitizedErrorInfo_*,TestBoundedErrorInfo_*, table-drivenTestIsCompletedBatchDuplicate(pinned to the specific constraint name)TestProcessCommand_RejectsMissingOrgID/TestReapplyCurrentPoolsWithWorkerNames_RejectsMissingOrgIDTestHandler_GetCommandBatchDeviceResults_{PassesThroughHappyPath, PropagatesInvalidArgument, PropagatesNotFound}Rollout notes
command_batch_logrows stay withorganization_id = NULLpermanently and are invisible toGetCommandBatchDeviceResults(closed-by-default). New writes always carry a real org because the service guard refusesOrganizationID <= 0. Operators who want the drill-down for a specific old batch can repair it manually:UPDATE command_batch_log SET organization_id = $ORG WHERE uuid = $UUID.CONCURRENTLY, so they holdACCESS EXCLUSIVEbriefly. Run during a low-traffic window on largeactivity_log/command_batch_logtables.