Skip to content

feat(activity): per-miner success/failure bulk actions detail [backend]#25

Merged
rongxin-liu merged 50 commits intomainfrom
feat/22-activity-log-per-miner-detail-backend
Apr 23, 2026
Merged

feat(activity): per-miner success/failure bulk actions detail [backend]#25
rongxin-liu merged 50 commits intomainfrom
feat/22-activity-log-per-miner-detail-backend

Conversation

@rongxin-liu
Copy link
Copy Markdown
Contributor

@rongxin-liu rongxin-liu commented Apr 22, 2026

Backend implementation for #22.

Problem

Bulk actions appear in the activity log as a single row written at initiation time, with no record of which miners succeeded or failed. Operators have no audit trail for diagnosing recurring per-miner issues after a partial bulk failure.

What ships

Backend only. Frontend follow-up tracked separately.

  • Two-event append-only activity model. The existing initiated row (event_type <event_type>, e.g. reboot) is joined by a new completion row (event_type <event_type>.completed, e.g. reboot.completed) written at batch FINISHED time, carrying the success / failure split in its metadata. Both rows share a new batch_id TEXT column on activity_log so the UI can collapse them into one logical entry. The initiated event_type is unchanged from the pre-existing string so existing UI filters keep working.
  • Per-miner detail persisted and served via a dedicated RPC. Each device's outcome (SUCCESS / FAILED) plus a bounded failure reason is stored on command_on_device_log. A new GetCommandBatchDeviceResults Connect RPC returns the org-scoped slice so the activity-log UI can drill into which miners failed and why. Capped at 5000 rows per response (SQL-enforced LIMIT); the response signals truncated=true when the cap fires.

Architecture

flowchart LR
    subgraph activity [activity_log - append-only]
        A1["event_type: reboot<br/>metadata: batch_id"]
        A2["event_type: reboot.completed<br/>metadata: success_count, failure_count"]
    end

    subgraph audit [command audit tables]
        B1["command_batch_log<br/>+ organization_id column"]
        B2["command_on_device_log<br/>+ error_info column"]
    end

    subgraph rpc [On-demand detail]
        R1["GetCommandBatchDeviceResults"]
    end

    A2 -- "batch_id column" --> R1
    R1 --> B1
    R1 --> B2
Loading

Migrations

Three non-destructive migrations. All columns are added nullable; no data backfill.

  • 000034_add_error_info_to_command_on_device_logerror_info TEXT NULL for per-device failure reason.
  • 000035_add_batch_id_to_activity_logbatch_id TEXT NULL + partial btree + partial unique index on (batch_id, event_type) WHERE event_type LIKE '%.completed' for finalizer retry idempotency.
  • 000036_add_org_id_to_command_batch_logorganization_id BIGINT NULL + FK to organization(id) + partial index. Pre-migration rows stay NULL and are intentionally invisible to GetCommandBatchDeviceResults (closed-by-default); no cross-org guess is ever made for historical data.

Proto

proto/minercommand/v1/command.proto:

  • message CommandBatchDeviceResult — typed per-device row (identifier, status, updated_at, optional error_message).
  • message GetCommandBatchDeviceResultsRequest / Response — response carries batch_identifier, command_type, status, total_count, success_count, failure_count, device_results, details_pruned, truncated.
  • rpc GetCommandBatchDeviceResults on MinerCommandService.

Typed messages rather than free-form Struct fields so the API is forward-compatible with future per-device fields.

proto/activity/v1/activity.proto:

  • optional string batch_id = 15 added to ActivityEntry so the frontend can group initiated + completed rows by batch. Wired through entryToProto() in the activity handler. Commit 1f08c3d.

Where to review

Per-device error persistence

  • server/internal/domain/command/error_info.go, server/internal/domain/command/execution_service.go
  • sanitizedErrorInfo surfaces only fleeterror.FleetError messages; everything else (plugin / device / transport errors) collapses to "command failed". Raw err.Error() still reaches slog.Error for admins.

Activity finalizer

  • server/internal/domain/command/service.gobuildActivityCompletedCallback, composeFinalizers, initializeStatusUpdateRoutine
  • Every command RPC composes the activity finalizer alongside any existing callback (e.g. DownloadLogs bundle). composeFinalizers is best-effort: all callbacks run even if earlier ones fail, so a bundle-builder failure cannot block the activity write. Already-succeeded callbacks are skipped on retry.
  • Three-phase callback lifecycle: up to maxCallbackRetries pre-mark attempts → batch marked FINISHED (client unblocked) → maxCallbackRetries more post-mark attempts before giving up. This prevents both client stalls and permanent audit gaps from transient DB failures.

RPC

  • server/internal/domain/command/service.goGetCommandBatchDeviceResults
  • server/internal/handlers/command/handler.go — thin pass-through
  • Single WithTransaction covers header + counts + rows, opened at sql.LevelRepeatableRead + ReadOnly: true so all three reads share one snapshot (default READ COMMITTED would let a concurrent worker write shift counts away from device_results). Org-scoped via command_batch_log.organization_id. details_pruned is true only when DevicesCount > 0 && status == FINISHED && len(rows) == 0. truncated signals that the SQL cap fired. total_count comes from the batch header while success_count/failure_count are command_on_device_log aggregates — they sum to total_count only for FINISHED batches with intact rows; in-progress batches have success + failure <= total.

Snapshot isolation for bundled reads

  • server/internal/infrastructure/db/with_transaction.goWithTransaction / WithTransactionNoResult now accept an optional variadic *sql.TxOptions (additive; no existing callers change).
  • The only caller passing opts today is GetCommandBatchDeviceResults. firstTxOpts(opts) threads the value through withTransactionWithRetryexecuteTransactiondb.BeginTx(ctx, txOpts); nil preserves the historical default (READ COMMITTED).
  • PG read-only REPEATABLE READ takes one snapshot at the first non-locking statement and never hits serialization failures, so the existing IsRetryablePostgresError retry path is unaffected.

Session / scheduler attribution

  • server/internal/domain/session/models.gosession.Actor alias + ActorScheduler constant
  • server/internal/domain/schedule/processor.goschedulerContext sets Actor: session.ActorScheduler
  • actorTypeFromSession in service.go maps it into activitymodels.ActorScheduler so scheduler-triggered completion rows are attributed correctly.

Session org_id guard

  • server/internal/domain/command/service.gosaveCommandBatchLogToDB
  • The single funnel for every command batch write refuses OrganizationID <= 0, so new writes can't land with organization_id = NULL regardless of entry point. Lives at the top of the helper before getDevicesCount, so unit tests without a wired deviceStore still hit the check cleanly. The column stays nullable for historical compatibility, but the invariant "new writes always have a real org" holds.

Idempotency swallow

  • server/internal/domain/stores/sqlstores/activity.goisCompletedBatchDuplicate
  • Only swallows the specific uq_activity_log_batch_completed unique-violation (pinned by SQLSTATE + constraint name), so a future unique constraint added to activity_log cannot be swallowed by accident.

SQL

  • server/sqlc/queries/activity.sqlInsertActivityLog carries batch_id
  • server/sqlc/queries/command.sqlCreateCommandBatchLog carries organization_id; UpsertCommandOnDeviceLog carries error_info; new GetBatchHeaderForOrg and ListBatchDeviceResults (with SQL LIMIT).
  • server/sqlc/queries/queue.sql — reap queries now return error_info so it propagates to command_on_device_log.

Security review findings (all addressed)

Severity Finding Resolution
HIGH GetBatchHeaderForOrg could leak across orgs if scoped via JOIN user_organization ON user_id = created_by: any org the creator belonged to would see the detail. Dedicated organization_id column captured from session.Info.OrganizationID at write time; pre-migration rows stay NULL and are invisible to the query (closed-by-default, no backfill guess).
MEDIUM No SQL cap on ListBatchDeviceResults; cost scaled with total batch size. Query takes max_rows; RPC passes cap + 1 so truncation detection stays cheap.
MEDIUM Raw plugin/device error text persisted and surfaced to org members. sanitizedErrorInfo restricts persisted text to server-authored FleetError messages; everything else is "command failed".

Post-review follow-ups (all addressed)

Source Finding Resolution
Copilot Proto doc claimed counts "stay consistent with total_count" and details_pruned covered "neither is available". Neither matched actual behavior. Proto comments rewritten to spell out the split-source counts and the FINISHED-with-pruned-rows scope. PR mermaid + text aligned to the real event_type strings (reboot / reboot.completed). Commit b94806e + regen d619282.
Human (mcharles-square) CommandBatchDeviceResult.device_identifier was labeled "(e.g. MAC address)" but it's a UUID v7 (VARCHAR(36) populated by id.GenerateID). MAC lives in device.mac_address (VARCHAR(17)). Comment rewritten to // Stable device UUID, preserved for soft-deleted devices. with regen in the same commit. Wider doc-accuracy sweep found no other factual errors. Commit 486c23e.
Codex (P2) Bundled reads in GetCommandBatchDeviceResults ran at PG default READ COMMITTED; a concurrent worker write to command_on_device_log between statements could make counts disagree with device_results. WithTransaction / WithTransactionNoResult extended with variadic *sql.TxOptions; RPC passes {LevelRepeatableRead, ReadOnly: true}. Commit 9941983.
Codex (P2) Finalizer writes *.completed activity row BEFORE MarkCommandBatchFinished*, so a transient mark failure leaves the activity log briefly ahead of batch_log.status. Refactored initializeStatusUpdateRoutine to a three-phase lifecycle: callback gets maxCallbackRetries pre-mark attempts, then the batch is marked FINISHED (unblocking the client), then maxCallbackRetries more post-mark attempts before giving up. composeFinalizers switched to best-effort so independent callbacks (bundle + activity) don't block each other.

Test plan

  • go build ./... and go vet ./... clean
  • just _lint-server0 issues
  • Short tests green in command, activity, activity/models, session, pools, stores/interfaces, stores/sqlstores, infrastructure/db, handlers/command
  • DB-backed integration tests (Postgres 15) cover:
    • TestGetCommandBatchDeviceResults_{HappyPath, NotFoundForCrossOrg, DetailsPrunedWhenFinishedWithNoRows, NotPrunedWhilePending, NotPrunedForEmptySelector, TruncatesLargeBatchesWithConsistentCounts, InvalidArgumentOnEmptyIdentifier}
    • TestMarkQueueMessageStatusTransitions / reaper integration — error_info persistence on failed and reaped rows
    • TestSanitizedErrorInfo_*, TestBoundedErrorInfo_*, table-driven TestIsCompletedBatchDuplicate (pinned to the specific constraint name)
    • TestProcessCommand_RejectsMissingOrgID / TestReapplyCurrentPoolsWithWorkerNames_RejectsMissingOrgID
    • TestHandler_GetCommandBatchDeviceResults_{PassesThroughHappyPath, PropagatesInvalidArgument, PropagatesNotFound}

Rollout notes

  • Historical command_batch_log rows stay with organization_id = NULL permanently and are invisible to GetCommandBatchDeviceResults (closed-by-default). New writes always carry a real org because the service guard refuses OrganizationID <= 0. Operators who want the drill-down for a specific old batch can repair it manually: UPDATE command_batch_log SET organization_id = $ORG WHERE uuid = $UUID.
  • Migrations 000035 / 000036 build indexes without CONCURRENTLY, so they hold ACCESS EXCLUSIVE briefly. Run during a low-traffic window on large activity_log / command_batch_log tables.

Introduces migration 000033 and extends UpsertCommandOnDeviceLog to accept
an optional error string. Existing callers pass the zero-value (NULL) until
M2 wires worker and reaper failure reasons through. This is the first piece
of issue #22 -- per-miner success/failure detail in the activity log.

Also adds docs/issue-22-activity-log-per-miner-detail.md as a progress
tracker for the rest of the milestones.

Made-with: Cursor
Propagate the worker's error string (truncated to 2048 runes with a suffix
marker) and the reaper reason into command_on_device_log.error_info. The
reap queries now RETURN error_info so the Go code doesn't have to duplicate
the reason strings baked into the SQL updates.

The new helper lives in error_info.go with unit coverage for empty / short
/ long / multibyte inputs and the nil-error convention.

Made-with: Cursor
Migration 000034 adds a dedicated batch_id TEXT column plus two partial
indexes: a btree for lookups by batch and a unique index scoped to
'*.completed' event types. The unique index guards the finalizer (added
in M6) against duplicate completion rows when the crash-recovery
reconciler re-runs.

InsertActivityLog and ListActivityLogs now read/write the column.

Made-with: Cursor
…(M4)

Add BatchID to models.Event and models.Entry, and have SQLActivityStore.Insert
populate / SQLActivityStore.List read the new column.

When the command finalizer (M6) writes a '<event_type>.completed' row for a
batch, a concurrent or reconciler-driven retry can collide with the partial
unique index added in M3. We detect that specific pgconn unique_violation and
silently return nil, so the finalizer remains idempotent.

Table-driven unit tests cover isCompletedBatchDuplicate for all relevant
combinations (nil event, missing batch id, wrong event suffix, wrong pg code,
non-pg error, and the legitimate duplicate case).

Made-with: Cursor
Extends the proto schema with CommandBatchDeviceResult plus the new
GetCommandBatchDeviceResults request/response, and adds the supporting
sqlc queries:

 - GetBatchHeaderForOrg returns the batch header only when the creator
   belongs to the caller's organization, giving the RPC tenant isolation
   without a dedicated org_id column on command_batch_log.
 - ListBatchDeviceResults returns per-device outcomes with deterministic
   ordering so the UI can paginate or virtualize without reshuffling.

Handler and service gain stubs that route into a CodeUnimplemented error
until M8 wires the real logic.

Made-with: Cursor
Add composeFinalizers and buildActivityCompletedCallback to command.Service,
and wire them into every command RPC. The finalizer runs when the batch
reaches FINISHED, reads the success/failure counts via
GetBatchStatusAndDeviceCounts, and logs a '<event_type>.completed' activity
row with summary metadata (batch_id, total_count, success_count,
failure_count) and result=success|failure.

Per issue #22, both the '<event_type>' (initiated) and
'<event_type>.completed' rows share the same batch_id column so the UI can
collapse them into a single timeline entry. The partial unique index plus
the store's duplicate swallow keeps completion writes idempotent so the
reconciler (M7) can safely backfill after crashes.

For DownloadLogs, the finalizer is composed so the ZIP bundle callback runs
first and the completion row appears only after the bundle is persisted.
CompletedEventSuffix moves into the activity models package so the command
and sqlstores packages share one source of truth.

Made-with: Cursor
Periodically backfills '<event_type>.completed' activity rows for
batches that reached FINISHED without one (server crash between mark
and write, or finalizer exhausting its 3 retries).

Only acts on batches that have an initiated activity row, so
internally-triggered batches (worker-name reapply) stay out of the
activity timeline. Attribution for the backfilled row is copied from
the initiated row so the completion event matches the original action
even when the triggering session is long gone.

Idempotency is enforced by the partial unique index on
(batch_id, event_type) -- concurrent reconciler runs across replicas
are safe. The metadata includes "reconciled": true so operators can
distinguish reconciler output from live finalizer output.

ActivityLogger interface keeps the reconciler decoupled from the
activity domain service. Wired into main.go in M11.

Made-with: Cursor
Replaces the M5 stub with a real implementation that:
 - scopes by session organization via GetBatchHeaderForOrg (NotFound for
   unknown batches or those in another org),
 - reads per-device rows via ListBatchDeviceResults, counts success/failure,
 - maps the status enum to the proto string ("success" / "failed"),
 - caps rows at maxBatchDeviceResults = 5000 to bound memory and response
   size; paging left as a follow-up once real fleets trip the cap,
 - sets details_pruned = true only when the batch is FINISHED with zero
   per-device rows, so mid-run responses stay non-pruned.

Small helper deviceCommandStatusToProto has table-driven unit coverage.

Made-with: Cursor
Introduces the first retention policy for command-audit tables. Before
this, command_batch_log, command_on_device_log, and queue_message grew
unbounded; only TimescaleDB hypertables had retention.

RetentionCleaner runs on a configurable interval (default 1h) and drains
three tables in FK-safe order, each with a per-query LIMIT
(DeleteBatchLimit, default 1000) and the drain loop keeps going until a
page comes back short. Defaults: queue_message terminals 30d,
command_on_device_log 90d, command_batch_log 180d.

The drain loop itself is unit-tested for happy path, early error, and
ctx cancellation; wiring into main.go lands in M11.

Made-with: Cursor
Activity log had no retention policy. The command audit cleaner (M9)
sweeps command_batch_log and its children; this change adds a parallel
sweep for activity_log itself, with longer defaults that match its
audit/compliance purpose (1 year vs 180 days).

 - ActivityStore.DeleteOlderThan added to the interface, implemented
   by SQLActivityStore and regenerated in the mock store.
 - activity.Config (new) exposes ActivityLogRetention, CleanupInterval
   (6h), and DeleteBatchLimit (1000) via env vars.
 - activity.RetentionCleaner drains activity_log in a loop using the
   same short-page termination rule as the command cleaner.

Unit-tested via a hand-rolled stub store covering the happy loop,
mid-pass error propagation, and ctx cancellation.

Made-with: Cursor
…M11)

Starts three new goroutines with graceful shutdown:

 - CompletionReconciler backfills missing '<event_type>.completed'
   activity rows for FINISHED batches whose finalizer didn't write one
   (server crash or retry exhaustion).
 - Command RetentionCleaner ages out queue_message terminals,
   command_on_device_log, and command_batch_log rows.
 - Activity RetentionCleaner ages out activity_log rows on a longer
   schedule (1 year default vs 180 days for batch headers).

activity.Config is added to the top-level fleetd Config with the
ACTIVITY_ env prefix so operators can tune retention independently of
the command settings.

Made-with: Cursor
Integration tests (require live Postgres):
 - mark_status_integration_test.go: two new subtests that assert
   error_info is persisted on FAILED rows with the worker's reason,
   and left NULL on SUCCESS rows.
 - reaper_integration_test.go: reap reason propagates end-to-end from
   queue_message.error_info into command_on_device_log.error_info.
 - retention_integration_test.go: full FK-safe drain across
   queue_message, command_on_device_log, and command_batch_log.
 - reconciler_integration_test.go: end-to-end backfill with correct
   counts and attribution; idempotent re-run; no backfill when there
   is no initiated row (worker-name reapply case).

Also exposes RunOnceForTest on both RetentionCleaner and
CompletionReconciler so the integration tests can drive a single pass
without starting the goroutine.

All added tests pass locally against the fleet test Postgres.

Made-with: Cursor
Closes the multi-org tenant leak flagged in the review. Previously
GetBatchHeaderForOrg authorized by joining through user_organization on
the batch creator, which meant any org that creator ever belonged to
could see the batch. A dual-org user could read batches they never ran
against.

This commit:
 - Adds command_batch_log.organization_id via migration 000035, with a
   partial index, FK to organization, and a deterministic backfill from
   user_organization. Column stays nullable for safe deploy; a follow-up
   migration can flip to NOT NULL after soak.
 - Captures the caller's session.OrganizationID in CreateCommandBatchLog
   so every new batch records its owning org.
 - Rewrites GetBatchHeaderForOrg to filter on the column directly.
   No more creator-membership JOIN.
 - Threads orgID through saveCommandBatchLogToDB and the two integration
   seed helpers that create command_batch_log rows.

Made-with: Cursor
…hDeviceResults (R2/H1)

Counts returned to the UI no longer derive from the capped per-device
slice, so success_count + failure_count always equals total_count even
for batches with more than maxBatchDeviceResults (5000) devices.

Source of truth: GetBatchStatusAndDeviceCounts, which already aggregates
per-status counts from command_on_device_log (the same query the
finalizer and the stream use).

A new bool field Truncated is set when the raw list exceeds the cap so
the UI can surface a "showing first N of total_count" affordance and
point operators at the CSV/ZIP export for the full list.

Pagination is deferred to a follow-up; the cap is plenty for the
activity-log drill-down today.

Made-with: Cursor
A batch whose selector matched no miners reaches FINISHED with
devices_count=0 and never generates per-device rows. The original
details_pruned condition (FINISHED + rows==0) false-positive'd it,
misleading the UI into saying "Per-miner details are no longer
available" when in fact there was nothing to show.

Tighten the guard with header.DevicesCount > 0 so we only claim
pruning for batches that actually had miners.

Made-with: Cursor
Adds service-level integration tests for every branch the review
highlighted as unguarded:

 - Happy path: FINISHED batch with one SUCCESS and one FAILED row.
   Asserts error_info propagates onto the proto's error_message.
 - Cross-org NotFound (exercises R1/M1): a caller in Org B cannot read
   a batch whose organization_id is Org A.
 - details_pruned=true when FINISHED + devices_count>0 + no codl rows.
 - details_pruned=false when PENDING/PROCESSING.
 - details_pruned=false when devices_count=0 (empty-selector batch),
   covers the R4/M2 guard.
 - Truncated=true when more than maxBatchDeviceResults (5000) codl
   rows exist; asserts that aggregate-sourced counts still sum to
   total_count (the H1 invariant).
 - InvalidArgument for empty batch_identifier.

Also adds two handler-level tests that wire the real command.Service
through handler.Handler.GetCommandBatchDeviceResults and verify the
FleetError round-trips unchanged for the interceptor to map.

While touching the RPC, the sql.ErrNoRows -> NotFound translation now
happens inside the WithTransaction callback since the retry wrapper
re-wraps non-FleetError errors as Internal, which had been erasing the
sentinel before the service's errors.Is check could see it. Pre-existing
bug that the new cross-org test caught.

Made-with: Cursor
Adds session.Info.Actor so internal orchestrators can declare what kind
of principal is acting on the request. The schedulerContext now sets it
to "scheduler"; user / API-key traffic leaves it empty.

logCommandActivity and buildActivityCompletedCallback translate the
string into an activity ActorType via actorTypeFromSession (scheduler ->
ActorScheduler, system -> ActorSystem, anything else -> empty so the
activity service's existing defaulting to ActorUser applies).

This removes the hardcoded ActorUser in the finalizer, which previously
forced scheduler-driven completions to misattribute as user events and
was inconsistent with the reconciler's backfill (which copies the
initiated row's actor type verbatim).

Table-driven unit test in actor_test.go covers nil, empty, scheduler,
system, and unknown-actor cases.

Made-with: Cursor
… (R6/M5)

Before this change, a transient failure in the activity finalizer (e.g.
a DB blip while reading GetBatchStatusAndDeviceCounts) caused the
initializeStatusUpdateRoutine retry loop to re-run the entire composed
chain, including the DownloadLogs bundle builder. The bundle builder
happens to be idempotent in practice today, but its ScheduleBatchLogCleanup
timer got stacked on each retry and the pattern is a latent footgun for
any future non-idempotent callback.

The composed closure now tracks per-callback completion: a retry picks
up only the callbacks that haven't yet succeeded. Single-callback
composition still short-circuits with zero tracking overhead.

New unit test TestComposeFinalizers_RetrySkipsAlreadySucceededCallbacks
drives the failing-callback path and asserts the earlier callback runs
exactly once across retries.

Made-with: Cursor
defaultRetentionConfig in both the command and activity retention
cleaners now:

 - Enforces FK-safe ordering in the command cleaner: if
   QueueMessageRetention > DeviceLogRetention, clamp it down (and log a
   warning) so the cleaner doesn't delete batches while their queue-
   message siblings are still live. Same for DeviceLog vs BatchLog.
 - Clamps CleanupInterval to a one-minute minimum so a typo doesn't
   turn the cleaner into a busy loop.
 - Clamps DeleteBatchLimit to a 50,000 maximum so a misconfiguration
   can't issue a million-row delete per tick.

All clamps log through slog.Warn so operators see the drift. Five new
unit tests cover each clamp path; the existing happy-path tests are
adjusted where they crossed the new minimums.

Made-with: Cursor
…(R8/L3)

Adds activity.Service.LogStrict(ctx, event) error, and routes the
command finalizer (buildActivityCompletedCallback) and the reconciler's
backfill through it. The existing Service.Log remains fire-and-forget
for everyone else.

Before: a transient DB failure while writing the .completed row was
only visible via slog.Error inside Log; the status routine's retry
loop never saw it. The row could silently go missing until the
reconciler caught up 2-7 minutes later.

After: transient failures bubble up through composeFinalizers (R6)
so the status routine retries only the failing leg. The partial unique
index + store-level duplicate swallow keep retries idempotent.

ActivityLogger interface in the reconciler gains LogStrict; backfillOne
uses it so systematic errors (e.g. FK violations) also surface.

Made-with: Cursor
L1: Drop the "stubbed here" sentence on Handler.GetCommandBatchDeviceResults
    now that it's the real implementation.

L2: Add sync.Mutex around Start/Stop lifecycle in both retention cleaners
    and the completion reconciler so back-to-back start+stop calls don't
    race on cancel/done fields.

L5: Wrap the three queries in GetCommandBatchDeviceResults (header, counts,
    per-device rows) inside a single db.WithTransaction callback so they
    come from a consistent snapshot. Prevents a retention sweep between
    queries from skewing the counts vs. the returned device list.

L6: Populate testutil.TestUser.ExternalUserID from the fixture's generated
    value and use it in reconciler_integration_test.go instead of the
    synthetic "ext-"+username literal. Keeps test attributions in sync
    with what the auth interceptor would produce.

L7: Add a file-level comment to retention_test.go explaining the
    DeleteBatchLimit convention the drain-loop tests rely on.

M4 note: Prominent operational comment at the top of migration 000034
    explaining the ACCESS EXCLUSIVE lock behavior of its CREATE INDEX
    statements and pointing at the CONCURRENTLY follow-up. No code
    change, just operator guidance.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (62a7f3eff1c2856bbbbe2296c52d90c3b2af54d4...eadbb8fac56d5947c7c8ce66b00a3ad796e171c2, exact PR three-dot diff)
  • Model: gpt-5.4

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] The new batch-results auth boundary trusts the batch header, not the devices in the batch

  • Category: Auth
  • Location: server/internal/domain/command/service.go:307, server/sqlc/queries/command.sql:102, server/sqlc/queries/command.sql:125
  • Description: GetCommandBatchDeviceResults authorizes access by checking only command_batch_log.organization_id, and that column is now written directly from the caller’s session when the batch is created. The endpoint then returns every command_on_device_log row for that batch without re-checking device ownership. That is only safe if batch creation guarantees all selected devices belong to the same org, but the explicit-identifier selection paths elsewhere still rely on org-unscoped identifier resolution.
  • Impact: A mixed-org batch will now leak foreign device identifiers, statuses, and per-device error messages back to the caller through the new RPC. Because the same mixed batch also drives command execution, this turns the new results API into a confirmation/disclosure path for cross-org command abuse.
  • Recommendation: Enforce org ownership before batch creation for every explicit device-identifier path, and add defense-in-depth on the read side by joining device.org_id (or equivalent ownership data) when listing batch device results so mixed-org rows are rejected instead of disclosed.

[MEDIUM] devices_count is persisted before real device resolution, so the new totals can be wrong or orphaned

  • Category: Reliability
  • Location: server/internal/domain/command/service.go:290, server/internal/domain/command/service.go:514, server/internal/domain/command/service.go:1199
  • Description: The batch row still saves devices_count before processCommand() resolves the actual deviceIDs. For all_devices, the helper counts paired devices without mirroring the full selector filters; for explicit lists, duplicates and unknown identifiers are counted before resolution. The new results RPC returns that persisted header value as total_count, and the completion callback/streaming code also treat it as authoritative.
  • Impact: Filtered batches can report incorrect totals, and explicit selections that resolve to no queue rows can still leave behind a batch header plus a polling goroutine. That produces permanently pending batches, misleading totals, and missing .completed activity rows.
  • Recommendation: Resolve and deduplicate the actual device IDs first, persist len(resolvedIDs) into command_batch_log.devices_count, and reject or immediately finish zero-device batches instead of starting the status polling routine.

[MEDIUM] The new 5,000-row cap is bypassed by an unbounded JSON_AGG

  • Category: Reliability
  • Location: server/sqlc/queries/command.sql:78, server/internal/domain/command/service.go:216, server/internal/domain/command/service.go:1134
  • Description: GetCommandBatchDeviceResults caps ListBatchDeviceResults to 5001 rows, but it also calls GetBatchStatusAndDeviceCounts, which still JSON_AGGs every success and failure device identifier in the batch. The new activity completion finalizer reuses that same heavy query even though it only needs counts.
  • Impact: Large batches can still force full-batch JSON materialization in PostgreSQL and in the Go driver, defeating the new truncation safeguard. In practice this can cause slow result RPCs, excess memory/CPU, and timeout-driven loss of .completed activity entries on large fleets.
  • Recommendation: Split out a lightweight counts/status query with no identifier aggregation and use that for the results RPC and completion finalizer. Keep the full identifier arrays only on paths that truly need them, or paginate them separately.

Notes


Generated by Codex Security Review |
Triggered by: @rongxin-liu |
Review workflow run

@github-actions github-actions Bot added the javascript Pull requests that update javascript code label Apr 22, 2026
…(R10/H1)

M10 added DeleteOlderThan to the ActivityStore interface and regenerated the
generated mock, but the hand-rolled noopActivityStore / spyActivityStore in
pools/service_test.go wasn't updated, leaving `go test
./internal/domain/pools/...` with a build failure.

This is the only hand-rolled fake that had to be updated -- the activity
package's own stubStore and the generated MockActivityStore were handled
in M10.

Made-with: Cursor
…/M1..M4)

M1: Retention clamp cascades correctly when multiple invariants are
violated. Previous order (Queue<=Device, then Device<=Batch) could leave
Queue > Device after a later Device<=Batch clamp reduced Device below the
already-clamped Queue. New order applies the outermost invariant first
(Batch -> Device -> Queue) so every subsequent clamp sees the already-
reduced upstream value. Added
TestDefaultRetentionConfig_ClampsCascadingViolations covering
Queue=1y / Device=30d / Batch=10d; asserts all three end up at 10d.

M2: Added db.WithReadOnlyTransaction backed by sql.TxOptions
{ReadOnly: true, Isolation: RepeatableRead}. GetCommandBatchDeviceResults
now uses it so the three queries (header, aggregate counts, per-device
rows) share a true snapshot; the comment's "consistent snapshot" claim is
now accurate and a concurrent retention sweep cannot perturb the response.

M3: Migration 000035 gained the same operational note as 000034 about the
backfill UPDATE and CREATE INDEX lock durations, plus the pointer to the
CONCURRENTLY follow-up.

M4: Dropped the unreachable sql.ErrNoRows branch on
GetBatchStatusAndDeviceCounts. Under the new read-only REPEATABLE READ
transaction (M2), if the header query has resolved, the aggregate query
cannot return ErrNoRows -- the snapshot is stable and GROUP BY cbl.id
always yields one row when cbl exists. The comment documents the
reasoning.

Made-with: Cursor
@rongxin-liu rongxin-liu changed the title feat(activity): per-miner success/failure detail for bulk actions feat(activity): per-miner success/failure detail for bulk actions [backend] Apr 22, 2026
@rongxin-liu rongxin-liu changed the title feat(activity): per-miner success/failure detail for bulk actions [backend] feat(activity): per-miner success/failure bulk actions detail [backend] Apr 22, 2026
Strip out work that isn't required to satisfy issue #22 (per-miner
success/failure detail for bulk actions) so the diff is feasible for a
single human review pass. Net impact: server/proto diff drops from
~5,481 additions across 57 files to ~2,641 additions across 38 files.

What is kept
- 3 migrations (error_info on command_on_device_log, batch_id on
  activity_log with partial unique index, organization_id on
  command_batch_log + single-org backfill + FK)
- Per-device error_info persistence via sanitizedErrorInfo (plugin text
  still collapses to a generic marker; FleetError DebugMessage still
  passes through, documented as a follow-up)
- activity_log batch_id linkage + idempotent '*.completed' finalizer,
  with isCompletedBatchDuplicate now pinned to the
  uq_activity_log_batch_completed constraint
- GetCommandBatchDeviceResults RPC, org-scoped via
  command_batch_log.organization_id, with SQL-enforced LIMIT,
  Truncated / DetailsPruned flags, and empty-selector guard
- Session organization_id validation on processCommand and
  ReapplyCurrentPoolsWithWorkerNames
- session.Actor alias + ActorScheduler for scheduler-triggered
  completion row attribution

What is removed (deferred to follow-up PRs)
- Command retention cleaner (queue_message / command_on_device_log /
  command_batch_log), activity retention cleaner, their configs, SQL
  DELETE queries, ActivityStore.DeleteOlderThan, main.go wiring, and
  the noopActivityStore DeleteOlderThan stub in pools/service_test.go
- Completion reconciler, ListFinishedBatchesWithoutCompletion,
  ResultUnknown activity state, ActorSystem session constant, and
  reconciler config fields (ReconcilerInterval / GracePeriod /
  MaxBatches)
- Aged-out RPC fallback: GetLatestCompletedActivityForBatch,
  synthesizeAgedOutResponse, parseCompletionMetadata, and the three
  header-aged-out integration tests. Header miss now returns NotFound
  directly.
- WithReadOnlyTransaction infrastructure and GetBatchDeviceCounts
  narrow query; the RPC and finalizer now use WithTransaction +
  GetBatchStatusAndDeviceCounts

Verification
- go build ./... clean
- go vet ./... clean
- golangci-lint: 0 issues
- Short tests green in the touched packages: command, activity,
  session, pools, stores/interfaces, stores/sqlstores,
  infrastructure/db, handlers/command

Made-with: Cursor
rongxin-liu and others added 3 commits April 22, 2026 22:39
The migration added both the column and a best-effort backfill for
historical rows. Per the review, historical batches can simply stay
NULL: they predate the feature, have no per-miner detail worth showing,
and remain invisible to GetBatchHeaderForOrg (closed-by-default). With
no backfill, the multi-org-ambiguity rationale, the integration test
that exercised it, and a handful of comments all go away.

- Strip the UPDATE ... WHERE block from migration 000036; keep the
  nullable column, the partial index (WHERE organization_id IS NOT NULL),
  and the FK unchanged
- Trim the multi-paragraph operator note to the essentials
- Delete org_backfill_integration_test.go (it existed solely to pin the
  single-org backfill rule)
- Trim the "precondition for the NOT NULL flip" rationale in
  saveCommandBatchLogToDB and processCommand; refuse
  OrganizationID <= 0 remains as a runtime guard against NULL writes
- Trim the "legacy backfill miss / closed-by-default posture" comment
  on GetBatchHeaderForOrg

No Go call sites change (organization_id stays nullable in SQL, so
CreateCommandBatchLogParams.OrganizationID remains sql.NullInt64 and
new writes continue to set Valid: true).

Verified:
- go build ./... clean
- go vet ./... clean
- golangci-lint: 0 issues
- Short tests green in touched packages (command, activity,
  stores/sqlstores, handlers/command)

Made-with: Cursor
Self-review follow-ups on feat/22; no functional behavior change.

- Lowercase batch-level status in GetCommandBatchDeviceResults so it matches
  the device-level "success"/"failed" convention already in the response
  ("finished" | "pending" | "processing"). Frontend is not yet wired.
- Centralize the session-org invariant: the OrganizationID <= 0 guard now
  lives at the top of saveCommandBatchLogToDB (the single funnel for every
  command batch write). Removes the duplicated check in processCommand and
  ReapplyCurrentPoolsWithWorkerNames so any future entry point inherits
  the invariant automatically. Existing validation tests still pass
  unchanged because the error message flows through the wrapper.
- Drop the redundant "batch_id" key from buildActivityCompletedCallback's
  metadata map; the first-class BatchID column introduced in this PR
  already carries it and grep confirms no reader.
- Extend Service.LogStrict godoc to name the uq_activity_log_batch_completed
  swallow contract so callers understand why idempotent finalizer retries
  look like success.
- Trim trailing blank lines in activity.sql and command.sql.
- sqlc regen picked up a stale GetBatchHeaderForOrg doc comment that drifted
  when the backfill scaffolding was dropped earlier on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rongxin-liu rongxin-liu marked this pull request as ready for review April 23, 2026 03:15
@rongxin-liu rongxin-liu requested a review from a team as a code owner April 23, 2026 03:15
Copilot AI review requested due to automatic review settings April 23, 2026 03:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds backend support for per-miner success/failure detail on bulk command batches by persisting per-device outcomes (with bounded failure reasons), emitting a terminal “completed” activity row, and exposing the per-device slice via a new org-scoped RPC.

Changes:

  • Add batch_id to activity_log and write a *.completed activity event at batch FINISHED time with idempotent retry behavior.
  • Persist per-device error_info on command_on_device_log and propagate it from worker failures / reaper paths.
  • Add organization_id to command_batch_log and implement GetCommandBatchDeviceResults (SQL + service + handler), scoped by org and capped at 5000 rows.

Reviewed changes

Copilot reviewed 30 out of 38 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
server/sqlc/queries/queue.sql Return error_info from reap queries so it can be persisted to the audit log.
server/sqlc/queries/command.sql Add organization_id, persist error_info, and add org-scoped header + capped device-result queries.
server/sqlc/queries/activity.sql Add batch_id to insert/list queries to link initiated/completed activity rows.
server/migrations/000034_add_error_info_to_command_on_device_log.up.sql Add nullable error_info column for per-device failure reasons.
server/migrations/000034_add_error_info_to_command_on_device_log.down.sql Rollback for error_info column addition.
server/migrations/000035_add_batch_id_to_activity_log.up.sql Add nullable batch_id plus partial index + partial unique index for completion idempotency.
server/migrations/000035_add_batch_id_to_activity_log.down.sql Rollback for batch_id column and indexes.
server/migrations/000036_add_org_id_to_command_batch_log.up.sql Add nullable organization_id with FK + partial index for org scoping.
server/migrations/000036_add_org_id_to_command_batch_log.down.sql Rollback for organization_id FK/index/column.
server/internal/handlers/command/handler.go Add handler passthrough for GetCommandBatchDeviceResults.
server/internal/handlers/command/handler_test.go Add DB-backed handler tests for happy path + error passthrough.
server/internal/domain/stores/sqlstores/activity.go Insert batch_id and swallow only the specific completion-row unique violation for idempotent retries.
server/internal/domain/stores/sqlstores/activity_test.go Add unit tests pinning the duplicate-detection logic to SQLSTATE + constraint name.
server/internal/domain/session/models.go Introduce session.Actor to tag synthesized sessions (scheduler).
server/internal/domain/schedule/processor.go Set session.Info.Actor for scheduler-driven contexts to attribute activity correctly.
server/internal/domain/command/service.go Add completion finalizer chaining, enforce org-id on new batch writes, and implement GetCommandBatchDeviceResults.
server/internal/domain/command/execution_service.go Persist sanitized error_info for worker failures and propagate reaper error info to device logs.
server/internal/domain/command/error_info.go Add bounded + sanitized error persistence helpers.
server/internal/domain/command/error_info_test.go Unit tests for truncation + sanitization behavior.
server/internal/domain/command/finalizers_test.go Unit tests for callback composition + retry semantics.
server/internal/domain/command/actor_test.go Unit tests for mapping session.Actor to activity actor type.
server/internal/domain/command/session_org_validation_test.go Unit tests ensuring missing org IDs are rejected before DB access.
server/internal/domain/command/results_mapping_test.go Unit tests for device-status-to-proto mapping.
server/internal/domain/command/integration_helpers_test.go Shared DB integration helpers for results tests.
server/internal/domain/command/results_integration_test.go DB-backed integration tests for org scoping, pruning semantics, and truncation.
server/internal/domain/command/mark_status_integration_test.go Integration tests asserting error_info persistence on FAILED and NULL on SUCCESS.
server/internal/domain/command/reaper_integration_test.go Integration test asserting reaper reason persists into the audit row.
server/internal/domain/activity/service.go Add LogStrict API so callers (finalizers) can observe persistence errors.
server/internal/domain/activity/models/models.go Add CompletedEventSuffix and BatchID fields to activity models.
proto/minercommand/v1/command.proto Define new messages and GetCommandBatchDeviceResults RPC contract.
server/generated/sqlc/queue.sql.go Generated: sqlc output updated for new error_info return columns.
server/generated/sqlc/models.go Generated: sqlc models updated for new columns.
server/generated/sqlc/db.go Generated: sqlc prepared statements updated for new queries.
server/generated/sqlc/command.sql.go Generated: sqlc output updated for org/header/results queries and new params.
server/generated/sqlc/activity.sql.go Generated: sqlc output updated for batch_id column.
server/generated/grpc/minercommand/v1/minercommandv1connect/command.connect.go Generated: Connect stubs updated for new RPC.
client/src/protoFleet/api/generated/minercommand/v1/command_pb.ts Generated: TS protobuf output updated for new messages/RPC.

Comment thread server/internal/domain/command/service.go
Comment thread proto/minercommand/v1/command.proto Outdated
Comment thread proto/minercommand/v1/command.proto Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a99c56c1b5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/command/service.go
rongxin-liu and others added 2 commits April 22, 2026 23:57
Address review feedback on PR #25:

- proto: aggregate counts only sum to total_count for FINISHED batches
  with intact rows; total_count is from the batch header, success/failure
  from codl aggregates.
- proto: details_pruned doc no longer claims to cover the missing-header
  case (that surfaces as connect.CodeNotFound, not as this response).
- service: document the intentional pre-mark ordering of
  buildActivityCompletedCallback and the transactional follow-up.

Doc-only; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the comment changes from b94806e in the generated Go and TS
proto files. Comment-only; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread proto/minercommand/v1/command.proto Outdated
Comment thread server/internal/domain/command/error_info.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b45121d906

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/command/service.go
rongxin-liu and others added 2 commits April 23, 2026 00:22
The CommandBatchDeviceResult.device_identifier comment claimed the value
was "(e.g. MAC address)", but device.device_identifier is a UUID v7
(VARCHAR(36), populated by id.GenerateID via uuid.NewV7). MAC lives in a
separate device.mac_address column. Updated the proto source comment and
regenerated the dependent Go and TS pb files.

Doc-only; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The RPC bundles three sequential reads (header, counts, rows) in one
db.WithTransaction, but the helper opens transactions with the default
isolation (READ COMMITTED) which gives each statement its own snapshot.
A worker writing command_on_device_log between statements could make
success_count/failure_count disagree with device_results, and could
shift truncated/details_pruned to a mismatched basis.

Fix: extend WithTransaction / WithTransactionNoResult with optional
*sql.TxOptions (variadic last arg, additive — no existing callers need
to change), and pass {Isolation: LevelRepeatableRead, ReadOnly: true}
from this RPC so all three reads share one snapshot.

Read-only RR transactions in PostgreSQL don't get serialization
failures, so the existing retry path is unchanged.

Addresses Codex P2 review: discussion_r3128315071.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99419830c9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/command/service.go
Comment thread server/internal/domain/command/service.go
The batch_id column was added to activity_log (migration 000035) and
the domain model already carries it, but entryToProto() dropped it
before the wire. Add optional string batch_id = 15 to ActivityEntry
and map it in the handler so the frontend can group initiated +
completed rows by batch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f08c3d8e9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/command/error_info.go
Comment thread server/internal/domain/command/service.go
composeFinalizers previously stopped on the first callback error,
which meant a DownloadLogs bundle-builder failure could prevent the
activity finalizer from ever running — leaving a permanent gap in the
completion audit trail. Switch to best-effort: run every callback even
if earlier ones fail, return the first error so the retry loop still
retries on the next tick. Already-succeeded callbacks remain skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e42fdee437

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/command/service.go
initializeStatusUpdateRoutine permanently abandoned the onFinished
callback after maxCallbackRetries, creating an unrecoverable audit gap
when transient DB failures lasted ≥3 polling ticks. The batch was
marked FINISHED but the *.completed activity row was never written.

Decouple the two concerns with a batchMarkedFinished flag: pre-finish
retries (up to maxCallbackRetries) block FINISHED marking as before;
once exhausted, the batch is marked FINISHED to unblock the client,
and the callback gets up to maxCallbackRetries more post-finish
attempts before giving up.

Addresses Codex P1 discussion_r3128491939.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rongxin-liu rongxin-liu merged commit acdfb94 into main Apr 23, 2026
60 checks passed
@rongxin-liu rongxin-liu deleted the feat/22-activity-log-per-miner-detail-backend branch April 23, 2026 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants