Skip to content

feat(uptime): 3-state uptime chart#17

Merged
ankitgoswami merged 11 commits intomainfrom
ankitg/uptime-snapshots
Apr 22, 2026
Merged

feat(uptime): 3-state uptime chart#17
ankitgoswami merged 11 commits intomainfrom
ankitg/uptime-snapshots

Conversation

@ankitgoswami
Copy link
Copy Markdown
Contributor

@ankitgoswami ankitgoswami commented Apr 21, 2026

image

Summary

  • New miner_state_snapshots TimescaleDB hypertable storing one row per paired device per tick with a 4-state code (offline/sleeping/broken/hashing). Classifier CASE mirrors CountMinersByState, so the chart and the live FleetHealth legend agree.
  • Uptime chart rendered as 3 segments: Hashing / Needs attention / Not hashing. Drill-through buttons deep-link to the matching miner-list filters via the existing encodeFilterToURL helper.
  • Writer is a single INSERT ... SELECT per 60 s tick — one round-trip, no per-org loop, no Go-side array packing.
  • Read query (GetMinerStateSnapshots) does DISTINCT ON (bucket, device_identifier) + SUM-by-state and applies the device filter at the CTE level, so uptime_status_counts honors the full device_selector (fleet, group, rack, arbitrary list). Drops in cleanly for future group/rack overview pages without server changes.

fixes #12

Test plan

  • go build, go vet, golangci-lint run clean on server/internal/... and server/cmd/...
  • go test ./internal/domain/telemetry/... ./internal/handlers/telemetry/... green
  • Unit tests for writeFleetStateSnapshot (happy path + log-on-error) and lifecycle tests (Start/Stop) cover the tick fires
  • cd client && npx tsc --noEmit, npm run lint, vitest run src/protoFleet/features/dashboard green
  • Manual: bring up local stack, verify miner_state_snapshots grows by ~fleet-size rows/min, open dashboard, confirm all 3 segments render, confirm drill-throughs land on the right filter URLs, confirm totals match FleetHealth legend

🤖 Generated with Claude Code

ankitgoswami and others added 3 commits April 21, 2026 13:01
Introduce a per-org miner_state_snapshots hypertable written every 60s by
fleetStateSnapshotRoutine, and route the Uptime chart through it. Chart and
live legend now share one classifier (CountMinersByState), and the chart
renders three segments: Hashing / Needs attention / Not hashing, with
drill-throughs to the corresponding miner-list filters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapses generated protobuf, sqlc, and mockgen output in GitHub diffs and
excludes them from language stats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches the canonical formatter output (protobuf-es + prettier + goimports).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/uptime-snapshots branch from 7c1eca5 to c6b9e02 Compare April 21, 2026 20:01
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (b8945fb1b4db0f6486a33066a2640c9263246d2f...ff020b23d39417819b502028c9c0c7fa83c949a4, exact PR three-dot diff)
  • Model: gpt-5.4

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Per-device minute snapshots will create unsustainable TimescaleDB growth

  • Category: Reliability
  • Location: server/sqlc/queries/miner_state_snapshots.sql:9
  • Description: The new snapshot pipeline inserts one row per paired miner on every snapshot tick, and the migration retains those rows for one year. At the default 60s interval this is 1,440,000 rows/day per 1,000 miners, or 525,600,000 rows/year per 1,000 miners. On fleets of a few thousand miners, this becomes billions of rows, and every tick also runs a full fleet-wide join over device, device_pairing, device_status, and errors.
  • Impact: Large fleets can drive sustained write amplification, compression backlog, storage exhaustion, and increasingly slow snapshot queries, eventually degrading or taking down the primary database.
  • Recommendation: Store aggregated org/bucket counts instead of per-device snapshots, or record only state transitions and downsample aggressively. If per-device history is truly required, shorten retention substantially and add capacity limits/observability before rollout.

[MEDIUM] Switching uptime reads to the new snapshot table drops all pre-deploy history

  • Category: Reliability
  • Location: server/internal/infrastructure/timescaledb/telemetry_store.go:556
  • Description: Authenticated combined-metrics reads now overwrite the old raw/hourly/daily uptime counts with uptimeCountsForQuery(), which reads only miner_state_snapshots. The new migration only creates the table and policies; it does not backfill from existing telemetry/status history.
  • Impact: After deployment, any 24h/7d/30d/etc. uptime window is mostly zero-filled until enough new snapshots accumulate, and all uptime history from before the rollout is effectively unavailable.
  • Recommendation: Backfill the snapshot table before switching reads, or keep the previous uptime-count path as a fallback until snapshot coverage exists for the requested window.

[MEDIUM] Historical GetCombinedMetrics requests always include a synthetic “now” bucket

  • Category: gRPC
  • Location: server/internal/domain/telemetry/service.go:1027
  • Description: GetCombinedMetrics() now unconditionally appends a live uptime bar built from time.Now() and GetMinerStateCounts(). That happens even when the caller supplied an explicit historical end_time.
  • Impact: Pull responses for past ranges are no longer time-bounded: the last uptime bar can fall outside the requested interval and reveal current fleet state in what should be a historical snapshot.
  • Recommendation: Only append the live bar when the query is effectively “up to now” (for example, no end_time or end_time within one snapshot interval of now), or keep this behavior limited to the streaming endpoint.

[MEDIUM] Uptime history disappears whenever there are no metric rows, even if snapshot rows exist

  • Category: Reliability
  • Location: server/internal/infrastructure/timescaledb/telemetry_store.go:547
  • Description: The raw/hourly/daily combined-metrics paths still return immediately on len(data) == 0 / len(rows) == 0 before they reach the new uptime snapshot query path. The same pattern repeats in the hourly and daily branches at lines 595 and 643.
  • Impact: Exactly the cases the uptime panel should explain, such as fully offline fleets or filters matching miners with no recent telemetry, collapse to an empty history or only the synthetic live bar despite miner_state_snapshots containing the needed status history.
  • Recommendation: Decouple uptime retrieval from metric retrieval so status-history responses can be returned even when the metric series is empty.

Notes

  • I did not find new JWT handling, SQL injection, command injection, pool-hijack, or plugin-boundary issues in the changed code.
  • The new tests cover device-scoped live counts and UI bucket wiring, but I did not see coverage for migration/backfill behavior, end_time in the past, or empty-metric windows.

Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

ankitgoswami and others added 5 commits April 21, 2026 13:41
Switch miner_state_snapshots from per-org aggregate rows to per-device state
rows, and aggregate at read time. This lets uptime_status_counts honor the
full device_selector (fleet, group, rack, arbitrary list) with the same
CountMinersByState classifier the live legend uses.

Other simplifications that fall out:
- Replace the per-org Go loop in the writer with a single INSERT...SELECT that
  materializes state for the whole paired fleet in one round-trip.
- Drop OrgIDLister, SQLOrganizationStore, and MinerStateCountsRow — no longer
  needed now that the snapshot query itself produces the rows.
- GetMinerStateSnapshots aggregates with DISTINCT ON per (bucket, device) so
  bucket sums stay truthful regardless of snapshot alignment, and applies the
  device-identifier filter at the CTE level.

Requires a local `just db-reset` since the unmerged migration 000033 is
edited in place (schema change, not a new migration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tighten prose across the snapshot code so comments explain non-obvious why
only. Cross-link the classifier between InsertMinerStateSnapshot and
CountMinersByState so future edits don't drift silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UI labels only; underlying buckets (hashing / broken / not-hashing) unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fier

- Hourly and daily metric paths normalize start/end to complete buckets
  before querying aggregates. Thread the normalized range into
  uptimeCountsForQuery so the uptime series stops at the same edge instead
  of leaking a partial current hour/day.

- InsertMinerStateSnapshot's hashing branch now requires ACTIVE + not
  auth-needed + no open errors (matching CountMinersByState exactly). A new
  state code 4 ("unknown") catches any status that doesn't fit the four
  named buckets so historical rows don't misclassify those devices as
  healthy. The read query already sums only 0..3, so unknown rows are
  excluded from every bucket — same as CountMinersByState.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami marked this pull request as ready for review April 21, 2026 22:03
@ankitgoswami ankitgoswami changed the title feat(uptime): snapshot-based 3-state uptime chart feat(uptime): 3-state uptime chart Apr 21, 2026
Append a synthetic UptimeStatusCount at time.Now() to GetCombinedMetrics (and
keep the streaming path parity) using a live CountMinersByState call. The
chart's right-most "live" bar now reflects current fleet state instead of
lagging up to one snapshot interval behind the FleetHealth legend.

Factor the filter-build + counts-fetch into appendLiveUptimeBar so the unary
and streaming paths share one code path (previously only streaming populated
MinerStateCounts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@flesher flesher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small thing I noticed, otherwise looks good

Comment thread client/src/protoFleet/features/dashboard/components/UptimePanel/UptimePanel.tsx Outdated
The uptime chart's Not-Hashing bucket (and FleetHealth's Sleeping segment)
count miners in either MAINTENANCE or INACTIVE as sleeping — matching the
CountMinersByState classifier. The URL filter plumbing only mapped
sleeping -> INACTIVE, so miners in MAINTENANCE counted toward the dashboard
bars but vanished from the filtered list page when users drilled through.

Fix symmetrically in three translators so sleeping <-> {INACTIVE, MAINTENANCE}
everywhere:
- encodeFilterToURL: either status now maps to status=sleeping.
- parseFilterFromURL: sleeping expands to both device statuses.
- MinerList dropdown: Sleeping chip selects both device statuses.

Call sites updated to pass both statuses for self-documentation:
- UptimePanel "Not hashing" drill-through.
- FleetHealth Sleeping segment link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami requested a review from a team as a code owner April 22, 2026 16:53
@github-actions github-actions Bot added dependencies Pull requests that update a dependency file javascript Pull requests that update javascript code labels Apr 22, 2026
@ankitgoswami ankitgoswami merged commit a2f3e8e into main Apr 22, 2026
60 checks passed
@ankitgoswami ankitgoswami deleted the ankitg/uptime-snapshots branch April 22, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client dependencies Pull requests that update a dependency file javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Uptime chart "not hashing" miner count is inaccurate

2 participants