feat(uptime): 3-state uptime chart by ankitgoswami · Pull Request #17 · block/proto-fleet

ankitgoswami · 2026-04-21T19:53:18Z

Summary

New miner_state_snapshots TimescaleDB hypertable storing one row per paired device per tick with a 4-state code (offline/sleeping/broken/hashing). Classifier CASE mirrors CountMinersByState, so the chart and the live FleetHealth legend agree.
Uptime chart rendered as 3 segments: Hashing / Needs attention / Not hashing. Drill-through buttons deep-link to the matching miner-list filters via the existing encodeFilterToURL helper.
Writer is a single INSERT ... SELECT per 60 s tick — one round-trip, no per-org loop, no Go-side array packing.
Read query (GetMinerStateSnapshots) does DISTINCT ON (bucket, device_identifier) + SUM-by-state and applies the device filter at the CTE level, so uptime_status_counts honors the full device_selector (fleet, group, rack, arbitrary list). Drops in cleanly for future group/rack overview pages without server changes.

fixes #12

Test plan

go build, go vet, golangci-lint run clean on server/internal/... and server/cmd/...
go test ./internal/domain/telemetry/... ./internal/handlers/telemetry/... green
Unit tests for writeFleetStateSnapshot (happy path + log-on-error) and lifecycle tests (Start/Stop) cover the tick fires
cd client && npx tsc --noEmit, npm run lint, vitest run src/protoFleet/features/dashboard green
Manual: bring up local stack, verify miner_state_snapshots grows by ~fleet-size rows/min, open dashboard, confirm all 3 segments render, confirm drill-throughs land on the right filter URLs, confirm totals match FleetHealth legend

🤖 Generated with Claude Code

Introduce a per-org miner_state_snapshots hypertable written every 60s by fleetStateSnapshotRoutine, and route the Uptime chart through it. Chart and live legend now share one classifier (CountMinersByState), and the chart renders three segments: Hashing / Needs attention / Not hashing, with drill-throughs to the corresponding miner-list filters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Collapses generated protobuf, sqlc, and mockgen output in GitHub diffs and excludes them from language stats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Matches the canonical formatter output (protobuf-es + prettier + goimports). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-21T20:09:58Z

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

Reviewed pull request diff only (b8945fb1b4db0f6486a33066a2640c9263246d2f...ff020b23d39417819b502028c9c0c7fa83c949a4, exact PR three-dot diff)

Model: gpt-5.4

💡 Click "edited" above to see previous reviews for this PR.

Review Summary

Overall Risk: HIGH

Findings

[HIGH] Per-device minute snapshots will create unsustainable TimescaleDB growth

Category: Reliability
Location: server/sqlc/queries/miner_state_snapshots.sql:9
Description: The new snapshot pipeline inserts one row per paired miner on every snapshot tick, and the migration retains those rows for one year. At the default 60s interval this is 1,440,000 rows/day per 1,000 miners, or 525,600,000 rows/year per 1,000 miners. On fleets of a few thousand miners, this becomes billions of rows, and every tick also runs a full fleet-wide join over device, device_pairing, device_status, and errors.
Impact: Large fleets can drive sustained write amplification, compression backlog, storage exhaustion, and increasingly slow snapshot queries, eventually degrading or taking down the primary database.
Recommendation: Store aggregated org/bucket counts instead of per-device snapshots, or record only state transitions and downsample aggressively. If per-device history is truly required, shorten retention substantially and add capacity limits/observability before rollout.

[MEDIUM] Switching uptime reads to the new snapshot table drops all pre-deploy history

Category: Reliability
Location: server/internal/infrastructure/timescaledb/telemetry_store.go:556
Description: Authenticated combined-metrics reads now overwrite the old raw/hourly/daily uptime counts with uptimeCountsForQuery(), which reads only miner_state_snapshots. The new migration only creates the table and policies; it does not backfill from existing telemetry/status history.
Impact: After deployment, any 24h/7d/30d/etc. uptime window is mostly zero-filled until enough new snapshots accumulate, and all uptime history from before the rollout is effectively unavailable.
Recommendation: Backfill the snapshot table before switching reads, or keep the previous uptime-count path as a fallback until snapshot coverage exists for the requested window.

[MEDIUM] Historical `GetCombinedMetrics` requests always include a synthetic “now” bucket

Category: gRPC
Location: server/internal/domain/telemetry/service.go:1027
Description: GetCombinedMetrics() now unconditionally appends a live uptime bar built from time.Now() and GetMinerStateCounts(). That happens even when the caller supplied an explicit historical end_time.
Impact: Pull responses for past ranges are no longer time-bounded: the last uptime bar can fall outside the requested interval and reveal current fleet state in what should be a historical snapshot.
Recommendation: Only append the live bar when the query is effectively “up to now” (for example, no end_time or end_time within one snapshot interval of now), or keep this behavior limited to the streaming endpoint.

[MEDIUM] Uptime history disappears whenever there are no metric rows, even if snapshot rows exist

Category: Reliability
Location: server/internal/infrastructure/timescaledb/telemetry_store.go:547
Description: The raw/hourly/daily combined-metrics paths still return immediately on len(data) == 0 / len(rows) == 0 before they reach the new uptime snapshot query path. The same pattern repeats in the hourly and daily branches at lines 595 and 643.
Impact: Exactly the cases the uptime panel should explain, such as fully offline fleets or filters matching miners with no recent telemetry, collapse to an empty history or only the synthetic live bar despite miner_state_snapshots containing the needed status history.
Recommendation: Decouple uptime retrieval from metric retrieval so status-history responses can be returned even when the metric series is empty.

Notes

I did not find new JWT handling, SQL injection, command injection, pool-hijack, or plugin-boundary issues in the changed code.
The new tests cover device-scoped live counts and UI bucket wiring, but I did not see coverage for migration/backfill behavior, end_time in the past, or empty-metric windows.

_{Generated by Codex Security Review |

Triggered by: @ankitgoswami |

Review workflow run}

Switch miner_state_snapshots from per-org aggregate rows to per-device state rows, and aggregate at read time. This lets uptime_status_counts honor the full device_selector (fleet, group, rack, arbitrary list) with the same CountMinersByState classifier the live legend uses. Other simplifications that fall out: - Replace the per-org Go loop in the writer with a single INSERT...SELECT that materializes state for the whole paired fleet in one round-trip. - Drop OrgIDLister, SQLOrganizationStore, and MinerStateCountsRow — no longer needed now that the snapshot query itself produces the rows. - GetMinerStateSnapshots aggregates with DISTINCT ON per (bucket, device) so bucket sums stay truthful regardless of snapshot alignment, and applies the device-identifier filter at the CTE level. Requires a local `just db-reset` since the unmerged migration 000033 is edited in place (schema change, not a new migration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tighten prose across the snapshot code so comments explain non-obvious why only. Cross-link the classifier between InsertMinerStateSnapshot and CountMinersByState so future edits don't drift silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

UI labels only; underlying buckets (hashing / broken / not-hashing) unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fier - Hourly and daily metric paths normalize start/end to complete buckets before querying aggregates. Thread the normalized range into uptimeCountsForQuery so the uptime series stops at the same edge instead of leaking a partial current hour/day. - InsertMinerStateSnapshot's hashing branch now requires ACTIVE + not auth-needed + no open errors (matching CountMinersByState exactly). A new state code 4 ("unknown") catches any status that doesn't fit the four named buckets so historical rows don't misclassify those devices as healthy. The read query already sums only 0..3, so unknown rows are excluded from every bucket — same as CountMinersByState. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Append a synthetic UptimeStatusCount at time.Now() to GetCombinedMetrics (and keep the streaming path parity) using a live CountMinersByState call. The chart's right-most "live" bar now reflects current fleet state instead of lagging up to one snapshot interval behind the FleetHealth legend. Factor the filter-build + counts-fetch into appendLiveUptimeBar so the unary and streaming paths share one code path (previously only streaming populated MinerStateCounts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

flesher

One small thing I noticed, otherwise looks good

The uptime chart's Not-Hashing bucket (and FleetHealth's Sleeping segment) count miners in either MAINTENANCE or INACTIVE as sleeping — matching the CountMinersByState classifier. The URL filter plumbing only mapped sleeping -> INACTIVE, so miners in MAINTENANCE counted toward the dashboard bars but vanished from the filtered list page when users drilled through. Fix symmetrically in three translators so sleeping <-> {INACTIVE, MAINTENANCE} everywhere: - encodeFilterToURL: either status now maps to status=sleeping. - parseFilterFromURL: sleeping expands to both device statuses. - MinerList dropdown: Sleeping chip selects both device statuses. Call sites updated to pass both statuses for self-documentation: - UptimePanel "Not hashing" drill-through. - FleetHealth Sleeping segment link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot assigned ankitgoswami Apr 21, 2026

github-actions Bot added client server shared labels Apr 21, 2026

ankitgoswami and others added 3 commits April 21, 2026 13:01

chore: mark generated files in .gitattributes

d6f6d95

Collapses generated protobuf, sqlc, and mockgen output in GitHub diffs and excludes them from language stats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: regenerate telemetry proto output via just gen

c6b9e02

Matches the canonical formatter output (protobuf-es + prettier + goimports). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ankitgoswami force-pushed the ankitg/uptime-snapshots branch from 7c1eca5 to c6b9e02 Compare April 21, 2026 20:01

ankitgoswami and others added 5 commits April 21, 2026 13:41

gen code

a47581c

feat(uptime): rename chart segments to Healthy / Degraded

497def5

UI labels only; underlying buckets (hashing / broken / not-hashing) unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ankitgoswami marked this pull request as ready for review April 21, 2026 22:03

ankitgoswami requested review from flesher, mcharles-square, negarn and rongxin-liu as code owners April 21, 2026 22:03

ankitgoswami changed the title ~~feat(uptime): snapshot-based 3-state uptime chart~~ feat(uptime): 3-state uptime chart Apr 21, 2026

flesher approved these changes Apr 22, 2026

View reviewed changes

Comment thread client/src/protoFleet/features/dashboard/components/UptimePanel/UptimePanel.tsx Outdated

ankitgoswami requested a review from a team as a code owner April 22, 2026 16:53

github-actions Bot added dependencies Pull requests that update a dependency file javascript Pull requests that update javascript code labels Apr 22, 2026

Merge branch 'main' into ankitg/uptime-snapshots

ff020b2

ankitgoswami merged commit a2f3e8e into main Apr 22, 2026
60 checks passed

ankitgoswami deleted the ankitg/uptime-snapshots branch April 22, 2026 17:41

ankitgoswami mentioned this pull request Apr 22, 2026

feat(uptime): hourly rollup + 30d raw retention for uptime snapshots #35

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(uptime): 3-state uptime chart#17

feat(uptime): 3-state uptime chart#17
ankitgoswami merged 11 commits intomainfrom
ankitg/uptime-snapshots

ankitgoswami commented Apr 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

flesher left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ankitgoswami commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔐 Codex Security Review

Review Summary

Findings

[HIGH] Per-device minute snapshots will create unsustainable TimescaleDB growth

[MEDIUM] Switching uptime reads to the new snapshot table drops all pre-deploy history

[MEDIUM] Historical GetCombinedMetrics requests always include a synthetic “now” bucket

[MEDIUM] Uptime history disappears whenever there are no metric rows, even if snapshot rows exist

Notes

Uh oh!

flesher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ankitgoswami commented Apr 21, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading

[MEDIUM] Historical `GetCombinedMetrics` requests always include a synthetic “now” bucket