feat(uptime): hourly rollup + 30d raw retention for uptime snapshots#35
Draft
ankitgoswami wants to merge 4 commits intomainfrom
Draft
feat(uptime): hourly rollup + 30d raw retention for uptime snapshots#35ankitgoswami wants to merge 4 commits intomainfrom
ankitgoswami wants to merge 4 commits intomainfrom
Conversation
Adds miner_state_snapshots_hourly, a continuous aggregate on the per-device snapshot table keyed by (hour, org, device) with last(state, time) — one row per device per hour. Chart queries whose window predates the 30-day raw retention now read the rollup instead of the raw hypertable, which is orders of magnitude cheaper at scale and unblocks shrinking raw retention. Schema - CREATE MATERIALIZED VIEW ... WITH (timescaledb.continuous) - Refresh policy matching the existing hourly CAGGs (30m schedule, 1h offset) - Compression segmentby=device_identifier, retention 1 year - Raw miner_state_snapshots retention dropped from 1y to 30d Read path - uptimeSnapshotRawRetention const with 1h slack - useHourlyRollup(startTime) routes <=30d to raw, older to rollup - queryUptimeRaw / queryUptimeHourly share the result-shape translation Storage rough numbers (compressed, 10x): 1k miners -> ~4 GB raw + rollup; 10k -> ~40 GB; 100k -> ~400 GB. Raw retention cut to 30d saves 12x hot storage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Split uptimeSnapshotRawRetentionPolicy (the migration contract) from
uptimeSnapshotRawRetention (router cutoff with slack) so a future retention
bump is a single const change and the coupling to the migration is called
out explicitly.
- Replace the magic sql.NullString{"1", true} sentinel used to activate
narg filters with a named package-level var.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Migration can permanently discard existing uptime history before the new rollups are populated
[MEDIUM] New uptime routing chooses the source table by window width only, so historical short-range queries can hit already-purged data
Notes
Generated by Codex Security Review | |
Follow the existing CAGG routing idiom (selectDataSource + String + switch) for the raw-vs-hourly decision on miner_state_snapshots. Makes the two data-source routers in this file visually parallel and extends cleanly if a future daily rollup is added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the miner_state_snapshots rollup story to match the two-tier CAGG
pattern used by device_metrics and device_status:
- New miner_state_snapshots_daily CAGG sourced from raw (no CAGG-on-CAGG;
matches the codebase convention). Policies mirror device_metrics_daily:
start_offset=7d, end_offset=1d, schedule=6h, compression=7d, retention=3y.
- Hourly CAGG retention adjusted to 3 months to match device_metrics_hourly.
- Router rewritten to duration-based selectUptimeDataSource, mirroring
selectDataSource (<=24h raw, <=10d hourly, >10d daily). Drops the
start-time-age consts I had; the new shape is symmetric with the other
router a few lines above in the same file.
- New GetMinerStateSnapshotsDaily sqlc query + sqlc.yaml column overrides.
- queryUptimeDaily parallels queryUptimeHourly; switch-dispatch clamps
bucket duration per source (1 min / 1 h / 1 d).
Migration file renamed to 000034_miner_state_snapshots_rollups.{up,down}.sql
since it now owns both CAGGs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
miner_state_snapshots_hourly— a TimescaleDB continuous aggregate on the per-device snapshot hypertable merged in feat(uptime): 3-state uptime chart #17, keyed by(hour, org, device)withlast(state, time). One row per device per hour.miner_state_snapshotsretention from 1 y → 30 d. Anything older is served from the rollup.GetMinerStateSnapshotsHourlywith the same aggregation shape as the raw query (DISTINCT ON→ SUM-by-state), and a router in the uptime read path that picks raw vs. rollup based onstartTime.Why
Follow-up to #17's code-review concern that per-device minute snapshots grow unsustainably on large fleets. Post-merge math:
(Raw numbers; cold chunks compress another ~10× via
segmentby=device_identifier.) The cap shifts from "how much hot storage can we keep per miner" to "how much cheap-compressed hourly history can we keep per miner" — a much better trade.Design notes
device_pairing,device_status,errors). After feat(uptime): 3-state uptime chart #17 the source is a hypertable, so CAGGs are back on the table — and they're strictly better than a cron-driven rollup: incremental materialization, refresh policy, compression, retention all managed by TimescaleDB.last(state, time)per hour. Keeps one state value per device per hour. The read query'sDISTINCT ON + SUM-by-statealready handles further aggregation into larger bar intervals (daily / weekly), so the rollup doesn't need to pre-compute per-bar counts — it just compresses the time axis.startTime, not bucket size. Raw is available for the last 30 d regardless of bucket size; older windows must use the rollup (raw rows are gone). One-h slack on the cutoff so boundary queries don't race the retention policy.CountMinersByState.Critical files
GetMinerStateSnapshotsHourlymirrors the raw query.bucket,state).useHourlyRolluprouter +queryUptimeRaw/queryUptimeHourly.Test plan
go build ./...,go vet ./...,golangci-lint runclean onserver/internal/...andserver/cmd/....go test ./internal/domain/telemetry/...+./internal/handlers/telemetry/...green.TestUseHourlyRollupcovers 1d / 29d / 31d / 1y windows.just db-migrateapplies 000034, CAGG materializes.SELECT COUNT(*) FROM miner_state_snapshotsstays bounded (confirm retention policy kicks in after 30d mark; the new policy takes effect immediately for out-of-window rows).Compatibility / deploy notes
miner_state_snapshotsitself, no code path changes on the write side. Safe to deploy incrementally.WITH NO DATA; the refresh policy backfills incrementally (mirroring000016_recreate_metrics_aggregates). Queries against long windows will return empty results until the refresh job has covered the backfill window — same trade-off the existing metric CAGGs accept.🤖 Generated with Claude Code