Skip to content

fix(overview): keep current-month snapshots live and query Tenant kind correctly#5

Open
tym83 wants to merge 1 commit intomainfrom
fix/current-month-live-and-tenant-label
Open

fix(overview): keep current-month snapshots live and query Tenant kind correctly#5
tym83 wants to merge 1 commit intomainfrom
fix/current-month-live-and-tenant-label

Conversation

@tym83
Copy link
Copy Markdown
Contributor

@tym83 tym83 commented Apr 22, 2026

Summary

The /api/overview endpoint currently freezes the current-month snapshot at the first request of the month. For April 2026 this has been returning 43 clusters / 164 nodes / 83 tenants since April 1, while live VictoriaMetrics has climbed to 112 / 450 / 444 as of 2026-04-22.

Two bugs fixed:

  1. Current-month snapshots are now regenerated on every request and never written to disk. Only final end-of-month snapshots are persisted. Non-final snapshots already on the PVC (e.g. the stale 2026-04.json from April 1) are ignored on load, so the existing file self-heals at month turnover.
  2. total_tenants was always 0 because the scalar query used kind="tenant" while the emitted metric label is kind="Tenant" (see client code in cozystack/cozystack/internal/telemetry and docs v1/operations/configuration/telemetry.md). Fixed to match the real label.

What

  • overview.go:
    • ensureSnapshot: for the current/future calendar month, skip memory+disk cache and always call collectSnapshot.
    • collectSnapshot: persist only when isFinalSnapshot(s); otherwise store in memory via the new storeSnapshotInMemory helper.
    • loadSnapshots + loadSnapshotFromDisk: skip non-final files so the legacy stale April file does not repopulate the cache.
    • saveSnapshot delegates the in-memory update to storeSnapshotInMemory (no behaviour change).
    • New helpers: isCurrentOrFutureMonth, isFinalSnapshot.
  • Tenant query: kind="tenant"kind="Tenant".

Why

Without this fix every current-month snapshot reflects whatever VM reported at the first request that month — frozen until the month ends. The website fetches this endpoint daily and exposes the stale number on cozystack.io/oss-health/telemetry/. The Tenant-label bug hid behind a workaround in the website fetcher that reads apps.Tenant instead of total_tenants; that workaround still works after this fix and can be dropped in a follow-up.

Operational note

After deployment the stale 2026-04.json on the PVC stays, but is no longer read (isFinalSnapshot = false). On May 1 a correct end-of-April snapshot will be generated and overwrite it. No manual cleanup needed.

Test plan

  • go build ./... and go vet ./... pass (done locally)
  • Deploy to staging; request /api/overview?year=2026&month=04 twice and confirm the response tracks live VM, not a frozen value
  • Request a past month (e.g. ?year=2026&month=03) and confirm it stays cached
  • Verify total_tenants is non-zero and matches sum(cozy_application_count{kind="Tenant"})
  • After 2026-05-01, confirm April snapshot on disk has been regenerated with CollectedAt past end of April

…d correctly

The current-month snapshot is now regenerated from VictoriaMetrics on every
/api/overview request and never written to disk. Only final, end-of-month
snapshots are persisted. Previously the first request of the month froze
the numbers to whatever VM reported at that moment, which is why April
stayed at 43 clusters / 164 nodes / 83 tenants while Grafana kept climbing.

Stale in-progress snapshots that may already be present on the snapshot
PVC are ignored when loading, so the existing bad April file self-heals
once the month ends (a fresh end-of-April snapshot is generated on first
request in May and written to disk).

Also fix the total tenants query: the application kind label is Tenant
(PascalCase), not lowercase tenant, so the scalar query was always
returning 0. This hid behind a workaround in the website fetcher that
read apps.Tenant instead; total_tenants will now be correct on its own.

Signed-off-by: tym83 <6355522@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@tym83 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 38 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 38 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 67d4ff77-8114-485d-9f33-d1127dde9018

📥 Commits

Reviewing files that changed from the base of the PR and between e2bb204 and a51ffbb.

📒 Files selected for processing (1)
  • overview.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/current-month-live-and-tenant-label

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

tym83 added a commit to cozystack/website that referenced this pull request Apr 22, 2026
…h cron (#508)

## Summary

Manually correct the April 2026 telemetry snapshot on
`cozystack.io/oss-health/telemetry/` with live VictoriaMetrics values
(112 clusters / 450 nodes / 444 tenants as of 2026-04-22) and pause the
daily fetch cron until the upstream server fix ships.

Root cause: the telemetry-server `/api/overview` endpoint freezes the
current-month snapshot at the first-of-month request. See
cozystack/cozystack-telemetry-server#5 for the server fix. Until that is
deployed, every `fetch-telemetry.yml` run would overwrite this
correction with stale 43 / 164 / 83.

## What

- `static/oss-health-data/telemetry.json`:
- `month` (April 2026) → 112 / 450 / 444. Apps scaled linearly by the
cluster ratio 112 / 43 ≈ 2.6047.
- `quarter` and `year` averages recomputed: March derived from `(2 ×
old_avg − stale_april)`, new average computed with the corrected April.
Apps given the same treatment.
- `.github/workflows/fetch-telemetry.yml`: `schedule:` block commented
out with a TODO pointing at the upstream fix. `workflow_dispatch:` kept
for manual refresh.

## Why

Without pausing the cron the next 08:00 UTC run would revert this
commit. Keeping `workflow_dispatch:` means we can trigger the fetch
on-demand once the server fix is live, to verify it returns live data
before re-enabling the schedule.

## Test plan

- [ ] Hugo build + Netlify preview show April 112 / 450 / 444 and scaled
apps on `/oss-health/telemetry/`
- [ ] GitHub Actions page shows `Fetch Telemetry Data` with only manual
trigger
- [ ] Once the server fix is deployed: trigger the workflow manually,
confirm it produces a diff of 0 (or within rounding) against this
committed JSON, then re-enable the daily schedule in a follow-up PR
@tym83 tym83 marked this pull request as ready for review April 22, 2026 18:26
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to distinguish between final and non-final snapshots, ensuring that data for the current month is always fresh and not persisted to disk. The review identifies two significant issues: the implementation bypasses the singleflight mechanism for the current month, which will cause redundant network queries, and the in-memory cache check for past months does not distinguish between final and non-final snapshots, potentially serving stale data until the service is restarted.

Comment thread overview.go
Comment on lines +110 to 119
if m.isCurrentOrFutureMonth(monthLabel) {
// In-progress month: always regenerate. collectSnapshot will choose
// in-memory-only storage because the resulting snapshot is non-final.
m.collectSnapshot(monthLabel)
return
}

if m.hasSnapshot(monthLabel) {
return
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of ensureSnapshot has two significant issues:

  1. Singleflight Bypass: For the current month, it calls collectSnapshot directly and returns (lines 110-115). This bypasses the singleflight logic, meaning that multiple concurrent requests for the current month will trigger redundant, expensive network queries to VictoriaMetrics and GitHub.
  2. Stale Memory Cache: For past months, the check if m.hasSnapshot(monthLabel) (line 117) returns true even if the in-memory snapshot is a non-final one from when the month was still in progress. This prevents the snapshot from ever being finalized (loaded from disk or regenerated) until the service is restarted.

I recommend refactoring the logic to use singleflight for all regenerations and ensuring that only final snapshots are used as a cache for past months.

	if !m.isCurrentOrFutureMonth(monthLabel) {
		m.mu.RLock()
		hasFinal := false
		for _, s := range m.snapshots {
			if s.Month == monthLabel && isFinalSnapshot(s) {
				hasFinal = true
				break
			}
		}
		m.mu.RUnlock()
		if hasFinal {
			return
		}

		// Try loading from disk cache before querying VM.
		if m.loadSnapshotFromDisk(monthLabel) {
			return
		}
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant