Conversation
Introduces the chart service for dashboard time-series data: - server/chart/ bounded context (service, HTTP handler, MySQL store, SCD bitmaps) - Two migrations for host_hourly_data and host_scd_data tables - Cron wiring for chart data collection - charts-backfill and charts-collect CLI tools - render.yaml updates for the data collection cron job
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughImplements a chart bounded context: new exported chart API and HTTP types, endpoint wiring and routes, a concrete service with viewer-backed authorization and host-filter TTL cache, MySQL SCD datastore implementation and migration for Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (6)
server/chart/internal/service/service.go (1)
85-85: Consider injecting a clock for deterministicGetChartDatatests.
time.Now()here means end-to-end tests ofGetChartDatacan't assert on the exact start/end passed to the datastore.computeBucketRangeis unit-tested in isolation, so this is only a minor ergonomics concern — afunc() time.Timefield onService(defaulted totime.Now) would close the gap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/chart/internal/service/service.go` at line 85, Replace the direct call to time.Now() in GetChartData by injecting a clock on the Service so tests can control "now": add a field Clock func() time.Time to the Service struct (default it to time.Now in constructor/new service), update GetChartData to call s.Clock() instead of time.Now(), and keep computeBucketRange usage unchanged so unit tests remain valid while end-to-end tests can supply a deterministic clock.tools/charts-backfill/main.go (1)
30-82: LGTM for a dev tool.Flag parsing, DSN validation via
Ping, host-ID sourcing, and UTC-midnight alignment all look right.INSERT ... ON DUPLICATE KEY UPDATEagainst(dataset, entity_id, valid_from)correctly makes re-runs idempotent.Minor: consider validating
*days > 0up front; withdays=0the tool silently does nothing (for day := range daysis a no-op), and with negative values theAddDate(0, 0, -(*days-1))computes a future start before exiting the empty loop.Based on learnings: For test harnesses and CLI tools in the Fleet codebase, resource cleanup on error paths may not be necessary since the OS handles cleanup when the process exits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/charts-backfill/main.go` around lines 30 - 82, The CLI doesn't validate the -days flag so days==0 or negative values produce no work or a future start date; in main() after flag.Parse() add a validation check for *days to ensure it's > 0 and call log.Fatalf with a clear message (e.g. "invalid days: must be > 0") if not, so functions like backfill, the start date calculation and loops (see variables days, startDate, start and call to backfill) always receive a sane positive day count.server/chart/internal/mysql/data.go (2)
260-330:GetSCDDatalooks correct; one minor determinism nit for snapshot aggregation.The overlap SQL predicate (
valid_from < lastBucketEnd AND valid_to > firstBucketStart), bucket labels (startDate + (i+1)*bucketSize, first label =startDate+bucketSize, last =endDate), and filter-mask AND all line up with the documented semantics and the snapshot/accumulate tests indata_test.go.Minor:
aggregateBucketin the snapshot path picks the first row per entity encountered inrows, but the SQL has noORDER BY. Under correct write invariants at most one row per entity satisfiesvalid_from < bucketEnd AND valid_to >= bucketEnd, so the dedup is a guard. If you ever breach that invariant (e.g., a bug or partial-write race), the selected row becomes driver-order-dependent and hard to reproduce. AddingORDER BY entity_id, valid_from DESC(or explicitly pickingmax(valid_from)) would make the behavior deterministic under failure modes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/chart/internal/mysql/data.go` around lines 260 - 330, GetSCDData's SQL scan can return rows in nondeterministic order so aggregateBucket may pick different "first" rows under races; fix by making the query deterministic: in the query string built in GetSCDData (select from host_scd_data) append an ORDER BY clause such as "ORDER BY entity_id, valid_from DESC" (or alternatively use an aggregate like max(valid_from) per entity) so that when aggregateBucket operates it consistently picks the newest valid_from per entity; update the query variable (and re-run sqlx.In/rewind) accordingly.
369-381: Consider chunking the cleanup DELETE to avoid long locks.
DELETE FROM host_scd_data WHERE valid_to < CURDATE() - INTERVAL ? DAY AND valid_to <> ?will run as one statement. On large tables with a big backlog (first run after deploy, or after a long cron outage) this can hold row/gap locks for a while and churn the binlog. Consider a bounded loop withLIMIT Nper statement — the pattern used elsewhere in the repo forDELETEsweeps. Safe to defer if the retention window is small and cron runs regularly.The
AND valid_to <> ?guard against the sentinel is redundant (sentinel9999-12-31can never satisfyvalid_to < CURDATE()-N) but harmless as defense-in-depth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/chart/internal/mysql/data.go` around lines 369 - 381, The single large DELETE in Datastore.CleanupSCDData can hold long locks; change CleanupSCDData to run deletes in a bounded loop that issues the same DELETE with a LIMIT (e.g., chunkSize constant) until no rows are affected, using ds.writer(ctx).ExecContext to execute each chunk and checking result.RowsAffected to exit; keep the existing scdOpenSentinel guard (or optionally drop it) and propagate errors from ExecContext as before.tools/charts-collect/main.go (1)
133-150:apiClient.getlacks timeout/retry semantics for a cron tool.The HTTP client has a 30s timeout (line 86), but there's no retry/backoff for transient 5xx or network errors. Combined with the "one host failure → false CVE closure" issue above, this makes the collector fragile against brief API hiccups. Consider adding retries with exponential backoff, or treating any non-2xx / transport error as sufficient cause to abort the snapshot reconciliation for the run.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/charts-collect/main.go` around lines 133 - 150, The get method on apiClient currently performs a single HTTP request (apiClient.get -> http.NewRequest -> a.http.Do) and returns errors directly, which makes the cron collector fragile; update apiClient.get to implement retry/backoff for transient network errors and 5xx responses (or alternatively treat any transport or non-2xx error as a fatal abort for the snapshot run depending on desired behaviour). Specifically, wrap the a.http.Do call and the StatusCode check in a retry loop with exponential backoff and a small maxAttempts, only returning success when resp.StatusCode is 2xx; for 5xx or temporary network errors retry (with increasing sleep) and for client errors (4xx) or exhausting retries return a wrapped error indicating path and final status; ensure resp.Body is closed on non-success and that the Authorization header logic (fmt.Sprintf("Bearer %s", a.token)) and baseURL+path composition remain unchanged.server/chart/internal/service/service_test.go (1)
256-281: Consider adding a test for theCollectDatasetserror path.
TestCollectDatasetsUptimeonly exercises the happy path, so thes.logger.ErrorContext(...)branch inservice.gois untested. Passingnilfor the logger (as other tests do) combined with a dataset that returns an error would reproduce the NPE called out onservice.go(lines 42–50). A small test with a stub dataset whoseCollectreturns an error would lock down the behavior you choose for that path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/chart/internal/service/service_test.go` around lines 256 - 281, Add a test that exercises the error path of svc.CollectDatasets by registering a stub dataset whose Collect method returns an error and invoking svc.CollectDatasets with a nil logger (like other tests do) to reproduce and lock down the NPE on logger.ErrorContext; specifically, create a minimal fake dataset type with Collect(...) error that returns an error, call svc := NewService(nil, ds, nil); svc.RegisterDataset(&faultyDataset{}); err := svc.CollectDatasets(t.Context(), someTime) and assert err is non-nil and that the call does not panic (use require.Error and no panic assertion), which will validate or force handling around logger.ErrorContext in service.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/fleet/serve.go`:
- Around line 1949-1950: The chart service currently only registers
UptimeDataset, so register the implemented CVE dataset by calling
chartSvc.RegisterDataset with a CVEDataset instance (e.g.,
chartSvc.RegisterDataset(&chart.CVEDataset{})); place this call alongside the
existing chartSvc.RegisterDataset(&chart.UptimeDataset{}) after
chart_bootstrap.New(...) so CVE chart requests have the proper dataset
metadata/resolution.
In `@server/chart/blob.go`:
- Around line 14-32: HostIDsToBlob currently sizes its allocation by the largest
ID which allows an attacker or corrupted input to force huge allocations; add a
sanity cap check in HostIDsToBlob: compute requiredBytes := maxID/8+1 and if
requiredBytes exceeds a defined constant cap (e.g. MaxBlobBytes or
MaxHostBlobSize) then handle it safely (either return nil/empty blob or return
an error path the callers expect — keep the existing function signature by
returning nil if you can't change APIs) and/or filter out IDs > cap*8 before
allocating. Reference HostIDsToBlob, maxID, and the blob allocation site when
adding the guard and unit tests to cover a very large ID to ensure it no longer
triggers large allocations.
In `@server/chart/internal/mysql/charts.go`:
- Around line 97-104: The WHERE clause's timestamp comparison only uses
h.detail_updated_at and h.created_at as fallbacks when both hst.seen_time and
ne.last_seen_at are NULL, so newer detail_updated_at or created_at values can be
ignored; update the SQL expression to take the greatest of all possible activity
timestamps (hst.seen_time, ne.last_seen_at, NULLIF(h.detail_updated_at,
'2000-01-01 00:00:00'), h.created_at) and compare that GREATEST(...) >= ? so the
newest signal wins (adjust the COALESCE/GREATEST nesting accordingly to include
h.detail_updated_at and h.created_at as candidates alongside hst.seen_time and
ne.last_seen_at).
In `@server/chart/internal/service/service.go`:
- Around line 42-50: CollectDatasets currently dereferences s.logger (calling
s.logger.ErrorContext) which can panic if NewService allowed a nil logger, and
it always returns nil hiding dataset errors; fix by guarding the logger and
aggregating errors: in Service.CollectDatasets iterate datasets calling
dataset.Collect(ctx, s.store, now), for each error first check if s.logger !=
nil before calling s.logger.ErrorContext (or use a safe discard logger set in
NewService), and append each error to a collection (use errors.Join to combine)
so CollectDatasets returns a non-nil aggregated error when any dataset failed
while still continuing to process remaining datasets; update NewService to
either enforce non-nil logger or set a discard logger to avoid nil-receiver
panics in s.logger.
In `@tools/charts-collect/main.go`:
- Around line 227-264: The loop that builds cveHosts currently logs per-host
fetchHostCVEs errors and continues, which can cause reconcileSnapshot to close
CVE rows incorrectly; modify the logic to detect any fetch error (e.g., set a
hadFetchError bool or collect failed hostIDs) during the hostIDs loop and if any
error occurred, return early (or skip reconciliation) instead of calling
reconcileSnapshot with a partial entityBitmaps; alternatively implement bounded
retries with backoff inside the loop for fetchHostCVEs before marking failure so
reconcileSnapshot is only invoked when the full scan completed successfully and
entityBitmaps accurately reflects all hosts for use with bucketStart.
- Around line 192-210: The current db.QueryRow()/Scan() swallow treats any
non-nil err the same as sql.ErrNoRows and can drop accumulated bitmap merges;
change the logic around QueryRow/Scan so you specifically handle sql.ErrNoRows
(i.e., if errors.Is(err, sql.ErrNoRows) then skip merging) and for any other
non-nil err return it (wrap with fmt.Errorf as in the insert error handling).
Use errors.Is(err, sql.ErrNoRows) (add "errors" to imports) and only call
chart.BlobOR(existing, newBlob) when Scan succeeded (err == nil); otherwise
return the real DB error.
---
Nitpick comments:
In `@server/chart/internal/mysql/data.go`:
- Around line 260-330: GetSCDData's SQL scan can return rows in nondeterministic
order so aggregateBucket may pick different "first" rows under races; fix by
making the query deterministic: in the query string built in GetSCDData (select
from host_scd_data) append an ORDER BY clause such as "ORDER BY entity_id,
valid_from DESC" (or alternatively use an aggregate like max(valid_from) per
entity) so that when aggregateBucket operates it consistently picks the newest
valid_from per entity; update the query variable (and re-run sqlx.In/rewind)
accordingly.
- Around line 369-381: The single large DELETE in Datastore.CleanupSCDData can
hold long locks; change CleanupSCDData to run deletes in a bounded loop that
issues the same DELETE with a LIMIT (e.g., chunkSize constant) until no rows are
affected, using ds.writer(ctx).ExecContext to execute each chunk and checking
result.RowsAffected to exit; keep the existing scdOpenSentinel guard (or
optionally drop it) and propagate errors from ExecContext as before.
In `@server/chart/internal/service/service_test.go`:
- Around line 256-281: Add a test that exercises the error path of
svc.CollectDatasets by registering a stub dataset whose Collect method returns
an error and invoking svc.CollectDatasets with a nil logger (like other tests
do) to reproduce and lock down the NPE on logger.ErrorContext; specifically,
create a minimal fake dataset type with Collect(...) error that returns an
error, call svc := NewService(nil, ds, nil);
svc.RegisterDataset(&faultyDataset{}); err := svc.CollectDatasets(t.Context(),
someTime) and assert err is non-nil and that the call does not panic (use
require.Error and no panic assertion), which will validate or force handling
around logger.ErrorContext in service.go.
In `@server/chart/internal/service/service.go`:
- Line 85: Replace the direct call to time.Now() in GetChartData by injecting a
clock on the Service so tests can control "now": add a field Clock func()
time.Time to the Service struct (default it to time.Now in constructor/new
service), update GetChartData to call s.Clock() instead of time.Now(), and keep
computeBucketRange usage unchanged so unit tests remain valid while end-to-end
tests can supply a deterministic clock.
In `@tools/charts-backfill/main.go`:
- Around line 30-82: The CLI doesn't validate the -days flag so days==0 or
negative values produce no work or a future start date; in main() after
flag.Parse() add a validation check for *days to ensure it's > 0 and call
log.Fatalf with a clear message (e.g. "invalid days: must be > 0") if not, so
functions like backfill, the start date calculation and loops (see variables
days, startDate, start and call to backfill) always receive a sane positive day
count.
In `@tools/charts-collect/main.go`:
- Around line 133-150: The get method on apiClient currently performs a single
HTTP request (apiClient.get -> http.NewRequest -> a.http.Do) and returns errors
directly, which makes the cron collector fragile; update apiClient.get to
implement retry/backoff for transient network errors and 5xx responses (or
alternatively treat any transport or non-2xx error as a fatal abort for the
snapshot run depending on desired behaviour). Specifically, wrap the a.http.Do
call and the StatusCode check in a retry loop with exponential backoff and a
small maxAttempts, only returning success when resp.StatusCode is 2xx; for 5xx
or temporary network errors retry (with increasing sleep) and for client errors
(4xx) or exhausting retries return a wrapped error indicating path and final
status; ensure resp.Body is closed on non-success and that the Authorization
header logic (fmt.Sprintf("Bearer %s", a.token)) and baseURL+path composition
remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f49d4152-ab3c-4290-91c5-78b76cd13920
⛔ Files ignored due to path filters (2)
tools/charts-backfill/README.mdis excluded by!**/*.mdtools/charts-collect/README.mdis excluded by!**/*.md
📒 Files selected for processing (25)
cmd/fleet/cron.gocmd/fleet/serve.gopkg/str/str.gopkg/str/str_test.goserver/chart/api/chart.goserver/chart/api/http/types.goserver/chart/api/service.goserver/chart/arch_test.goserver/chart/blob.goserver/chart/blob_test.goserver/chart/bootstrap/bootstrap.goserver/chart/datasets.goserver/chart/internal/mysql/charts.goserver/chart/internal/mysql/data.goserver/chart/internal/mysql/data_test.goserver/chart/internal/service/endpoint_utils.goserver/chart/internal/service/handler.goserver/chart/internal/service/service.goserver/chart/internal/service/service_test.goserver/chart/internal/types/chart.goserver/datastore/mysql/migrations/tables/20260421182057_AddHostSCDData.goserver/datastore/mysql/migrations/tables/20260421182057_AddHostSCDData_test.goserver/fleet/cron_schedules.gotools/charts-backfill/main.gotools/charts-collect/main.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #43910 +/- ##
==========================================
- Coverage 66.83% 66.76% -0.08%
==========================================
Files 2610 2623 +13
Lines 210529 211148 +619
Branches 9424 9424
==========================================
+ Hits 140712 140966 +254
- Misses 57000 57353 +353
- Partials 12817 12829 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/chart/internal/mysql/charts.go`:
- Around line 88-93: FindRecentlySeenHostIDs is performing a read but
incorrectly calls ds.writer(ctx), which routes traffic to primary; change the
call to use ds.reader(ctx) instead (i.e., replace ds.writer(ctx) with
ds.reader(ctx) in the sqlx.SelectContext call) so reads use the replica helper;
if this read truly needs to be pinned to primary for replication reasons, ensure
the caller sets ctxdb.RequirePrimary(ctx) so ds.reader(ctx) will honor the
primary requirement.
In `@server/chart/internal/mysql/data.go`:
- Around line 361-371: The DELETE uses CURDATE() which is session-local; instead
compute the UTC cutoff in Go and pass it as a parameter to the query so the
retention boundary is deterministic. In Datastore.CleanupSCDData, replace the
CURDATE() - INTERVAL ? DAY expression with a placeholder (e.g. valid_to < ?) and
compute cutoff := time.Now().UTC().AddDate(0,0,-days) (or derive from the
caller's UTC logic), then call ds.writer(ctx).ExecContext with cutoff and
scdOpenSentinel as the bound parameters; keep the valid_to <> ? sentinel check
as-is and wrap/return errors the same way.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1db182b8-eba2-46a6-bde9-db6f97d4b065
📒 Files selected for processing (10)
server/chart/api/chart.goserver/chart/api/http/types.goserver/chart/internal/mysql/charts.goserver/chart/internal/mysql/data.goserver/chart/internal/service/handler.goserver/chart/internal/service/host_cache.goserver/chart/internal/service/host_cache_test.goserver/chart/internal/service/service.goserver/chart/internal/service/service_test.goserver/chart/internal/types/chart.go
🚧 Files skipped from review as they are similar to previous changes (3)
- server/chart/api/http/types.go
- server/chart/internal/service/service.go
- server/chart/internal/types/chart.go
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
server/chart/internal/service/service.go (1)
47-54:⚠️ Potential issue | 🟠 MajorMake collection failures observable and nil-logger safe.
NewServiceallowslogger == nil, so this path can panic on collection errors in tests or alternate wiring. Also, returningnilhides dataset failures from cron/monitoring; continue collecting, but return an aggregate error.🛠️ Proposed fix
import ( "context" + "errors" "fmt" "log/slog" "time" @@ func (s *Service) CollectDatasets(ctx context.Context, now time.Time) error { + var errs []error for name, dataset := range s.datasets { if err := dataset.Collect(ctx, s.store, now); err != nil { // Log and continue — don't let one dataset failure block others. - s.logger.ErrorContext(ctx, "collect chart dataset", "dataset", name, "err", ctxerr.Wrap(ctx, err, "collect chart dataset")) + wrapped := ctxerr.Wrap(ctx, err, "collect chart dataset") + if s.logger != nil { + s.logger.ErrorContext(ctx, "collect chart dataset", "dataset", name, "err", wrapped) + } + errs = append(errs, fmt.Errorf("dataset %q: %w", name, wrapped)) } } - return nil + return errors.Join(errs...) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/chart/internal/service/service.go` around lines 47 - 54, The CollectDatasets method currently can panic if s.logger is nil and also drops all dataset errors; update Service.CollectDatasets to be nil-logger safe and return an aggregate error: for each entry in s.datasets call dataset.Collect(ctx, s.store, now) and if it fails, use a safe logging helper that checks s.logger != nil (or use a no-op logger) to log "collect chart dataset" with dataset name and wrapped error, append the error into a slice, continue iterating, and at the end return nil if no errors or a combined error (e.g., multi-error or fmt.Errorf("multiple errors: %v", errs)) so callers/crons can observe failures; reference the Service.CollectDatasets, s.logger, s.datasets, and dataset.Collect symbols when making the changes.
🧹 Nitpick comments (1)
server/chart/internal/service/host_cache.go (1)
103-112: Canonicalize duplicate filter values before hashing.Sorting alone means
LabelIDs: []uint{1, 2}andLabelIDs: []uint{1, 2, 2}produce different cache keys even though downstream filtering is semantically the same. Deduplicate each sorted slice to avoid cache fragmentation from repeated query params.♻️ Proposed canonicalization
- teams := slices.Clone(f.TeamIDs) - slices.Sort(teams) - labels := slices.Clone(f.LabelIDs) - slices.Sort(labels) - platforms := slices.Clone(f.Platforms) - slices.Sort(platforms) - include := slices.Clone(f.IncludeHostIDs) - slices.Sort(include) - exclude := slices.Clone(f.ExcludeHostIDs) - slices.Sort(exclude) + teams := canonicalUintSlice(f.TeamIDs) + labels := canonicalUintSlice(f.LabelIDs) + platforms := canonicalStringSlice(f.Platforms) + include := canonicalUintSlice(f.IncludeHostIDs) + exclude := canonicalUintSlice(f.ExcludeHostIDs) @@ } + +func canonicalUintSlice(in []uint) []uint { + out := slices.Clone(in) + slices.Sort(out) + return slices.Compact(out) +} + +func canonicalStringSlice(in []string) []string { + out := slices.Clone(in) + slices.Sort(out) + return slices.Compact(out) +}Also applies to: 120-121
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/chart/internal/service/host_cache.go` around lines 103 - 112, The cloned filter slices (teams, labels, platforms, include, exclude) are only being sorted but not deduplicated, so semantically-identical filters with repeated IDs yield different cache keys; after cloning and sorting each slice (teams, labels, platforms, include, exclude) remove consecutive duplicates (canonicalize) before using them for hashing/key construction in host_cache.go so that []{1,2,2} becomes []{1,2}; apply the same deduplication to the other cloned slices used for cache key generation in the same function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/chart/internal/service/host_cache.go`:
- Around line 24-35: The hostFilterCache currently leaves expired entries in the
entries map indefinitely; add bounded-pruning logic to prevent unbounded growth
by implementing a size or time-based eviction: modify hostFilterCache to include
a maxEntries (or implement an LRU) and a background janitor goroutine (or
prune-on-insert) that removes expired entries based on ttl and evicts oldest
entries when entries exceeds maxEntries; update places that populate entries
(look for methods using hostFilterCache.entries and singleflight via sf) to call
the prune/evict helper after writes, and ensure the janitor uses
hostFilterCache.clock and respects shutdown semantics if applicable.
In `@server/chart/internal/service/service.go`:
- Around line 95-109: The Downsample divisor check should only apply for
sub-daily datasets: keep the negative-value check for opts.Downsample but move
or gate the "24%opts.Downsample != 0" validation so it runs only when
dataset.BucketSize() < 24*time.Hour (i.e., sub-daily). Update the logic around
the existing validation and the bucketSize computation (referencing
opts.Downsample and dataset.BucketSize()) so daily datasets (bucketSize >=
24*time.Hour) do not fail on non-divisor downsample values.
---
Duplicate comments:
In `@server/chart/internal/service/service.go`:
- Around line 47-54: The CollectDatasets method currently can panic if s.logger
is nil and also drops all dataset errors; update Service.CollectDatasets to be
nil-logger safe and return an aggregate error: for each entry in s.datasets call
dataset.Collect(ctx, s.store, now) and if it fails, use a safe logging helper
that checks s.logger != nil (or use a no-op logger) to log "collect chart
dataset" with dataset name and wrapped error, append the error into a slice,
continue iterating, and at the end return nil if no errors or a combined error
(e.g., multi-error or fmt.Errorf("multiple errors: %v", errs)) so callers/crons
can observe failures; reference the Service.CollectDatasets, s.logger,
s.datasets, and dataset.Collect symbols when making the changes.
---
Nitpick comments:
In `@server/chart/internal/service/host_cache.go`:
- Around line 103-112: The cloned filter slices (teams, labels, platforms,
include, exclude) are only being sorted but not deduplicated, so
semantically-identical filters with repeated IDs yield different cache keys;
after cloning and sorting each slice (teams, labels, platforms, include,
exclude) remove consecutive duplicates (canonicalize) before using them for
hashing/key construction in host_cache.go so that []{1,2,2} becomes []{1,2};
apply the same deduplication to the other cloned slices used for cache key
generation in the same function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 88a3e721-736e-448f-836b-c0f33633eb45
📒 Files selected for processing (13)
cmd/fleet/serve.goserver/acl/chartacl/fleet_adapter.goserver/acl/chartacl/fleet_adapter_test.goserver/chart/api/chart.goserver/chart/api/http/types.goserver/chart/api/service.goserver/chart/bootstrap/bootstrap.goserver/chart/internal/mysql/charts.goserver/chart/internal/service/host_cache.goserver/chart/internal/service/host_cache_test.goserver/chart/internal/service/service.goserver/chart/internal/service/service_test.goserver/chart/internal/types/chart.go
✅ Files skipped from review due to trivial changes (1)
- server/chart/internal/types/chart.go
🚧 Files skipped from review as they are similar to previous changes (4)
- server/chart/api/service.go
- server/chart/bootstrap/bootstrap.go
- server/chart/api/http/types.go
- server/chart/internal/mysql/charts.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
server/chart/datasets.go (1)
22-35: Small nit: consider logging / metric whenhostIDsis empty.Silent early return on an empty recently-seen list is correct for the accumulate strategy, but makes it hard to distinguish "legitimately no hosts active in the last 10 minutes" from "collector never sees anything" at runtime. A debug log with the sample
nowwould help operators spot stuck collectors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/chart/datasets.go` around lines 22 - 35, In UptimeDataset.Collect, add an explicit debug log and/or metric increment right before the early return when len(hostIDs) == 0: log the sample time (now) and the uptimeRecentlySeenWindow so operators can distinguish a legitimately empty window from a stuck collector, and optionally increment a counter like "uptime_empty_recently_seen" via your existing metrics facility; place the log/metric call immediately above the current return nil in UptimeDataset.Collect so it runs only for the empty-hostIDs case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@server/chart/datasets.go`:
- Around line 22-35: In UptimeDataset.Collect, add an explicit debug log and/or
metric increment right before the early return when len(hostIDs) == 0: log the
sample time (now) and the uptimeRecentlySeenWindow so operators can distinguish
a legitimately empty window from a stuck collector, and optionally increment a
counter like "uptime_empty_recently_seen" via your existing metrics facility;
place the log/metric call immediately above the current return nil in
UptimeDataset.Collect so it runs only for the empty-hostIDs case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fe46d7b6-bdf7-4356-aa97-9ad9588712d3
📒 Files selected for processing (5)
server/chart/api/chart.goserver/chart/datasets.goserver/chart/internal/mysql/charts.goserver/chart/internal/service/service_test.goserver/chart/internal/types/chart.go
🚧 Files skipped from review as they are similar to previous changes (1)
- server/chart/internal/types/chart.go
|
@sgress454 Victor is focused on wrapping up a story for 4.84.0 so @ksykulev is going to take review on this, which works well since he has context from the hackathon. |
ksykulev
left a comment
There was a problem hiding this comment.
I think besides making Response.Data work with multiple series as demonstrated in ksykulev/dashboard-policy-chart, this looks like a good template to start for adding more relevant charts.
| ), | ||
| NULLIF(h.detail_updated_at, '2000-01-01 00:00:00'), | ||
| h.created_at | ||
| ) >= ?` |
There was a problem hiding this comment.
Does the host_seen_times.idx_host_seen_times_seen_time index cover this query?
There was a problem hiding this comment.
Will double-check
There was a problem hiding this comment.
Ah it doesn't use that index because of the COALESCE (I don't think there's any way to index that). So it's a full table scan on hosts, but PKs for all the joins. For something that runs once every ten minutes I think it'll be fine even at 100k hosts but if you feel otherwise I'll look into updating it, perhaps to a UNION. Note that nano_enrollments.last_seen_at doesn't have an index at all so that'd always be a bigger scan (although we could look into adding one).
There was a problem hiding this comment.
I'm ok with it, as far as I know the hosts table isn't a very hot table, right?
<!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> **Related issue:** For #43769 # Details This PR adds "Hosts active" and "Hosts enrolled" charts to the dashboard. New components: * **ChartCard.tsx**: encapsulates a visualization-agnostic chart, for data provided by the new `/charts` endpoint created in #43910 * **ChartFilterModal.tsx**: modal for setting filters on a chart. Currently supports filtering by label, platform and individual host. * **CheckerboardViz.tsx**: a checkerboard visualization for use in ChartCard. Capable of charting 1, 7, 14 or 30 days at a time, although only 30 day charts are used right now. Bespoke rendering using SVG, since recharts scatterchart was harder to wrangle than it was worth. * **LineChartViz.tsx**: a line-chart visualization using Recharts * **HostsEnrolledCard.tsx**: a bar chart of enrolled hosts using Recharts # Checklist for submitter If some of the following don't apply, delete the relevant line. - [X] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. See [Changes files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files) for more information. ## Testing - [X] Added/updated automated tests - [X] QA'd all new/changed functionality manually - With backend provided by #43910: <img width="1426" height="428" alt="image" src="https://github.com/user-attachments/assets/0f53b9d1-c87b-4225-a175-2d40af5afe41" /> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Dashboard now shows interactive "Hosts active" (line/heatmap) and "Hosts enrolled" (bar) charts with metric selection, filter modal (labels/platforms/hosts), legends, tooltips, and responsive layout. * **Tests** * Added comprehensive tests covering chart rendering, checkerboard heatmap, and no-data states. * **Chores** * Added charting library dependency to support visualizations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Related issue: For #42812
Details
This PR implements a new bounded context,
chart, with a single endpoint/charts. The context encompasses a framework for recording and querying and aggregating historical data for Fleet hosts, and returning that data via the API for the purpose of charting.This initial iteration has a full implementation of a dataset called "uptime" which captures which hosts were online hour-by-hour (online meaning, having been "seen" at some point during that hour). It has a partial implementation of a "cve" dataset which will capture which hosts were vulnerable to which CVEs during a given day.
Data storage
Data is stored in an SCD (slowly-changing dimension) format in the
host_scd_datatable, where the main "value" in a row is stored in thehost_bitmapcolumn, which is amediumblobwhere each bit encodes a host ID (bit one represents host ID 1, bit 1444 represents host ID 1444, etc.). The set of bits set on a row represents that hosts for which that dataset is "on" during a given time period represented by thevalid_from(inclusive) andvalid_to(exclusive) dates, where avalid_tocan have the special "sentinel" value 9999-12-31T00:00:00.000 meaning that the row is still "open" (the value represents everything fromvalid_fromto the present). Additionally anentity_idcolumn can be used for datasets with multiple dimensions, e.g. CVE exposure or software usage which would have entity IDs representing CVEs or software items respectively.Data collection
Data is collected via a cron job that runs every 10 minutes. Each dataset has its own
Collectmethod which will sample the data for the given moment. For example the "uptime" dataset gathers the set of hosts that are online at the moment, and the "cve" dataset will gather the set of hosts that are vulnerable to each CVE at that moment. The sample can then be recorded using one of two strategies:accumulate: bitwise OR the sample with any data already recorded for the current hour, or add a new pre-closed row for that hour.snapshot: if there is no open row, create one with the sample andvalid_to setto the sentinel. Otherwise:valid_fromis within the same hour, update the current row's valuevalid_fromis not within the same hour, close the current open row and start a new one withvalid_from= the start of the current hourData retrieval
fleet_idparam was passed to the/chartsendpoint), and further whittled down by any filter options supplied with the request (labels, platforms, etc.).host_scd_datarows for the requested dataset and date range (i.e. all rows whosevalid_fromis < the date range end andvalid_tois > the date range start).Tools
This PR includes two dev tools that don't require deep review:
Checklist for submitter
If some of the following don't apply, delete the relevant line.
Changes file added for user-visible changes in
changes/,orbit/changes/oree/fleetd-chrome/changes.See Changes files for more information.
Input data is properly validated,
SELECT *is avoided, SQL injection is prevented (using placeholders for values in statements), JS inline code is prevented especially for url redirects, and untrusted data interpolated into shell scripts/commands is validated against shell metacharacters.Testing
Added/updated automated tests
Where appropriate, automated tests simulate multiple hosts and test for host isolation (updates to one hosts's records do not affect another)
QA'd all new/changed functionality manually
Database migrations
COLLATE utf8mb4_unicode_ci).Summary by CodeRabbit
New Features
Database
Tests