Skip to content

Dashboard charts backend#43910

Merged
sgress454 merged 27 commits intomainfrom
sgress454/dashboard-charts-backend
Apr 23, 2026
Merged

Dashboard charts backend#43910
sgress454 merged 27 commits intomainfrom
sgress454/dashboard-charts-backend

Conversation

@sgress454
Copy link
Copy Markdown
Contributor

@sgress454 sgress454 commented Apr 21, 2026

Related issue: For #42812

Details

This PR implements a new bounded context, chart, with a single endpoint /charts. The context encompasses a framework for recording and querying and aggregating historical data for Fleet hosts, and returning that data via the API for the purpose of charting.

This initial iteration has a full implementation of a dataset called "uptime" which captures which hosts were online hour-by-hour (online meaning, having been "seen" at some point during that hour). It has a partial implementation of a "cve" dataset which will capture which hosts were vulnerable to which CVEs during a given day.

Data storage

Data is stored in an SCD (slowly-changing dimension) format in the host_scd_data table, where the main "value" in a row is stored in the host_bitmap column, which is a mediumblob where each bit encodes a host ID (bit one represents host ID 1, bit 1444 represents host ID 1444, etc.). The set of bits set on a row represents that hosts for which that dataset is "on" during a given time period represented by the valid_from (inclusive) and valid_to (exclusive) dates, where a valid_to can have the special "sentinel" value 9999-12-31T00:00:00.000 meaning that the row is still "open" (the value represents everything from valid_from to the present). Additionally an entity_id column can be used for datasets with multiple dimensions, e.g. CVE exposure or software usage which would have entity IDs representing CVEs or software items respectively.

Data collection

Data is collected via a cron job that runs every 10 minutes. Each dataset has its own Collect method which will sample the data for the given moment. For example the "uptime" dataset gathers the set of hosts that are online at the moment, and the "cve" dataset will gather the set of hosts that are vulnerable to each CVE at that moment. The sample can then be recorded using one of two strategies:

  • accumulate: bitwise OR the sample with any data already recorded for the current hour, or add a new pre-closed row for that hour.
  • snapshot: if there is no open row, create one with the sample and valid_to set to the sentinel. Otherwise:
    • If the sample has the same value as the current open row, do nothing
    • If the sample has a different value and the current open row's valid_from is within the same hour, update the current row's value
    • If the sample has a different value and the current open row's valid_from is not within the same hour, close the current open row and start a new one with valid_from = the start of the current hour

Data retrieval

  1. Gets the set of host IDs to retrieve data for. This starts with the set of host IDs in the requested fleet (or all the hosts a user has access to if no fleet_id param was passed to the /charts endpoint), and further whittled down by any filter options supplied with the request (labels, platforms, etc.).
  2. Finds all host_scd_data rows for the requested dataset and date range (i.e. all rows whose valid_from is < the date range end and valid_to is > the date range start).
  3. Calculates the date ranges of the "buckets" to return datapoints for. For the uptime chart we default to 3-hour buckets, so we want 8 buckets per day.
  4. Iterates over each bucket and finds the row or rows from host_scd_data that cover that bucket range. For datasets using the "accumulate" strategy, the values for those rows are ORed together. For "snapshot"s, we take the one active at the bucket end time to represent the bucket (e.g. "which hosts had a given CVE at the end of the day")

Tools

This PR includes two dev tools that don't require deep review:

  • chart-backfill - used to backfill data to various datasets for testing
  • charts-collect - used to collect data from a live server via the API and put into a local hosts_scd_data table

Checklist for submitter

If some of the following don't apply, delete the relevant line.

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.
    See Changes files for more information.

  • Input data is properly validated, SELECT * is avoided, SQL injection is prevented (using placeholders for values in statements), JS inline code is prevented especially for url redirects, and untrusted data interpolated into shell scripts/commands is validated against shell metacharacters.

Testing

image

Database migrations

  • Checked schema for all modified table for columns that will auto-update timestamps during migration.
  • Confirmed that updating the timestamps is acceptable, and will not cause unwanted side effects.
  • Ensured the correct collation is explicitly set for character columns (COLLATE utf8mb4_unicode_ci).

Summary by CodeRabbit

  • New Features

    • Added charting bounded context: HTTP API for metrics (uptime, CVE), dataset registry, hosted dataset collection, background collection/cleanup with opt-out env.
    • New utilities: host bitmap operations and string-list/uint-list parsers.
    • New CLI tools to collect and backfill chart data.
  • Database

    • Migration and schema to store host time-series SCD chart data.
  • Tests

    • Extensive unit and integration tests for service, storage, caching, cron, and utilities.

Introduces the chart service for dashboard time-series data:
- server/chart/ bounded context (service, HTTP handler, MySQL store, SCD bitmaps)
- Two migrations for host_hourly_data and host_scd_data tables
- Cron wiring for chart data collection
- charts-backfill and charts-collect CLI tools
- render.yaml updates for the data collection cron job
@sgress454
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Comment thread pkg/str/str.go Fixed
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Implements a chart bounded context: new exported chart API and HTTP types, endpoint wiring and routes, a concrete service with viewer-backed authorization and host-filter TTL cache, MySQL SCD datastore implementation and migration for host_scd_data, bitmap utilities and host-ID blob ops, two datasets (uptime, cve) with collection/cleanup flows and cron schedules, CLI tools for collection/backfill, string parsing helpers, and extensive unit and integration tests across the new modules.

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Dashboard charts backend' is concise and accurately reflects the main objective of implementing a new chart bounded context with a backend endpoint, matching the core purpose of the PR.
Description check ✅ Passed The PR description is comprehensive and complete, covering related issue reference, detailed explanation of data storage (SCD format), collection strategies, retrieval flow, and all required checklist items are addressed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sgress454/dashboard-charts-backend

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (6)
server/chart/internal/service/service.go (1)

85-85: Consider injecting a clock for deterministic GetChartData tests.

time.Now() here means end-to-end tests of GetChartData can't assert on the exact start/end passed to the datastore. computeBucketRange is unit-tested in isolation, so this is only a minor ergonomics concern — a func() time.Time field on Service (defaulted to time.Now) would close the gap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/chart/internal/service/service.go` at line 85, Replace the direct call
to time.Now() in GetChartData by injecting a clock on the Service so tests can
control "now": add a field Clock func() time.Time to the Service struct (default
it to time.Now in constructor/new service), update GetChartData to call
s.Clock() instead of time.Now(), and keep computeBucketRange usage unchanged so
unit tests remain valid while end-to-end tests can supply a deterministic clock.
tools/charts-backfill/main.go (1)

30-82: LGTM for a dev tool.

Flag parsing, DSN validation via Ping, host-ID sourcing, and UTC-midnight alignment all look right. INSERT ... ON DUPLICATE KEY UPDATE against (dataset, entity_id, valid_from) correctly makes re-runs idempotent.

Minor: consider validating *days > 0 up front; with days=0 the tool silently does nothing (for day := range days is a no-op), and with negative values the AddDate(0, 0, -(*days-1)) computes a future start before exiting the empty loop.

Based on learnings: For test harnesses and CLI tools in the Fleet codebase, resource cleanup on error paths may not be necessary since the OS handles cleanup when the process exits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/charts-backfill/main.go` around lines 30 - 82, The CLI doesn't validate
the -days flag so days==0 or negative values produce no work or a future start
date; in main() after flag.Parse() add a validation check for *days to ensure
it's > 0 and call log.Fatalf with a clear message (e.g. "invalid days: must be >
0") if not, so functions like backfill, the start date calculation and loops
(see variables days, startDate, start and call to backfill) always receive a
sane positive day count.
server/chart/internal/mysql/data.go (2)

260-330: GetSCDData looks correct; one minor determinism nit for snapshot aggregation.

The overlap SQL predicate (valid_from < lastBucketEnd AND valid_to > firstBucketStart), bucket labels (startDate + (i+1)*bucketSize, first label = startDate+bucketSize, last = endDate), and filter-mask AND all line up with the documented semantics and the snapshot/accumulate tests in data_test.go.

Minor: aggregateBucket in the snapshot path picks the first row per entity encountered in rows, but the SQL has no ORDER BY. Under correct write invariants at most one row per entity satisfies valid_from < bucketEnd AND valid_to >= bucketEnd, so the dedup is a guard. If you ever breach that invariant (e.g., a bug or partial-write race), the selected row becomes driver-order-dependent and hard to reproduce. Adding ORDER BY entity_id, valid_from DESC (or explicitly picking max(valid_from)) would make the behavior deterministic under failure modes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/chart/internal/mysql/data.go` around lines 260 - 330, GetSCDData's SQL
scan can return rows in nondeterministic order so aggregateBucket may pick
different "first" rows under races; fix by making the query deterministic: in
the query string built in GetSCDData (select from host_scd_data) append an ORDER
BY clause such as "ORDER BY entity_id, valid_from DESC" (or alternatively use an
aggregate like max(valid_from) per entity) so that when aggregateBucket operates
it consistently picks the newest valid_from per entity; update the query
variable (and re-run sqlx.In/rewind) accordingly.

369-381: Consider chunking the cleanup DELETE to avoid long locks.

DELETE FROM host_scd_data WHERE valid_to < CURDATE() - INTERVAL ? DAY AND valid_to <> ? will run as one statement. On large tables with a big backlog (first run after deploy, or after a long cron outage) this can hold row/gap locks for a while and churn the binlog. Consider a bounded loop with LIMIT N per statement — the pattern used elsewhere in the repo for DELETE sweeps. Safe to defer if the retention window is small and cron runs regularly.

The AND valid_to <> ? guard against the sentinel is redundant (sentinel 9999-12-31 can never satisfy valid_to < CURDATE()-N) but harmless as defense-in-depth.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/chart/internal/mysql/data.go` around lines 369 - 381, The single large
DELETE in Datastore.CleanupSCDData can hold long locks; change CleanupSCDData to
run deletes in a bounded loop that issues the same DELETE with a LIMIT (e.g.,
chunkSize constant) until no rows are affected, using ds.writer(ctx).ExecContext
to execute each chunk and checking result.RowsAffected to exit; keep the
existing scdOpenSentinel guard (or optionally drop it) and propagate errors from
ExecContext as before.
tools/charts-collect/main.go (1)

133-150: apiClient.get lacks timeout/retry semantics for a cron tool.

The HTTP client has a 30s timeout (line 86), but there's no retry/backoff for transient 5xx or network errors. Combined with the "one host failure → false CVE closure" issue above, this makes the collector fragile against brief API hiccups. Consider adding retries with exponential backoff, or treating any non-2xx / transport error as sufficient cause to abort the snapshot reconciliation for the run.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/charts-collect/main.go` around lines 133 - 150, The get method on
apiClient currently performs a single HTTP request (apiClient.get ->
http.NewRequest -> a.http.Do) and returns errors directly, which makes the cron
collector fragile; update apiClient.get to implement retry/backoff for transient
network errors and 5xx responses (or alternatively treat any transport or
non-2xx error as a fatal abort for the snapshot run depending on desired
behaviour). Specifically, wrap the a.http.Do call and the StatusCode check in a
retry loop with exponential backoff and a small maxAttempts, only returning
success when resp.StatusCode is 2xx; for 5xx or temporary network errors retry
(with increasing sleep) and for client errors (4xx) or exhausting retries return
a wrapped error indicating path and final status; ensure resp.Body is closed on
non-success and that the Authorization header logic (fmt.Sprintf("Bearer %s",
a.token)) and baseURL+path composition remain unchanged.
server/chart/internal/service/service_test.go (1)

256-281: Consider adding a test for the CollectDatasets error path.

TestCollectDatasetsUptime only exercises the happy path, so the s.logger.ErrorContext(...) branch in service.go is untested. Passing nil for the logger (as other tests do) combined with a dataset that returns an error would reproduce the NPE called out on service.go (lines 42–50). A small test with a stub dataset whose Collect returns an error would lock down the behavior you choose for that path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/chart/internal/service/service_test.go` around lines 256 - 281, Add a
test that exercises the error path of svc.CollectDatasets by registering a stub
dataset whose Collect method returns an error and invoking svc.CollectDatasets
with a nil logger (like other tests do) to reproduce and lock down the NPE on
logger.ErrorContext; specifically, create a minimal fake dataset type with
Collect(...) error that returns an error, call svc := NewService(nil, ds, nil);
svc.RegisterDataset(&faultyDataset{}); err := svc.CollectDatasets(t.Context(),
someTime) and assert err is non-nil and that the call does not panic (use
require.Error and no panic assertion), which will validate or force handling
around logger.ErrorContext in service.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/fleet/serve.go`:
- Around line 1949-1950: The chart service currently only registers
UptimeDataset, so register the implemented CVE dataset by calling
chartSvc.RegisterDataset with a CVEDataset instance (e.g.,
chartSvc.RegisterDataset(&chart.CVEDataset{})); place this call alongside the
existing chartSvc.RegisterDataset(&chart.UptimeDataset{}) after
chart_bootstrap.New(...) so CVE chart requests have the proper dataset
metadata/resolution.

In `@server/chart/blob.go`:
- Around line 14-32: HostIDsToBlob currently sizes its allocation by the largest
ID which allows an attacker or corrupted input to force huge allocations; add a
sanity cap check in HostIDsToBlob: compute requiredBytes := maxID/8+1 and if
requiredBytes exceeds a defined constant cap (e.g. MaxBlobBytes or
MaxHostBlobSize) then handle it safely (either return nil/empty blob or return
an error path the callers expect — keep the existing function signature by
returning nil if you can't change APIs) and/or filter out IDs > cap*8 before
allocating. Reference HostIDsToBlob, maxID, and the blob allocation site when
adding the guard and unit tests to cover a very large ID to ensure it no longer
triggers large allocations.

In `@server/chart/internal/mysql/charts.go`:
- Around line 97-104: The WHERE clause's timestamp comparison only uses
h.detail_updated_at and h.created_at as fallbacks when both hst.seen_time and
ne.last_seen_at are NULL, so newer detail_updated_at or created_at values can be
ignored; update the SQL expression to take the greatest of all possible activity
timestamps (hst.seen_time, ne.last_seen_at, NULLIF(h.detail_updated_at,
'2000-01-01 00:00:00'), h.created_at) and compare that GREATEST(...) >= ? so the
newest signal wins (adjust the COALESCE/GREATEST nesting accordingly to include
h.detail_updated_at and h.created_at as candidates alongside hst.seen_time and
ne.last_seen_at).

In `@server/chart/internal/service/service.go`:
- Around line 42-50: CollectDatasets currently dereferences s.logger (calling
s.logger.ErrorContext) which can panic if NewService allowed a nil logger, and
it always returns nil hiding dataset errors; fix by guarding the logger and
aggregating errors: in Service.CollectDatasets iterate datasets calling
dataset.Collect(ctx, s.store, now), for each error first check if s.logger !=
nil before calling s.logger.ErrorContext (or use a safe discard logger set in
NewService), and append each error to a collection (use errors.Join to combine)
so CollectDatasets returns a non-nil aggregated error when any dataset failed
while still continuing to process remaining datasets; update NewService to
either enforce non-nil logger or set a discard logger to avoid nil-receiver
panics in s.logger.

In `@tools/charts-collect/main.go`:
- Around line 227-264: The loop that builds cveHosts currently logs per-host
fetchHostCVEs errors and continues, which can cause reconcileSnapshot to close
CVE rows incorrectly; modify the logic to detect any fetch error (e.g., set a
hadFetchError bool or collect failed hostIDs) during the hostIDs loop and if any
error occurred, return early (or skip reconciliation) instead of calling
reconcileSnapshot with a partial entityBitmaps; alternatively implement bounded
retries with backoff inside the loop for fetchHostCVEs before marking failure so
reconcileSnapshot is only invoked when the full scan completed successfully and
entityBitmaps accurately reflects all hosts for use with bucketStart.
- Around line 192-210: The current db.QueryRow()/Scan() swallow treats any
non-nil err the same as sql.ErrNoRows and can drop accumulated bitmap merges;
change the logic around QueryRow/Scan so you specifically handle sql.ErrNoRows
(i.e., if errors.Is(err, sql.ErrNoRows) then skip merging) and for any other
non-nil err return it (wrap with fmt.Errorf as in the insert error handling).
Use errors.Is(err, sql.ErrNoRows) (add "errors" to imports) and only call
chart.BlobOR(existing, newBlob) when Scan succeeded (err == nil); otherwise
return the real DB error.

---

Nitpick comments:
In `@server/chart/internal/mysql/data.go`:
- Around line 260-330: GetSCDData's SQL scan can return rows in nondeterministic
order so aggregateBucket may pick different "first" rows under races; fix by
making the query deterministic: in the query string built in GetSCDData (select
from host_scd_data) append an ORDER BY clause such as "ORDER BY entity_id,
valid_from DESC" (or alternatively use an aggregate like max(valid_from) per
entity) so that when aggregateBucket operates it consistently picks the newest
valid_from per entity; update the query variable (and re-run sqlx.In/rewind)
accordingly.
- Around line 369-381: The single large DELETE in Datastore.CleanupSCDData can
hold long locks; change CleanupSCDData to run deletes in a bounded loop that
issues the same DELETE with a LIMIT (e.g., chunkSize constant) until no rows are
affected, using ds.writer(ctx).ExecContext to execute each chunk and checking
result.RowsAffected to exit; keep the existing scdOpenSentinel guard (or
optionally drop it) and propagate errors from ExecContext as before.

In `@server/chart/internal/service/service_test.go`:
- Around line 256-281: Add a test that exercises the error path of
svc.CollectDatasets by registering a stub dataset whose Collect method returns
an error and invoking svc.CollectDatasets with a nil logger (like other tests
do) to reproduce and lock down the NPE on logger.ErrorContext; specifically,
create a minimal fake dataset type with Collect(...) error that returns an
error, call svc := NewService(nil, ds, nil);
svc.RegisterDataset(&faultyDataset{}); err := svc.CollectDatasets(t.Context(),
someTime) and assert err is non-nil and that the call does not panic (use
require.Error and no panic assertion), which will validate or force handling
around logger.ErrorContext in service.go.

In `@server/chart/internal/service/service.go`:
- Line 85: Replace the direct call to time.Now() in GetChartData by injecting a
clock on the Service so tests can control "now": add a field Clock func()
time.Time to the Service struct (default it to time.Now in constructor/new
service), update GetChartData to call s.Clock() instead of time.Now(), and keep
computeBucketRange usage unchanged so unit tests remain valid while end-to-end
tests can supply a deterministic clock.

In `@tools/charts-backfill/main.go`:
- Around line 30-82: The CLI doesn't validate the -days flag so days==0 or
negative values produce no work or a future start date; in main() after
flag.Parse() add a validation check for *days to ensure it's > 0 and call
log.Fatalf with a clear message (e.g. "invalid days: must be > 0") if not, so
functions like backfill, the start date calculation and loops (see variables
days, startDate, start and call to backfill) always receive a sane positive day
count.

In `@tools/charts-collect/main.go`:
- Around line 133-150: The get method on apiClient currently performs a single
HTTP request (apiClient.get -> http.NewRequest -> a.http.Do) and returns errors
directly, which makes the cron collector fragile; update apiClient.get to
implement retry/backoff for transient network errors and 5xx responses (or
alternatively treat any transport or non-2xx error as a fatal abort for the
snapshot run depending on desired behaviour). Specifically, wrap the a.http.Do
call and the StatusCode check in a retry loop with exponential backoff and a
small maxAttempts, only returning success when resp.StatusCode is 2xx; for 5xx
or temporary network errors retry (with increasing sleep) and for client errors
(4xx) or exhausting retries return a wrapped error indicating path and final
status; ensure resp.Body is closed on non-success and that the Authorization
header logic (fmt.Sprintf("Bearer %s", a.token)) and baseURL+path composition
remain unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f49d4152-ab3c-4290-91c5-78b76cd13920

📥 Commits

Reviewing files that changed from the base of the PR and between c2379ea and a38d0bf.

⛔ Files ignored due to path filters (2)
  • tools/charts-backfill/README.md is excluded by !**/*.md
  • tools/charts-collect/README.md is excluded by !**/*.md
📒 Files selected for processing (25)
  • cmd/fleet/cron.go
  • cmd/fleet/serve.go
  • pkg/str/str.go
  • pkg/str/str_test.go
  • server/chart/api/chart.go
  • server/chart/api/http/types.go
  • server/chart/api/service.go
  • server/chart/arch_test.go
  • server/chart/blob.go
  • server/chart/blob_test.go
  • server/chart/bootstrap/bootstrap.go
  • server/chart/datasets.go
  • server/chart/internal/mysql/charts.go
  • server/chart/internal/mysql/data.go
  • server/chart/internal/mysql/data_test.go
  • server/chart/internal/service/endpoint_utils.go
  • server/chart/internal/service/handler.go
  • server/chart/internal/service/service.go
  • server/chart/internal/service/service_test.go
  • server/chart/internal/types/chart.go
  • server/datastore/mysql/migrations/tables/20260421182057_AddHostSCDData.go
  • server/datastore/mysql/migrations/tables/20260421182057_AddHostSCDData_test.go
  • server/fleet/cron_schedules.go
  • tools/charts-backfill/main.go
  • tools/charts-collect/main.go

Comment thread cmd/fleet/serve.go Outdated
Comment thread server/chart/blob.go
Comment thread server/chart/internal/mysql/charts.go
Comment thread server/chart/internal/service/service.go
Comment thread tools/charts-collect/main.go
Comment thread tools/charts-collect/main.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 41.06280% with 366 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.76%. Comparing base (5da912a) to head (ef625c4).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
server/chart/internal/mysql/data.go 8.09% 157 Missing and 2 partials ⚠️
server/chart/internal/mysql/charts.go 0.00% 80 Missing ⚠️
server/chart/internal/service/endpoint_utils.go 0.00% 26 Missing ⚠️
server/chart/internal/service/handler.go 0.00% 25 Missing ⚠️
cmd/fleet/serve.go 0.00% 24 Missing ⚠️
cmd/fleet/cron.go 0.00% 15 Missing ⚠️
server/chart/internal/service/service.go 87.37% 9 Missing and 4 partials ⚠️
server/chart/bootstrap/bootstrap.go 0.00% 6 Missing ⚠️
server/chart/datasets.go 68.42% 4 Missing and 2 partials ⚠️
server/chart/internal/service/host_cache.go 92.72% 2 Missing and 2 partials ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #43910      +/-   ##
==========================================
- Coverage   66.83%   66.76%   -0.08%     
==========================================
  Files        2610     2623      +13     
  Lines      210529   211148     +619     
  Branches     9424     9424              
==========================================
+ Hits       140712   140966     +254     
- Misses      57000    57353     +353     
- Partials    12817    12829      +12     
Flag Coverage Δ
backend 68.53% <41.06%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sgress454
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/chart/internal/mysql/charts.go`:
- Around line 88-93: FindRecentlySeenHostIDs is performing a read but
incorrectly calls ds.writer(ctx), which routes traffic to primary; change the
call to use ds.reader(ctx) instead (i.e., replace ds.writer(ctx) with
ds.reader(ctx) in the sqlx.SelectContext call) so reads use the replica helper;
if this read truly needs to be pinned to primary for replication reasons, ensure
the caller sets ctxdb.RequirePrimary(ctx) so ds.reader(ctx) will honor the
primary requirement.

In `@server/chart/internal/mysql/data.go`:
- Around line 361-371: The DELETE uses CURDATE() which is session-local; instead
compute the UTC cutoff in Go and pass it as a parameter to the query so the
retention boundary is deterministic. In Datastore.CleanupSCDData, replace the
CURDATE() - INTERVAL ? DAY expression with a placeholder (e.g. valid_to < ?) and
compute cutoff := time.Now().UTC().AddDate(0,0,-days) (or derive from the
caller's UTC logic), then call ds.writer(ctx).ExecContext with cutoff and
scdOpenSentinel as the bound parameters; keep the valid_to <> ? sentinel check
as-is and wrap/return errors the same way.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1db182b8-eba2-46a6-bde9-db6f97d4b065

📥 Commits

Reviewing files that changed from the base of the PR and between a38d0bf and 3735695.

📒 Files selected for processing (10)
  • server/chart/api/chart.go
  • server/chart/api/http/types.go
  • server/chart/internal/mysql/charts.go
  • server/chart/internal/mysql/data.go
  • server/chart/internal/service/handler.go
  • server/chart/internal/service/host_cache.go
  • server/chart/internal/service/host_cache_test.go
  • server/chart/internal/service/service.go
  • server/chart/internal/service/service_test.go
  • server/chart/internal/types/chart.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • server/chart/api/http/types.go
  • server/chart/internal/service/service.go
  • server/chart/internal/types/chart.go

Comment thread server/chart/internal/mysql/charts.go
Comment thread server/chart/internal/mysql/data.go
@sgress454
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
server/chart/internal/service/service.go (1)

47-54: ⚠️ Potential issue | 🟠 Major

Make collection failures observable and nil-logger safe.

NewService allows logger == nil, so this path can panic on collection errors in tests or alternate wiring. Also, returning nil hides dataset failures from cron/monitoring; continue collecting, but return an aggregate error.

🛠️ Proposed fix
 import (
 	"context"
+	"errors"
 	"fmt"
 	"log/slog"
 	"time"
@@
 func (s *Service) CollectDatasets(ctx context.Context, now time.Time) error {
+	var errs []error
 	for name, dataset := range s.datasets {
 		if err := dataset.Collect(ctx, s.store, now); err != nil {
 			// Log and continue — don't let one dataset failure block others.
-			s.logger.ErrorContext(ctx, "collect chart dataset", "dataset", name, "err", ctxerr.Wrap(ctx, err, "collect chart dataset"))
+			wrapped := ctxerr.Wrap(ctx, err, "collect chart dataset")
+			if s.logger != nil {
+				s.logger.ErrorContext(ctx, "collect chart dataset", "dataset", name, "err", wrapped)
+			}
+			errs = append(errs, fmt.Errorf("dataset %q: %w", name, wrapped))
 		}
 	}
-	return nil
+	return errors.Join(errs...)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/chart/internal/service/service.go` around lines 47 - 54, The
CollectDatasets method currently can panic if s.logger is nil and also drops all
dataset errors; update Service.CollectDatasets to be nil-logger safe and return
an aggregate error: for each entry in s.datasets call dataset.Collect(ctx,
s.store, now) and if it fails, use a safe logging helper that checks s.logger !=
nil (or use a no-op logger) to log "collect chart dataset" with dataset name and
wrapped error, append the error into a slice, continue iterating, and at the end
return nil if no errors or a combined error (e.g., multi-error or
fmt.Errorf("multiple errors: %v", errs)) so callers/crons can observe failures;
reference the Service.CollectDatasets, s.logger, s.datasets, and dataset.Collect
symbols when making the changes.
🧹 Nitpick comments (1)
server/chart/internal/service/host_cache.go (1)

103-112: Canonicalize duplicate filter values before hashing.

Sorting alone means LabelIDs: []uint{1, 2} and LabelIDs: []uint{1, 2, 2} produce different cache keys even though downstream filtering is semantically the same. Deduplicate each sorted slice to avoid cache fragmentation from repeated query params.

♻️ Proposed canonicalization
-	teams := slices.Clone(f.TeamIDs)
-	slices.Sort(teams)
-	labels := slices.Clone(f.LabelIDs)
-	slices.Sort(labels)
-	platforms := slices.Clone(f.Platforms)
-	slices.Sort(platforms)
-	include := slices.Clone(f.IncludeHostIDs)
-	slices.Sort(include)
-	exclude := slices.Clone(f.ExcludeHostIDs)
-	slices.Sort(exclude)
+	teams := canonicalUintSlice(f.TeamIDs)
+	labels := canonicalUintSlice(f.LabelIDs)
+	platforms := canonicalStringSlice(f.Platforms)
+	include := canonicalUintSlice(f.IncludeHostIDs)
+	exclude := canonicalUintSlice(f.ExcludeHostIDs)
@@
 }
+
+func canonicalUintSlice(in []uint) []uint {
+	out := slices.Clone(in)
+	slices.Sort(out)
+	return slices.Compact(out)
+}
+
+func canonicalStringSlice(in []string) []string {
+	out := slices.Clone(in)
+	slices.Sort(out)
+	return slices.Compact(out)
+}

Also applies to: 120-121

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/chart/internal/service/host_cache.go` around lines 103 - 112, The
cloned filter slices (teams, labels, platforms, include, exclude) are only being
sorted but not deduplicated, so semantically-identical filters with repeated IDs
yield different cache keys; after cloning and sorting each slice (teams, labels,
platforms, include, exclude) remove consecutive duplicates (canonicalize) before
using them for hashing/key construction in host_cache.go so that []{1,2,2}
becomes []{1,2}; apply the same deduplication to the other cloned slices used
for cache key generation in the same function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/chart/internal/service/host_cache.go`:
- Around line 24-35: The hostFilterCache currently leaves expired entries in the
entries map indefinitely; add bounded-pruning logic to prevent unbounded growth
by implementing a size or time-based eviction: modify hostFilterCache to include
a maxEntries (or implement an LRU) and a background janitor goroutine (or
prune-on-insert) that removes expired entries based on ttl and evicts oldest
entries when entries exceeds maxEntries; update places that populate entries
(look for methods using hostFilterCache.entries and singleflight via sf) to call
the prune/evict helper after writes, and ensure the janitor uses
hostFilterCache.clock and respects shutdown semantics if applicable.

In `@server/chart/internal/service/service.go`:
- Around line 95-109: The Downsample divisor check should only apply for
sub-daily datasets: keep the negative-value check for opts.Downsample but move
or gate the "24%opts.Downsample != 0" validation so it runs only when
dataset.BucketSize() < 24*time.Hour (i.e., sub-daily). Update the logic around
the existing validation and the bucketSize computation (referencing
opts.Downsample and dataset.BucketSize()) so daily datasets (bucketSize >=
24*time.Hour) do not fail on non-divisor downsample values.

---

Duplicate comments:
In `@server/chart/internal/service/service.go`:
- Around line 47-54: The CollectDatasets method currently can panic if s.logger
is nil and also drops all dataset errors; update Service.CollectDatasets to be
nil-logger safe and return an aggregate error: for each entry in s.datasets call
dataset.Collect(ctx, s.store, now) and if it fails, use a safe logging helper
that checks s.logger != nil (or use a no-op logger) to log "collect chart
dataset" with dataset name and wrapped error, append the error into a slice,
continue iterating, and at the end return nil if no errors or a combined error
(e.g., multi-error or fmt.Errorf("multiple errors: %v", errs)) so callers/crons
can observe failures; reference the Service.CollectDatasets, s.logger,
s.datasets, and dataset.Collect symbols when making the changes.

---

Nitpick comments:
In `@server/chart/internal/service/host_cache.go`:
- Around line 103-112: The cloned filter slices (teams, labels, platforms,
include, exclude) are only being sorted but not deduplicated, so
semantically-identical filters with repeated IDs yield different cache keys;
after cloning and sorting each slice (teams, labels, platforms, include,
exclude) remove consecutive duplicates (canonicalize) before using them for
hashing/key construction in host_cache.go so that []{1,2,2} becomes []{1,2};
apply the same deduplication to the other cloned slices used for cache key
generation in the same function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 88a3e721-736e-448f-836b-c0f33633eb45

📥 Commits

Reviewing files that changed from the base of the PR and between 3735695 and f2acaec.

📒 Files selected for processing (13)
  • cmd/fleet/serve.go
  • server/acl/chartacl/fleet_adapter.go
  • server/acl/chartacl/fleet_adapter_test.go
  • server/chart/api/chart.go
  • server/chart/api/http/types.go
  • server/chart/api/service.go
  • server/chart/bootstrap/bootstrap.go
  • server/chart/internal/mysql/charts.go
  • server/chart/internal/service/host_cache.go
  • server/chart/internal/service/host_cache_test.go
  • server/chart/internal/service/service.go
  • server/chart/internal/service/service_test.go
  • server/chart/internal/types/chart.go
✅ Files skipped from review due to trivial changes (1)
  • server/chart/internal/types/chart.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • server/chart/api/service.go
  • server/chart/bootstrap/bootstrap.go
  • server/chart/api/http/types.go
  • server/chart/internal/mysql/charts.go

Comment thread server/chart/internal/service/host_cache.go
Comment thread server/chart/internal/service/service.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/chart/datasets.go (1)

22-35: Small nit: consider logging / metric when hostIDs is empty.

Silent early return on an empty recently-seen list is correct for the accumulate strategy, but makes it hard to distinguish "legitimately no hosts active in the last 10 minutes" from "collector never sees anything" at runtime. A debug log with the sample now would help operators spot stuck collectors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/chart/datasets.go` around lines 22 - 35, In UptimeDataset.Collect, add
an explicit debug log and/or metric increment right before the early return when
len(hostIDs) == 0: log the sample time (now) and the uptimeRecentlySeenWindow so
operators can distinguish a legitimately empty window from a stuck collector,
and optionally increment a counter like "uptime_empty_recently_seen" via your
existing metrics facility; place the log/metric call immediately above the
current return nil in UptimeDataset.Collect so it runs only for the
empty-hostIDs case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/chart/datasets.go`:
- Around line 22-35: In UptimeDataset.Collect, add an explicit debug log and/or
metric increment right before the early return when len(hostIDs) == 0: log the
sample time (now) and the uptimeRecentlySeenWindow so operators can distinguish
a legitimately empty window from a stuck collector, and optionally increment a
counter like "uptime_empty_recently_seen" via your existing metrics facility;
place the log/metric call immediately above the current return nil in
UptimeDataset.Collect so it runs only for the empty-hostIDs case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fe46d7b6-bdf7-4356-aa97-9ad9588712d3

📥 Commits

Reviewing files that changed from the base of the PR and between e642667 and b4058b3.

📒 Files selected for processing (5)
  • server/chart/api/chart.go
  • server/chart/datasets.go
  • server/chart/internal/mysql/charts.go
  • server/chart/internal/service/service_test.go
  • server/chart/internal/types/chart.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/chart/internal/types/chart.go

@lukeheath
Copy link
Copy Markdown
Member

@sgress454 Victor is focused on wrapping up a story for 4.84.0 so @ksykulev is going to take review on this, which works well since he has context from the hackathon.

@sgress454 sgress454 mentioned this pull request Apr 22, 2026
3 tasks
Copy link
Copy Markdown
Contributor

@ksykulev ksykulev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think besides making Response.Data work with multiple series as demonstrated in ksykulev/dashboard-policy-chart, this looks like a good template to start for adding more relevant charts.

),
NULLIF(h.detail_updated_at, '2000-01-01 00:00:00'),
h.created_at
) >= ?`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the host_seen_times.idx_host_seen_times_seen_time index cover this query?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will double-check

Copy link
Copy Markdown
Contributor Author

@sgress454 sgress454 Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it doesn't use that index because of the COALESCE (I don't think there's any way to index that). So it's a full table scan on hosts, but PKs for all the joins. For something that runs once every ten minutes I think it'll be fine even at 100k hosts but if you feel otherwise I'll look into updating it, perhaps to a UNION. Note that nano_enrollments.last_seen_at doesn't have an index at all so that'd always be a bigger scan (although we could look into adding one).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with it, as far as I know the hosts table isn't a very hot table, right?

Comment thread server/chart/datasets.go
Comment thread server/chart/api/chart.go
Comment thread tools/charts-backfill/main.go
ksykulev
ksykulev previously approved these changes Apr 23, 2026
ksykulev
ksykulev previously approved these changes Apr 23, 2026
ksykulev
ksykulev previously approved these changes Apr 23, 2026
@sgress454 sgress454 merged commit 28908e6 into main Apr 23, 2026
43 checks passed
@sgress454 sgress454 deleted the sgress454/dashboard-charts-backend branch April 23, 2026 17:43
sgress454 added a commit that referenced this pull request Apr 24, 2026
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** For #43769

# Details

This PR adds "Hosts active" and "Hosts enrolled" charts to the
dashboard.

New components:

* **ChartCard.tsx**: encapsulates a visualization-agnostic chart, for
data provided by the new `/charts` endpoint created in
#43910
* **ChartFilterModal.tsx**: modal for setting filters on a chart.
Currently supports filtering by label, platform and individual host.
* **CheckerboardViz.tsx**: a checkerboard visualization for use in
ChartCard. Capable of charting 1, 7, 14 or 30 days at a time, although
only 30 day charts are used right now. Bespoke rendering using SVG,
since recharts scatterchart was harder to wrangle than it was worth.
* **LineChartViz.tsx**: a line-chart visualization using Recharts
* **HostsEnrolledCard.tsx**: a bar chart of enrolled hosts using
Recharts

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.

## Testing

- [X] Added/updated automated tests
- [X] QA'd all new/changed functionality manually
- With backend provided by #43910:
<img width="1426" height="428" alt="image"
src="https://github.com/user-attachments/assets/0f53b9d1-c87b-4225-a175-2d40af5afe41"
/>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Dashboard now shows interactive "Hosts active" (line/heatmap) and
"Hosts enrolled" (bar) charts with metric selection, filter modal
(labels/platforms/hosts), legends, tooltips, and responsive layout.
* **Tests**
* Added comprehensive tests covering chart rendering, checkerboard
heatmap, and no-data states.
* **Chores**
  * Added charting library dependency to support visualizations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants