Skip to content

Optimize data collection: add index and batch deletes#44692

Merged
sgress454 merged 5 commits intomainfrom
sgress454/44609-add-index-to-host-scd-table
May 5, 2026
Merged

Optimize data collection: add index and batch deletes#44692
sgress454 merged 5 commits intomainfrom
sgress454/44609-add-index-to-host-scd-table

Conversation

@sgress454
Copy link
Copy Markdown
Contributor

@sgress454 sgress454 commented May 4, 2026

Related issue: Resolves #44609

Details

This PR optimizes the historical data collection system in two ways:

  1. Adds an additional index on the host_scd_data table allowing more efficient lookups of rows by their valid_to, to optimize both closing out open rows and deleting old rows
  2. Implements batching in the job that deletes old rows, so that it no longer blocks writes if the collection job happens to happen at the same time as the cleanup job

Checklist for submitter

If some of the following don't apply, delete the relevant line.

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.
    See Changes files for more information.
    n/a, unreleased
  • Input data is properly validated, SELECT * is avoided, SQL injection is prevented (using placeholders for values in statements), JS inline code is prevented especially for url redirects, and untrusted data interpolated into shell scripts/commands is validated against shell metacharacters.
  • Timeouts are implemented and retries are limited to avoid infinite loops

Testing

SQL explains -- before:

+----+-------------+---------------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table         | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+---------------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | DELETE      | host_scd_data | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 144320 |   100.00 | Using where |
+----+-------------+---------------+------------+------+---------------+------+---------+------+--------+----------+-------------+

+----+-------------+---------------+------------+-------+--------------------------------------+--------------------+---------+-------------+------+----------+-------------+
| id | select_type | table         | partitions | type  | possible_keys                        | key                | key_len | ref         | rows | filtered | Extra       |
+----+-------------+---------------+------------+-------+--------------------------------------+--------------------+---------+-------------+------+----------+-------------+
|  1 | UPDATE      | host_scd_data | NULL       | range | uniq_entity_bucket,idx_dataset_range | uniq_entity_bucket | 604     | const,const | 3030 |   100.00 | Using where |
+----+-------------+---------------+------------+-------+--------------------------------------+--------------------+---------+-------------+------+----------+-------------+

Using a test set of data (~144k "open" rows), UPDATES happened at 9 ops per second.

after:

+----+-------------+---------------+------------+-------+----------------------+----------------------+---------+-------+-------+----------+-------------+
| id | select_type | table         | partitions | type  | possible_keys        | key                  | key_len | ref   | rows  | filtered | Extra       |
+----+-------------+---------------+------------+-------+----------------------+----------------------+---------+-------+-------+----------+-------------+
|  1 | DELETE      | host_scd_data | NULL       | range | idx_valid_to_dataset | idx_valid_to_dataset | 5       | const | 55749 |   100.00 | Using where |
+----+-------------+---------------+------------+-------+----------------------+----------------------+---------+-------+-------+----------+-------------+

+----+-------------+---------------+------------+-------+-----------------------------------------------------------+----------------------+---------+-------------------+------+----------+------------------------------+
| id | select_type | table         | partitions | type  | possible_keys                                             | key                  | key_len | ref               | rows | filtered | Extra                        |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------+----------------------+---------+-------------------+------+----------+------------------------------+
|  1 | UPDATE      | host_scd_data | NULL       | range | uniq_entity_bucket,idx_dataset_range,idx_valid_to_dataset | idx_valid_to_dataset | 609     | const,const,const |    4 |   100.00 | Using where; Using temporary |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------+----------------------+---------+-------------------+------+----------+------------------------------+

Using the same test set of data, UPDATES happened at 4,910 ops per second.

For unreleased bug fixes in a release candidate, one of:

  • Confirmed that the fix is not expected to adversely impact load test results
    this should significantly improve results!
  • Alerted the release DRI if additional load testing is needed

Database migrations

  • Checked schema for all modified table for columns that will auto-update timestamps during migration.
  • Confirmed that updating the timestamps is acceptable, and will not cause unwanted side effects.
  • Ensured the correct collation is explicitly set for character columns (COLLATE utf8mb4_unicode_ci).

Summary by CodeRabbit

  • Chores
    • Cleanup now runs in controlled, ordered batches, removing only closed/historical records while respecting cancellation; error reporting for cleanup was strengthened.
    • Added a new composite index on historical data to improve cleanup and query performance.
  • Tests
    • Added tests and test helpers validating batched cleanup behavior, preservation of open records, multi-batch operation, and cancellation handling.

@sgress454 sgress454 requested a review from a team as a code owner May 4, 2026 19:11
Copilot AI review requested due to automatic review settings May 4, 2026 19:11
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@sgress454 sgress454 changed the title Sgress454/44609 add index to host scd table Optimize data collection: add index and batch deletes May 4, 2026
@sgress454
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce MySQL load from the chart historical data (SCD) system by (1) adding an index to accelerate host_scd_data lookups by valid_to, and (2) changing cleanup to delete old rows in smaller batches to shorten lock windows and better interleave with concurrent writers.

Changes:

  • Add a new secondary index on host_scd_data intended to optimize queries filtering on valid_to.
  • Update CleanupSCDData to delete rows in batches (looping DELETE ... LIMIT ...) instead of one large delete.
  • Add a batch-size constant for SCD cleanup.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
server/datastore/mysql/migrations/tables/20260423161823_AddHostSCDData.go Adds an index definition to the host_scd_data table creation SQL.
server/chart/internal/mysql/data.go Implements batched deletion for old SCD rows and introduces scdCleanupBatch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread server/chart/internal/mysql/data.go
Comment thread server/chart/internal/mysql/data.go
Comment thread server/chart/internal/mysql/data.go
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c039f64f-4680-41af-b09a-b6273ea77f6b

📥 Commits

Reviewing files that changed from the base of the PR and between cfd66ac and e46cb56.

📒 Files selected for processing (1)
  • server/chart/arch_test.go

Walkthrough

Added package-level variable scdCleanupBatch and refactored CleanupSCDData to perform repeated, ordered, batched deletes: each iteration checks ctx.Err(), computes the UTC cutoff, executes DELETE ... WHERE valid_to < ? AND valid_to <> ? ORDER BY valid_to LIMIT ?, reads RowsAffected() and stops when a batch deletes fewer than scdCleanupBatch rows; errors from the delete execution and RowsAffected() are wrapped. Added composite index idx_valid_to_dataset (valid_to, dataset, entity_id) to host_scd_data. Added tests for CleanupSCDData and a chart testutils package to support those tests; adjusted arch test ignores.

Possibly related PRs

  • fleetdm/fleet#43910: Modifies the same CleanupSCDData implementation to use ordered, batched deletes with context checks and updated error handling.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding an index and implementing batch deletes for data collection optimization.
Description check ✅ Passed The PR description covers the two main changes, includes SQL EXPLAIN outputs, performance measurements, testing approach, and database migration considerations.
Linked Issues check ✅ Passed The PR addresses issue #44609 by adding an index for efficient lookups and implementing batching to reduce database load during cleanup operations.
Out of Scope Changes check ✅ Passed All changes are directly related to optimizing historical data collection: index creation, batch delete implementation, tests, test utilities, and architecture test adjustments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sgress454/44609-add-index-to-host-scd-table

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
server/chart/internal/mysql/data.go (1)

384-389: 💤 Low value

Consider adding ORDER BY valid_to to the batched DELETE for statement-based replication safety.

DELETE ... LIMIT without ORDER BY is flagged as unsafe by MySQL in binlog_format=STATEMENT mode and can produce replication errors or warnings in mixed-mode setups. With RDS's default ROW format this is benign, but adding an explicit ORDER BY costs nothing and makes the statement portable:

♻️ Proposed change
  res, err := ds.writer(ctx).ExecContext(ctx,
      `DELETE FROM host_scd_data
       WHERE valid_to < ?
         AND valid_to <> ?
+      ORDER BY valid_to
       LIMIT ?`,
      cutoff, scdOpenSentinel, scdCleanupBatch)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/chart/internal/mysql/data.go` around lines 384 - 389, The batched
deletion on host_scd_data using ds.writer(...).ExecContext currently uses DELETE
... LIMIT without a deterministic order; update the SQL in the ExecContext call
that deletes from host_scd_data to include an explicit ORDER BY valid_to
(ascending) so rows are deleted in a stable order and the statement is safe for
statement-based replication. Keep the same WHERE predicates (valid_to < ? AND
valid_to <> ?) and LIMIT parameter; just append ORDER BY valid_to before LIMIT
in the query string used by the ExecContext invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/datastore/mysql/migrations/tables/20260423161823_AddHostSCDData.go`:
- Around line 35-36: The CREATE TABLE migration in
20260423161823_AddHostSCDData.go will not add idx_valid_to_dataset on
already-migrated DBs, so create a new migration file (e.g.,
20260503000000_AddHostSCDDataValidToIdx.go) that implements Up_20260503000000(tx
*sql.Tx) to run an ALTER TABLE host_scd_data ADD KEY idx_valid_to_dataset
(valid_to, dataset, entity_id), wrap and return any error (e.g., fmt.Errorf("add
idx_valid_to_dataset to host_scd_data: %w", err)); include a corresponding Down
if your migration framework requires it.

---

Nitpick comments:
In `@server/chart/internal/mysql/data.go`:
- Around line 384-389: The batched deletion on host_scd_data using
ds.writer(...).ExecContext currently uses DELETE ... LIMIT without a
deterministic order; update the SQL in the ExecContext call that deletes from
host_scd_data to include an explicit ORDER BY valid_to (ascending) so rows are
deleted in a stable order and the statement is safe for statement-based
replication. Keep the same WHERE predicates (valid_to < ? AND valid_to <> ?) and
LIMIT parameter; just append ORDER BY valid_to before LIMIT in the query string
used by the ExecContext invocation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 74fcc29e-30a5-42de-a49a-94abbef0ff53

📥 Commits

Reviewing files that changed from the base of the PR and between 1e6e8b1 and 72b815c.

📒 Files selected for processing (2)
  • server/chart/internal/mysql/data.go
  • server/datastore/mysql/migrations/tables/20260423161823_AddHostSCDData.go

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 92.30769% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.69%. Comparing base (5f05ffe) to head (e46cb56).
⚠️ Report is 47 commits behind head on main.

Files with missing lines Patch % Lines
server/chart/internal/mysql/data.go 76.47% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main   #44692    +/-   ##
========================================
  Coverage   66.69%   66.69%            
========================================
  Files        2651     2652     +1     
  Lines      213440   213559   +119     
  Branches     9638     9638            
========================================
+ Hits       142344   142438    +94     
- Misses      58135    58154    +19     
- Partials    12961    12967     +6     
Flag Coverage Δ
backend 68.56% <92.30%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Member

@JordanMontgomery JordanMontgomery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good solution, may want to ping infra or possibly just manually add the index on QA-wolf(I think our DB acess will allow it) to see if it resolves things

@sgress454 sgress454 merged commit 5e7f5a7 into main May 5, 2026
53 checks passed
@sgress454 sgress454 deleted the sgress454/44609-add-index-to-host-scd-table branch May 5, 2026 13:29
sgress454 added a commit that referenced this pull request May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Significantly increased DB load on QA-Wolf-Premium as of April 30 due to scd_ queries

3 participants