Use swap table pattern and batch delete to improve DB access patterns for vuln cron by getvictor · Pull Request #41729 · fleetdm/fleet

getvictor · 2026-03-15T19:45:56Z

Related issue: Resolves #41664

Switching kernel_host_counts to the established swap pattern. Reduce load on the DB writer by moving the large read to the DB reader.

Do CleanupSoftwareTitles in batches. With a single large select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in loadtest with 100K hosts.

Checklist for submitter

If some of the following don't apply, delete the relevant line.

Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.

Testing

QA'd all new/changed functionality manually

For unreleased bug fixes in a release candidate, one of:

Alerted the release DRI if additional load testing is needed

Summary by CodeRabbit

Performance Improvements
- Added indexing and a batched swap/insert flow to speed up vulnerability-related queries and lower maintenance contention.
- Batched cleanup of orphaned records to reduce long-running delete operations.
Reliability
- Migration removes a legacy constraint to simplify data maintenance and avoid migration failures.
- Scheduled vulnerability refresh now runs more atomically to reduce disruption.
Tests
- Updated assertion logic to improve test clarity for host-count verification.

getvictor · 2026-03-15T19:46:12Z

@coderabbitai full review

coderabbitai · 2026-03-15T19:46:20Z

✅ Actions performed

Full review triggered.

coderabbitai · 2026-03-15T19:49:40Z

Walkthrough

Replaces in-place updates to kernel_host_counts with a batched swap-table flow: create a swap table, compute counts via a read pass, batch-insert counts into the swap table, then atomically rename the swap into live kernel_host_counts and refresh OS version vulnerability aggregates. Adds a migration to drop the foreign key on kernel_host_counts and adjusts tests to count rows for host counts. Also changes orphaned software cleanup and software title cleanup to use batched reader-first queries and batched deletes.

Possibly related PRs

PR 35317: Implements the same atomic table-swap pattern for replacing vulnerability host counts, closely matching the swap/rename approach used here.
PR 31722: Modifies InsertKernelSoftwareMapping and related kernel-to-software mapping logic that this change rewrites into a swap-table, batched flow.
PR 41375: Converts orphaned software cleanup to a batched find-and-delete process in server/datastore/mysql/software.go, the same file and batching pattern updated here.

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check	❓ Inconclusive	The PR partially addresses the linked issue `#41664`. It implements the swap table pattern for kernel_host_counts and batches software cleanup, but does not fully implement all three stated objectives: missing batching for SyncHostsSoftwareTitles by ID ranges and the 100ms sleep between batches.	Verify that batching of SyncHostsSoftwareTitles by title ID ranges and 100ms sleep delays are being addressed in this PR or a follow-up PR as part of the `#41664` fix.
Description check	❓ Inconclusive	The PR description provides the related issue number and covers the main changes (swap pattern for kernel_host_counts, batching for CleanupSoftwareTitles), but omits several key checklist items including validation of input data, backwards compatibility, automated tests, database migration checks, and timestamp implications.	Add or clarify the status of: input validation/SQL injection prevention, automated tests added/updated, database migration review (timestamp handling, collation), and confirmation about load test impact per the release checklist.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: using swap table pattern and batch delete to improve DB access patterns for vulnerability cron operations, which directly matches the code modifications in kernel_host_counts and software cleanup logic.
Out of Scope Changes check	✅ Passed	All code changes are in scope: migration drops foreign key, swap table logic is implemented for kernel_host_counts, software cleanup is batched, and tests are updated to match new behavior. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch victor-cron

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

server/datastore/mysql/operating_system_vulnerabilities.go (1)
519-519: Consider increasing batch size for better insert throughput.

A batch size of 100 rows is relatively small for bulk inserts. Larger batches (e.g., 1000–5000) typically reduce round-trip overhead and improve throughput, especially when inserting many rows. However, the current size is functional and may be intentionally conservative.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/datastore/mysql/operating_system_vulnerabilities.go` at line 519, The
constant batchSize = 100 in operating_system_vulnerabilities.go is small for
bulk inserts; increase it to a larger value (for example 1000 or 5000) to
improve insert throughput by reducing round-trips, while ensuring memory and
transaction size remain acceptable; update the const batchSize declaration and
re-run any insert tests related to the functions that use batchSize (search for
references to batchSize in this file) and adjust if you encounter memory or DB
transaction limits.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/datastore/mysql/operating_system_vulnerabilities.go`:
- Line 519: The constant batchSize = 100 in operating_system_vulnerabilities.go
is small for bulk inserts; increase it to a larger value (for example 1000 or
5000) to improve insert throughput by reducing round-trips, while ensuring
memory and transaction size remain acceptable; update the const batchSize
declaration and re-run any insert tests related to the functions that use
batchSize (search for references to batchSize in this file) and adjust if you
encounter memory or DB transaction limits.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dc3016a9-af4b-48d3-889d-77c3a6afc09d

📥 Commits

Reviewing files that changed from the base of the PR and between 4128819 and 422f555.

📒 Files selected for processing (2)

changes/41664-vulnerability-cron-db-contention
server/datastore/mysql/operating_system_vulnerabilities.go

Copilot

Pull request overview

This PR updates the MySQL datastore’s kernel-to-OS mapping refresh (InsertKernelSoftwareMapping) to use a swap-table rebuild approach, aiming to reduce contention/overload during the vulnerability cron when large host_software scans coincide with per-host software writes.

Changes:

Rebuild kernel_host_counts via a newly populated swap table and atomically rename it into place.
Read the aggregate source data from the reader and batch-insert results into the swap table on the writer.
Update the release note entry for the vulnerability-cron contention work.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`server/datastore/mysql/operating_system_vulnerabilities.go`	Reworks `InsertKernelSoftwareMapping` to populate a swap table and atomically swap it in.
`changes/41664-vulnerability-cron-db-contention`	Updates the changelog entry describing the contention reduction work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

codecov · 2026-03-15T20:05:24Z

Codecov Report

❌ Patch coverage is 70.08547% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.41%. Comparing base (616578a) to head (1293b7a).
⚠️ Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
...atastore/mysql/operating_system_vulnerabilities.go	66.21%	16 Missing and 9 partials ⚠️
server/datastore/mysql/software.go	82.85%	3 Missing and 3 partials ⚠️
...s/20260316120000_DropKernelHostCountsForeignKey.go	50.00%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #41729      +/-   ##
==========================================
- Coverage   66.41%   66.41%   -0.01%     
==========================================
  Files        2504     2505       +1     
  Lines      200618   200687      +69     
  Branches     8930     8930              
==========================================
+ Hits       133243   133286      +43     
- Misses      55330    55348      +18     
- Partials    12045    12053       +8

Flag	Coverage Δ
backend	`68.20% <70.08%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

getvictor · 2026-03-16T21:12:37Z

@coderabbitai review

coderabbitai · 2026-03-16T21:12:42Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/datastore/mysql/operating_system_vulnerabilities.go (1)
511-576: ⚠️ Potential issue | 🟠 Major

Don't swap the live kernel_host_counts table from an unguarded replica snapshot.

refreshOSVersionVulnerabilities() runs immediately after this rename and treats the swapped table as authoritative. If the reader is lagging, this can publish a partial/empty kernel mapping and then delete still-valid Linux vuln rows. Please gate the reader path on replica freshness or fall back to the writer before the swap.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/datastore/mysql/operating_system_vulnerabilities.go` around lines 511
- 576, The code currently reads kernel host counts via
ds.reader(ctx).QueryContext and then immediately renames swapTable into
kernel_host_counts and calls ds.refreshOSVersionVulnerabilities(), risking
publishing partial data if the replica reader is stale; modify the logic in the
function that builds and swaps kernel_host_counts to first verify replica
freshness (or fall back to ds.writer) before using ds.reader: add a freshness
check using the same connection used by ds.reader (e.g., query replica
replication lag/seconds_behind_master or a heartbeat table) and if the replica
is stale, switch to ds.writer(ctx).QueryContext for the selectStmt so the data
used to populate swapTable is authoritative; ensure the swap/rename and
subsequent call to ds.refreshOSVersionVulnerabilities() remain unchanged and
still run inside ds.withRetryTxx for atomicity.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/datastore/mysql/software.go`:
- Around line 2850-2856: The writer-side batch delete currently runs a DELETE
built from sqlx.In(`DELETE FROM software_titles WHERE id IN (?)`, ids) and can
remove rows that became non-orphaned after the read; modify the writer-side
delete (the stmt used with ds.writer(ctx).ExecContext) to include the same
orphan-check predicates used when building ids (e.g. add the NOT EXISTS or LEFT
JOIN + IS NULL conditions that verify there are no remaining referencing rows
such as software_instance_titles and any other orphan criteria) so the final
DELETE is conditional (e.g. DELETE ... WHERE id IN (?) AND <orphan-predicates>)
and thus safe against concurrent attaches while keeping the candidate-list
optimization.
- Around line 2830-2861: The orphaned-title cleanup currently re-runs the same
LIMIT ? query each iteration causing repeated full-table scans; change the query
and loop to page by st.id using a lastID cursor (e.g., add "WHERE st.id > ?
ORDER BY st.id LIMIT ?" to findOrphanedSoftwareTitlesStmt) and pass lastID into
sqlx.SelectContext, update the loop to set lastID to the highest id from ids
after each successful batch, continue until no ids returned, and ensure deletes
are executed with sqlx.In as before; update references to
findOrphanedSoftwareTitlesStmt, the loop using
cleanupMaxIterations/cleanupBatchSize, ids, n, and the ds.reader/ds.writer calls
accordingly.

---

Outside diff comments:
In `@server/datastore/mysql/operating_system_vulnerabilities.go`:
- Around line 511-576: The code currently reads kernel host counts via
ds.reader(ctx).QueryContext and then immediately renames swapTable into
kernel_host_counts and calls ds.refreshOSVersionVulnerabilities(), risking
publishing partial data if the replica reader is stale; modify the logic in the
function that builds and swaps kernel_host_counts to first verify replica
freshness (or fall back to ds.writer) before using ds.reader: add a freshness
check using the same connection used by ds.reader (e.g., query replica
replication lag/seconds_behind_master or a heartbeat table) and if the replica
is stale, switch to ds.writer(ctx).QueryContext for the selectStmt so the data
used to populate swapTable is authoritative; ensure the swap/rename and
subsequent call to ds.refreshOSVersionVulnerabilities() remain unchanged and
still run inside ds.withRetryTxx for atomicity.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d27756cf-33df-4d18-82d9-86d1cbb4eb25

📥 Commits

Reviewing files that changed from the base of the PR and between 422f555 and 82b910d.

📒 Files selected for processing (5)

server/datastore/mysql/migrations/tables/20260315120000_DropKernelHostCountsForeignKey.go
server/datastore/mysql/operating_system_vulnerabilities.go
server/datastore/mysql/operating_system_vulnerabilities_test.go
server/datastore/mysql/schema.sql
server/datastore/mysql/software.go

# Conflicts: # server/datastore/mysql/schema.sql

ksykulev · 2026-03-17T14:43:57Z

+		if batchCount == batchSize {
+			values := strings.TrimSuffix(strings.Repeat(valuesPart, batchCount), ",")
+			if _, err := ds.writer(ctx).ExecContext(ctx, fmt.Sprintf(insertStmt, values), args...); err != nil {
+				return ctxerr.Wrap(ctx, err, "insert kernel host counts batch into swap table")


If this fails mid batch what happens? We aren't in a transaction. Do we care?

This query is inserting into a new kernel_host_counts_swap table that no one else is using. There is no contention. Any way it could fail?

Parent context is cancelled.
Mysql server restart or crash.
Connection timeout / network failure.

But I don't think we actually care if the swap table is left partial.

… for vuln cron (#41729)  **Related issue:** Resolves #41664 Switching `kernel_host_counts` to the established swap pattern. Reduce load on the DB writer by moving the large read to the DB reader. Do `CleanupSoftwareTitles` in batches. With a single large select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in loadtest with 100K hosts. If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. - [x] QA'd all new/changed functionality manually For unreleased bug fixes in a release candidate, one of: - [x] Alerted the release DRI if additional load testing is needed  * **Performance Improvements** * Added indexing and a batched swap/insert flow to speed up vulnerability-related queries and lower maintenance contention. * Batched cleanup of orphaned records to reduce long-running delete operations. * **Reliability** * Migration removes a legacy constraint to simplify data maintenance and avoid migration failures. * Scheduled vulnerability refresh now runs more atomically to reduce disruption. * **Tests** * Updated assertion logic to improve test clarity for host-count verification.

…e DB access patterns for vuln cron (#41729)  **Related issue:** Resolves #41664 Switching `kernel_host_counts` to the established swap pattern. Reduce load on the DB writer by moving the large read to the DB reader. Do `CleanupSoftwareTitles` in batches. With a single large select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in loadtest with 100K hosts. If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. - [x] QA'd all new/changed functionality manually For unreleased bug fixes in a release candidate, one of: - [x] Alerted the release DRI if additional load testing is needed  * **Performance Improvements** * Added indexing and a batched swap/insert flow to speed up vulnerability-related queries and lower maintenance contention. * Batched cleanup of orphaned records to reduce long-running delete operations. * **Reliability** * Migration removes a legacy constraint to simplify data maintenance and avoid migration failures. * Scheduled vulnerability refresh now runs more atomically to reduce disruption. * **Tests** * Updated assertion logic to improve test clarity for host-count verification.

#42300) …e DB access patterns for vuln cron (#41729)  **Related issue:** Resolves #41664 Switching `kernel_host_counts` to the established swap pattern. Reduce load on the DB writer by moving the large read to the DB reader. Do `CleanupSoftwareTitles` in batches. With a single large select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in loadtest with 100K hosts. If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. - [x] QA'd all new/changed functionality manually For unreleased bug fixes in a release candidate, one of: - [x] Alerted the release DRI if additional load testing is needed  * **Performance Improvements** * Added indexing and a batched swap/insert flow to speed up vulnerability-related queries and lower maintenance contention. * Batched cleanup of orphaned records to reduce long-running delete operations. * **Reliability** * Migration removes a legacy constraint to simplify data maintenance and avoid migration failures. * Scheduled vulnerability refresh now runs more atomically to reduce disruption. * **Tests** * Updated assertion logic to improve test clarity for host-count verification.  Co-authored-by: Victor Lyuboslavsky <2685025+getvictor@users.noreply.github.com>

Use swap table pattern for InsertKernelSoftwareMapping

422f555

getvictor temporarily deployed to Docker Hub March 15, 2026 19:46 — with GitHub Actions Inactive

getvictor requested a review from Copilot March 15, 2026 19:46

Copilot started reviewing on behalf of getvictor March 15, 2026 19:46 View session

coderabbitai Bot reviewed Mar 15, 2026

View reviewed changes

Copilot AI reviewed Mar 15, 2026

View reviewed changes

Comment thread server/datastore/mysql/operating_system_vulnerabilities.go

Comment thread server/datastore/mysql/operating_system_vulnerabilities.go Outdated

Fix code review issues.

81e3bc4

getvictor temporarily deployed to Docker Hub March 15, 2026 20:33 — with GitHub Actions Inactive

getvictor temporarily deployed to Docker Hub March 16, 2026 11:45 — with GitHub Actions Inactive

Do cleanup titles in batches.

82b910d

getvictor temporarily deployed to Docker Hub March 16, 2026 20:20 — with GitHub Actions Inactive

getvictor changed the title ~~Use swap table pattern for InsertKernelSoftwareMapping~~ Use swap table pattern and batch delete to improve DB access patterns for vuln cron Mar 16, 2026

coderabbitai Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread server/datastore/mysql/software.go Outdated

Comment thread server/datastore/mysql/software.go Outdated

getvictor added 2 commits March 16, 2026 23:54

Merge remote-tracking branch 'origin/main' into victor-cron

56f48ab

# Conflicts: # server/datastore/mysql/schema.sql

Retimestamp.

2c2b254

getvictor temporarily deployed to Docker Hub March 17, 2026 04:57 — with GitHub Actions Inactive

Code review fixes.

1293b7a

getvictor temporarily deployed to Docker Hub March 17, 2026 05:14 — with GitHub Actions Inactive

getvictor temporarily deployed to Docker Hub March 17, 2026 11:52 — with GitHub Actions Inactive

getvictor marked this pull request as ready for review March 17, 2026 11:53

getvictor requested a review from a team as a code owner March 17, 2026 11:53

getvictor assigned ksykulev Mar 17, 2026

ksykulev reviewed Mar 17, 2026

View reviewed changes

getvictor requested a review from ksykulev March 17, 2026 14:55

ksykulev approved these changes Mar 17, 2026

View reviewed changes

getvictor merged commit 02e8d47 into main Mar 17, 2026
49 checks passed

getvictor deleted the victor-cron branch March 17, 2026 15:34

JordanMontgomery mentioned this pull request Mar 24, 2026

Backport to 4.83.0: Use swap table pattern and batch delete to improv… #42300

Merged

3 tasks

Conversation

getvictor commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist for submitter

Testing

Summary by CodeRabbit

Uh oh!

getvictor commented Mar 15, 2026

Uh oh!

coderabbitai Bot commented Mar 15, 2026

Uh oh!

coderabbitai Bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

getvictor commented Mar 16, 2026

Uh oh!

coderabbitai Bot commented Mar 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ksykulev Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

getvictor Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

ksykulev Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

getvictor commented Mar 15, 2026 •

edited

Loading

coderabbitai Bot commented Mar 15, 2026 •

edited

Loading

codecov Bot commented Mar 15, 2026 •

edited

Loading