Skip to content

Use swap table pattern and batch delete to improve DB access patterns for vuln cron#41729

Merged
getvictor merged 6 commits intomainfrom
victor-cron
Mar 17, 2026
Merged

Use swap table pattern and batch delete to improve DB access patterns for vuln cron#41729
getvictor merged 6 commits intomainfrom
victor-cron

Conversation

@getvictor
Copy link
Copy Markdown
Member

@getvictor getvictor commented Mar 15, 2026

Related issue: Resolves #41664

Switching kernel_host_counts to the established swap pattern. Reduce load on the DB writer by moving the large read to the DB reader.

Do CleanupSoftwareTitles in batches. With a single large select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in loadtest with 100K hosts.

Checklist for submitter

If some of the following don't apply, delete the relevant line.

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.

Testing

  • QA'd all new/changed functionality manually

For unreleased bug fixes in a release candidate, one of:

  • Alerted the release DRI if additional load testing is needed

Summary by CodeRabbit

  • Performance Improvements

    • Added indexing and a batched swap/insert flow to speed up vulnerability-related queries and lower maintenance contention.
    • Batched cleanup of orphaned records to reduce long-running delete operations.
  • Reliability

    • Migration removes a legacy constraint to simplify data maintenance and avoid migration failures.
    • Scheduled vulnerability refresh now runs more atomically to reduce disruption.
  • Tests

    • Updated assertion logic to improve test clarity for host-count verification.

@getvictor
Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 15, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 15, 2026

Walkthrough

Replaces in-place updates to kernel_host_counts with a batched swap-table flow: create a swap table, compute counts via a read pass, batch-insert counts into the swap table, then atomically rename the swap into live kernel_host_counts and refresh OS version vulnerability aggregates. Adds a migration to drop the foreign key on kernel_host_counts and adjusts tests to count rows for host counts. Also changes orphaned software cleanup and software title cleanup to use batched reader-first queries and batched deletes.

Possibly related PRs

  • PR 35317: Implements the same atomic table-swap pattern for replacing vulnerability host counts, closely matching the swap/rename approach used here.
  • PR 31722: Modifies InsertKernelSoftwareMapping and related kernel-to-software mapping logic that this change rewrites into a swap-table, batched flow.
  • PR 41375: Converts orphaned software cleanup to a batched find-and-delete process in server/datastore/mysql/software.go, the same file and batching pattern updated here.
🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check ❓ Inconclusive The PR partially addresses the linked issue #41664. It implements the swap table pattern for kernel_host_counts and batches software cleanup, but does not fully implement all three stated objectives: missing batching for SyncHostsSoftwareTitles by ID ranges and the 100ms sleep between batches. Verify that batching of SyncHostsSoftwareTitles by title ID ranges and 100ms sleep delays are being addressed in this PR or a follow-up PR as part of the #41664 fix.
Description check ❓ Inconclusive The PR description provides the related issue number and covers the main changes (swap pattern for kernel_host_counts, batching for CleanupSoftwareTitles), but omits several key checklist items including validation of input data, backwards compatibility, automated tests, database migration checks, and timestamp implications. Add or clarify the status of: input validation/SQL injection prevention, automated tests added/updated, database migration review (timestamp handling, collation), and confirmation about load test impact per the release checklist.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: using swap table pattern and batch delete to improve DB access patterns for vulnerability cron operations, which directly matches the code modifications in kernel_host_counts and software cleanup logic.
Out of Scope Changes check ✅ Passed All code changes are in scope: migration drops foreign key, swap table logic is implemented for kernel_host_counts, software cleanup is batched, and tests are updated to match new behavior. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch victor-cron
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/datastore/mysql/operating_system_vulnerabilities.go (1)

519-519: Consider increasing batch size for better insert throughput.

A batch size of 100 rows is relatively small for bulk inserts. Larger batches (e.g., 1000–5000) typically reduce round-trip overhead and improve throughput, especially when inserting many rows. However, the current size is functional and may be intentionally conservative.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/datastore/mysql/operating_system_vulnerabilities.go` at line 519, The
constant batchSize = 100 in operating_system_vulnerabilities.go is small for
bulk inserts; increase it to a larger value (for example 1000 or 5000) to
improve insert throughput by reducing round-trips, while ensuring memory and
transaction size remain acceptable; update the const batchSize declaration and
re-run any insert tests related to the functions that use batchSize (search for
references to batchSize in this file) and adjust if you encounter memory or DB
transaction limits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/datastore/mysql/operating_system_vulnerabilities.go`:
- Line 519: The constant batchSize = 100 in operating_system_vulnerabilities.go
is small for bulk inserts; increase it to a larger value (for example 1000 or
5000) to improve insert throughput by reducing round-trips, while ensuring
memory and transaction size remain acceptable; update the const batchSize
declaration and re-run any insert tests related to the functions that use
batchSize (search for references to batchSize in this file) and adjust if you
encounter memory or DB transaction limits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dc3016a9-af4b-48d3-889d-77c3a6afc09d

📥 Commits

Reviewing files that changed from the base of the PR and between 4128819 and 422f555.

📒 Files selected for processing (2)
  • changes/41664-vulnerability-cron-db-contention
  • server/datastore/mysql/operating_system_vulnerabilities.go

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the MySQL datastore’s kernel-to-OS mapping refresh (InsertKernelSoftwareMapping) to use a swap-table rebuild approach, aiming to reduce contention/overload during the vulnerability cron when large host_software scans coincide with per-host software writes.

Changes:

  • Rebuild kernel_host_counts via a newly populated swap table and atomically rename it into place.
  • Read the aggregate source data from the reader and batch-insert results into the swap table on the writer.
  • Update the release note entry for the vulnerability-cron contention work.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
server/datastore/mysql/operating_system_vulnerabilities.go Reworks InsertKernelSoftwareMapping to populate a swap table and atomically swap it in.
changes/41664-vulnerability-cron-db-contention Updates the changelog entry describing the contention reduction work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread server/datastore/mysql/operating_system_vulnerabilities.go
Comment thread server/datastore/mysql/operating_system_vulnerabilities.go Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 15, 2026

Codecov Report

❌ Patch coverage is 70.08547% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.41%. Comparing base (616578a) to head (1293b7a).
⚠️ Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
...atastore/mysql/operating_system_vulnerabilities.go 66.21% 16 Missing and 9 partials ⚠️
server/datastore/mysql/software.go 82.85% 3 Missing and 3 partials ⚠️
...s/20260316120000_DropKernelHostCountsForeignKey.go 50.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #41729      +/-   ##
==========================================
- Coverage   66.41%   66.41%   -0.01%     
==========================================
  Files        2504     2505       +1     
  Lines      200618   200687      +69     
  Branches     8930     8930              
==========================================
+ Hits       133243   133286      +43     
- Misses      55330    55348      +18     
- Partials    12045    12053       +8     
Flag Coverage Δ
backend 68.20% <70.08%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@getvictor getvictor changed the title Use swap table pattern for InsertKernelSoftwareMapping Use swap table pattern and batch delete to improve DB access patterns for vuln cron Mar 16, 2026
@getvictor
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/datastore/mysql/operating_system_vulnerabilities.go (1)

511-576: ⚠️ Potential issue | 🟠 Major

Don't swap the live kernel_host_counts table from an unguarded replica snapshot.

refreshOSVersionVulnerabilities() runs immediately after this rename and treats the swapped table as authoritative. If the reader is lagging, this can publish a partial/empty kernel mapping and then delete still-valid Linux vuln rows. Please gate the reader path on replica freshness or fall back to the writer before the swap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/datastore/mysql/operating_system_vulnerabilities.go` around lines 511
- 576, The code currently reads kernel host counts via
ds.reader(ctx).QueryContext and then immediately renames swapTable into
kernel_host_counts and calls ds.refreshOSVersionVulnerabilities(), risking
publishing partial data if the replica reader is stale; modify the logic in the
function that builds and swaps kernel_host_counts to first verify replica
freshness (or fall back to ds.writer) before using ds.reader: add a freshness
check using the same connection used by ds.reader (e.g., query replica
replication lag/seconds_behind_master or a heartbeat table) and if the replica
is stale, switch to ds.writer(ctx).QueryContext for the selectStmt so the data
used to populate swapTable is authoritative; ensure the swap/rename and
subsequent call to ds.refreshOSVersionVulnerabilities() remain unchanged and
still run inside ds.withRetryTxx for atomicity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/datastore/mysql/software.go`:
- Around line 2850-2856: The writer-side batch delete currently runs a DELETE
built from sqlx.In(`DELETE FROM software_titles WHERE id IN (?)`, ids) and can
remove rows that became non-orphaned after the read; modify the writer-side
delete (the stmt used with ds.writer(ctx).ExecContext) to include the same
orphan-check predicates used when building ids (e.g. add the NOT EXISTS or LEFT
JOIN + IS NULL conditions that verify there are no remaining referencing rows
such as software_instance_titles and any other orphan criteria) so the final
DELETE is conditional (e.g. DELETE ... WHERE id IN (?) AND <orphan-predicates>)
and thus safe against concurrent attaches while keeping the candidate-list
optimization.
- Around line 2830-2861: The orphaned-title cleanup currently re-runs the same
LIMIT ? query each iteration causing repeated full-table scans; change the query
and loop to page by st.id using a lastID cursor (e.g., add "WHERE st.id > ?
ORDER BY st.id LIMIT ?" to findOrphanedSoftwareTitlesStmt) and pass lastID into
sqlx.SelectContext, update the loop to set lastID to the highest id from ids
after each successful batch, continue until no ids returned, and ensure deletes
are executed with sqlx.In as before; update references to
findOrphanedSoftwareTitlesStmt, the loop using
cleanupMaxIterations/cleanupBatchSize, ids, n, and the ds.reader/ds.writer calls
accordingly.

---

Outside diff comments:
In `@server/datastore/mysql/operating_system_vulnerabilities.go`:
- Around line 511-576: The code currently reads kernel host counts via
ds.reader(ctx).QueryContext and then immediately renames swapTable into
kernel_host_counts and calls ds.refreshOSVersionVulnerabilities(), risking
publishing partial data if the replica reader is stale; modify the logic in the
function that builds and swaps kernel_host_counts to first verify replica
freshness (or fall back to ds.writer) before using ds.reader: add a freshness
check using the same connection used by ds.reader (e.g., query replica
replication lag/seconds_behind_master or a heartbeat table) and if the replica
is stale, switch to ds.writer(ctx).QueryContext for the selectStmt so the data
used to populate swapTable is authoritative; ensure the swap/rename and
subsequent call to ds.refreshOSVersionVulnerabilities() remain unchanged and
still run inside ds.withRetryTxx for atomicity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d27756cf-33df-4d18-82d9-86d1cbb4eb25

📥 Commits

Reviewing files that changed from the base of the PR and between 422f555 and 82b910d.

📒 Files selected for processing (5)
  • server/datastore/mysql/migrations/tables/20260315120000_DropKernelHostCountsForeignKey.go
  • server/datastore/mysql/operating_system_vulnerabilities.go
  • server/datastore/mysql/operating_system_vulnerabilities_test.go
  • server/datastore/mysql/schema.sql
  • server/datastore/mysql/software.go

Comment thread server/datastore/mysql/software.go Outdated
Comment thread server/datastore/mysql/software.go Outdated
@getvictor getvictor marked this pull request as ready for review March 17, 2026 11:53
@getvictor getvictor requested a review from a team as a code owner March 17, 2026 11:53
Comment thread changes/41664-vulnerability-cron-db-contention
if batchCount == batchSize {
values := strings.TrimSuffix(strings.Repeat(valuesPart, batchCount), ",")
if _, err := ds.writer(ctx).ExecContext(ctx, fmt.Sprintf(insertStmt, values), args...); err != nil {
return ctxerr.Wrap(ctx, err, "insert kernel host counts batch into swap table")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails mid batch what happens? We aren't in a transaction. Do we care?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query is inserting into a new kernel_host_counts_swap table that no one else is using. There is no contention. Any way it could fail?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parent context is cancelled.
Mysql server restart or crash.
Connection timeout / network failure.

But I don't think we actually care if the swap table is left partial.

@getvictor getvictor requested a review from ksykulev March 17, 2026 14:55
@getvictor getvictor merged commit 02e8d47 into main Mar 17, 2026
49 checks passed
@getvictor getvictor deleted the victor-cron branch March 17, 2026 15:34
georgekarrv pushed a commit that referenced this pull request Mar 17, 2026
… for vuln cron (#41729)

<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #41664

Switching `kernel_host_counts` to the established swap pattern. Reduce
load on the DB writer by moving the large read to the DB reader.

Do `CleanupSoftwareTitles` in batches. With a single large
select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in
loadtest with 100K hosts.

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.

- [x] QA'd all new/changed functionality manually

For unreleased bug fixes in a release candidate, one of:

- [x] Alerted the release DRI if additional load testing is needed

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Performance Improvements**
* Added indexing and a batched swap/insert flow to speed up
vulnerability-related queries and lower maintenance contention.
* Batched cleanup of orphaned records to reduce long-running delete
operations.

* **Reliability**
* Migration removes a legacy constraint to simplify data maintenance and
avoid migration failures.
* Scheduled vulnerability refresh now runs more atomically to reduce
disruption.

* **Tests**
* Updated assertion logic to improve test clarity for host-count
verification.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
JordanMontgomery pushed a commit that referenced this pull request Mar 24, 2026
…e DB access patterns for vuln cron (#41729)

<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #41664

Switching `kernel_host_counts` to the established swap pattern. Reduce
load on the DB writer by moving the large read to the DB reader.

Do `CleanupSoftwareTitles` in batches. With a single large
select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in
loadtest with 100K hosts.

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.

- [x] QA'd all new/changed functionality manually

For unreleased bug fixes in a release candidate, one of:

- [x] Alerted the release DRI if additional load testing is needed

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Performance Improvements**
* Added indexing and a batched swap/insert flow to speed up
vulnerability-related queries and lower maintenance contention.
* Batched cleanup of orphaned records to reduce long-running delete
operations.

* **Reliability**
* Migration removes a legacy constraint to simplify data maintenance and
avoid migration failures.
* Scheduled vulnerability refresh now runs more atomically to reduce
disruption.

* **Tests**
* Updated assertion logic to improve test clarity for host-count
verification.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
JordanMontgomery added a commit that referenced this pull request Mar 24, 2026
#42300)

…e DB access patterns for vuln cron (#41729)

<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #41664

Switching `kernel_host_counts` to the established swap pattern. Reduce
load on the DB writer by moving the large read to the DB reader.

Do `CleanupSoftwareTitles` in batches. With a single large
select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in
loadtest with 100K hosts.

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.

- [x] QA'd all new/changed functionality manually

For unreleased bug fixes in a release candidate, one of:

- [x] Alerted the release DRI if additional load testing is needed

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Performance Improvements**
* Added indexing and a batched swap/insert flow to speed up
vulnerability-related queries and lower maintenance contention.
* Batched cleanup of orphaned records to reduce long-running delete
operations.

* **Reliability**
* Migration removes a legacy constraint to simplify data maintenance and
avoid migration failures.
* Scheduled vulnerability refresh now runs more atomically to reduce
disruption.

* **Tests**
* Updated assertion logic to improve test clarity for host-count
verification.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Victor Lyuboslavsky <2685025+getvictor@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vulnerability cron causes DB overload due to contention with per-host software writes

3 participants