Use swap table pattern and batch delete to improve DB access patterns for vuln cron#41729
Use swap table pattern and batch delete to improve DB access patterns for vuln cron#41729
Conversation
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughReplaces in-place updates to kernel_host_counts with a batched swap-table flow: create a swap table, compute counts via a read pass, batch-insert counts into the swap table, then atomically rename the swap into live kernel_host_counts and refresh OS version vulnerability aggregates. Adds a migration to drop the foreign key on kernel_host_counts and adjusts tests to count rows for host counts. Also changes orphaned software cleanup and software title cleanup to use batched reader-first queries and batched deletes. Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (1 warning, 2 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
server/datastore/mysql/operating_system_vulnerabilities.go (1)
519-519: Consider increasing batch size for better insert throughput.A batch size of 100 rows is relatively small for bulk inserts. Larger batches (e.g., 1000–5000) typically reduce round-trip overhead and improve throughput, especially when inserting many rows. However, the current size is functional and may be intentionally conservative.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/datastore/mysql/operating_system_vulnerabilities.go` at line 519, The constant batchSize = 100 in operating_system_vulnerabilities.go is small for bulk inserts; increase it to a larger value (for example 1000 or 5000) to improve insert throughput by reducing round-trips, while ensuring memory and transaction size remain acceptable; update the const batchSize declaration and re-run any insert tests related to the functions that use batchSize (search for references to batchSize in this file) and adjust if you encounter memory or DB transaction limits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@server/datastore/mysql/operating_system_vulnerabilities.go`:
- Line 519: The constant batchSize = 100 in operating_system_vulnerabilities.go
is small for bulk inserts; increase it to a larger value (for example 1000 or
5000) to improve insert throughput by reducing round-trips, while ensuring
memory and transaction size remain acceptable; update the const batchSize
declaration and re-run any insert tests related to the functions that use
batchSize (search for references to batchSize in this file) and adjust if you
encounter memory or DB transaction limits.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dc3016a9-af4b-48d3-889d-77c3a6afc09d
📒 Files selected for processing (2)
changes/41664-vulnerability-cron-db-contentionserver/datastore/mysql/operating_system_vulnerabilities.go
There was a problem hiding this comment.
Pull request overview
This PR updates the MySQL datastore’s kernel-to-OS mapping refresh (InsertKernelSoftwareMapping) to use a swap-table rebuild approach, aiming to reduce contention/overload during the vulnerability cron when large host_software scans coincide with per-host software writes.
Changes:
- Rebuild
kernel_host_countsvia a newly populated swap table and atomically rename it into place. - Read the aggregate source data from the reader and batch-insert results into the swap table on the writer.
- Update the release note entry for the vulnerability-cron contention work.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
server/datastore/mysql/operating_system_vulnerabilities.go |
Reworks InsertKernelSoftwareMapping to populate a swap table and atomically swap it in. |
changes/41664-vulnerability-cron-db-contention |
Updates the changelog entry describing the contention reduction work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #41729 +/- ##
==========================================
- Coverage 66.41% 66.41% -0.01%
==========================================
Files 2504 2505 +1
Lines 200618 200687 +69
Branches 8930 8930
==========================================
+ Hits 133243 133286 +43
- Misses 55330 55348 +18
- Partials 12045 12053 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/datastore/mysql/operating_system_vulnerabilities.go (1)
511-576:⚠️ Potential issue | 🟠 MajorDon't swap the live
kernel_host_countstable from an unguarded replica snapshot.
refreshOSVersionVulnerabilities()runs immediately after this rename and treats the swapped table as authoritative. If the reader is lagging, this can publish a partial/empty kernel mapping and then delete still-valid Linux vuln rows. Please gate the reader path on replica freshness or fall back to the writer before the swap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/datastore/mysql/operating_system_vulnerabilities.go` around lines 511 - 576, The code currently reads kernel host counts via ds.reader(ctx).QueryContext and then immediately renames swapTable into kernel_host_counts and calls ds.refreshOSVersionVulnerabilities(), risking publishing partial data if the replica reader is stale; modify the logic in the function that builds and swaps kernel_host_counts to first verify replica freshness (or fall back to ds.writer) before using ds.reader: add a freshness check using the same connection used by ds.reader (e.g., query replica replication lag/seconds_behind_master or a heartbeat table) and if the replica is stale, switch to ds.writer(ctx).QueryContext for the selectStmt so the data used to populate swapTable is authoritative; ensure the swap/rename and subsequent call to ds.refreshOSVersionVulnerabilities() remain unchanged and still run inside ds.withRetryTxx for atomicity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/datastore/mysql/software.go`:
- Around line 2850-2856: The writer-side batch delete currently runs a DELETE
built from sqlx.In(`DELETE FROM software_titles WHERE id IN (?)`, ids) and can
remove rows that became non-orphaned after the read; modify the writer-side
delete (the stmt used with ds.writer(ctx).ExecContext) to include the same
orphan-check predicates used when building ids (e.g. add the NOT EXISTS or LEFT
JOIN + IS NULL conditions that verify there are no remaining referencing rows
such as software_instance_titles and any other orphan criteria) so the final
DELETE is conditional (e.g. DELETE ... WHERE id IN (?) AND <orphan-predicates>)
and thus safe against concurrent attaches while keeping the candidate-list
optimization.
- Around line 2830-2861: The orphaned-title cleanup currently re-runs the same
LIMIT ? query each iteration causing repeated full-table scans; change the query
and loop to page by st.id using a lastID cursor (e.g., add "WHERE st.id > ?
ORDER BY st.id LIMIT ?" to findOrphanedSoftwareTitlesStmt) and pass lastID into
sqlx.SelectContext, update the loop to set lastID to the highest id from ids
after each successful batch, continue until no ids returned, and ensure deletes
are executed with sqlx.In as before; update references to
findOrphanedSoftwareTitlesStmt, the loop using
cleanupMaxIterations/cleanupBatchSize, ids, n, and the ds.reader/ds.writer calls
accordingly.
---
Outside diff comments:
In `@server/datastore/mysql/operating_system_vulnerabilities.go`:
- Around line 511-576: The code currently reads kernel host counts via
ds.reader(ctx).QueryContext and then immediately renames swapTable into
kernel_host_counts and calls ds.refreshOSVersionVulnerabilities(), risking
publishing partial data if the replica reader is stale; modify the logic in the
function that builds and swaps kernel_host_counts to first verify replica
freshness (or fall back to ds.writer) before using ds.reader: add a freshness
check using the same connection used by ds.reader (e.g., query replica
replication lag/seconds_behind_master or a heartbeat table) and if the replica
is stale, switch to ds.writer(ctx).QueryContext for the selectStmt so the data
used to populate swapTable is authoritative; ensure the swap/rename and
subsequent call to ds.refreshOSVersionVulnerabilities() remain unchanged and
still run inside ds.withRetryTxx for atomicity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d27756cf-33df-4d18-82d9-86d1cbb4eb25
📒 Files selected for processing (5)
server/datastore/mysql/migrations/tables/20260315120000_DropKernelHostCountsForeignKey.goserver/datastore/mysql/operating_system_vulnerabilities.goserver/datastore/mysql/operating_system_vulnerabilities_test.goserver/datastore/mysql/schema.sqlserver/datastore/mysql/software.go
# Conflicts: # server/datastore/mysql/schema.sql
| if batchCount == batchSize { | ||
| values := strings.TrimSuffix(strings.Repeat(valuesPart, batchCount), ",") | ||
| if _, err := ds.writer(ctx).ExecContext(ctx, fmt.Sprintf(insertStmt, values), args...); err != nil { | ||
| return ctxerr.Wrap(ctx, err, "insert kernel host counts batch into swap table") |
There was a problem hiding this comment.
If this fails mid batch what happens? We aren't in a transaction. Do we care?
There was a problem hiding this comment.
This query is inserting into a new kernel_host_counts_swap table that no one else is using. There is no contention. Any way it could fail?
There was a problem hiding this comment.
Parent context is cancelled.
Mysql server restart or crash.
Connection timeout / network failure.
But I don't think we actually care if the swap table is left partial.
… for vuln cron (#41729) <!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> **Related issue:** Resolves #41664 Switching `kernel_host_counts` to the established swap pattern. Reduce load on the DB writer by moving the large read to the DB reader. Do `CleanupSoftwareTitles` in batches. With a single large select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in loadtest with 100K hosts. If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. - [x] QA'd all new/changed functionality manually For unreleased bug fixes in a release candidate, one of: - [x] Alerted the release DRI if additional load testing is needed <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Performance Improvements** * Added indexing and a batched swap/insert flow to speed up vulnerability-related queries and lower maintenance contention. * Batched cleanup of orphaned records to reduce long-running delete operations. * **Reliability** * Migration removes a legacy constraint to simplify data maintenance and avoid migration failures. * Scheduled vulnerability refresh now runs more atomically to reduce disruption. * **Tests** * Updated assertion logic to improve test clarity for host-count verification. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…e DB access patterns for vuln cron (#41729) <!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> **Related issue:** Resolves #41664 Switching `kernel_host_counts` to the established swap pattern. Reduce load on the DB writer by moving the large read to the DB reader. Do `CleanupSoftwareTitles` in batches. With a single large select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in loadtest with 100K hosts. If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. - [x] QA'd all new/changed functionality manually For unreleased bug fixes in a release candidate, one of: - [x] Alerted the release DRI if additional load testing is needed <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Performance Improvements** * Added indexing and a batched swap/insert flow to speed up vulnerability-related queries and lower maintenance contention. * Batched cleanup of orphaned records to reduce long-running delete operations. * **Reliability** * Migration removes a legacy constraint to simplify data maintenance and avoid migration failures. * Scheduled vulnerability refresh now runs more atomically to reduce disruption. * **Tests** * Updated assertion logic to improve test clarity for host-count verification. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
#42300) …e DB access patterns for vuln cron (#41729) <!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> **Related issue:** Resolves #41664 Switching `kernel_host_counts` to the established swap pattern. Reduce load on the DB writer by moving the large read to the DB reader. Do `CleanupSoftwareTitles` in batches. With a single large select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in loadtest with 100K hosts. If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. - [x] QA'd all new/changed functionality manually For unreleased bug fixes in a release candidate, one of: - [x] Alerted the release DRI if additional load testing is needed <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Performance Improvements** * Added indexing and a batched swap/insert flow to speed up vulnerability-related queries and lower maintenance contention. * Batched cleanup of orphaned records to reduce long-running delete operations. * **Reliability** * Migration removes a legacy constraint to simplify data maintenance and avoid migration failures. * Scheduled vulnerability refresh now runs more atomically to reduce disruption. * **Tests** * Updated assertion logic to improve test clarity for host-count verification. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Victor Lyuboslavsky <2685025+getvictor@users.noreply.github.com>
Related issue: Resolves #41664
Switching
kernel_host_countsto the established swap pattern. Reduce load on the DB writer by moving the large read to the DB reader.Do
CleanupSoftwareTitlesin batches. With a single large select/delete, it took > 16 minutes. In batches, it took ~1.5 minutes in loadtest with 100K hosts.Checklist for submitter
If some of the following don't apply, delete the relevant line.
changes/,orbit/changes/oree/fleetd-chrome/changes.Testing
For unreleased bug fixes in a release candidate, one of:
Summary by CodeRabbit
Performance Improvements
Reliability
Tests