Skip to content

Fixed DB lock contention during vulnerability cron's software cleanup that caused failures under load#41375

Merged
getvictor merged 2 commits intomainfrom
victor/unused-software
Mar 10, 2026
Merged

Fixed DB lock contention during vulnerability cron's software cleanup that caused failures under load#41375
getvictor merged 2 commits intomainfrom
victor/unused-software

Conversation

@getvictor
Copy link
Copy Markdown
Member

@getvictor getvictor commented Mar 10, 2026

Related issue: Resolves #41374

Checklist for submitter

If some of the following don't apply, delete the relevant line.

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.

Testing

  • QA'd all new/changed functionality manually

For unreleased bug fixes in a release candidate, one of:

  • Alerted the release DRI if additional load testing is needed

Summary by CodeRabbit

  • Bug Fixes
    • Resolved database lock contention that occurred during software cleanup operations, which previously caused failures under heavy load. The cleanup process now uses an optimized batched approach for improved reliability and performance.

@getvictor
Copy link
Copy Markdown
Member Author

@coderabbitai full review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses DB lock contention observed during the vulnerability cron’s SyncHostsSoftware cleanup by changing unused/orphaned software deletion from a single large DELETE to a batched cleanup loop.

Changes:

  • Introduce cleanupUnusedSoftware to find and delete orphaned software rows in batches.
  • Add a cleanupBatchSize tuning knob for the batched cleanup.
  • Add a changelog entry for the fix.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
server/datastore/mysql/software.go Replaces single unbounded orphan-software DELETE with a batched cleanup helper invoked by SyncHostsSoftware.
changes/41374-unused-software Adds release note for the lock contention fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2788 to +2794
stmt, args, err := sqlx.In(`DELETE FROM software WHERE id IN (?)`, ids)
if err != nil {
return ctxerr.Wrap(ctx, err, "build delete unused software query")
}
if _, err := ds.writer(ctx).ExecContext(ctx, stmt, args...); err != nil {
return ctxerr.Wrap(ctx, err, "delete unused software batch")
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanupUnusedSoftware is now a two-phase SELECT-then-DELETE. That introduces a race window where a concurrent ingestion transaction can insert a host_software row referencing one of the selected software IDs after the SELECT but before the DELETE executes; the DELETE would then remove that software row and cascade-delete the newly inserted host_software row, losing data. To keep the safety property the previous single-statement DELETE had, re-check the orphan conditions at DELETE time (e.g., include the NOT EXISTS/LEFT JOIN predicates in the DELETE statement, or use a batched DELETE with LIMIT in a single SQL statement/derived table) so rows that became referenced are not deleted.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge concern. The chance of a host suddenly reporting that exact orphaned software in the millisecond window between SELECT and DELETE is negligible. If it did happen, the next hourly ingestion would simply re-add it. It's self-healing.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 10, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 10, 2026

Walkthrough

This PR addresses database lock contention in the vulnerability cleanup cron by replacing an unbounded DELETE statement with a batched cleanup approach. The SyncHostsSoftware function's inline cleanup logic is moved to a new cleanupUnusedSoftware function that iteratively selects batches of orphaned software IDs (not referenced by any host and not in software_host_counts) and deletes them in small batches. A cleanupBatchSize configuration is introduced. A changelog entry documents the fix for DB lock contention during the vulnerability cron's software cleanup.

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete and lacks critical testing details required by the template. Add explicit coverage of: whether input validation and SQL injection prevention were checked, if automated tests were added/updated, whether the fix was confirmed to not adversely impact load test results (given the PR targets load-related issues), and any database migration considerations.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main change: fixing DB lock contention in the vulnerability cron's software cleanup that causes failures under load, which matches the core objective in the linked issue.
Linked Issues check ✅ Passed The code changes implement the core requirement from issue #41374: replacing the single unbounded DELETE with a batched approach via the new cleanupUnusedSoftware function that selectively deletes orphaned software in batches.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #41374: the changelog entry documents the fix, and the code modifications implement the batched cleanup strategy as specified in the linked issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch victor/unused-software

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/datastore/mysql/software.go`:
- Around line 2757-2794: The current cleanupUnusedSoftware function can delete
software rows that are mid-ingestion because preInsertSoftwareInventory writes
to software before linkSoftwareToHost inserts host_software; change
cleanupUnusedSoftware (and its findUnusedSoftwareStmt) to only select candidates
that are older than a safe threshold (e.g., created_at or last_seen is at least
N minutes/hours old) so newly inserted-but-not-yet-linked rows are skipped;
update the query to include a time cutoff (or a processed boolean column) and
ensure the threshold constant (cleanupBatchAge or similar) is used when calling
sqlx.SelectContext, leaving other logic (DELETE IN (?) using ids and
cleanupBatchSize) unchanged and referencing functions/objects:
cleanupUnusedSoftware, preInsertSoftwareInventory, linkSoftwareToHost,
host_software, software_host_counts, and cleanupBatchSize.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d23c7058-a2e5-40a9-a656-f8cf4c1a1701

📥 Commits

Reviewing files that changed from the base of the PR and between 54a9160 and 44e1c13.

📒 Files selected for processing (2)
  • changes/41374-unused-software
  • server/datastore/mysql/software.go

Comment thread server/datastore/mysql/software.go
@getvictor getvictor marked this pull request as ready for review March 10, 2026 18:39
@getvictor getvictor requested a review from a team as a code owner March 10, 2026 18:39
@getvictor getvictor merged commit 989e503 into main Mar 10, 2026
15 checks passed
@getvictor getvictor deleted the victor/unused-software branch March 10, 2026 18:44
getvictor added a commit that referenced this pull request Mar 10, 2026
… that caused failures under load (#41375)

<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #41374

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.

- [x] QA'd all new/changed functionality manually

For unreleased bug fixes in a release candidate, one of:

- [x] Alerted the release DRI if additional load testing is needed

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Bug Fixes**
* Resolved database lock contention that occurred during software
cleanup operations, which previously caused failures under heavy load.
The cleanup process now uses an optimized batched approach for improved
reliability and performance.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
getvictor added a commit that referenced this pull request Mar 10, 2026
… that caused failures under load (#41375) (#41380)

Cherry pick.

<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #41374

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.

- [x] QA'd all new/changed functionality manually

For unreleased bug fixes in a release candidate, one of:

- [x] Alerted the release DRI if additional load testing is needed

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Bug Fixes**
* Resolved database lock contention that occurred during software
cleanup operations, which previously caused failures under heavy load.
The cleanup process now uses an optimized batched approach for improved
reliability and performance.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 63.63636% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.36%. Comparing base (f051c5a) to head (b44a64b).
⚠️ Report is 39 commits behind head on main.

Files with missing lines Patch % Lines
server/datastore/mysql/software.go 63.63% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #41375      +/-   ##
==========================================
+ Coverage   66.34%   66.36%   +0.02%     
==========================================
  Files        2477     2477              
  Lines      198395   198542     +147     
  Branches     8854     8854              
==========================================
+ Hits       131619   131762     +143     
+ Misses      54875    54872       -3     
- Partials    11901    11908       +7     
Flag Coverage Δ
backend 68.15% <63.63%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SyncHostsSoftware cleanup DELETE causes DB lock contention with software ingestion

3 participants