Skip to content

Vulnerability cron causes DB overload due to contention with per-host software writes #41664

@getvictor

Description

@getvictor

Fleet versions

  • Discovered: 4.83 (unreleased)
  • Reproduced: 4.83 (unreleased)

Actual behavior

When the vulnerability cron runs on a Fleet server with a large number of agents, the MySQL writer DB becomes overloaded. The top queries by load are all host_software table operations:

  • DELETE FROM host_software WHERE host_id = ? AND software_id IN (...) (~17.6 combined load)
  • INSERT IGNORE INTO host_software (host_id, software_id, last_opened_at) VALUES (...) (~2.47 load)
  • UPDATE host_software hs JOIN (...) ... (~3.6 combined load)

These per-host software inventory writes contend with the vulnerability cron's heavy table scans:

  • SyncHostsSoftware runs 3 full scans of host_software per batch (global, team, no-team counts)
  • SyncHostsSoftwareTitles runs 3 unbounded full table scans (no batching by ID range)
  • cleanupUnusedSoftware runs NOT EXISTS (SELECT 1 FROM host_software ...) on the writer up to 100 times

The concurrent long-running reads from the cron cause row-level lock waits, increased undo log pressure, and buffer pool contention with the per-host write transactions.

To fix

Three changes to reduce contention:

  1. Use reader replica for first SELECT in cleanupUnusedSoftware: The first iteration of the cleanup loop queries the reader to find orphaned software IDs, reducing writer load. Subsequent iterations use the writer so we see our own deletes and don't re-select the same rows due to replica lag.

  2. Add 100ms sleep between cron batches: Insert time.Sleep(100ms) between batch iterations in SyncHostsSoftware, SyncHostsSoftwareTitles, and cleanupUnusedSoftware. This gives per-host transactions a window to acquire locks and commit.

  3. Batch SyncHostsSoftwareTitles by title ID ranges: Add WHERE st.id > ? AND st.id <= ? to all 3 count queries and process in countHostSoftwareBatchSize (100K) chunks, matching the approach already used in SyncHostsSoftware. Previously, each of the 3 queries scanned the entire host_software table in one go.

Steps to reproduce

These steps:

  • Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
  • Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.
  1. Deploy a Fleet server with a large number of agents (50K+), each reporting 500+ software items.
  2. Wait for the vulnerability cron to trigger (runs every hour by default).
  3. Observe MySQL performance metrics — the host_software DELETE/INSERT/UPDATE queries spike in load while the cron is running.
  4. Check MySQL SHOW PROCESSLIST or performance schema for lock waits on host_software.

More info

The issue only manifests when the vulnerability cron runs concurrently with normal agent check-ins. Outside the cron window, the per-host software writes operate at acceptable load levels.

Metadata

Metadata

Labels

#g-security-complianceSecurity & Compliance product groupbugSomething isn't working as documented~unreleased bugThis bug was found in an unreleased version of Fleet.

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions