Fleet versions
- Discovered: 4.83 (unreleased)
- Reproduced: 4.83 (unreleased)
Actual behavior
When the vulnerability cron runs on a Fleet server with a large number of agents, the MySQL writer DB becomes overloaded. The top queries by load are all host_software table operations:
DELETE FROM host_software WHERE host_id = ? AND software_id IN (...) (~17.6 combined load)
INSERT IGNORE INTO host_software (host_id, software_id, last_opened_at) VALUES (...) (~2.47 load)
UPDATE host_software hs JOIN (...) ... (~3.6 combined load)
These per-host software inventory writes contend with the vulnerability cron's heavy table scans:
SyncHostsSoftware runs 3 full scans of host_software per batch (global, team, no-team counts)
SyncHostsSoftwareTitles runs 3 unbounded full table scans (no batching by ID range)
cleanupUnusedSoftware runs NOT EXISTS (SELECT 1 FROM host_software ...) on the writer up to 100 times
The concurrent long-running reads from the cron cause row-level lock waits, increased undo log pressure, and buffer pool contention with the per-host write transactions.
To fix
Three changes to reduce contention:
-
Use reader replica for first SELECT in cleanupUnusedSoftware: The first iteration of the cleanup loop queries the reader to find orphaned software IDs, reducing writer load. Subsequent iterations use the writer so we see our own deletes and don't re-select the same rows due to replica lag.
-
Add 100ms sleep between cron batches: Insert time.Sleep(100ms) between batch iterations in SyncHostsSoftware, SyncHostsSoftwareTitles, and cleanupUnusedSoftware. This gives per-host transactions a window to acquire locks and commit.
-
Batch SyncHostsSoftwareTitles by title ID ranges: Add WHERE st.id > ? AND st.id <= ? to all 3 count queries and process in countHostSoftwareBatchSize (100K) chunks, matching the approach already used in SyncHostsSoftware. Previously, each of the 3 queries scanned the entire host_software table in one go.
Steps to reproduce
These steps:
- Deploy a Fleet server with a large number of agents (50K+), each reporting 500+ software items.
- Wait for the vulnerability cron to trigger (runs every hour by default).
- Observe MySQL performance metrics — the
host_software DELETE/INSERT/UPDATE queries spike in load while the cron is running.
- Check MySQL
SHOW PROCESSLIST or performance schema for lock waits on host_software.
More info
The issue only manifests when the vulnerability cron runs concurrently with normal agent check-ins. Outside the cron window, the per-host software writes operate at acceptable load levels.
Fleet versions
Actual behavior
When the vulnerability cron runs on a Fleet server with a large number of agents, the MySQL writer DB becomes overloaded. The top queries by load are all
host_softwaretable operations:DELETE FROM host_software WHERE host_id = ? AND software_id IN (...)(~17.6 combined load)INSERT IGNORE INTO host_software (host_id, software_id, last_opened_at) VALUES (...)(~2.47 load)UPDATE host_software hs JOIN (...) ...(~3.6 combined load)These per-host software inventory writes contend with the vulnerability cron's heavy table scans:
SyncHostsSoftwareruns 3 full scans ofhost_softwareper batch (global, team, no-team counts)SyncHostsSoftwareTitlesruns 3 unbounded full table scans (no batching by ID range)cleanupUnusedSoftwarerunsNOT EXISTS (SELECT 1 FROM host_software ...)on the writer up to 100 timesThe concurrent long-running reads from the cron cause row-level lock waits, increased undo log pressure, and buffer pool contention with the per-host write transactions.
To fix
Three changes to reduce contention:
Use reader replica for first SELECT in
cleanupUnusedSoftware: The first iteration of the cleanup loop queries the reader to find orphaned software IDs, reducing writer load. Subsequent iterations use the writer so we see our own deletes and don't re-select the same rows due to replica lag.Add 100ms sleep between cron batches: Insert
time.Sleep(100ms)between batch iterations inSyncHostsSoftware,SyncHostsSoftwareTitles, andcleanupUnusedSoftware. This gives per-host transactions a window to acquire locks and commit.Batch
SyncHostsSoftwareTitlesby title ID ranges: AddWHERE st.id > ? AND st.id <= ?to all 3 count queries and process incountHostSoftwareBatchSize(100K) chunks, matching the approach already used inSyncHostsSoftware. Previously, each of the 3 queries scanned the entirehost_softwaretable in one go.Steps to reproduce
These steps:
host_softwareDELETE/INSERT/UPDATE queries spike in load while the cron is running.SHOW PROCESSLISTor performance schema for lock waits onhost_software.More info
The issue only manifests when the vulnerability cron runs concurrently with normal agent check-ins. Outside the cron window, the per-host software writes operate at acceptable load levels.