-
Notifications
You must be signed in to change notification settings - Fork 843
Long running software transaction causing massive deadlocks #32201
Copy link
Copy link
Closed
Labels
#g-security-complianceSecurity & Compliance product groupSecurity & Compliance product group:productProduct Design department (shows up on 🦢 Drafting board)Product Design department (shows up on 🦢 Drafting board)bugSomething isn't working as documentedSomething isn't working as documented~aging bugBug has been open more than 90 daysBug has been open more than 90 days~released bugThis bug was found in a stable release.This bug was found in a stable release.~software-ingestionIssue regarding ingesting software inventory from a host into Fleet.Issue regarding ingesting software inventory from a host into Fleet.
Metadata
Metadata
Assignees
Labels
#g-security-complianceSecurity & Compliance product groupSecurity & Compliance product group:productProduct Design department (shows up on 🦢 Drafting board)Product Design department (shows up on 🦢 Drafting board)bugSomething isn't working as documentedSomething isn't working as documented~aging bugBug has been open more than 90 daysBug has been open more than 90 days~released bugThis bug was found in a stable release.This bug was found in a stable release.~software-ingestionIssue regarding ingesting software inventory from a host into Fleet.Issue regarding ingesting software inventory from a host into Fleet.
Type
Projects
Status
Done
Fleet version: main (79d431e)
💥 Actual behavior
While trying to reproduce deadlocks for issue #31173, I saw a massive spike in deadlocks due to a long running software transaction.
Load test details: basic load test with 15,000 online hosts and ~ 80,000 offline hosts.
According to Claude Code (done by enabling MySQL logging of all deadlocks, and feeding the error log to Claude):
Key Details:
The Problem Query:
What's Happening:
Root Cause:
This appears to be an inefficient bulk update operation that's:
🛠️ To fix
Try to speed up this transaction to be in the seconds range. Alternatively, we may need to prevent other related edits on the server while this one is running.
🧑💻 Steps to reproduce
I tried looking at APM for details, but could not find that query there. This suggests that it is one of the cron jobs. This is an example where fully working telemetry could help us catch issues like this before our customers do. cc: @ksykulev
To reproduce, we can try to monitor long running transactions manually, which seems unpredictable.
Or, add telemetry or logs for this transaction to measure how long is it actually taking. Maybe the issue is sizing. We currently have few signals indicating whether the DB instance is undersized. This transaction length may be a key signal.
🕯️ More info (optional)
N/A