Skip to content

Long running software transaction causing massive deadlocks #32201

@getvictor

Description

@getvictor

Fleet version: main (79d431e)


💥  Actual behavior

While trying to reproduce deadlocks for issue #31173, I saw a massive spike in deadlocks due to a long running software transaction.

Load test details: basic load test with 15,000 online hosts and ~ 80,000 offline hosts.

According to Claude Code (done by enabling MySQL logging of all deadlocks, and feeding the error log to Claude):

Key Details:

  • Duration: Running for 402+ seconds (over 6.7 minutes) at the time of the deadlocks
  • Thread ID: MySQL thread 10057, from IP 10.12.1.63 (one of the Fleet containers)
  • Lock Statistics: Holding 34,530+ lock structures with 7.3MB heap size
  • Row Locks: Only 13 row locks, but massive lock structure count
  • Undo Log: Only 5 entries (very few actual changes made)

The Problem Query:

UPDATE software s
JOIN software_titles st
ON s.bundle_identifier = st.bundle_identifier AND
IF(s.source IN ('apps', 'ios_apps', 'ipados_apps'), s.source = st.source, 1)
SET s.title_id = st.id
WHERE s.title_id IS NULL
OR s.title_id != st.id

What's Happening:

  1. This is a bulk UPDATE that's trying to match software records with software_titles based on bundle_identifier
  2. It's been running for over 6 minutes and accumulating massive amounts of lock structures (34,530+)
  3. The query is "fetching rows" - still in the process of finding rows to update
  4. It's blocking hundreds of other transactions trying to access the same software table
  5. The massive lock count with few actual row locks suggests it's scanning huge portions of the table

Root Cause:
This appears to be an inefficient bulk update operation that's:

  • Missing proper indexes on the JOIN columns
  • Potentially updating a massive number of rows without batching
  • Holding locks for the entire duration while scanning the table

🛠️ To fix

Try to speed up this transaction to be in the seconds range. Alternatively, we may need to prevent other related edits on the server while this one is running.

🧑‍💻  Steps to reproduce

I tried looking at APM for details, but could not find that query there. This suggests that it is one of the cron jobs. This is an example where fully working telemetry could help us catch issues like this before our customers do. cc: @ksykulev

To reproduce, we can try to monitor long running transactions manually, which seems unpredictable.

Or, add telemetry or logs for this transaction to measure how long is it actually taking. Maybe the issue is sizing. We currently have few signals indicating whether the DB instance is undersized. This transaction length may be a key signal.

🕯️ More info (optional)

N/A

Metadata

Metadata

Assignees

Labels

#g-security-complianceSecurity & Compliance product group:productProduct Design department (shows up on 🦢 Drafting board)bugSomething isn't working as documented~aging bugBug has been open more than 90 days~released bugThis bug was found in a stable release.~software-ingestionIssue regarding ingesting software inventory from a host into Fleet.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions