Long running software transaction causing massive deadlocks

**Fleet version**: main (79d431e3641db827d2a26a4a27ec9b5193714761)

<hr/>

### 💥  Actual behavior
While trying to reproduce deadlocks for issue [#31173](https://github.com/fleetdm/fleet/issues/31173), I saw a massive spike in deadlocks due to a long running software transaction.

Load test details: basic load test with 15,000 online hosts and ~ 80,000 offline hosts.

According to Claude Code (done by enabling MySQL logging of all deadlocks, and feeding the error log to Claude):

Key Details:
- Duration: Running for 402+ seconds (over 6.7 minutes) at the time of the deadlocks
- Thread ID: MySQL thread 10057, from IP 10.12.1.63 (one of the Fleet containers)
- Lock Statistics: Holding 34,530+ lock structures with 7.3MB heap size
- Row Locks: Only 13 row locks, but massive lock structure count
- Undo Log: Only 5 entries (very few actual changes made)

The Problem Query:
```sql
UPDATE software s
JOIN software_titles st
ON s.bundle_identifier = st.bundle_identifier AND
IF(s.source IN ('apps', 'ios_apps', 'ipados_apps'), s.source = st.source, 1)
SET s.title_id = st.id
WHERE s.title_id IS NULL
OR s.title_id != st.id
```

What's Happening:
1. This is a bulk UPDATE that's trying to match software records with software_titles based on bundle_identifier
2. It's been running for over 6 minutes and accumulating massive amounts of lock structures (34,530+)
3. The query is "fetching rows" - still in the process of finding rows to update
4. It's blocking hundreds of other transactions trying to access the same software table
5. The massive lock count with few actual row locks suggests it's scanning huge portions of the table

Root Cause:
This appears to be an inefficient bulk update operation that's:
- Missing proper indexes on the JOIN columns
- Potentially updating a massive number of rows without batching
- Holding locks for the entire duration while scanning the table

### 🛠️ To fix

Try to speed up this transaction to be in the seconds range. Alternatively, we may need to prevent other related edits on the server while this one is running.


### 🧑‍💻  Steps to reproduce
I tried looking at APM for details, but could not find that query there. This suggests that it is one of the cron jobs. This is an example where fully working telemetry could help us catch issues like this before our customers do. cc: @ksykulev 

To reproduce, we can try to monitor long running transactions manually, which seems unpredictable.

Or, add telemetry or logs for this transaction to measure how long is it actually taking. Maybe the issue is sizing. We currently have few signals indicating whether the DB instance is undersized. This transaction length may be a key signal.


### 🕯️ More info _(optional)_

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running software transaction causing massive deadlocks #32201

💥 Actual behavior

🛠️ To fix

🧑‍💻 Steps to reproduce

🕯️ More info (optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long running software transaction causing massive deadlocks #32201

Description

💥 Actual behavior

🛠️ To fix

🧑‍💻 Steps to reproduce

🕯️ More info (optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions