-
Notifications
You must be signed in to change notification settings - Fork 798
Description
customer-mozartia: Slack thread (contains logs).
Fleet version: v4.78.1
Web browser and operating system: N/A
💥 Actual behavior
Windows hosts enter an infinite BitLocker encryption/decryption loop after Fleet enrollment and disk encryption enforcement.
- Fleet UI shows Disk encryption: Off, OS settings: Pending.
- BitLocker starts encrypting the OS drive (C:), progresses to Verifying (or partially encrypts, e.g. 86.9%), eventually reverts/fails, loops back to Pending, and repeats indefinitely.
- Uninstalling Fleet or moving the host back to a team without encryption enforced stops the loop.
- Manual BitLocker encryption works without issues/looping.
- Reinstalling Windows via Recovery option (no full disk wipe/format) does not fix the issue.
- Partition layouts show non-standard elements (multiple/extra recovery partitions, occasional ReFS data volumes alongside NTFS OS; see More info below).
🛠️ To fix
Timebox TBD at estimation
🧑💻 Steps to reproduce
These steps:
- Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
- Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.
- Set up a Windows device with a non-standard partition map (see More info below).
- Enroll that host into Fleet.
- Assign the host to a team with disk encryption enforcement enabled.
🕯️ More info
Partition data examples:
- One case: C: NTFS + large D: ReFS + multiple NTFS/FAT32/recovery partitions.
- Another: All NTFS but 3+ recovery partitions + non-standard map.
Might could be related to #37454.
Root cause
Three concurrent orbit subsystems share a single COM thread managed by the comshim library: BitLocker, MDM Bridge, and Windows Update. The comshim library maintains a global reference count: Add(1) increments it (initializing COM if the count was 0), and Done() decrements it (tearing down COM if it reaches 0).
BitLocker's GetEncryptionStatus() enumerates all logical volumes and queries each one's BitLocker status via WMI. On a VM with 5 drives, this means 5 sequential cycles of:
comshim.Add(1) → bitlockerConnect → WMI query → bitlockerClose → comshim.Done()
Each Done() can drop the ref count to 0, triggering COM teardown. The next Add(1) re-initializes COM. This rapid oscillation through zero, with MDM Bridge and Windows Update also calling Add/Done on their own schedules, creates a race condition. If a Done() triggers teardown while another goroutine's Add(1) is trying to re-initialize, they deadlock on comshim's internal mutex and COM's initialization locks.
This is not a transient timing collision (BitLocker only holds COM for 1-2 seconds). It's a structural deadlock caused by the teardown/re-init lifecycle race, which permanently blocks all participating threads.
Fix
Create a dedicated COM worker goroutine for BitLocker that bypasses comshim entirely.
The worker:
- Locks itself to an OS thread via
runtime.LockOSThread() - Calls
ole.CoInitializeEx(0, ole.COINIT_MULTITHREADED)once at startup - Processes BitLocker operations (encrypt, decrypt, status check) sequentially via a channel
- COM stays initialized for the lifetime of orbit; no ref count oscillation, no teardown races
MDM Bridge and Windows Update continue using comshim. Without BitLocker in the mix, they have minimal conflict risk.
Bonus fix
Also fixed a pre-existing nil pointer panic in orbit.go when running with --disable-updates: updateRunner.OsqueryVersion was accessed but updateRunner is nil when updates are disabled. Fixed by introducing an osqueryVersion variable set in both code paths.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status