Skip to content

Windows BitLocker encyption looping #38405

@spalmesano0

Description

@spalmesano0

Fleet version: v4.78.1

Web browser and operating system: N/A


💥 Actual behavior

Windows hosts enter an infinite BitLocker encryption/decryption loop after Fleet enrollment and disk encryption enforcement.

  • Fleet UI shows Disk encryption: Off, OS settings: Pending.
  • BitLocker starts encrypting the OS drive (C:), progresses to Verifying (or partially encrypts, e.g. 86.9%), eventually reverts/fails, loops back to Pending, and repeats indefinitely.
  • Uninstalling Fleet or moving the host back to a team without encryption enforced stops the loop.
  • Manual BitLocker encryption works without issues/looping.
  • Reinstalling Windows via Recovery option (no full disk wipe/format) does not fix the issue.
  • Partition layouts show non-standard elements (multiple/extra recovery partitions, occasional ReFS data volumes alongside NTFS OS; see More info below).

🛠️ To fix

Timebox TBD at estimation

🧑‍💻 Steps to reproduce

These steps:

  • Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
  • Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.
  1. Set up a Windows device with a non-standard partition map (see More info below).
  2. Enroll that host into Fleet.
  3. Assign the host to a team with disk encryption enforcement enabled.

🕯️ More info

Partition data examples:

  • One case: C: NTFS + large D: ReFS + multiple NTFS/FAT32/recovery partitions.
  • Another: All NTFS but 3+ recovery partitions + non-standard map.

Might could be related to #37454.

Root cause

Three concurrent orbit subsystems share a single COM thread managed by the comshim library: BitLocker, MDM Bridge, and Windows Update. The comshim library maintains a global reference count: Add(1) increments it (initializing COM if the count was 0), and Done() decrements it (tearing down COM if it reaches 0).

BitLocker's GetEncryptionStatus() enumerates all logical volumes and queries each one's BitLocker status via WMI. On a VM with 5 drives, this means 5 sequential cycles of:

comshim.Add(1) → bitlockerConnect → WMI query → bitlockerClose → comshim.Done()

Each Done() can drop the ref count to 0, triggering COM teardown. The next Add(1) re-initializes COM. This rapid oscillation through zero, with MDM Bridge and Windows Update also calling Add/Done on their own schedules, creates a race condition. If a Done() triggers teardown while another goroutine's Add(1) is trying to re-initialize, they deadlock on comshim's internal mutex and COM's initialization locks.

This is not a transient timing collision (BitLocker only holds COM for 1-2 seconds). It's a structural deadlock caused by the teardown/re-init lifecycle race, which permanently blocks all participating threads.

Fix

Create a dedicated COM worker goroutine for BitLocker that bypasses comshim entirely.

The worker:

  • Locks itself to an OS thread via runtime.LockOSThread()
  • Calls ole.CoInitializeEx(0, ole.COINIT_MULTITHREADED) once at startup
  • Processes BitLocker operations (encrypt, decrypt, status check) sequentially via a channel
  • COM stays initialized for the lifetime of orbit; no ref count oscillation, no teardown races

MDM Bridge and Windows Update continue using comshim. Without BitLocker in the mix, they have minimal conflict risk.

Bonus fix

Also fixed a pre-existing nil pointer panic in orbit.go when running with --disable-updates: updateRunner.OsqueryVersion was accessed but updateRunner is nil when updates are disabled. Fixed by introducing an osqueryVersion variable set in both code paths.

Metadata

Metadata

Assignees

Labels

#g-security-complianceSecurity & Compliance product group:releaseReady to write code. Scheduled in a release. See "Making changes" in handbook.bugSomething isn't working as documentedcustomer-jultayucustomer-mozartia

Type

No type

Projects

Status

✅ Ready for release

Relationships

None yet

Development

No branches or pull requests

Issue actions