Skip to content

Low disk space forcing Orbit into an unrecoverable state #44561

@mason-buettner

Description

@mason-buettner

Fleet versions

  • Discovered: 4.83.2
  • Reproduced: N/A

Web browser and operating system: MacOS 26.4


💥 Actual behavior

Two hosts are unable to install configuration profiles, run scripts, or install software. Both devices experienced critically low disk space according to logs, and then failed to recover when space was freed.

🛠️ Expected behavior

Orbit recovers gracefully from disk-full scenarios. Once disk space is freed, all device management operations resume without manual intervention.

🧑‍💻 Steps to reproduce

These steps:

  • Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
  • Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.
  1. Run standard device management tasks on a host with critically low disk space (indicated by stderr messages of No space left on device).
  2. Observe that configuration profiles, scripts and software install fail even when a sufficient amount of space is present.

🕯️ More info (optional)

N/A


Spec

Problem: After critically low disk space, Orbit enters an unrecoverable state where configuration profiles, scripts, and software installs all fail permanently, even after disk space is freed.

Hypothesized root cause (needs reproduction to confirm): writeData() in orbit/pkg/update/filestore/filestore.go opens the TUF metadata file (updates-metadata.json) with O_RDWR|O_CREATE but no O_TRUNC. When disk fills mid-write, the JSON is partially written over old data, leaving a corrupted file. On next startup, readData() fails to decode the corrupted JSON, NewUpdater() returns a fatal error, and Orbit cannot function. This hypothesis is based on code analysis -- to confirm, check the updates-metadata.json file on affected hosts and verify NewUpdater() is what fails in the logs.

Fix behavior:

  1. Atomic writes in filestore: writeData() must use atomic write-and-rename (write to a temp file, sync, then rename over the target). This prevents corruption from partial writes.
  2. Corrupted file recovery in filestore: readData() must handle corrupted/unparseable JSON gracefully. If the file exists but cannot be decoded, delete it and initialize with an empty metadata map (same as the file-not-found path). Log a warning when this happens. This is safe because TUF will re-fetch all metadata from the server on the next update check.
  3. Startup resilience: If NewUpdater() fails in orbit.go, Orbit should not exit fatally. It should delete the corrupted file, retry immediately, and continue startup.

Proposed subtasks

  • Reproduce the issue and confirm the root cause hypothesis
  • Atomic writes in filestore -- change writeData() in filestore.go to write-to-temp-then-rename. Tests for partial write recovery.
  • Corrupted file recovery in filestore -- update readData() in filestore.go to delete and re-initialize when JSON decode fails, with a warning log. Tests for corrupted file handling.
  • Startup resilience in orbit.go -- make NewUpdater() failure non-fatal, delete corrupted file, retry immediately, continue startup. Tests for startup with bad metadata.

Metadata

Metadata

Assignees

Labels

#g-orchestrationOrchestration product group:productProduct Design department (shows up on 🦢 Drafting board)bugSomething isn't working as documentedcustomer-shackleton

Type

No type
No fields configured for issues without a type.

Projects

Status

🐥 Ready to estimate

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions