Fleet versions
- Discovered: 4.83.2
- Reproduced: N/A
Web browser and operating system: MacOS 26.4
💥 Actual behavior
Two hosts are unable to install configuration profiles, run scripts, or install software. Both devices experienced critically low disk space according to logs, and then failed to recover when space was freed.
🛠️ Expected behavior
Orbit recovers gracefully from disk-full scenarios. Once disk space is freed, all device management operations resume without manual intervention.
🧑💻 Steps to reproduce
These steps:
- Run standard device management tasks on a host with critically low disk space (indicated by stderr messages of
No space left on device).
- Observe that configuration profiles, scripts and software install fail even when a sufficient amount of space is present.
🕯️ More info (optional)
N/A
Spec
Problem: After critically low disk space, Orbit enters an unrecoverable state where configuration profiles, scripts, and software installs all fail permanently, even after disk space is freed.
Hypothesized root cause (needs reproduction to confirm): writeData() in orbit/pkg/update/filestore/filestore.go opens the TUF metadata file (updates-metadata.json) with O_RDWR|O_CREATE but no O_TRUNC. When disk fills mid-write, the JSON is partially written over old data, leaving a corrupted file. On next startup, readData() fails to decode the corrupted JSON, NewUpdater() returns a fatal error, and Orbit cannot function. This hypothesis is based on code analysis -- to confirm, check the updates-metadata.json file on affected hosts and verify NewUpdater() is what fails in the logs.
Fix behavior:
- Atomic writes in filestore:
writeData() must use atomic write-and-rename (write to a temp file, sync, then rename over the target). This prevents corruption from partial writes.
- Corrupted file recovery in filestore:
readData() must handle corrupted/unparseable JSON gracefully. If the file exists but cannot be decoded, delete it and initialize with an empty metadata map (same as the file-not-found path). Log a warning when this happens. This is safe because TUF will re-fetch all metadata from the server on the next update check.
- Startup resilience: If
NewUpdater() fails in orbit.go, Orbit should not exit fatally. It should delete the corrupted file, retry immediately, and continue startup.
Proposed subtasks
Fleet versions
Web browser and operating system: MacOS 26.4
💥 Actual behavior
Two hosts are unable to install configuration profiles, run scripts, or install software. Both devices experienced critically low disk space according to logs, and then failed to recover when space was freed.
🛠️ Expected behavior
Orbit recovers gracefully from disk-full scenarios. Once disk space is freed, all device management operations resume without manual intervention.
🧑💻 Steps to reproduce
These steps:
No space left on device).🕯️ More info (optional)
N/A
Spec
Problem: After critically low disk space, Orbit enters an unrecoverable state where configuration profiles, scripts, and software installs all fail permanently, even after disk space is freed.
Hypothesized root cause (needs reproduction to confirm):
writeData()inorbit/pkg/update/filestore/filestore.goopens the TUF metadata file (updates-metadata.json) withO_RDWR|O_CREATEbut noO_TRUNC. When disk fills mid-write, the JSON is partially written over old data, leaving a corrupted file. On next startup,readData()fails to decode the corrupted JSON,NewUpdater()returns a fatal error, and Orbit cannot function. This hypothesis is based on code analysis -- to confirm, check theupdates-metadata.jsonfile on affected hosts and verifyNewUpdater()is what fails in the logs.Fix behavior:
writeData()must use atomic write-and-rename (write to a temp file, sync, then rename over the target). This prevents corruption from partial writes.readData()must handle corrupted/unparseable JSON gracefully. If the file exists but cannot be decoded, delete it and initialize with an empty metadata map (same as the file-not-found path). Log a warning when this happens. This is safe because TUF will re-fetch all metadata from the server on the next update check.NewUpdater()fails inorbit.go, Orbit should not exit fatally. It should delete the corrupted file, retry immediately, and continue startup.Proposed subtasks
writeData()infilestore.goto write-to-temp-then-rename. Tests for partial write recovery.readData()infilestore.goto delete and re-initialize when JSON decode fails, with a warning log. Tests for corrupted file handling.orbit.go-- makeNewUpdater()failure non-fatal, delete corrupted file, retry immediately, continue startup. Tests for startup with bad metadata.