Low disk space forcing Orbit into an unrecoverable state

**Fleet versions**
  - *Discovered:* 4.83.2
  - *Reproduced:* N/A

**Web browser and operating system**: MacOS 26.4

<hr/>

### 💥  Actual behavior
Two hosts are unable to install configuration profiles, run scripts, or install software. Both devices experienced critically low disk space according to logs, and then failed to recover when space was freed.

### 🛠️ Expected behavior
Orbit recovers gracefully from disk-full scenarios. Once disk space is freed, all device management operations resume without manual intervention.

### 🧑‍💻  Steps to reproduce

These steps:

- [ ] Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
- [x] Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.

1. Run standard device management tasks on a host with critically low disk space (indicated by stderr messages of `No space left on device`).
2. Observe that configuration profiles, scripts and software install fail even when a sufficient amount of space is present.

### 🕯️ More info _(optional)_
N/A

---

## Spec

**Problem:** After critically low disk space, Orbit enters an unrecoverable state where configuration profiles, scripts, and software installs all fail permanently, even after disk space is freed.

**Hypothesized root cause (needs reproduction to confirm):** `writeData()` in `orbit/pkg/update/filestore/filestore.go` opens the TUF metadata file (`updates-metadata.json`) with `O_RDWR|O_CREATE` but no `O_TRUNC`. When disk fills mid-write, the JSON is partially written over old data, leaving a corrupted file. On next startup, `readData()` fails to decode the corrupted JSON, `NewUpdater()` returns a fatal error, and Orbit cannot function. This hypothesis is based on code analysis -- to confirm, check the `updates-metadata.json` file on affected hosts and verify `NewUpdater()` is what fails in the logs.

**Fix behavior:**

1. **Atomic writes in filestore:** `writeData()` must use atomic write-and-rename (write to a temp file, sync, then rename over the target). This prevents corruption from partial writes.
2. **Corrupted file recovery in filestore:** `readData()` must handle corrupted/unparseable JSON gracefully. If the file exists but cannot be decoded, delete it and initialize with an empty metadata map (same as the file-not-found path). Log a warning when this happens. This is safe because TUF will re-fetch all metadata from the server on the next update check.
3. **Startup resilience:** If `NewUpdater()` fails in `orbit.go`, Orbit should not exit fatally. It should delete the corrupted file, retry immediately, and continue startup.

## Proposed subtasks

- [ ] Reproduce the issue and confirm the root cause hypothesis
- [ ] Atomic writes in filestore -- change `writeData()` in `filestore.go` to write-to-temp-then-rename. Tests for partial write recovery.
- [ ] Corrupted file recovery in filestore -- update `readData()` in `filestore.go` to delete and re-initialize when JSON decode fails, with a warning log. Tests for corrupted file handling.
- [ ] Startup resilience in `orbit.go` -- make `NewUpdater()` failure non-fatal, delete corrupted file, retry immediately, continue startup. Tests for startup with bad metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low disk space forcing Orbit into an unrecoverable state #44561

💥 Actual behavior

🛠️ Expected behavior

🧑‍💻 Steps to reproduce

🕯️ More info (optional)

Spec

Proposed subtasks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Low disk space forcing Orbit into an unrecoverable state #44561

Description

💥 Actual behavior

🛠️ Expected behavior

🧑‍💻 Steps to reproduce

🕯️ More info (optional)

Spec

Proposed subtasks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions