Summary
After any clean shutdown of askalf-agent (`systemctl restart`, host reboot, env change triggering daemon-reload, anything), the agent restarts and fails authentication with the bridge, leaving the device permanently offline. Recovery requires minting a new bridge `api_keys` row and overwriting agent.json with plaintext.
Reproduced on Linux (Hetzner CX33, systemd v3.1.5) and Windows (alf-prod-docker desktop service v3.1.5).
Symptom
```
Connecting to ws://localhost:3005/ws/agent-bridge...
Connected. Registering device...
Server error: [AUTH_FAILED] Invalid or expired token
Disconnected (4001: Invalid token)
Registration timeout — starting heartbeat for auto-register...
```
The "auto-register" mode is cosmetic — it heartbeats but never re-pairs. Device stays offline indefinitely.
Root cause
`~/.askalf/agent.json` stores `{ apiKey, _encrypted: true }`. The agent re-encrypts the apiKey at shutdown using process-local entropy (suspected: env-derived or random-at-init key). On next start, the agent decrypts using a different key derivation, gets garbage, and sends garbage as the Bearer token. The bridge's `validateBridgeToken` finds no matching `key_hash` row in `api_keys` and rejects with AUTH_FAILED.
Test that proved per-process decryption (not transport corruption)
Restored a known-good `agent.json.bak.v5` backup from earlier in the same day → got the exact same AUTH_FAILED. Two different encrypted blobs both unreadable after restart confirms decryption isn't stable across process lifetimes.
Schema reference for anyone debugging downstream
The bridge auth table is `api_keys` (not `forge_api_keys` — those are forge-API-key-scoped, separate concern). Schema columns relevant to debugging:
```
id, user_id, tenant_id, name, key_hash, key_prefix, key_preview,
scopes[], status, created_at, last_used_at, revoked_at
```
`key_hash` format: `.` where salt=16 bytes, hash=`pbkdf2_hmac_sha256(plaintext, salt, 100_000, 32)`. Validation lives in `apps/forge/src/runtime/agent-bridge.ts::validateBridgeToken` on the forge side.
Current recovery recipe (Linux)
- `systemctl stop askalf-agent`
- Generate plaintext: `"sk_live_" + secrets.token_urlsafe(32)`
- INSERT new `api_keys` row with computed `key_hash`, same scopes as old
- Revoke prior row: `UPDATE api_keys SET status='revoked', revoked_at=NOW() WHERE id=`
- Write agent.json: `{"apiKey":"<plaintext>","url":"ws://localhost:3005","deviceName":"","_encrypted":false}`
- `chown askalf-agent` + `chmod 600`
- `systemctl start askalf-agent` → expect "Registered as device …" within ~1s
The `_encrypted: false` flag avoids the broken re-encrypt path on next shutdown (until the agent encrypts it again on its own shutdown, looping the issue — but it gives you a clean window).
Suggested fixes (any one resolves)
- A. Stable on-disk encryption key. Derive the agent.json encryption key from a persistent secret (e.g. a generated keyfile written once during install and chmod-600 to the agent user). All restarts decrypt with the same key.
- B. Plaintext fallback when encryption is unreliable. Detect that the prior encrypted blob can't be safely decrypted with the current process entropy and fall back to writing/reading plaintext. Honest plaintext is better than silent corruption that masquerades as "auto-register has it covered."
- C. Detect garbage-decrypt and trigger re-pair flow. When a decrypt produces non-printable bytes / wrong-prefix output, surface that to the user instead of looping in cosmetic auto-register. At minimum the AUTH_FAILED log line should say "re-pair required" with a link to the recovery recipe.
A + C together would be ideal — stable encryption is the real fix, detect-and-surface is the safety net.
What doesn't work (please don't suggest in the fix discussion)
- `.bak.v*` backups are not recoverable — they share the same brokenness.
- "auto-register" mode does not eventually re-pair. It just heartbeats forever against a closed WS.
- Restarting the agent without a recovery plan staged is the failure mode that surfaces this bug.
Impact
Every `systemctl restart askalf-agent` (which is the documented way to apply config / env / version changes) requires a full re-pair on every fleet device. Effectively makes the agent non-restartable in production until this is fixed.
Tracked in platform memory
This bug was first encountered on 2026-05-12 during alf-platform-1 recovery (v8 mint, sk_live_8EaW…). Documented under platform memory `reference_askalf_agent_encryption_bug.md`.
Summary
After any clean shutdown of askalf-agent (`systemctl restart`, host reboot, env change triggering daemon-reload, anything), the agent restarts and fails authentication with the bridge, leaving the device permanently offline. Recovery requires minting a new bridge `api_keys` row and overwriting agent.json with plaintext.
Reproduced on Linux (Hetzner CX33, systemd v3.1.5) and Windows (alf-prod-docker desktop service v3.1.5).
Symptom
```
Connecting to ws://localhost:3005/ws/agent-bridge...
Connected. Registering device...
Server error: [AUTH_FAILED] Invalid or expired token
Disconnected (4001: Invalid token)
Registration timeout — starting heartbeat for auto-register...
```
The "auto-register" mode is cosmetic — it heartbeats but never re-pairs. Device stays offline indefinitely.
Root cause
`~/.askalf/agent.json` stores `{ apiKey, _encrypted: true }`. The agent re-encrypts the apiKey at shutdown using process-local entropy (suspected: env-derived or random-at-init key). On next start, the agent decrypts using a different key derivation, gets garbage, and sends garbage as the Bearer token. The bridge's `validateBridgeToken` finds no matching `key_hash` row in `api_keys` and rejects with AUTH_FAILED.
Test that proved per-process decryption (not transport corruption)
Restored a known-good `agent.json.bak.v5` backup from earlier in the same day → got the exact same AUTH_FAILED. Two different encrypted blobs both unreadable after restart confirms decryption isn't stable across process lifetimes.
Schema reference for anyone debugging downstream
The bridge auth table is `api_keys` (not `forge_api_keys` — those are forge-API-key-scoped, separate concern). Schema columns relevant to debugging:
```
id, user_id, tenant_id, name, key_hash, key_prefix, key_preview,
scopes[], status, created_at, last_used_at, revoked_at
```
`key_hash` format: `.` where salt=16 bytes, hash=`pbkdf2_hmac_sha256(plaintext, salt, 100_000, 32)`. Validation lives in `apps/forge/src/runtime/agent-bridge.ts::validateBridgeToken` on the forge side.
Current recovery recipe (Linux)
The `_encrypted: false` flag avoids the broken re-encrypt path on next shutdown (until the agent encrypts it again on its own shutdown, looping the issue — but it gives you a clean window).
Suggested fixes (any one resolves)
A + C together would be ideal — stable encryption is the real fix, detect-and-surface is the safety net.
What doesn't work (please don't suggest in the fix discussion)
Impact
Every `systemctl restart askalf-agent` (which is the documented way to apply config / env / version changes) requires a full re-pair on every fleet device. Effectively makes the agent non-restartable in production until this is fixed.
Tracked in platform memory
This bug was first encountered on 2026-05-12 during alf-platform-1 recovery (v8 mint, sk_live_8EaW…). Documented under platform memory `reference_askalf_agent_encryption_bug.md`.