agent.json apiKey re-encrypted with per-process entropy — restart corrupts credential, requires manual re-pair

## Summary

After any clean shutdown of askalf-agent (\`systemctl restart\`, host reboot, env change triggering daemon-reload, anything), the agent restarts and fails authentication with the bridge, leaving the device permanently offline. Recovery requires minting a new bridge \`api_keys\` row and overwriting agent.json with plaintext.

Reproduced on Linux (Hetzner CX33, systemd v3.1.5) and Windows (alf-prod-docker desktop service v3.1.5).

## Symptom

\`\`\`
Connecting to ws://localhost:3005/ws/agent-bridge...
Connected. Registering device...
Server error: [AUTH_FAILED] Invalid or expired token
Disconnected (4001: Invalid token)
Registration timeout — starting heartbeat for auto-register...
\`\`\`

The "auto-register" mode is cosmetic — it heartbeats but never re-pairs. Device stays offline indefinitely.

## Root cause

\`~/.askalf/agent.json\` stores \`{ apiKey, _encrypted: true }\`. The agent re-encrypts the apiKey at shutdown using **process-local entropy** (suspected: env-derived or random-at-init key). On next start, the agent decrypts using a *different* key derivation, gets garbage, and sends garbage as the Bearer token. The bridge's \`validateBridgeToken\` finds no matching \`key_hash\` row in \`api_keys\` and rejects with AUTH_FAILED.

## Test that proved per-process decryption (not transport corruption)

Restored a known-good \`agent.json.bak.v5\` backup from earlier in the same day → got the exact same AUTH_FAILED. Two different encrypted blobs both unreadable after restart confirms decryption isn't stable across process lifetimes.

## Schema reference for anyone debugging downstream

The bridge auth table is \`api_keys\` (not \`forge_api_keys\` — those are forge-API-key-scoped, separate concern). Schema columns relevant to debugging:

\`\`\`
id, user_id, tenant_id, name, key_hash, key_prefix, key_preview,
scopes[], status, created_at, last_used_at, revoked_at
\`\`\`

\`key_hash\` format: \`<saltHex>.<hashHex>\` where salt=16 bytes, hash=\`pbkdf2_hmac_sha256(plaintext, salt, 100_000, 32)\`. Validation lives in \`apps/forge/src/runtime/agent-bridge.ts::validateBridgeToken\` on the forge side.

## Current recovery recipe (Linux)

1. \`systemctl stop askalf-agent\`
2. Generate plaintext: \`"sk_live_" + secrets.token_urlsafe(32)\`
3. INSERT new \`api_keys\` row with computed \`key_hash\`, same scopes as old
4. Revoke prior row: \`UPDATE api_keys SET status='revoked', revoked_at=NOW() WHERE id=<prev>\`
5. Write agent.json: \`{"apiKey":"<plaintext>","url":"ws://localhost:3005","deviceName":"<name>","_encrypted":false}\`
6. \`chown askalf-agent\` + \`chmod 600\`
7. \`systemctl start askalf-agent\` → expect "Registered as device …" within ~1s

The \`_encrypted: false\` flag avoids the broken re-encrypt path on next shutdown (until the agent encrypts it again on its own shutdown, looping the issue — but it gives you a clean window).

## Suggested fixes (any one resolves)

- **A. Stable on-disk encryption key.** Derive the agent.json encryption key from a persistent secret (e.g. a generated keyfile written once during install and chmod-600 to the agent user). All restarts decrypt with the same key.
- **B. Plaintext fallback when encryption is unreliable.** Detect that the prior encrypted blob can't be safely decrypted with the current process entropy and fall back to writing/reading plaintext. Honest plaintext is better than silent corruption that masquerades as "auto-register has it covered."
- **C. Detect garbage-decrypt and trigger re-pair flow.** When a decrypt produces non-printable bytes / wrong-prefix output, surface that to the user instead of looping in cosmetic auto-register. At minimum the AUTH_FAILED log line should say "re-pair required" with a link to the recovery recipe.

A + C together would be ideal — stable encryption is the real fix, detect-and-surface is the safety net.

## What doesn't work (please don't suggest in the fix discussion)

- \`.bak.v*\` backups are not recoverable — they share the same brokenness.
- "auto-register" mode does not eventually re-pair. It just heartbeats forever against a closed WS.
- Restarting the agent without a recovery plan staged is the failure mode that surfaces this bug.

## Impact

Every \`systemctl restart askalf-agent\` (which is the documented way to apply config / env / version changes) requires a full re-pair on every fleet device. Effectively makes the agent non-restartable in production until this is fixed.

## Tracked in platform memory

This bug was first encountered on 2026-05-12 during alf-platform-1 recovery (v8 mint, sk_live_8EaW…). Documented under platform memory \`reference_askalf_agent_encryption_bug.md\`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent.json apiKey re-encrypted with per-process entropy — restart corrupts credential, requires manual re-pair #15

Summary

Symptom

Root cause

Test that proved per-process decryption (not transport corruption)

Schema reference for anyone debugging downstream

Current recovery recipe (Linux)

Suggested fixes (any one resolves)

What doesn't work (please don't suggest in the fix discussion)

Impact

Tracked in platform memory

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

agent.json apiKey re-encrypted with per-process entropy — restart corrupts credential, requires manual re-pair #15

Description

Summary

Symptom

Root cause

Test that proved per-process decryption (not transport corruption)

Schema reference for anyone debugging downstream

Current recovery recipe (Linux)

Suggested fixes (any one resolves)

What doesn't work (please don't suggest in the fix discussion)

Impact

Tracked in platform memory

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions