Skip to content

agent.json apiKey re-encrypted with per-process entropy — restart corrupts credential, requires manual re-pair #15

@askalf

Description

@askalf

Summary

After any clean shutdown of askalf-agent (`systemctl restart`, host reboot, env change triggering daemon-reload, anything), the agent restarts and fails authentication with the bridge, leaving the device permanently offline. Recovery requires minting a new bridge `api_keys` row and overwriting agent.json with plaintext.

Reproduced on Linux (Hetzner CX33, systemd v3.1.5) and Windows (alf-prod-docker desktop service v3.1.5).

Symptom

```
Connecting to ws://localhost:3005/ws/agent-bridge...
Connected. Registering device...
Server error: [AUTH_FAILED] Invalid or expired token
Disconnected (4001: Invalid token)
Registration timeout — starting heartbeat for auto-register...
```

The "auto-register" mode is cosmetic — it heartbeats but never re-pairs. Device stays offline indefinitely.

Root cause

`~/.askalf/agent.json` stores `{ apiKey, _encrypted: true }`. The agent re-encrypts the apiKey at shutdown using process-local entropy (suspected: env-derived or random-at-init key). On next start, the agent decrypts using a different key derivation, gets garbage, and sends garbage as the Bearer token. The bridge's `validateBridgeToken` finds no matching `key_hash` row in `api_keys` and rejects with AUTH_FAILED.

Test that proved per-process decryption (not transport corruption)

Restored a known-good `agent.json.bak.v5` backup from earlier in the same day → got the exact same AUTH_FAILED. Two different encrypted blobs both unreadable after restart confirms decryption isn't stable across process lifetimes.

Schema reference for anyone debugging downstream

The bridge auth table is `api_keys` (not `forge_api_keys` — those are forge-API-key-scoped, separate concern). Schema columns relevant to debugging:

```
id, user_id, tenant_id, name, key_hash, key_prefix, key_preview,
scopes[], status, created_at, last_used_at, revoked_at
```

`key_hash` format: `.` where salt=16 bytes, hash=`pbkdf2_hmac_sha256(plaintext, salt, 100_000, 32)`. Validation lives in `apps/forge/src/runtime/agent-bridge.ts::validateBridgeToken` on the forge side.

Current recovery recipe (Linux)

  1. `systemctl stop askalf-agent`
  2. Generate plaintext: `"sk_live_" + secrets.token_urlsafe(32)`
  3. INSERT new `api_keys` row with computed `key_hash`, same scopes as old
  4. Revoke prior row: `UPDATE api_keys SET status='revoked', revoked_at=NOW() WHERE id=`
  5. Write agent.json: `{"apiKey":"<plaintext>","url":"ws://localhost:3005","deviceName":"","_encrypted":false}`
  6. `chown askalf-agent` + `chmod 600`
  7. `systemctl start askalf-agent` → expect "Registered as device …" within ~1s

The `_encrypted: false` flag avoids the broken re-encrypt path on next shutdown (until the agent encrypts it again on its own shutdown, looping the issue — but it gives you a clean window).

Suggested fixes (any one resolves)

  • A. Stable on-disk encryption key. Derive the agent.json encryption key from a persistent secret (e.g. a generated keyfile written once during install and chmod-600 to the agent user). All restarts decrypt with the same key.
  • B. Plaintext fallback when encryption is unreliable. Detect that the prior encrypted blob can't be safely decrypted with the current process entropy and fall back to writing/reading plaintext. Honest plaintext is better than silent corruption that masquerades as "auto-register has it covered."
  • C. Detect garbage-decrypt and trigger re-pair flow. When a decrypt produces non-printable bytes / wrong-prefix output, surface that to the user instead of looping in cosmetic auto-register. At minimum the AUTH_FAILED log line should say "re-pair required" with a link to the recovery recipe.

A + C together would be ideal — stable encryption is the real fix, detect-and-surface is the safety net.

What doesn't work (please don't suggest in the fix discussion)

  • `.bak.v*` backups are not recoverable — they share the same brokenness.
  • "auto-register" mode does not eventually re-pair. It just heartbeats forever against a closed WS.
  • Restarting the agent without a recovery plan staged is the failure mode that surfaces this bug.

Impact

Every `systemctl restart askalf-agent` (which is the documented way to apply config / env / version changes) requires a full re-pair on every fleet device. Effectively makes the agent non-restartable in production until this is fixed.

Tracked in platform memory

This bug was first encountered on 2026-05-12 during alf-platform-1 recovery (v8 mint, sk_live_8EaW…). Documented under platform memory `reference_askalf_agent_encryption_bug.md`.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions