Skip to content

fix(crypto): stable keyfile-based config encryption (closes #15)#18

Merged
askalf merged 2 commits into
mainfrom
fix/15-stable-encryption-keyfile
May 22, 2026
Merged

fix(crypto): stable keyfile-based config encryption (closes #15)#18
askalf merged 2 commits into
mainfrom
fix/15-stable-encryption-keyfile

Conversation

@askalf
Copy link
Copy Markdown
Owner

@askalf askalf commented May 22, 2026

Closes #15.

TL;DR

The pre-3.3 config encryption key was a SHA-256 hash that included every currently-up non-loopback MAC address. Docker/VPN/Wi-Fi churn → MAC list changed between agent process lifetimes → derived key changed → AES-GCM auth tag failed → loadConfig's catch-all returned the encrypted ciphertext as the Bearer token → AUTH_FAILED on every systemctl restart, every reboot, every config change, with no recoverable diagnostic.

Fix: switch to a randomly-generated 32-byte keyfile at ~/.askalf/keyfile (chmod 600) — stable across every restart. Plus a one-shot legacy-key migration on upgrade for hosts where the old key still happens to work, and a clear re-pair error for hosts where it doesn't.

Root cause walkthrough

src/crypto.ts:deriveMachineKey() ran on every agent start:

const macs = Object.values(networkInterfaces()).flat()
  .filter(i => !!i && !i.internal && i.mac !== '00:00:00:00:00:00')
  .map(i => i.mac).sort();
const machineId = [hostname(), user, arch, platform, macs.join(',')].join(':');
return scryptSync(sha256(machineId), 'askalf-agent-v2', 32);

i.internal only excludes loopback. docker0, veth*, br-*, tun0, wg0 all sail through the filter. The MAC list reorders/shrinks/grows on:

  • Docker container start/stop (every askalf-fleet device runs Docker — Hetzner CX33 has the 8-service compose stack; alf-prod-docker also runs containers)
  • ProtonVPN/gluetun reconnecting (memory: searxng routes through ProtonVPN)
  • Wi-Fi reconnecting with MAC-randomization (Win 10/11 default)

And the silent-corruption mechanism — src/cli.ts:loadConfig pre-fix:

try {
  return decryptConfig(raw);
} catch {
  // Backwards compatibility: old plaintext config files
  return raw;            // ← returns the encrypted ciphertext unchanged
}

Whether the catch saw "this file was never encrypted in the first place" or "this file was encrypted with a different key" was indistinguishable, so the agent happily kept booting and sending the still-encrypted base64 blob as Authorization: Bearer <blob> until the bridge slammed the WS shut.

Fix design (issue's "A + C together")

A — stable on-disk keyfile (src/crypto.ts)

  • New getStableKey() reads or creates ~/.askalf/keyfile: 32 random bytes, chmod 600, O_EXCL write to be race-safe against parallel installs.
  • Old deriveMachineKey() renamed to deriveLegacyMachineKey(), exported, used only for one-shot migration. Comment in the file explains why it can never be the primary key again.
  • encrypt / decrypt default-key changed; old optional key?: Buffer parameter preserved for migration callers.

C — loadConfig distinguishes wrong-key vs legacy-plaintext (src/cli.ts)

if (!raw['_encrypted']) return raw as AgentConfig;          // legacy plaintext

try {
  return decryptConfig(raw);                                  // stable v3.3+ key
} catch {
  try {
    const migrated = decryptConfig(raw, deriveLegacyMachineKey());
    saveConfig(migrated);                                     // transparent migration
    return migrated;
  } catch {
    throw new Error('Could not decrypt ... re-pair: link to #15');
  }
}

Migration story

Pre-3.3 install state First load under 3.3 Outcome
Plaintext (_encrypted: false) Returned as-is, encrypted with stable key on next save Healed
Encrypted, network state stable since write Legacy key decrypts, re-saved under stable key Healed transparently, logs migration line
Encrypted, network state churned Both keys fail Re-pair error with #15 link (instead of silent AUTH_FAILED)

No data loss in any case — the recovery recipe in #15 is one INSERT + one file write, ~30 seconds.

Smoke tests (run locally before push)

PASS round-trip via stable keyfile
PASS keyfile created at ~/.askalf/keyfile (32 bytes)
PASS wrong-key decrypt throws (instead of returning garbage)

After merge

  1. Cut release v3.3.0 — publish.yml fires on release: published and will ship to npm now that the token + repo-public fixes from earlier today are in place.
  2. On Hetzner + alf-prod-docker: npm i -g @askalf/agent@3.3.0, then re-pair each device once (per the agent.json apiKey re-encrypted with per-process entropy — restart corrupts credential, requires manual re-pair #15 recipe). Future restarts won't need this dance ever again.
  3. Close agent.json apiKey re-encrypted with per-process entropy — restart corrupts credential, requires manual re-pair #15 with a link to the published release.

Out of scope (deliberate)

  • No test infrastructure added — the repo has none, and adding tsx/vitest/etc. is a separate concern. The smoke test was sufficient to verify the change locally; the code is small enough to read.
  • No CHANGELOG backfill for v3.1.6 / v3.2.0 / v3.2.1 (which had stale CHANGELOG entries pre-this-PR). Worth a separate housekeeping PR.
  • No README update yet for the keyfile location / security implications. Worth a follow-up.

askalf added 2 commits May 22, 2026 12:09
Daily npm whoami check against the configured registry; opens (and
de-dupes via marker comment) a GitHub issue if NPM_TOKEN no longer
authenticates, and auto-closes that issue on the first passing run
after rotation.

Motivated by today's discovery that NPM_TOKEN had been rotated 7
weeks ago (2026-04-01 → present) without updating this repo's
secret, which silently stranded v3.1.6 / v3.2.0 / v3.2.1 from
reaching npm. Anyone running 'npm i -g @askalf/agent' got the stale
v3.1.5 — including the WS-subprotocol crash fix the v3.2.1 release
was specifically shipping.

Pattern mirrors the existing @askalf/dario npm-token-health.yml;
issue body includes the gh-secret-set + gh-run-rerun recipe so the
recovery is self-documenting for any future incident.
Pre-3.3 the agent.json apiKey was encrypted under a key derived from a
SHA-256 hash that included every currently-up non-loopback interface's
MAC address. networkInterfaces() returns *currently up* interfaces, and
i.internal only excludes loopback — docker0, veth*, br-*, tun0, wg0 all
showed up. On any host with Docker (every askalf-fleet device), a VPN
(searxng routes through ProtonVPN), or Wi-Fi MAC randomization (Win
10/11 default), the MAC list churned between agent process lifetimes,
the derived key changed, AES-GCM auth-tag verification then failed at
decrypt time, and loadConfig's catch-all swallowed the failure as 'old
plaintext config' — returning the still-encrypted ciphertext as the
Bearer token. Result: AUTH_FAILED on every restart, indefinitely, with
no useful diagnostic and a manual mint-and-replace recovery dance.

Fix is the issue's recommended 'A + C together':

A. Stable on-disk keyfile. Generate 32 random bytes once at
   ~/.askalf/keyfile (chmod 600), use those bytes directly as the
   AES-256-GCM key. No derivation, no machine-state dependency, stable
   across every restart and reconfiguration. Race-safe creation via
   open(O_EXCL).

C. Detect 'wrong key' vs 'legacy plaintext' in loadConfig. The old
   path was 'if decrypt throws, return raw' — which conflated 'pre-
   encryption-existed config' with 'valid encrypted but key changed'.
   Now we first check the _encrypted flag and only treat un-flagged
   configs as plaintext. Flagged blobs that won't decrypt surface a
   clear re-pair error pointing at #15.

Migration: on upgrade, if the new stable key fails, the legacy
machine-derivation is tried once as a fallback. If it succeeds (network
state happened to be stable since the config was written), the config
is transparently re-saved under the stable key and the host is healed
forever. If both fail, the operator gets the re-pair error with the
recovery recipe link instead of silent AUTH_FAILED.

Smoke-tested locally:
- Round-trip via stable keyfile: PASS
- Keyfile created at ~/.askalf/keyfile with 32 bytes: PASS
- Wrong-key decrypt throws (instead of returning ciphertext as Bearer): PASS
@askalf askalf merged commit 7797d9f into main May 22, 2026
3 checks passed
@askalf askalf deleted the fix/15-stable-encryption-keyfile branch May 22, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

agent.json apiKey re-encrypted with per-process entropy — restart corrupts credential, requires manual re-pair

1 participant