fix(crypto): stable keyfile-based config encryption (closes #15)#18
Merged
Conversation
Daily npm whoami check against the configured registry; opens (and de-dupes via marker comment) a GitHub issue if NPM_TOKEN no longer authenticates, and auto-closes that issue on the first passing run after rotation. Motivated by today's discovery that NPM_TOKEN had been rotated 7 weeks ago (2026-04-01 → present) without updating this repo's secret, which silently stranded v3.1.6 / v3.2.0 / v3.2.1 from reaching npm. Anyone running 'npm i -g @askalf/agent' got the stale v3.1.5 — including the WS-subprotocol crash fix the v3.2.1 release was specifically shipping. Pattern mirrors the existing @askalf/dario npm-token-health.yml; issue body includes the gh-secret-set + gh-run-rerun recipe so the recovery is self-documenting for any future incident.
Pre-3.3 the agent.json apiKey was encrypted under a key derived from a SHA-256 hash that included every currently-up non-loopback interface's MAC address. networkInterfaces() returns *currently up* interfaces, and i.internal only excludes loopback — docker0, veth*, br-*, tun0, wg0 all showed up. On any host with Docker (every askalf-fleet device), a VPN (searxng routes through ProtonVPN), or Wi-Fi MAC randomization (Win 10/11 default), the MAC list churned between agent process lifetimes, the derived key changed, AES-GCM auth-tag verification then failed at decrypt time, and loadConfig's catch-all swallowed the failure as 'old plaintext config' — returning the still-encrypted ciphertext as the Bearer token. Result: AUTH_FAILED on every restart, indefinitely, with no useful diagnostic and a manual mint-and-replace recovery dance. Fix is the issue's recommended 'A + C together': A. Stable on-disk keyfile. Generate 32 random bytes once at ~/.askalf/keyfile (chmod 600), use those bytes directly as the AES-256-GCM key. No derivation, no machine-state dependency, stable across every restart and reconfiguration. Race-safe creation via open(O_EXCL). C. Detect 'wrong key' vs 'legacy plaintext' in loadConfig. The old path was 'if decrypt throws, return raw' — which conflated 'pre- encryption-existed config' with 'valid encrypted but key changed'. Now we first check the _encrypted flag and only treat un-flagged configs as plaintext. Flagged blobs that won't decrypt surface a clear re-pair error pointing at #15. Migration: on upgrade, if the new stable key fails, the legacy machine-derivation is tried once as a fallback. If it succeeds (network state happened to be stable since the config was written), the config is transparently re-saved under the stable key and the host is healed forever. If both fail, the operator gets the re-pair error with the recovery recipe link instead of silent AUTH_FAILED. Smoke-tested locally: - Round-trip via stable keyfile: PASS - Keyfile created at ~/.askalf/keyfile with 32 bytes: PASS - Wrong-key decrypt throws (instead of returning ciphertext as Bearer): PASS
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #15.
TL;DR
The pre-3.3 config encryption key was a SHA-256 hash that included every currently-up non-loopback MAC address. Docker/VPN/Wi-Fi churn → MAC list changed between agent process lifetimes → derived key changed → AES-GCM auth tag failed →
loadConfig's catch-all returned the encrypted ciphertext as the Bearer token → AUTH_FAILED on everysystemctl restart, every reboot, every config change, with no recoverable diagnostic.Fix: switch to a randomly-generated 32-byte keyfile at
~/.askalf/keyfile(chmod 600) — stable across every restart. Plus a one-shot legacy-key migration on upgrade for hosts where the old key still happens to work, and a clear re-pair error for hosts where it doesn't.Root cause walkthrough
src/crypto.ts:deriveMachineKey()ran on every agent start:i.internalonly excludes loopback.docker0,veth*,br-*,tun0,wg0all sail through the filter. The MAC list reorders/shrinks/grows on:And the silent-corruption mechanism —
src/cli.ts:loadConfigpre-fix:Whether the catch saw "this file was never encrypted in the first place" or "this file was encrypted with a different key" was indistinguishable, so the agent happily kept booting and sending the still-encrypted base64 blob as
Authorization: Bearer <blob>until the bridge slammed the WS shut.Fix design (issue's "A + C together")
A — stable on-disk keyfile (
src/crypto.ts)getStableKey()reads or creates~/.askalf/keyfile: 32 random bytes, chmod 600,O_EXCLwrite to be race-safe against parallel installs.deriveMachineKey()renamed toderiveLegacyMachineKey(), exported, used only for one-shot migration. Comment in the file explains why it can never be the primary key again.encrypt/decryptdefault-key changed; old optionalkey?: Bufferparameter preserved for migration callers.C —
loadConfigdistinguishes wrong-key vs legacy-plaintext (src/cli.ts)Migration story
_encrypted: false)No data loss in any case — the recovery recipe in #15 is one INSERT + one file write, ~30 seconds.
Smoke tests (run locally before push)
After merge
release: publishedand will ship to npm now that the token + repo-public fixes from earlier today are in place.npm i -g @askalf/agent@3.3.0, then re-pair each device once (per the agent.json apiKey re-encrypted with per-process entropy — restart corrupts credential, requires manual re-pair #15 recipe). Future restarts won't need this dance ever again.Out of scope (deliberate)