feat(broker): credential broker laptop CLI + Ansible roles (spec 005)#32
feat(broker): credential broker laptop CLI + Ansible roles (spec 005)#32pofallon wants to merge 7 commits into
Conversation
Remove long-lived developer credentials from Remo instances. Provisioning
creds now come from laptop `fnox` (FR-006), per-instance bootstrap tokens
are delivered via provider-specific transport (SSH push / IMDS / bind-mount),
and a per-project allowlist manifest plus on-instance broker socket gate
secret access from devcontainers.
Cross-repo dependency: `BROKER_PINNED_VERSION = "0.1.0"` — the Rust broker
daemon is owned by `get2knowio/remo-broker` and consumed as a signed binary
release.
CLI surface:
remo init --backend {1password|vault|aws-sm|age-git}
remo incus add-node / remo proxmox add-node
remo audit <instance> [--tail|--since|--json]
remo rotate-bootstrap [<instance>|--all] [--force]
remo destroy — pre-deletion bootstrap-token revoke (FR-020, exit 5)
passive overdue-rotation reminder on every invocation
Implementation breakdown (phases 1–9, all tasks in tasks.md):
• core: fnox subprocess wrapper, nodes.yml registry, manifest TOML
discover/synthesize/validate, devcontainer socket helper + language
auto-synthesis, audit JSON-lines parser, broker mint/revoke dispatcher
for 1Password SCIM / Vault accessor / AWS IAM / age-git
• providers: per-developer-per-region AWS IAM instance profile with
scoped secretsmanager:GetSecretValue, Hetzner SSH stdin push, Incus
`lxc config device add … readonly=true`, Proxmox `pct set -mp0 …,ro=1`
• ansible: broker_install role (binary download + SHA-256 verify + systemd
unit with LoadCredential), 3× bootstrap_token_* assertion roles,
group_vars migrated to `lookup('pipe', 'fnox get …')`
• polish: docs/credential-broker.md (threat model + runbook),
scripts/grep-credential-leaks.sh enforcing FR-005/FR-006, version
bump 2.0.0rc4 → 2.1.0rc1
Spec & supporting work:
• Authoritative spec moved from docs/remo-fnox-spec.md to
specs/005-credential-broker/ alongside plan/research/data-model/
contracts/quickstart/tasks
• Pre-existing 005-provider-snapshots specs renumbered to 007 to free
the 005 slot
Tests: 644 → 711 (+67 new). Ruff clean. grep gate clean. mypy clean
modulo pre-existing missing third-party stubs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the defects surfaced by the high-effort code review of #32. Most-severe-first highlights: 1. Gate broker_install role on REMO_BROKER_BACKEND so missing v0.1.0 release URL doesn't 404 every `remo {provider} create`. 2. Never format mint-call payloads into BackendError messages — surface field-presence info only so 1P/Vault schema drift can't leak the freshly-minted token to shell scrollback / CI logs. 3. Wrap `lookup('pipe', 'fnox get ...')` in tolerant defaults plus preflight asserts so eager pipe-lookup failures don't crash before the friendly-error path can fire. 4. Subclass click.Group to wrap invoke() in try/finally so the passive-update + overdue-rotation hook fires past subcommand sys.exit() (it was completely dead). 5. Per-instance AWS broker IAM role (not per-developer); a destroy on one instance can no longer break IMDS creds on a sibling. 6. Rename Hetzner labels remo:* → remo_* (the colon form is rejected by the cloud API; FR-020 revoke was silently broken). 7. Distinguish "no token to revoke" (None, success) from "lookup failed" (raise; honor --force) in core/broker_revoke.py. 8. Wire revoke_before_destroy + --force-broker into AWS/Incus/Proxmox destroy (was Hetzner-only). 9. Gate the remo-broker restart handler on broker_token_present so a fresh install doesn't transition the service into failed state. 10. Drop the empty `touch /usr/local/libexec/remo-broker-tokens` that blocks the real Ansible-installed helper (force: false → true on the copy task). 11. Validate the broker .sha256 body is a 64-char hex digest before handing it to get_url — whitespace-only bodies silently disabled integrity checking. 12. Add core/broker_config.{get_backend,get_admin_sa_fnox_key} with env-first/file-fallback resolution so `remo init` is no longer a no-op. 13. Replace the regex-based JSONC stripper with a string-aware tokenizer that doesn't corrupt `"path": "a//b"`. 14. Normalize _parse_iso/_parse_ts to aware-UTC so bare-ISO timestamps don't crash `remo rotate-bootstrap` / `remo audit --since`. 15. Fetch Hetzner-reported SSH host keys and verify via ssh-keyscan before piping the bootstrap token on stdin — close the accept-new MITM window on freshly-allocated public IPs. Also dropped the backwards-compat fallback for the colon-form Hetzner label keys per direct user guidance (only the underscore form is canonical). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the missing link in `remo rotate-bootstrap`: after minting a fresh sub-token at the backend, push it to /etc/remo-broker/bootstrap-token on the instance and call the broker daemon's `rotate-bootstrap` admin- socket op (per get2knowio/remo-broker docs/wire-protocol.md) so the daemon atomically swaps to the fresh session before the old token is revoked. Previously the mint succeeded and the old token was revoked but the new token was never delivered — leaving the instance broker serving against a revoked credential. - New core/broker_admin.py: NDJSON-over-SSH bridge to the admin Unix socket (mode 0600 root-owned), using a small `sudo python3 -c` shim to avoid a socat / nc dependency on the instance. - cli/rotate.py: drop the SIGHUP TODO; add `_deliver_and_reload(host, token)` that pushes + calls rotate-bootstrap for Hetzner (only fully wired provider today). Non-Hetzner providers emit a clear warning; AWS-SM (no on-disk token) still issues the admin reload so the broker re-fetches creds from IMDS. - On delivery failure: do NOT revoke the previous token (broker is still serving with it) and return exit 7. Tests: 6 cover the admin-socket client (happy path, broker error, SSH transport failure, empty/garbage response, default options); 3 cover the rotate wiring (Hetzner push+reload, unsupported-provider warning, delivery-failure aborts revoke). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `--cadence-days N` to `remo {hetzner,aws,incus,proxmox} create`
and persists the value in the provider's native metadata primitive so
the passive overdue-rotation reminder and `remo rotate-bootstrap` can
respect per-instance cadence overrides (FR-021).
Provider coverage:
- Hetzner: writes `remo_rotation_cadence_days` server label; rotate-
bootstrap now also writes `remo_last_rotation_at` +
`remo_bootstrap_token_id` labels on success.
- AWS: writes `remo:rotation-cadence-days` EC2 instance tag; rotate-
bootstrap writes `remo:last-rotation-at` tag. `_read_rotation_metadata`
extended to read these via `ec2.describe_tags`. token_id stays
derived from the per-instance role name (not stored in tags).
- Incus: writes `user.remo.rotation_cadence_days` container config via
`incus config set` over SSH. Rotation flow itself remains deferred.
- Proxmox: LXC lacks a clean `user.*` primitive — flag emits a clear
deferred-warning rather than half-implementing it. Persistence +
rotation for Proxmox is a follow-up.
Tests: 5 new in tests/unit/providers/test_cadence_writes.py (one per
provider, including the deferred-warning case for Proxmox) and 3 in
tests/unit/cli/test_rotate.py covering the post-rotation metadata
write paths and the AWS tag reader.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update: three follow-up commits landed since the reviewPushed
Coordination: companion PR get2knowio/remo-broker#6 adds the release workflow that unblocks the Tests: 764 passing, ruff clean. Deferred for Phase 3: Incus + Proxmox rotation flow (cadence write lands here; admin-socket reload via container exec is next). |
Extends `remo rotate-bootstrap` to drive the full mint → push → broker- reload → revoke flow on Incus containers and Proxmox LXC containers, matching the Hetzner coverage that landed in c326727. The laptop reaches each container's `/run/remo-broker/admin.sock` by tunneling the existing NDJSON `sudo python3` bridge through `incus exec <container> -- …` (Incus, with localhost fast-path) or `pct exec <vmid> -- …` (Proxmox, always over SSH). Token pushes use the same indirection with stdin forwarding so the secret never appears in argv / ps output. Metadata persistence: - Incus: `incus config set <container> user.remo.{rotation_cadence_days, last_rotation_at, bootstrap_token_id}` (host-side config keys). - Proxmox: in-container files under `/etc/remo-broker/{rotation_cadence_ days, last_rotation_at, bootstrap_token_id}` written via `pct exec`. Proxmox LXC has no host-side `user.*` primitive; the in-container file has the same lifetime as the bootstrap-token file the broker already consumes from `/etc/remo-broker/`. `_lookup_token_id` (broker_revoke) now reads both stores, distinguishing "no token minted yet" (empty stdout → None) from "lookup transport failed" (rc != 0 → TokenLookupError, blocks destroy unless --force). `destroy()` in providers/{incus,proxmox} now constructs the pre-revoke candidate KnownHost with the right fields (`name="<host>/<container>"`, `instance_id=<vmid>` on Proxmox / `<host-user>` on Incus, `region=<host -user>` on Proxmox) so `_lookup_token_id` can resolve the exec target. Tests: 13 new across `test_broker_admin.py`, `test_rotate.py`, `test_broker_revoke.py`, and two new files `test_incus_token_push.py` and `test_proxmox_token_push.py`. `test_rotate_warns_for_unsupported_ provider` retargeted from incus → proxmox at Incus landing and then deleted at Proxmox landing (no unwired provider remains). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 3 landed — `644c131`: Incus + Proxmox rotation lifecycleRotation flow now complete on all four providers. Same shape as Hetzner:
Tests: 791 passing, ruff clean. 13 new tests across |
remo-broker v0.1.0 was cut today (release workflow merged in get2knowio/remo-broker#6 + tag pushed), so the binary URL the broker_install Ansible role downloads no longer 404s. Removing the REMO_BROKER_BACKEND env-gate that Finding 1 added as a temporary workaround. The systemd unit installs enabled-but-stopped when no token is present (Finding 9), so running broker_install on every create is harmless even when the user hasn't configured a backend. The bootstrap_token_{file,imds,mount} assertion roles stay gated — those correctly only fire when the user has opted in to a backend and minted a token. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`deb72a3`: un-gate broker_install now that remo-broker v0.1.0 shipsremo-broker #6 merged and tag `v0.1.0` pushed — release workflow currently running (will publish the cross-arch binaries + `.sha256` files + `remo-broker.v1.json` schema to https://github.com/get2knowio/remo-broker/releases/tag/v0.1.0). This commit drops the temporary `REMO_BROKER_BACKEND` env-gate that Finding 1 added on the `broker_install` Ansible role include in the four `*_configure.yml` plays. The systemd unit installs enabled-but-stopped when no token is present (Finding 9), so running `broker_install` on every `create` is harmless even without a backend configured. The `bootstrap_token_{file,imds,mount}` assertion roles stay gated — those correctly only run when the user has opted in to a backend and minted a token. Tests: 791 passing. |
…ox cadence
Two bugs surfaced by the first end-to-end Proxmox test against a real
container (lab1):
1. broker_install was only included in the *_configure.yml playbooks, but
`remo {provider} create` runs `*_site.yml` — so the broker daemon was
never installed on create for *any* of the four providers. The role
only fired on the standalone `remo {provider} update` path.
2. The per-instance cadence write on Proxmox tried to `echo N >
/etc/remo-broker/rotation_cadence_days` before broker_install had run
to provision /etc/remo-broker/. Result: a non-fatal warning at every
create and the cadence value was never persisted.
Note: these patches keep the *current* (external-backend) design working,
which we're no longer planning to ship — see PR #32 conversation for the
redesign. Committed so the branch is internally consistent for posterity.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Closing without merge. End-to-end testing on 2026-05-29 against a real Proxmox container surfaced a categorical mismatch between this branch's architecture (external secret backend + bootstrap-token-on-instance + broker-fetches-on-demand) and the threat model it set out to defend. Specifically: the bootstrap token at Six findings from the test session reinforced this — most notably that Superseded by spec 006: laptop-push model.
The redesign: laptop encrypts a project-scoped secret bundle (age, X25519 + ChaCha20-Poly1305) and pushes it to the instance over SSH; broker decrypts in memory using a key loaded via systemd's ~60% of the laptop-side code in this PR carries forward to the new branch (broker chassis, manifest model, install machinery, fnox-as-laptop-store, per-provider create/destroy plumbing, SSH/incus exec/pct exec bridges). On the broker side, ~80% of the chassis carries forward; The |
Summary
fnox(FR-006); per-instance bootstrap tokens are delivered via provider-specific transport (SSH push / IMDS / bind-mount); a per-project allowlist manifest plus on-instance broker socket gates secret access from devcontainers.remo init --backend …,remo {incus,proxmox} add-node,remo audit,remo rotate-bootstrap, pre-destroy bootstrap-token revoke onremo destroy(FR-020, exit 5), passive overdue-rotation reminder on every invocation.BROKER_PINNED_VERSION = "0.1.0"— Rust broker daemon owned byget2knowio/remo-broker, consumed as a signed binary release. This PR is the laptop CLI + Ansible half.Scope
Phases delivered (all tasks in
specs/005-credential-broker/tasks.md):jsonschemadep, broker version constants, vendored manifest schema baselinecore/fnox.py,core/nodes.py(0600 enforced),core/manifest.py,core/broker_install.py,models/node.py,models/manifest.py,providers/broker.pyskeleton, three Ansible roles (broker_install+ 3×bootstrap_token_*),group_varsmigrated tolookup('pipe', 'fnox get …')secretsmanager:GetSecretValue, Incuslxc config device add … readonly=true, Proxmoxpct set -mp0 …,ro=1.add-nodeCLI for self-hosted nodes.core/devcontainer.pysocket helper + language auto-synthesis.Restart=on-failure,RestartSec=5s,LoadCredential=bootstrap-token:…; post-install verify; no device-bound state.remo initwith backend selection, fnox-missing rejection (exit 3), age-git downgrade warning (exit 2), interactive-identity rejection (exit 4); Hetzner env reads →_get_hetzner_api_tokenvia fnox..gitignoreensure-block,remo auditCLI with--tail/--since/--json.remo rotate-bootstrapwith cadence reading + 1-hour fresh-skip + partial-success exit 7, pre-destroy revoke hook wired into Hetzner, passive overdue reminder.docs/credential-broker.md(threat model + operator runbook), README updates,scripts/grep-credential-leaks.shenforcing FR-005/FR-006, version bump2.0.0rc4→2.1.0rc1.Spec & supporting work:
docs/remo-fnox-spec.mdtospecs/005-credential-broker/alongside plan/research/data-model/contracts/quickstart/tasks.005-provider-snapshotsspecs renumbered to007to free the slot.Deferred / partial (
[~]intasks.md)cli/shell.pyhooks are a no-op because project workspaces live on the instance; helpers (core.devcontainer.ensure_socket_mount,core.manifest.synthesize_default,core.devcontainer.synthesize_devcontainer_json) are reachable for instance-side automation when remo-broker ships its hooks.destroy()flows can adopt the samecore/broker_revoke.pyhook.--cadence-days Nwriter onremo {provider} createnot yet exposed.Test plan
uv run pytest— 711 passed (was 644; +67 new tests)uv run ruff check src/remo_cli— all checks passedbash scripts/grep-credential-leaks.sh— credential-leak grep gate: okuv run mypy src/remo_cli— clean modulo pre-existing missing third-party stubs (jsonschema/yaml/boto3)remo init --backend 1passwordround-trip on a fresh devcontainer🤖 Generated with Claude Code