Skip to content

feat(broker): credential broker laptop CLI + Ansible roles (spec 005)#32

Closed
pofallon wants to merge 7 commits into
mainfrom
005-credential-broker
Closed

feat(broker): credential broker laptop CLI + Ansible roles (spec 005)#32
pofallon wants to merge 7 commits into
mainfrom
005-credential-broker

Conversation

@pofallon
Copy link
Copy Markdown
Contributor

Summary

  • Removes long-lived developer credentials from Remo instances. Provisioning creds now come from laptop fnox (FR-006); per-instance bootstrap tokens are delivered via provider-specific transport (SSH push / IMDS / bind-mount); a per-project allowlist manifest plus on-instance broker socket gates secret access from devcontainers.
  • New CLI surface: remo init --backend …, remo {incus,proxmox} add-node, remo audit, remo rotate-bootstrap, pre-destroy bootstrap-token revoke on remo destroy (FR-020, exit 5), passive overdue-rotation reminder on every invocation.
  • Cross-repo dep: BROKER_PINNED_VERSION = "0.1.0" — Rust broker daemon owned by get2knowio/remo-broker, consumed as a signed binary release. This PR is the laptop CLI + Ansible half.

Scope

Phases delivered (all tasks in specs/005-credential-broker/tasks.md):

  • 1 — Setup: jsonschema dep, broker version constants, vendored manifest schema baseline
  • 2 — Foundational: core/fnox.py, core/nodes.py (0600 enforced), core/manifest.py, core/broker_install.py, models/node.py, models/manifest.py, providers/broker.py skeleton, three Ansible roles (broker_install + 3× bootstrap_token_*), group_vars migrated to lookup('pipe', 'fnox get …')
  • 3 — US1 (bootstrap delivery): per-provider transport — Hetzner SSH stdin push, AWS per-dev-per-region IAM instance profile with scoped secretsmanager:GetSecretValue, Incus lxc config device add … readonly=true, Proxmox pct set -mp0 …,ro=1. add-node CLI for self-hosted nodes. core/devcontainer.py socket helper + language auto-synthesis.
  • 4 — US2 (multi-device + reboot survival): systemd unit with Restart=on-failure, RestartSec=5s, LoadCredential=bootstrap-token:…; post-install verify; no device-bound state.
  • 5 — US3 (no creds on instance): remo init with backend selection, fnox-missing rejection (exit 3), age-git downgrade warning (exit 2), interactive-identity rejection (exit 4); Hetzner env reads → _get_hetzner_api_token via fnox.
  • 6 — US4 (manifest + audit): manifest discover/synthesize/validate with JSON-Schema + TOML position info, .gitignore ensure-block, remo audit CLI with --tail/--since/--json.
  • 7 — US5 (rotation + revocation): all four backend mint/revoke impls (1Password SCIM, Vault accessor, AWS IAM deny-all + teardown, age-git warn-only), remo rotate-bootstrap with cadence reading + 1-hour fresh-skip + partial-success exit 7, pre-destroy revoke hook wired into Hetzner, passive overdue reminder.
  • 8 — US6 (devcontainer auto-synth): language-marker priority table (Node/Python/Rust/Go/Ruby/Ubuntu), socket mount in every synthesized json, exit-warning state.yml plumbing.
  • 9 — Polish: docs/credential-broker.md (threat model + operator runbook), README updates, scripts/grep-credential-leaks.sh enforcing FR-005/FR-006, version bump 2.0.0rc42.1.0rc1.

Spec & supporting work:

  • Authoritative spec moved from docs/remo-fnox-spec.md to specs/005-credential-broker/ alongside plan/research/data-model/contracts/quickstart/tasks.
  • Pre-existing 005-provider-snapshots specs renumbered to 007 to free the slot.

Deferred / partial ([~] in tasks.md)

  • T039 / T063 / T086 — laptop-side cli/shell.py hooks are a no-op because project workspaces live on the instance; helpers (core.devcontainer.ensure_socket_mount, core.manifest.synthesize_default, core.devcontainer.synthesize_devcontainer_json) are reachable for instance-side automation when remo-broker ships its hooks.
  • T074 — pre-destroy revoke wired into Hetzner; AWS/Incus/Proxmox destroy() flows can adopt the same core/broker_revoke.py hook.
  • T078 — cadence reader works for Hetzner labels; --cadence-days N writer on remo {provider} create not yet exposed.
  • T088 — explicit "exit to instance shell" menu option lives in the server-side picker script; state.yml plumbing is laptop-side.
  • T097 — quickstart end-to-end runs against real cloud accounts deferred to pre-GA validation.

Test plan

  • uv run pytest — 711 passed (was 644; +67 new tests)
  • uv run ruff check src/remo_cli — all checks passed
  • bash scripts/grep-credential-leaks.sh — credential-leak grep gate: ok
  • uv run mypy src/remo_cli — clean modulo pre-existing missing third-party stubs (jsonschema/yaml/boto3)
  • Live quickstart on real AWS + Hetzner + Incus + Proxmox (gated on remo-broker daemon release)
  • Verify remo init --backend 1password round-trip on a fresh devcontainer

🤖 Generated with Claude Code

Remove long-lived developer credentials from Remo instances. Provisioning
creds now come from laptop `fnox` (FR-006), per-instance bootstrap tokens
are delivered via provider-specific transport (SSH push / IMDS / bind-mount),
and a per-project allowlist manifest plus on-instance broker socket gate
secret access from devcontainers.

Cross-repo dependency: `BROKER_PINNED_VERSION = "0.1.0"` — the Rust broker
daemon is owned by `get2knowio/remo-broker` and consumed as a signed binary
release.

CLI surface:
  remo init --backend {1password|vault|aws-sm|age-git}
  remo incus  add-node / remo proxmox add-node
  remo audit  <instance>     [--tail|--since|--json]
  remo rotate-bootstrap [<instance>|--all] [--force]
  remo destroy — pre-deletion bootstrap-token revoke (FR-020, exit 5)
  passive overdue-rotation reminder on every invocation

Implementation breakdown (phases 1–9, all tasks in tasks.md):
  • core: fnox subprocess wrapper, nodes.yml registry, manifest TOML
    discover/synthesize/validate, devcontainer socket helper + language
    auto-synthesis, audit JSON-lines parser, broker mint/revoke dispatcher
    for 1Password SCIM / Vault accessor / AWS IAM / age-git
  • providers: per-developer-per-region AWS IAM instance profile with
    scoped secretsmanager:GetSecretValue, Hetzner SSH stdin push, Incus
    `lxc config device add … readonly=true`, Proxmox `pct set -mp0 …,ro=1`
  • ansible: broker_install role (binary download + SHA-256 verify + systemd
    unit with LoadCredential), 3× bootstrap_token_* assertion roles,
    group_vars migrated to `lookup('pipe', 'fnox get …')`
  • polish: docs/credential-broker.md (threat model + runbook),
    scripts/grep-credential-leaks.sh enforcing FR-005/FR-006, version
    bump 2.0.0rc4 → 2.1.0rc1

Spec & supporting work:
  • Authoritative spec moved from docs/remo-fnox-spec.md to
    specs/005-credential-broker/ alongside plan/research/data-model/
    contracts/quickstart/tasks
  • Pre-existing 005-provider-snapshots specs renumbered to 007 to free
    the 005 slot

Tests: 644 → 711 (+67 new). Ruff clean. grep gate clean. mypy clean
modulo pre-existing missing third-party stubs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
pofallon and others added 3 commits May 26, 2026 22:51
Closes the defects surfaced by the high-effort code review of #32.
Most-severe-first highlights:

1. Gate broker_install role on REMO_BROKER_BACKEND so missing v0.1.0
   release URL doesn't 404 every `remo {provider} create`.
2. Never format mint-call payloads into BackendError messages — surface
   field-presence info only so 1P/Vault schema drift can't leak the
   freshly-minted token to shell scrollback / CI logs.
3. Wrap `lookup('pipe', 'fnox get ...')` in tolerant defaults plus
   preflight asserts so eager pipe-lookup failures don't crash before
   the friendly-error path can fire.
4. Subclass click.Group to wrap invoke() in try/finally so the
   passive-update + overdue-rotation hook fires past subcommand
   sys.exit() (it was completely dead).
5. Per-instance AWS broker IAM role (not per-developer); a destroy on
   one instance can no longer break IMDS creds on a sibling.
6. Rename Hetzner labels remo:* → remo_* (the colon form is rejected by
   the cloud API; FR-020 revoke was silently broken).
7. Distinguish "no token to revoke" (None, success) from "lookup
   failed" (raise; honor --force) in core/broker_revoke.py.
8. Wire revoke_before_destroy + --force-broker into AWS/Incus/Proxmox
   destroy (was Hetzner-only).
9. Gate the remo-broker restart handler on broker_token_present so a
   fresh install doesn't transition the service into failed state.
10. Drop the empty `touch /usr/local/libexec/remo-broker-tokens` that
    blocks the real Ansible-installed helper (force: false → true on
    the copy task).
11. Validate the broker .sha256 body is a 64-char hex digest before
    handing it to get_url — whitespace-only bodies silently disabled
    integrity checking.
12. Add core/broker_config.{get_backend,get_admin_sa_fnox_key} with
    env-first/file-fallback resolution so `remo init` is no longer a
    no-op.
13. Replace the regex-based JSONC stripper with a string-aware
    tokenizer that doesn't corrupt `"path": "a//b"`.
14. Normalize _parse_iso/_parse_ts to aware-UTC so bare-ISO
    timestamps don't crash `remo rotate-bootstrap` / `remo audit
    --since`.
15. Fetch Hetzner-reported SSH host keys and verify via ssh-keyscan
    before piping the bootstrap token on stdin — close the
    accept-new MITM window on freshly-allocated public IPs.

Also dropped the backwards-compat fallback for the colon-form Hetzner
label keys per direct user guidance (only the underscore form is
canonical).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the missing link in `remo rotate-bootstrap`: after minting a fresh
sub-token at the backend, push it to /etc/remo-broker/bootstrap-token on
the instance and call the broker daemon's `rotate-bootstrap` admin-
socket op (per get2knowio/remo-broker docs/wire-protocol.md) so the
daemon atomically swaps to the fresh session before the old token is
revoked. Previously the mint succeeded and the old token was revoked
but the new token was never delivered — leaving the instance broker
serving against a revoked credential.

- New core/broker_admin.py: NDJSON-over-SSH bridge to the admin Unix
  socket (mode 0600 root-owned), using a small `sudo python3 -c` shim
  to avoid a socat / nc dependency on the instance.
- cli/rotate.py: drop the SIGHUP TODO; add `_deliver_and_reload(host,
  token)` that pushes + calls rotate-bootstrap for Hetzner (only fully
  wired provider today). Non-Hetzner providers emit a clear warning;
  AWS-SM (no on-disk token) still issues the admin reload so the broker
  re-fetches creds from IMDS.
- On delivery failure: do NOT revoke the previous token (broker is
  still serving with it) and return exit 7.

Tests: 6 cover the admin-socket client (happy path, broker error,
SSH transport failure, empty/garbage response, default options); 3
cover the rotate wiring (Hetzner push+reload, unsupported-provider
warning, delivery-failure aborts revoke).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `--cadence-days N` to `remo {hetzner,aws,incus,proxmox} create`
and persists the value in the provider's native metadata primitive so
the passive overdue-rotation reminder and `remo rotate-bootstrap` can
respect per-instance cadence overrides (FR-021).

Provider coverage:
- Hetzner: writes `remo_rotation_cadence_days` server label; rotate-
  bootstrap now also writes `remo_last_rotation_at` +
  `remo_bootstrap_token_id` labels on success.
- AWS: writes `remo:rotation-cadence-days` EC2 instance tag; rotate-
  bootstrap writes `remo:last-rotation-at` tag. `_read_rotation_metadata`
  extended to read these via `ec2.describe_tags`. token_id stays
  derived from the per-instance role name (not stored in tags).
- Incus: writes `user.remo.rotation_cadence_days` container config via
  `incus config set` over SSH. Rotation flow itself remains deferred.
- Proxmox: LXC lacks a clean `user.*` primitive — flag emits a clear
  deferred-warning rather than half-implementing it. Persistence +
  rotation for Proxmox is a follow-up.

Tests: 5 new in tests/unit/providers/test_cadence_writes.py (one per
provider, including the deferred-warning case for Proxmox) and 3 in
tests/unit/cli/test_rotate.py covering the post-rotation metadata
write paths and the AWS tag reader.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@pofallon
Copy link
Copy Markdown
Contributor Author

Update: three follow-up commits landed since the review

Pushed 0daf172, c326727, 795e086 resolving the 15 review findings in #33 and pushing the credential-broker rotation lifecycle to completion. Branch now ships:

  1. 0daf172 — Resolve 15 code-review findings (closes Code review findings: 15 defects from /code-review of #32 (005-credential-broker) #33). Token-leak in mint exceptions, Hetzner label keys swap (:_), per-instance AWS IAM role (not per-developer), result_callback hook now actually fires past sys.exit, broker_config bridge so remo init isn't a no-op, JSONC string-aware stripper, aware-UTC datetime normalization, SSH host-key verification before bootstrap-token push.
  2. c326727 — Admin-socket client + complete rotation lifecycle. New core/broker_admin.py (NDJSON-over-SSH bridge to /run/remo-broker/admin.sock); remo rotate-bootstrap now pushes the fresh token + tells the broker to reload before revoking the old token. Previously the daemon ended up with a revoked credential.
  3. 795e086 — Per-instance rotation cadence at create time (T078). --cadence-days N on remo {hetzner,aws,incus,proxmox} create. Hetzner + AWS fully wired (label/tag at create + last_rotation/token_id on rotate + AWS-tag reader). Incus persists cadence; Proxmox emits a clear deferred warning.

Coordination: companion PR get2knowio/remo-broker#6 adds the release workflow that unblocks the broker_install Ansible role's current 404.

Tests: 764 passing, ruff clean.

Deferred for Phase 3: Incus + Proxmox rotation flow (cadence write lands here; admin-socket reload via container exec is next).

Extends `remo rotate-bootstrap` to drive the full mint → push → broker-
reload → revoke flow on Incus containers and Proxmox LXC containers,
matching the Hetzner coverage that landed in c326727.

The laptop reaches each container's `/run/remo-broker/admin.sock` by
tunneling the existing NDJSON `sudo python3` bridge through `incus exec
<container> -- …` (Incus, with localhost fast-path) or `pct exec <vmid>
-- …` (Proxmox, always over SSH). Token pushes use the same indirection
with stdin forwarding so the secret never appears in argv / ps output.

Metadata persistence:
- Incus: `incus config set <container> user.remo.{rotation_cadence_days,
  last_rotation_at, bootstrap_token_id}` (host-side config keys).
- Proxmox: in-container files under `/etc/remo-broker/{rotation_cadence_
  days, last_rotation_at, bootstrap_token_id}` written via `pct exec`.
  Proxmox LXC has no host-side `user.*` primitive; the in-container file
  has the same lifetime as the bootstrap-token file the broker already
  consumes from `/etc/remo-broker/`.

`_lookup_token_id` (broker_revoke) now reads both stores, distinguishing
"no token minted yet" (empty stdout → None) from "lookup transport
failed" (rc != 0 → TokenLookupError, blocks destroy unless --force).

`destroy()` in providers/{incus,proxmox} now constructs the pre-revoke
candidate KnownHost with the right fields (`name="<host>/<container>"`,
`instance_id=<vmid>` on Proxmox / `<host-user>` on Incus, `region=<host
-user>` on Proxmox) so `_lookup_token_id` can resolve the exec target.

Tests: 13 new across `test_broker_admin.py`, `test_rotate.py`,
`test_broker_revoke.py`, and two new files `test_incus_token_push.py`
and `test_proxmox_token_push.py`. `test_rotate_warns_for_unsupported_
provider` retargeted from incus → proxmox at Incus landing and then
deleted at Proxmox landing (no unwired provider remains).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@pofallon
Copy link
Copy Markdown
Contributor Author

Phase 3 landed — `644c131`: Incus + Proxmox rotation lifecycle

Rotation flow now complete on all four providers. Same shape as Hetzner:

  • Incus: _push_bootstrap_token_to_container via incus exec <container> -- install … (localhost fast-path + remote SSH); admin-socket reload via incus exec … -- sudo python3 … (broker_admin.rotate_bootstrap_via_incus). Metadata in host-side user.remo.* config keys.
  • Proxmox: mirror via pct exec <vmid> -- … (always over SSH; no localhost flavour). Metadata in in-container files under /etc/remo-broker/ — Proxmox LXC has no host-side user.* primitive, but the in-container file has the same lifetime as the bootstrap-token file the broker already consumes there.

broker_revoke._lookup_token_id now reads both stores. Empty stdout = "no token minted yet" (None, success). Non-zero exit = transport failure (TokenLookupError, blocks destroy unless --force).

providers/{incus,proxmox}.destroy() now build the pre-revoke candidate KnownHost with the right fields so _lookup_token_id can resolve the exec target.

Tests: 791 passing, ruff clean. 13 new tests across test_broker_admin.py, test_rotate.py, test_broker_revoke.py, and two new files test_incus_token_push.py / test_proxmox_token_push.py. The "unsupported provider" warning test is retired (no unwired provider remains).

remo-broker v0.1.0 was cut today (release workflow merged in
get2knowio/remo-broker#6 + tag pushed), so the binary URL the
broker_install Ansible role downloads no longer 404s. Removing the
REMO_BROKER_BACKEND env-gate that Finding 1 added as a temporary
workaround.

The systemd unit installs enabled-but-stopped when no token is present
(Finding 9), so running broker_install on every create is harmless even
when the user hasn't configured a backend.

The bootstrap_token_{file,imds,mount} assertion roles stay gated — those
correctly only fire when the user has opted in to a backend and minted
a token.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@pofallon
Copy link
Copy Markdown
Contributor Author

`deb72a3`: un-gate broker_install now that remo-broker v0.1.0 ships

remo-broker #6 merged and tag `v0.1.0` pushed — release workflow currently running (will publish the cross-arch binaries + `.sha256` files + `remo-broker.v1.json` schema to https://github.com/get2knowio/remo-broker/releases/tag/v0.1.0).

This commit drops the temporary `REMO_BROKER_BACKEND` env-gate that Finding 1 added on the `broker_install` Ansible role include in the four `*_configure.yml` plays. The systemd unit installs enabled-but-stopped when no token is present (Finding 9), so running `broker_install` on every `create` is harmless even without a backend configured.

The `bootstrap_token_{file,imds,mount}` assertion roles stay gated — those correctly only run when the user has opted in to a backend and minted a token.

Tests: 791 passing.

…ox cadence

Two bugs surfaced by the first end-to-end Proxmox test against a real
container (lab1):

1. broker_install was only included in the *_configure.yml playbooks, but
   `remo {provider} create` runs `*_site.yml` — so the broker daemon was
   never installed on create for *any* of the four providers. The role
   only fired on the standalone `remo {provider} update` path.

2. The per-instance cadence write on Proxmox tried to `echo N >
   /etc/remo-broker/rotation_cadence_days` before broker_install had run
   to provision /etc/remo-broker/. Result: a non-fatal warning at every
   create and the cadence value was never persisted.

Note: these patches keep the *current* (external-backend) design working,
which we're no longer planning to ship — see PR #32 conversation for the
redesign. Committed so the branch is internally consistent for posterity.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@pofallon
Copy link
Copy Markdown
Contributor Author

Closing without merge.

End-to-end testing on 2026-05-29 against a real Proxmox container surfaced a categorical mismatch between this branch's architecture (external secret backend + bootstrap-token-on-instance + broker-fetches-on-demand) and the threat model it set out to defend.

Specifically: the bootstrap token at /etc/remo-broker/bootstrap-token is itself the kind of "credential laying around" the origin-story principle (@nateberkopec) was specifically scrubbing. An AI agent or supply-chain attacker in the devcontainer who escalates to read that file gets every secret behind it, bypassing the per-project manifest gate.

Six findings from the test session reinforced this — most notably that age-git was advertised as a downgrade path but errors on first use; that no create-time initial-token mint+push was wired (rotate-bootstrap was doing double duty); and that broker_install was never being included in the *_site.yml playbooks the create flow actually runs (fixed in 803667a for branch-integrity, but moot given the supersede).

Superseded by spec 006: laptop-push model.

  • remo: specs/006-credential-broker-laptop-push/ (spec, plan)
  • remo-broker: specs/002-laptop-push-secrets/ (spec) — supersedes 001-broker-daemon

The redesign: laptop encrypts a project-scoped secret bundle (age, X25519 + ChaCha20-Poly1305) and pushes it to the instance over SSH; broker decrypts in memory using a key loaded via systemd's LoadCredentialEncrypted= (TPM2 → host-key → plaintext-mode-0600 fallback ladder); devcontainers fetch via the existing per-project Unix socket protocol with manifest allowlist enforcement. No external backend, no on-disk bootstrap token, no fnox-core dependency (which cascade-deletes Cross.toml, all 6 deny.toml advisory ignores, and shrinks the binary from ~32 MiB toward the 15 MiB NFR target).

~60% of the laptop-side code in this PR carries forward to the new branch (broker chassis, manifest model, install machinery, fnox-as-laptop-store, per-provider create/destroy plumbing, SSH/incus exec/pct exec bridges). On the broker side, ~80% of the chassis carries forward; src/backend.rs and src/bootstrap.rs delete entirely.

The 005-credential-broker branch is being kept intact as historical reference; not deleting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant