fix(gateway): harden deploy contract for infra-as-code consumers#649
Conversation
Reads AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and the optional AWS_SESSION_TOKEN from secret files in loadGatewayConfig, validates the access/secret pair, and injects them into the S3Client constructor. When neither is set, the SDK default credential chain takes over so IRSA/EC2-instance-role deployments keep working. - AwsCredentials interface added next to ObjectStoreConfig - Pair validation: throws when exactly one of the access/secret pair is set - AWS_SESSION_TOKEN is optional and only attached when the pair is present - Orphan session token (set without the pair) logs an info-level warning and falls through to the SDK default chain - S3Client receives credentials only when explicitly configured Gateway tests: 87 -> 95 (+8). Runtime tests: 350 -> 352 (+2).
…irectory secrets deploy/compose.yaml now mounts the three AWS credential bind-mounts (access key id, secret access key, session token) alongside the existing Discord and S3 secret files, so operators wiring infra-as-code can supply them through the same pattern. Each service also gains a bounded json-file logging block (10m x 3) to keep small-VM disk usage in check. readOptionalSecret now asserts the secret path resolves to a file via statSync, not just that the path exists. When a bind-mount source doesn't exist on the host, Docker materializes the path as a directory; the new guard surfaces a clear startup error pointing operators at the likely cause instead of failing later with a raw EISDIR. Gateway tests: 95 -> 96 (+1). Docker compose config validates clean once the three AWS secret files exist (empty files for the default-chain path).
The Dockerfile healthcheck no longer succeeds before the gateway is genuinely connected to Discord. A new readiness helper clears any stale `/tmp/gateway-ready` flag at process startup and registers a one-time `clientReady` listener that writes the flag once Discord confirms the bot is fully connected. The healthcheck polls for the flag plus PID 1 liveness, so `docker compose up --wait` only returns once the daemon is actually ready. - packages/gateway/src/readiness.ts: setupReadinessFlag(client, logger) - main.ts wires it between client creation and login() - Dockerfile HEALTHCHECK switched from no-op to file-and-pid probe with --interval=10s --timeout=3s --retries=12 --start-period=45s - compose.yaml drops the gateway healthcheck override (CA-cert wait is already gated by depends_on: mitmproxy: service_healthy) - deploy/README.md documents the new readiness semantics and the AWS credential `touch` block for operators wiring infra-as-code Gateway tests: 96 -> 101 (+5).
- `stat.isFile()` check uses explicit boolean comparison - Orphan AWS_SESSION_TOKEN warning emits via `console.warn` rather than `console.log` (matches makeLogger's warn-level pattern; lifts the eslint-disable directive) - Drop the redundant `NodeJS.ErrnoException` cast in readiness stale-flag cleanup -- in-operator narrowing is sufficient - Strengthen the stale-readiness test to assert flag unlink ordering at listener-registration time, not just final state - Replace `as string` / `as unknown` assertions in config.test.ts with proper narrowing - Add custom-endpoint+credentials coverage to s3-adapter - Document the AWS credential pair contract for operators in deploy/README.md Runtime tests: 352 -> 353 (+1).
Captures the v0.44.x deploy contract hardening pass: AWS credential plumbing, compose mounts and bounded logging, secret-path directory guard, and a real readiness healthcheck backed by Discord clientReady.
Static analyzers flag predictable paths in /tmp as a temp-file weakness (CWE-377/378). The flag is intentionally deterministic — the Dockerfile healthcheck has to know where to find it — so fs.mkdtempSync() doesn't apply. Move the default to /var/run/fro-bot/gateway-ready, which the Dockerfile creates with 0700 permissions before the HEALTHCHECK runs. The FRO_BOT_READY_FLAG_PATH env override is preserved; tests already use isolated mkdtempSync paths and didn't need changes. Rebuilt dist/ for the readiness path baked into main.mjs.
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
The four deploy-contract gaps are addressed correctly. Logic, tests, and the Docker wiring are all sound. One non-blocking concern and one documentation inaccuracy noted below.
Blocking issues
None.
Non-blocking concerns
1. Static credential objects don't rotate. AwsCredentials is passed as a plain AwsCredentialIdentity object — not a provider function — so if an STS session token expires the process must restart to pick up new credentials. This is fine for a compose-on-a-single-droplet setup, but worth a comment in s3-adapter.ts near the spread for whoever next tries to add token refresh.
2. PR description inaccuracy. The description says "existing deployers don't need to change anything". That's true at the code level (SDK default chain still wins when no env vars are set), but it's not true at the compose level. The updated compose.yaml unconditionally bind-mounts three new files (./secrets/aws-access-key-id, etc.). If an existing deployer pulls and runs docker compose up without touching those files, Docker silently creates directories at the mount points, and the new readOptionalSecret directory guard throws a clear error at startup. The README does document the touch steps, so operators who read it are covered — the claim in the PR body is just slightly misleading.
3. kill -0 1 || exit 1 in the HEALTHCHECK is redundant. test -f ... && kill -0 1 || exit 1 — when test -f fails, the shell exits with code 1 before reaching || exit 1. The guard is harmless but adds noise. Not worth a change on its own.
Missing tests
None — happy path, pair-validation, orphan-token, empty-file, and directory-guard cases are all covered. The TOCTOU caveat is honestly documented in the test file.
Risk assessment (LOW/MED/HIGH): LOW
- Credential plumbing is additive (SDK default chain unchanged when env vars absent).
- The readiness gate moves a pre-existing no-op healthcheck to a real one; the only new risk is a startup that never reaches
clientReady, which surfaces as a faileddocker compose up --waitrather than a silent bad state. - The directory guard turns a previously-silent
EISDIRcrash into a structured error — strictly better. - Blast radius if something is wrong: gateway fails to start, Discord bot goes offline; object store and workspace are unaffected.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26140941619 |
| Cache | hit |
| Session | ses_1bc64d3a3ffeyweZAYgbxB4fby |
…ness write
- Collapse `existsSync` + `statSync` in `readOptionalSecret` into a single
`try/catch` on `statSync`. ENOENT falls through to the env-var fallback;
any other error rethrows. Behavior is identical (missing file → null,
directory → clear error, file → read) without the check-then-act window.
- Pass `{mode: 0o600}` to `writeFileSync` for the gateway-ready flag and
its corresponding test fixture. Defense-in-depth on the readiness flag
permissions and a signal to static analyzers that the write is
intentional and owner-scoped.
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
Re-reviewing after the new commit 676ff36 (my prior approval was dismissed). The single new commit addresses the TOCTOU window flagged by static analysis and adds 0o600 permissions to the readiness flag write — both are improvements.
Changes since prior review
readOptionalSecret TOCTOU fix. The existsSync + statSync check-then-act sequence is replaced with a single statSync inside a try/catch. ENOENT is caught and silently falls through (file absent → env-var fallback path); any other error (e.g. EACCES, EPERM) rethrows. Directory paths still surface the clear operator-facing error. Behavior is identical to before; the race window is gone.
Readiness flag permissions. writeFileSync(flagPath, '', {mode: 0o600}) scopes the flag to owner-read/write, consistent with the 0700 directory created in the Dockerfile. Defense-in-depth; no functional change since the container runs as root.
Test fixture. writeFileSync(flagPath, 'stale', {mode: 0o600}) in the stale-flag test matches the production write mode. Correct.
Blocking issues
None.
Non-blocking concerns
The three non-blocking items from the prior review still apply (static credentials don't rotate; compose.yaml requires new secret files to be touched; redundant || exit 1 in HEALTHCHECK). None are blocking. The commit message explicitly acknowledges the TOCTOU fix motivation, so no new documentation gap.
Missing tests
The existing test suite covers the updated code path. The ENOENT catch is exercised by the "tolerates ENOENT" and "file absent" tests; the directory guard by the "throws with clear message" test; the flag permissions by the updated stale-flag test.
Risk assessment (LOW/MED/HIGH): LOW
Same blast radius as before. The refactor is behaviorally equivalent — same outputs for all inputs — with no new code paths added.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26141212536 |
| Cache | hit |
| Session | ses_1bc64d3a3ffeyweZAYgbxB4fby |
…; harden compose - `readOptionalSecret` collapses to a single `readFileSync` that catches ENOENT (fall through to env-var) and EISDIR (clear directory error). Removes the stat-then-read pattern that static analyzers flag as a file-system race. - Drop redundant `|| exit 1` from the gateway HEALTHCHECK -- the shell already exits non-zero when test/kill fail. - Convert secret bind mounts in compose.yaml to long syntax with `bind.create_host_path: false`. Missing secret source files now fail at `docker compose up` time instead of silently materializing as directories. The readOptionalSecret directory guard remains as a backstop. - Document static-credential rotation in deploy/README.md: rotate by writing the new value into the secret file and restarting the gateway. STS or instance-role credentials should leave the pair empty and rely on the SDK default credential chain.
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
Re-reviewing after commit 791a0d5. This commit closes out all three non-blocking concerns raised in the prior reviews.
Changes since prior review
readOptionalSecret is now a single atomic readFileSync — the stat-then-read sequence is completely gone. ENOENT falls through to the env-var fallback; EISDIR maps to the operator-facing directory error; every other error rethrows. No TOCTOU window exists. The test comment is updated to match.
compose.yaml bind mounts converted to long syntax with create_host_path: false — missing secret source files now cause docker compose up to fail immediately with a clear Compose-level error, before any container starts. The directory guard in readOptionalSecret remains as a backstop for non-Compose deployment paths. This fully closes the "existing deployers must touch the new files" footgun identified in the first review.
Redundant || exit 1 removed from HEALTHCHECK — CMD test -f /var/run/fro-bot/gateway-ready && kill -0 1 is the correct form; when test -f fails the shell exits non-zero automatically.
Credential rotation documented in README — explicit "rotate by writing new value and restarting; use SDK default chain for STS/instance-role refresh" guidance added.
Blocking issues
None.
Non-blocking concerns
None — all prior concerns are resolved.
Missing tests
None. The EISDIR path is exercised by the directory-guard test using a real temp directory; ENOENT fall-through is exercised by the "tolerates ENOENT" test. create_host_path: false is a Compose-engine behavior and does not need a unit test.
Risk assessment (LOW/MED/HIGH): LOW
readOptionalSecretbehavior is unchanged for all existing inputs; the refactor only removes a race window.- The bind-mount change fails earlier and more visibly for misconfigured deployments — strictly better.
- Blast radius unchanged: gateway startup failure, bot goes offline; object store and workspace unaffected.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26141629388 |
| Cache | hit |
| Session | ses_1bc64d3a3ffeyweZAYgbxB4fby |
The v0.44.2 compose contract (fro-bot/agent#649 + #651) adds the optional AWS_SESSION_TOKEN_FILE bind mount and uses create_host_path: false on all secret bind sources. Without aws-session-token in buildSecretFileList, docker compose up would fail on a missing host file even for deployments that don't use STS temporary credentials. Materialize aws-session-token as empty when AWS_SESSION_TOKEN is unset, matching the existing s3-endpoint pattern. Bump apps/gateway/upstream.json to v0.44.2.
* fix(gateway): rename secret files to match upstream compose contract
buildSecretFileList wrote snake_case filenames like discord_token to
/opt/gateway/deploy/secrets/, but upstream fro-bot/agent compose mounts
kebab-case paths like ./secrets/discord-token. With short-syntax bind
mounts, Docker silently creates the missing source as a directory and
container creation fails.
Rename all 5 existing files to kebab-case (discord-token,
discord-application-id, discord-guild-id, aws-access-key-id,
aws-secret-access-key) and add 3 more files required by the upstream
contract: s3-bucket and s3-region (required), s3-endpoint (optional —
written as empty file when unset, matching the existing optional-secret
pattern).
8 files total now, matching what the upstream compose contract expects
once it lands the parallel coordinated changes for AWS creds and the
guild ID mount.
* fix(gateway): use path-style hostname for OBJECT_STORE_HOSTS
computeObjectStoreHosts() returned the virtual-hosted-style hostname
(`${bucket}.${url.hostname}`) for custom S3 endpoints. The upstream
S3 adapter constructs its client with `forcePathStyle: true`, so
requests go to the bare endpoint hostname with the bucket in the path.
mitmproxy egress allowlist was therefore allowing a hostname the
gateway never requests, while blocking the real one. R2 / MinIO / any
custom-endpoint deploys would silently fail on first object-store
operation.
Return `url.hostname` for the custom-endpoint branch. AWS S3 path
(no S3_ENDPOINT set) is unchanged.
* fix(cli): pass full deploy env to local gateway deploy child
getGatewayDeployEnv() returned only PATH, HOME, SSH_AUTH_SOCK, and
GATEWAY_HOST. Bun.spawn({ env }) does not merge with process.env, so
the spawned apps/gateway/src/deploy.ts subprocess never saw
DISCORD_TOKEN, AWS_ACCESS_KEY_ID, S3_BUCKET, or the other required
env vars — even when they were set in the user's shell or local .env.
`gateway deploy --local` would therefore fail at validateRequiredEnv
even on a perfectly-configured host.
Add the 7 missing required env vars to the child env using the
explicit-allowlist pattern already used in cliproxy and keeweb deploy
commands.
* fix(gateway): use ssh -i key file instead of ssh-agent in CI
validateRequiredEnv() in deploy.ts required GATEWAY_SSH_KEY when
CI=true but buildDeployEnv() only forwarded SSH_AUTH_SOCK and
sshCommand() had no -i flag — the secret was validated but unused.
The actual auth path went through webfactory/ssh-agent in the
workflow.
ssh-agent has a real production failure mode: when the operator's
DigitalOcean account holds multiple SSH keys, OpenSSH cycles through
them all before reaching the right one and trips MaxAuthTries. Hit
this manually during gateway provisioning tonight (workaround was
extracting the key to a tmp file and using ssh -i / IdentitiesOnly).
Switch CI to the same pattern. In CI mode, write GATEWAY_SSH_KEY to
a mkdtemp file with mode 0600 (defensive chmodSync too), pass -i
$path and -o IdentitiesOnly=yes to every ssh/scp/writeRemoteFile
invocation, wrap main() in try/finally so the tmp dir is removed
even on mid-deploy failure. Key bytes never enter argv.
Local mode is unchanged — still requires SSH_AUTH_SOCK.
Workflow: drop the Setup SSH agent step; pass GATEWAY_SSH_KEY into
the Deploy step's env block directly.
* fix(gateway): tighten OBJECT_STORE_HOSTS validation to RFC1123
validateObjectStoreHosts() used /^[\w.,\-]+$/ which accepts
underscores via \w. Upstream mitmproxy allowlist parser rejects
underscores and other non-RFC1123 chars at runtime, so an invalid
hostname like `foo_bucket.example.com` passed infra validation,
shipped to the droplet, and only failed when the egress filter
silently went down.
Replace with strict per-label RFC1123 validation:
- comma-separated list of hostnames
- each hostname is dot-separated labels
- each label matches /^(?!-)[a-z0-9-]{1,63}(?<!-)$/
- full hostname max length 253
Reject uppercase, underscores, leading/trailing hyphens per label,
empty labels, labels > 63 chars, total hostname > 253 chars. Error
messages name the offending host and reason category.
* fix(gateway): propagate readRemoteChecksum exit failures
readRemoteChecksum awaited proc.exited but discarded the exit code,
returning an empty string on any non-zero exit. The caller treated
empty string as 'no prior checksum exists' and force-recreated
containers, masking real SSH failures (transient network, auth,
permissions) as phantom first deploys.
The SSH command itself is unchanged — it uses
`cat $CHECKSUM_PATH 2>/dev/null || echo ''` so genuine
missing-file cases still exit 0 with empty stdout. Only structural
SSH/auth failures now surface as thrown errors with the exit code
and stderr tail for diagnosis.
* fix(gateway): apply ce:review autofix bundle (6 findings)
Applied 6 safe_auto findings from ce:review autofix run on PR #273:
F-A: getGatewayDeployEnv() forwards S3_ENDPOINT and OBJECT_STORE_HOSTS
(Fro Bot blocker + 5 reviewer agreement). Without these, gateway
deploy --local silently produced wrong mitmproxy egress allowlists
for R2/MinIO endpoints.
F-G: Move CI key materialization inside try/catch with cleanup on
failure, so writeFileSync/chmodSync errors don't leak the mkdtemp
directory.
F-H: Extend CI key permission test to actually stat the key file and
assert mode is 0o600 while it exists, not just confirm the file
was created.
F-I: Delete dead Phase 11 DISCORD_OPERATOR_ROLE_ID warning block and
test. Upstream fro-bot/agent v0.44.0 doesn't ship role-gating yet,
and the secret file was already removed from buildSecretFileList.
F-J: Add patch changeset for @marcusrbrown/infra describing the env
passthrough fix.
F-K: Replace 4 non-null assertions on validated env vars in main()
with a typed parser (narrowValidatedEnv + ValidatedDeployEnv). Lint
warning count drops by exactly 4.
* fix(gateway): add aws-session-token secret and bump upstream to v0.44.2
The v0.44.2 compose contract (fro-bot/agent#649 + #651) adds the
optional AWS_SESSION_TOKEN_FILE bind mount and uses create_host_path:
false on all secret bind sources. Without aws-session-token in
buildSecretFileList, docker compose up would fail on a missing host
file even for deployments that don't use STS temporary credentials.
Materialize aws-session-token as empty when AWS_SESSION_TOKEN is unset,
matching the existing s3-endpoint pattern.
Bump apps/gateway/upstream.json to v0.44.2.
* fix(gateway): forward AWS_SESSION_TOKEN through CLI and CI workflow
Commit 1af69d4 added aws-session-token as an optional secret file but
missed the env forwarding paths: getGatewayDeployEnv() didn't include
it in the local deploy passthrough, and deploy-gateway.yaml didn't
expose it as a workflow_call secret or pass it to the Deploy step.
Same bug class as the S3_ENDPOINT/OBJECT_STORE_HOSTS omission Fro Bot
caught on the first pass.
- packages/cli/src/commands/gateway/deploy.ts: forward AWS_SESSION_TOKEN
- packages/cli/src/commands/gateway/deploy.test.ts: dedicated forward
test + assertion in the required-env test
- .github/workflows/deploy-gateway.yaml: declare AWS_SESSION_TOKEN as
optional workflow_call secret and forward to the Deploy step env
* docs: include AWS_SESSION_TOKEN in changeset release note
Three reliability improvements that landed from PR #649's review cycle: - readiness flag now re-arms on Discord reconnect cycles. Previously the flag was written once via clientReady and persisted for the lifetime of the process, so a permanent disconnect (rate-limit ban, gateway revocation, network partition past discord.js's retry budget) left the healthcheck green forever. Now clientReady uses on (not once) so reconnects re-write the flag, and shardDisconnect clears it so the healthcheck goes red during outages. - readOptionalSecret rejects values containing embedded newlines. AWS credentials copy-pasted from a wrapped terminal can land with mid-value newlines that pass trimEnd but break S3 request signing later with an opaque AWS error. The check applies symmetrically to file-read and env-var paths and throws with a clear, actionable message pointing to the file path or env var name. - main.ts now exposes makeGatewayProgram as an injectable factory so the startup ordering between setupReadinessFlag and client.login is observable from a unit test. A reversal of that order would silently reintroduce the stale-flag bug; the new test fails fast if anyone moves the wiring.
Three reliability improvements that landed from PR #649's review cycle: - readiness flag now re-arms on Discord reconnect cycles. Previously the flag was written once via clientReady and persisted for the lifetime of the process, so a permanent disconnect (rate-limit ban, gateway revocation, network partition past discord.js's retry budget) left the healthcheck green forever. Now clientReady uses on (not once) so reconnects re-write the flag, and shardDisconnect clears it so the healthcheck goes red during outages. - readOptionalSecret rejects values containing embedded newlines. AWS credentials copy-pasted from a wrapped terminal can land with mid-value newlines that pass trimEnd but break S3 request signing later with an opaque AWS error. The check applies symmetrically to file-read and env-var paths and throws with a clear, actionable message pointing to the file path or env var name. - main.ts now exposes makeGatewayProgram as an injectable factory so the startup ordering between setupReadinessFlag and client.login is observable from a unit test. A reversal of that order would silently reintroduce the stale-flag bug; the new test fails fast if anyone moves the wiring.
Four gaps in the v0.44.0 deploy contract surfaced during first-deploy attempts from infra-as-code consumers (marcusrbrown/infra targeting a DigitalOcean droplet). Two were already fixed; two were real.
What changes
AWS credentials reach the S3 client.
loadGatewayConfigreadsAWS_ACCESS_KEY_ID_FILE,AWS_SECRET_ACCESS_KEY_FILE, andAWS_SESSION_TOKEN_FILEas optional secrets. Pair validation enforces both-or-neither for the access-key/secret pair; an orphan session token logs a warning and falls back to the SDK default chain. The S3 adapter conditionally spreadscredentialsso deployments using IRSA or~/.awsprofiles continue to work untouched.docker compose up --waitis a real readiness gate. The Dockerfile's prior no-opnode -e 'process.exit(0)'is replaced with a probe for/tmp/gateway-readyplus PID 1 liveness. The flag is cleared at startup and written by aclientReadylistener registered beforeclient.login(), so docker compose only reports the gateway ready once Discord has confirmed the connection. The compose-level healthcheck override is removed.Compose mounts the credential secrets alongside the existing Discord secrets, using the same
_FILEenv-var convention. All three services get boundedjson-filelog rotation (10m × 3) — small droplets won't fill on a chatty incident.readOptionalSecretrejects directories. When a bind-mount source is missing on the host, Docker creates an empty directory at the mount point; the prior reader would fail with a rawEISDIRonreadFileSync. AstatSync().isFile()guard catches this and surfaces a clear operator-facing error pointing at the missing source.What stays
The SDK default credential chain still wins when no
AWS_*_FILEenv is set — existing deployers don't need to change anything to opt in.ObjectStoreConfiggains an optionalcredentialsfield, so downstream consumers compile unchanged.Verification
docker compose config -f deploy/compose.yamlvalidatesRelease notes
Patch — v0.44.x. AWS credential support and the readiness probe are additive; the directory guard exposes a previously-silent misconfiguration rather than introducing a new failure path.