Skip to content

scripts: forward ADMIN_* env vars to remote SSH heredoc (follow-up to #669)#678

Merged
bootjp merged 3 commits intomainfrom
fix/admin-rolling-env-forward
Apr 27, 2026
Merged

scripts: forward ADMIN_* env vars to remote SSH heredoc (follow-up to #669)#678
bootjp merged 3 commits intomainfrom
fix/admin-rolling-env-forward

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 26, 2026

Summary

Follow-up to #669 (now merged). The admin-flag plumbing I added in
that PR has a real bug I caught while re-reading the script: every
rollout would crash on the first remote node because the ADMIN_*
variables were never forwarded across the SSH heredoc.

What was broken

build_admin_flags lives inside the remote SSH heredoc in
update_one_node (bash -s <<'REMOTE'), but the env block that
seeds the remote shell only forwarded the existing IMAGE /
RAFT_PORT / EXTRA_ENV / etc. variables — no ADMIN_*. With
set -u active on the remote, the first access of
${ADMIN_ENABLED} inside build_admin_flags crashes the rollout
regardless of whether admin is enabled (the helper is invoked
unconditionally from run_container).

So an operator running this script after #669 with the default
ADMIN_ENABLED=false would have seen ADMIN_ENABLED: unbound variable
on the first node touched, leaving at most one node restarted but
otherwise the cluster intact (the per-node health check would
exit non-zero before moving on).

Fix

  1. Forward all 9 ADMIN_* variables through env, alongside
    the existing forwarding pattern. Path-like values (*_FILE,
    *_KEYS, ADDRESS) get printf %q quoting at the bottom of
    the local script (matches the existing RAFT_TO_REDIS_MAP_Q
    etc. pattern). The three boolean flags (ADMIN_ENABLED,
    ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK, ADMIN_ALLOW_INSECURE_DEV_COOKIE)
    are forwarded unquoted for readability — but only after a
    local validation pass that rejects anything other than the
    literal true / false so the unquoted forwarding stays
    metacharacter-safe.

  2. Defense-in-depth :- defaults inside build_admin_flags.
    Every ADMIN_* reference inside the helper now reads through
    ${VAR:-} once at the top into a local. A future refactor
    that ever drops one of the forwarded variables will produce
    the targeted "ADMIN_* required" error instead of an opaque
    unbound variable crash with no hint at which variable.

Test plan

  • bash -n scripts/rolling-update.sh — passes
  • ADMIN_ENABLED=invalid bash scripts/rolling-update.sh
    "must be 'true' or 'false', got 'invalid'"
  • ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK=yes → same validator catches it
  • ADMIN_ENABLED=true (no signing key) → reaches the remote branch
    where build_admin_flags would 'aborting' on the missing key
  • End-to-end rollout against a 3-node staging cluster with
    ADMIN_ENABLED=true (operator to verify before merging into
    a production deploy.env)
  • End-to-end rollout with ADMIN_ENABLED=false (the previously
    broken default path)

Summary by CodeRabbit

  • Chores
    • Improved validation and error reporting for deployment configuration parameters.
    • Enhanced environment variable forwarding and path handling in deployment scripts to increase reliability.

The admin-flag plumbing I added in cadffe4 had a real bug:
build_admin_flags lives inside the remote SSH heredoc (the `bash -s
<<'REMOTE'` block in update_one_node), but the env block that
forwards values from local to remote did not include any ADMIN_*
variable. With `set -u` active inside the heredoc, the first
access of `${ADMIN_ENABLED}` inside build_admin_flags would have
crashed every rollout that landed on a node where ADMIN_ENABLED
was unset on the remote — i.e., all of them, because the local
defaults at line 172 only populate the local control shell, not
the SSH target.

The bug only fires when run_container is reached on a remote node
(every rollout), so a deploy attempt with ADMIN_ENABLED=false
also crashed: build_admin_flags was invoked unconditionally at
line 800. The script was effectively unusable in this state.

Fix is two-part:

1. **Forward every ADMIN_* through env**. Added 9 ADMIN_*
   variables to the env block in update_one_node, alongside the
   existing IMAGE / RAFT_PORT / EXTRA_ENV forwarding. Path-like
   values get printf %q quoting at the bottom of the local
   script (matches the existing _Q variants for
   RAFT_TO_REDIS_MAP, RAFT_TO_S3_MAP, etc.). The three boolean
   flags (ADMIN_ENABLED, ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK,
   ADMIN_ALLOW_INSECURE_DEV_COOKIE) are forwarded unquoted for
   readability — but only after a local validation pass that
   rejects anything other than the literal "true" / "false" so
   the unquoted forwarding stays metacharacter-safe.

2. **Defense-in-depth `:-` defaults inside build_admin_flags**.
   Even with explicit env forwarding, a future refactor that
   drops one of the variables would surface as an opaque
   "unbound variable" crash with no hint at which variable. The
   helper now reads each variable through `${VAR:-}` once at the
   top, then refers to the locals — so a missing forward would
   produce the targeted "ADMIN_* required" error instead.

Smoke-tested with three cases:
  - ADMIN_ENABLED=invalid        → "must be 'true' or 'false'"
  - ADMIN_ALLOW_PLAINTEXT_…=yes  → same validator catches it
  - ADMIN_ENABLED=true (no key)  → reaches the remote branch
                                   (where the remote build_admin_flags
                                   would 'aborting' on the missing
                                   signing key)

`bash -n scripts/rolling-update.sh` passes.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a631592-5718-4d27-a593-4a1ed31fde85

📥 Commits

Reviewing files that changed from the base of the PR and between 2bae541 and 3fe2c45.

📒 Files selected for processing (1)
  • scripts/rolling-update.sh

📝 Walkthrough

Walkthrough

The scripts/rolling-update.sh script now includes enhanced validation of boolean ADMIN_* flags as literal "true"/"false" strings locally. It forwards all required ADMIN_* configuration values to remote hosts while properly escaping path-like values using printf '%q' to preserve integrity through SSH parsing. The remote build logic is refactored to use localized defaults with required-file and TLS pair validation.

Changes

Cohort / File(s) Summary
Admin Configuration Validation & Forwarding
scripts/rolling-update.sh
Added strict local validation of boolean ADMIN_* flags; enhanced environment forwarding during node updates with full required ADMIN_* configuration values; implemented shell-escaping of path-like values via printf '%q' for SSH safety; refactored remote build_admin_flags to use localized defaults with :- operator, required-file checks, TLS pair validation, and flag construction.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 Hops through configs with careful grace,
Validating flags in every place,
printf '%q' guards the paths from strife,
SSH echoes ripple through the night,
True and false, so crisp and bright,
Rolling updates done just right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: forwarding ADMIN_* environment variables to a remote SSH heredoc as a follow-up fix to a previous issue (#669).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/admin-rolling-env-forward

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 2m 12s —— View job


Code Review

  • Read the changed file
  • Analyze the diff vs origin/main
  • Review for correctness, security, and shell best practices
  • Post comprehensive feedback

Overall: LGTM with one minor nit. The fix correctly identifies and resolves the set -u crash. The approach is consistent with the established forwarding pattern, and the defense-in-depth changes are well-motivated. No blocking issues.


The Core Fix

The root cause analysis is accurate. build_admin_flags runs inside the bash -s <<'REMOTE' heredoc under set -euo pipefail, and set -u makes any reference to an unset variable an immediate fatal error. The env block only forwarded ~17 variables — none of the 9 ADMIN_* ones — so the very first ${ADMIN_ENABLED} reference in the function would crash with "unbound variable" regardless of whether admin was enabled. The fix is complete and correct.

Validation loop (scripts/rolling-update.sh lines 190–199)

for _bool_var in ADMIN_ENABLED ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK ADMIN_ALLOW_INSECURE_DEV_COOKIE; do
  case "${!_bool_var}" in
    true|false) ;;
    *)
      echo "rolling-update: ${_bool_var} must be 'true' or 'false', got '${!_bool_var}'" >&2
      exit 1
      ;;
  esac
done
unset _bool_var

Correct. ${!_bool_var} (bash indirect expansion) is fine given the script already requires bash 4.x (local -n, [[ ]], arrays with +=). The unset is good hygiene. The early-exit error message names the offending variable, which is exactly what an operator needs.

Forwarding pattern consistency

Path-like vars get printf '%q' (matching RAFT_TO_REDIS_MAP_Q, S3_CREDENTIALS_FILE_Q, etc.). Booleans are forwarded unquoted after being proven metacharacter-safe by the validation loop. Empty-string printf '%q' produces '', which survives the SSH argument serialization → remote login shell reparse → env parsing chain intact (as the existing RAFT_TO_REDIS_MAP= case already demonstrates). The approach is sound and consistent.

Defense-in-depth :- defaults in build_admin_flags

Using local signing_key="${ADMIN_SESSION_SIGNING_KEY_FILE:-}" (and similar) as a first read is a good pattern. If a future refactor drops a forwarded variable, the operator gets the explicit "ADMIN_ENABLED=true requires …" error rather than the opaque "unbound variable: ADMIN_SESSION_SIGNING_KEY_FILE". The :-false default on the boolean reads is the safe choice — not forwarding a boolean defaults to "feature off" rather than crash. ✓


One nit: comment says "two" boolean flags but there are three

scripts/rolling-update.sh line ~1192:

# The two boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*) are validated

There are actually three boolean flags: ADMIN_ENABLED, ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK, and ADMIN_ALLOW_INSECURE_DEV_COOKIE. The validation loop above it handles all three correctly — it's just the comment count that's off. Fix this →


End-to-end coverage

The two unchecked test plan items (staging cluster with ADMIN_ENABLED=true and with ADMIN_ENABLED=false) are the right ones to flag — they're the only way to confirm the env forwarding actually reaches build_admin_flags on real nodes. The local bash -n + manual flag tests are a reasonable pre-merge gate for a script change of this scope.


Result: Approve once the "two" → "three" nit is addressed (or left as-is — it's cosmetic). The functional fix is correct.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the rolling-update.sh script by introducing validation for ADMIN_* boolean flags, applying shell-safe escaping to configuration variables, and updating the build_admin_flags function with local variable defaults to prevent unbound variable errors. The review feedback recommends further refactoring build_admin_flags to consistently use local variables for all ADMIN_* parameters, ensuring a uniform defense-in-depth approach.

Comment thread scripts/rolling-update.sh Outdated
Comment on lines +876 to +886
if [[ "${ADMIN_ENABLED:-false}" != "true" ]]; then
return 0
fi

if [[ -z "${ADMIN_SESSION_SIGNING_KEY_FILE}" ]]; then
local signing_key="${ADMIN_SESSION_SIGNING_KEY_FILE:-}"
local full_keys="${ADMIN_FULL_ACCESS_KEYS:-}"
local read_only_keys="${ADMIN_READ_ONLY_ACCESS_KEYS:-}"
local previous_key="${ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE:-}"
local admin_listen="${ADMIN_ADDRESS:-}"
local tls_cert="${ADMIN_TLS_CERT_FILE:-}"
local tls_key="${ADMIN_TLS_KEY_FILE:-}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and to fully adhere to the "defense-in-depth" strategy described in the function's comment and the PR summary, all ADMIN_* variables should be read into local variables with defaults at the start of the function. This ensures that even if a variable is missing from the env block in a future refactor, the script fails gracefully with a clear error message instead of an unbound variable crash.

Suggested change
if [[ "${ADMIN_ENABLED:-false}" != "true" ]]; then
return 0
fi
if [[ -z "${ADMIN_SESSION_SIGNING_KEY_FILE}" ]]; then
local signing_key="${ADMIN_SESSION_SIGNING_KEY_FILE:-}"
local full_keys="${ADMIN_FULL_ACCESS_KEYS:-}"
local read_only_keys="${ADMIN_READ_ONLY_ACCESS_KEYS:-}"
local previous_key="${ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE:-}"
local admin_listen="${ADMIN_ADDRESS:-}"
local tls_cert="${ADMIN_TLS_CERT_FILE:-}"
local tls_key="${ADMIN_TLS_KEY_FILE:-}"
local enabled="${ADMIN_ENABLED:-false}"
if [[ "$enabled" != "true" ]]; then
return 0
fi
local signing_key="${ADMIN_SESSION_SIGNING_KEY_FILE:-}"
local full_keys="${ADMIN_FULL_ACCESS_KEYS:-}"
local read_only_keys="${ADMIN_READ_ONLY_ACCESS_KEYS:-}"
local previous_key="${ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE:-}"
local admin_listen="${ADMIN_ADDRESS:-}"
local tls_cert="${ADMIN_TLS_CERT_FILE:-}"
local tls_key="${ADMIN_TLS_KEY_FILE:-}"
local allow_plaintext="${ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK:-false}"
local insecure_cookie="${ADMIN_ALLOW_INSECURE_DEV_COOKIE:-false}"

Comment thread scripts/rolling-update.sh Outdated
Comment on lines 947 to 952
if [[ "${ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK:-false}" == "true" ]]; then
_flags+=(--adminAllowPlaintextNonLoopback)
fi
if [[ "${ADMIN_ALLOW_INSECURE_DEV_COOKIE}" == "true" ]]; then
if [[ "${ADMIN_ALLOW_INSECURE_DEV_COOKIE:-false}" == "true" ]]; then
_flags+=(--adminAllowInsecureDevCookie)
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the local variables defined at the top of the function for consistency and to avoid direct references to potentially unbound global variables.

Suggested change
if [[ "${ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK:-false}" == "true" ]]; then
_flags+=(--adminAllowPlaintextNonLoopback)
fi
if [[ "${ADMIN_ALLOW_INSECURE_DEV_COOKIE}" == "true" ]]; then
if [[ "${ADMIN_ALLOW_INSECURE_DEV_COOKIE:-false}" == "true" ]]; then
_flags+=(--adminAllowInsecureDevCookie)
fi
if [[ "$allow_plaintext" == "true" ]]; then
_flags+=(--adminAllowPlaintextNonLoopback)
fi
if [[ "$insecure_cookie" == "true" ]]; then
_flags+=(--adminAllowInsecureDevCookie)
fi

Two gemini medium findings on PR #678 caught that the
defense-in-depth localization I added was incomplete:
ADMIN_SESSION_SIGNING_KEY_FILE et al got `${VAR:-}` defaults into
locals at the top of the helper, but ADMIN_ENABLED and the two
ADMIN_ALLOW_* booleans were still accessed directly from the
calling environment further down in the function.

The gap defeated the comment's own claim. If a future refactor
ever drops one of those three booleans from the env forwarding,
`set -u` would crash on `${ADMIN_ENABLED}` (and the code path
below would silently fall through to defaults for the two
ALLOW_* flags, masking the misconfiguration). The point of the
local-with-default pattern is that every ADMIN_* reference goes
through one place where the safety net is guaranteed.

Localized all nine into `enabled`, `signing_key`, `full_keys`,
`read_only_keys`, `previous_key`, `admin_listen`, `tls_cert`,
`tls_key`, `allow_plaintext`, `insecure_cookie`. The two
ALLOW_* check sites at the bottom now read the locals instead of
re-fetching the globals — same value, but consistent with the
rest of the helper and the comment's contract.

No behaviour change for any valid input. Smoke-tested both
boolean validators (`ADMIN_ENABLED=invalid` and
`ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK=yes`) — local script-level
errors still fire with the targeted message before reaching
update_one_node.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

Addressed both gemini medium findings (build_admin_flags localization gap):

  • Localized ADMIN_ENABLEDenabled, ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACKallow_plaintext, ADMIN_ALLOW_INSECURE_DEV_COOKIEinsecure_cookie at the top of the function alongside the existing path-like locals.
  • Updated the two ALLOW_* check sites at the bottom to read the locals instead of re-fetching the globals.

The previous shape mixed localized path vars with direct global access on the booleans, which defeated the helper's own defense-in-depth comment — set -u would still have crashed on ${ADMIN_ENABLED} if a future refactor dropped that variable from env forwarding. Now every ADMIN_* reference inside the helper goes through one place where the safety net is guaranteed.

Latest commit: 3fe2c45. bash -n passes; both boolean validators still fire with the targeted message.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 1m 11s —— View job


Code Review

  • Read the changed file
  • Analyze the latest commit addressing gemini findings
  • Verify build_admin_flags localization is complete and correct
  • Check for any remaining issues
  • Post comprehensive review

LGTM. The gemini findings are fully and correctly addressed. All 9 ADMIN_* variables are now localized at the top of build_admin_flags, and every reference inside the function goes through a local — the defense-in-depth comment's contract is now actually satisfied.


What the latest commit does

The previous shape of build_admin_flags localized the 6 path-like vars (signing_key, full_keys, etc.) but kept direct global reads for ADMIN_ENABLED and the two ALLOW_* booleans. The commit:

  1. Pulls ADMIN_ENABLED into local enabled before the early-return guard (if [[ "$enabled" != "true" ]]), so the guard itself now reads through the safety net.
  2. Adds local allow_plaintext and local insecure_cookie alongside the other path locals.
  3. Updates the two flag-append sites at the bottom (lines 953–958) to read $allow_plaintext / $insecure_cookie instead of ${ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK:-false} / ${ADMIN_ALLOW_INSECURE_DEV_COOKIE:-false}.

The result: all 9 ADMIN_* globals are read exactly once — at the top of the function — and the rest of the function is entirely in terms of locals. A future refactor that drops one of the forwarded variables now produces the targeted "ADMIN_* required" error instead of an opaque set -u crash, exactly as the function comment claims.


One outstanding nit (cosmetic, carry-over from the previous review)

scripts/rolling-update.sh:1198:

# The two boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*) are validated

There are three boolean flags: ADMIN_ENABLED, ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK, and ADMIN_ALLOW_INSECURE_DEV_COOKIE. The validation loop on line 190 handles all three correctly — it's just the prose count that's off. Fix this →

This is cosmetic — it doesn't affect correctness. Fine to fold into this PR or defer.


Result: Approve. The gemini findings are fully resolved; the defense-in-depth is now complete and consistent with its own documentation.

@bootjp bootjp merged commit df5ea5e into main Apr 27, 2026
9 checks passed
@bootjp bootjp deleted the fix/admin-rolling-env-forward branch April 27, 2026 07:20
bootjp added a commit that referenced this pull request Apr 27, 2026
## Summary

State-refresh of `docs/design/2026_04_24_partial_admin_dashboard.md` now
that all four phases of the original P1–P4 plan have shipped.

| Phase | Was | Now |
|---|---|---|
| P1 | shipped | shipped (no change) |
| P2 | partial — read-only #658 only | **shipped** (+ #669 writes + #673
forwarding) |
| P3 | shipped | shipped (no change) |
| P4 | mostly | **shipped** (+ operator doc #674, deployment runbook
#669, script wiring #669+#678) |

Also added the **AdminDeleteBucket TOCTOU** to Outstanding open items —
coderabbitai flagged it during PR #669 review (pre-existing race,
`s3.go:deleteBucket` inherits the same shape; recorded as a code comment
on `AdminDeleteBucket` for the future fix).

Doc stays at `_partial_` rather than promoting to `_implemented_`
because two outstanding items remain: AdminForward acceptance criterion
5 (rolling-upgrade flag, explicitly deferred at design time) and the new
TOCTOU. Updated the closing rename trigger to cover both.

## Test plan

- [x] No design changes — this is a state refresh
- [x] Status table entries cross-checked against the merge commits on
main
- [x] AdminDeleteBucket TOCTOU description matches the code comment in
adapter/s3_admin.go and the operator-side workaround in
docs/admin_deployment.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant