Skip to content

ops(rolling-update): add KeyViz heatmap support#693

Open
bootjp wants to merge 4 commits intomainfrom
ops/rolling-update-keyviz-support
Open

ops(rolling-update): add KeyViz heatmap support#693
bootjp wants to merge 4 commits intomainfrom
ops/rolling-update-keyviz-support

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 27, 2026

Summary

Mirror the existing ADMIN_* env-driven plumbing for the KeyViz heatmap sampler so operators can flip --keyvizEnabled and the --keyvizFanoutNodes peer list from deploy.env without editing the script per-deploy.

What's added

  • KEYVIZ_ENABLED master switch (validated as true|false at the top of the local script alongside ADMIN_ENABLED).
  • KEYVIZ_FANOUT_NODES comma-separated host:port list. Pre-quoted via printf %q for the same reason ADMIN_FULL_ACCESS_KEYS is: commas survive an unquoted env pass but pre-quoting keeps the pattern uniform.
  • build_keyviz_flags helper (nameref-output, mirrors build_admin_flags). Empty array when KEYVIZ_ENABLED != "true", so existing deploys see no behaviour change.
  • KEYVIZ_* env vars threaded through the SSH env passthrough so the heredoc-resident build_keyviz_flags reads them on the remote.
  • env.example block documenting the variables and pointing at the Phase 2-C fan-out auth follow-up (docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md).

What's not added

  • KEYVIZ_STEP, KEYVIZ_MAX_TRACKED_ROUTES, KEYVIZ_HISTORY_COLUMNS, KEYVIZ_FANOUT_TIMEOUT — these have sensible binary-side defaults and operators have not asked for env-driven overrides yet. Adding them is a one-line _flags+= edit per knob if needed.
  • Inter-node fan-out auth — Phase 2-C does not yet ship a pre-shared bearer token, so a multi-node KEYVIZ_FANOUT_NODES setup surfaces 5/N peer 401s in the SPA degraded banner. The env.example comment documents this.

Verification

Deployed to a live 5-node cluster (192.168.0.210-214) with KEYVIZ_ENABLED=true and KEYVIZ_FANOUT_NODES set across all five admin listeners; --keyviz* flags appeared on every container's docker inspect, and the admin keyviz handler returned a matrix with the fanout block populated.

$ docker inspect elastickv --format '{{json .Args}}' | python3 -c "import sys,json;a=json.load(sys.stdin);print([v for v in a if v.startswith('--keyviz')])"
['--keyvizEnabled', '--keyvizFanoutNodes']

$ curl -s -b cookies.txt https://elastickv.bootjp.dev/admin/api/v1/keyviz/matrix \
  | jq '.fanout | {responded, expected}'
{
  "responded": 1,
  "expected": 6
}

(The 5/6 401s are the expected fan-out-auth gap, surfaced by the PR #687 FanoutBanner in the SPA. Tracked separately.)

Self-review (CLAUDE.md 5 lenses)

  1. Data loss — None. Build/runbook plumbing only.
  2. Concurrency — None.
  3. Performance — None on the deploy path. Per-node sampler cost is covered by the sampler's own design doc.
  4. Data consistency — None.
  5. Test coveragebash -n syntax-check passes. Existing rolling-update.sh has no automated tests in this repo; the verification path is "deploy to a real cluster and inspect docker run args."

Test plan

  • bash -n scripts/rolling-update.sh — syntax OK
  • Live 5-node deploy with KEYVIZ_ENABLED=true — verified
  • CI: shellcheck (if/when added)

@claude review

Mirror the existing ADMIN_* env-driven plumbing for the KeyViz
heatmap sampler so operators can flip --keyvizEnabled and the
--keyvizFanoutNodes peer list from deploy.env without editing the
docker run command. The previous workflow required either editing
the script per-deploy or running the binary outside the rolling-
update path.

What's added:

- KEYVIZ_ENABLED master switch (validated as true|false at the top
  of the local script alongside ADMIN_ENABLED).
- KEYVIZ_FANOUT_NODES comma-separated host:port list. Pre-quoted
  via printf %q for the same reason ADMIN_FULL_ACCESS_KEYS is:
  commas survive an unquoted env pass but pre-quoting keeps the
  pattern uniform.
- build_keyviz_flags helper (nameref-output, mirrors
  build_admin_flags). Empty array when KEYVIZ_ENABLED != "true",
  so existing deploys see no behaviour change.
- KEYVIZ_* env vars threaded through the SSH env passthrough so
  the heredoc-resident build_keyviz_flags reads them on the remote.
- env.example block documenting the variables and pointing at the
  Phase 2-C fan-out auth follow-up
  (docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md).

What's not added:

- KEYVIZ_STEP, KEYVIZ_MAX_TRACKED_ROUTES, KEYVIZ_HISTORY_COLUMNS,
  KEYVIZ_FANOUT_TIMEOUT — these have sensible binary-side defaults
  and operators have not asked for env-driven overrides yet.
  Adding them is a one-line _flags+= edit per knob if needed.
- Inter-node fan-out auth — Phase 2-C does not yet ship a pre-
  shared bearer token, so a multi-node KEYVIZ_FANOUT_NODES setup
  surfaces 5/N peer 401s in the SPA degraded banner. The env.example
  comment documents this.

Verified locally: deployed to a 5-node cluster with KEYVIZ_ENABLED=true
and KEYVIZ_FANOUT_NODES set across all five admin listeners; --keyviz*
flags appeared on every container's docker inspect, and the admin
keyviz handler returned a matrix with the fan-out block populated.

Self-review (CLAUDE.md 5 lenses):
1. Data loss -- None. Build/runbook plumbing only.
2. Concurrency -- None.
3. Performance -- None on the deploy path. Per-node sampler cost is
   covered by the sampler's own design doc.
4. Data consistency -- None.
5. Test coverage -- bash -n syntax-check passes. Existing rolling-
   update.sh has no automated tests in this repo; the verification
   path is "deploy to a real cluster and inspect docker run args."
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 38 minutes and 43 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 14c1c398-c3fe-493a-a56b-ab09e17efdcf

📥 Commits

Reviewing files that changed from the base of the PR and between 303ebe2 and 76f2906.

📒 Files selected for processing (2)
  • scripts/rolling-update.env.example
  • scripts/rolling-update.sh
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ops/rolling-update-keyviz-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 3m 3s —— View job


Code Review

  • Gather context and understand the request
  • Read changed files
  • Review scripts/rolling-update.sh
  • Review scripts/rolling-update.env.example
  • Post review findings

Overall: Clean, backward-compatible PR. The env-driven plumbing faithfully mirrors the ADMIN_* pattern. No correctness bugs found. A few minor style/consistency nits follow.


scripts/rolling-update.sh

1. Nameref quote inconsistency (rolling-update.sh:873)

# build_keyviz_flags
local -n _flags=$1       # no quotes

# build_admin_flags
local -n _flags="$1"     # quoted

$1 is always a bare variable name here so it's functionally harmless, but omitting the quotes is inconsistent with the established pattern. Fix this →


2. Env vars read inline rather than snapshotted into locals (rolling-update.sh:874,878)

build_admin_flags opens with a locals-first pattern explicitly called out in a comment (referencing PR #678 where a missing local caused a bug):

local enabled="${ADMIN_ENABLED:-false}"   # snapshot into local

build_keyviz_flags reads the vars inline instead:

if [[ "${KEYVIZ_ENABLED:-false}" != "true" ]]; then   # direct global read
  ...
if [[ -n "${KEYVIZ_FANOUT_NODES:-}" ]]; then           # direct global read

With only two vars this is low-risk today, but it's inconsistent with the defence-in-depth rationale documented in build_admin_flags. Worth snapshotting for symmetry, especially since future knobs (KEYVIZ_STEP, etc.) will expand this list. Fix this →


3. Stale comment at line 1230

# The two boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*) are validated
# at the top of the local script to be the literal "true" or "false"...

KEYVIZ_ENABLED was added to that same validation loop, so the comment is now inaccurate. Fix this →


scripts/rolling-update.env.example

4. Ambiguous 401 note (line 90)

# Phase 2-C does not yet ship inter-node auth, so peers reject the
# fan-out call with 401 unless --adminEnabled is off on the peer.

"unless --adminEnabled is off on the peer" is easy to misread. If --adminEnabled is OFF, the admin listener doesn't exist — you'd get a connection refused, not a 401. The intended meaning is that a peer running with --adminEnabled (but without the future bearer-token support) rejects fan-out with 401; a peer without --adminEnabled has an unauthenticated keyviz endpoint and responds 200. Consider rewording:

# Phase 2-C does not yet ship inter-node auth, so peers running with
# --adminEnabled will reject the fan-out call with 401; peers without
# --adminEnabled expose an unauthenticated keyviz endpoint and respond
# normally.

Fix this →


Non-issues (confirmed correct)

  • KEYVIZ_FANOUT_NODES_Q ordering — assigned at line 1245 (after update_one_node is defined, before it's called at line 1249). Correct.
  • KEYVIZ_ENABLED passed unquoted in the env block — validated to be "true" or "false" at the top of the local script, exactly like ADMIN_ENABLED. Safe.
  • build_keyviz_flags defined after run_container — both are inside the SSH heredoc (lines 538–1099). All functions are defined before any are called. No forward-reference issue.
  • Referenced design doc (docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md) exists in-tree. ✓
  • Backward compatibilityKEYVIZ_ENABLED defaults to false; build_keyviz_flags returns an empty array when disabled. Existing deploys see no change. ✓

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the KeyViz heatmap sampler in the rolling update scripts. It adds new environment variables, KEYVIZ_ENABLED and KEYVIZ_FANOUT_NODES, to the example configuration and the update script, ensuring the boolean flag is validated and the settings are correctly passed to the container as CLI flags. Feedback was provided to refactor the build_keyviz_flags function to use local variables and quoted namerefs for better consistency and safety.

Comment thread scripts/rolling-update.sh
bootjp and others added 3 commits April 28, 2026 01:30
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Address claude bot review on 6c7ae79. Findings 1 and 2 already
landed in 94ade0d (gemini's nameref-quote + locals-snapshot
suggestion accepted by the author). This commit picks up the two
remaining items:

- Stale comment at scripts/rolling-update.sh:1233 still said
  "two boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*)" -- the
  validation loop above also covers KEYVIZ_ENABLED now. Rewrite
  to "boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*, KEYVIZ_ENABLED)".

- env.example fan-out note was both ambiguously phrased and
  outdated. PR #692 ("forward session cookie on KeyViz fan-out so
  peers do not 401") shipped on main while this PR was in review,
  so the prior "Phase 2-C does not yet ship inter-node auth" note
  is no longer accurate. Rewrite to describe the as-shipped
  behaviour: the aggregator forwards the operator's session
  cookie, so peers with --adminEnabled accept the call as long as
  the same signing key and role allow-lists are configured
  cluster-wide; peers without --adminEnabled expose an
  unauthenticated endpoint and respond unconditionally.

bash -n scripts/rolling-update.sh passes.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 3m 1s —— View job


Code Review

  • Gather context and understand the request
  • Read changed files (scripts/rolling-update.sh, scripts/rolling-update.env.example)
  • Analyze code quality, consistency, and correctness
  • Post review findings

Overall: The PR is clean and well-structured. The four nits from the previous Claude review (nameref quoting, local-var snapshotting, stale boolean-flags comment, ambiguous 401 wording in env.example) have all been addressed in the latest commit. One new edge case is worth noting:


scripts/rolling-update.sh

Finding: printf '%q' applied to KEYVIZ_FANOUT_NODES before emptiness check (rolling-update.sh:1248)

printf '%q' "" emits the two-character string ''. That value travels through the SSH env block and lands in the remote shell as a non-empty variable. When KEYVIZ_ENABLED=true but KEYVIZ_FANOUT_NODES is unset (the default), build_keyviz_flags sees fanout_nodes='\'\'', [[ -n "$fanout_nodes" ]] fires, and the docker invocation gets --keyvizFanoutNodes ''.

The ADMIN pattern has the same root issue but is guarded by file-existence checks and required-field validation before any flag is emitted. KEYVIZ_FANOUT_NODES has no such guard.

The comment on line 1245 already acknowledges that commas survive unquoted, so the simplest fix is to only apply printf %q when the value is non-empty:

# Before (line 1248):
KEYVIZ_FANOUT_NODES_Q="$(printf '%q' "$KEYVIZ_FANOUT_NODES")"

# After:
if [[ -n "$KEYVIZ_FANOUT_NODES" ]]; then
  KEYVIZ_FANOUT_NODES_Q="$(printf '%q' "$KEYVIZ_FANOUT_NODES")"
else
  KEYVIZ_FANOUT_NODES_Q=""
fi

This matches the intent of the guard in build_keyviz_flags and is safe because commas, colons, and dots in host:port lists need no shell-quoting. Fix this →

Severity: low — only affects deployments where KEYVIZ_ENABLED=true without KEYVIZ_FANOUT_NODES set. The binary likely rejects or ignores '' as a fanout list, but it's worth closing before the flag surfaces in more deployments.


Confirmed correct (non-issues)

  • Previous review nits — all four addressed in 94ade0d: nameref quoting, local-var snapshotting, stale boolean-flags comment, and env.example 401 wording. ✓
  • build_keyviz_flags defined after run_container — both are inside the SSH heredoc; bash resolves function references at call time, not definition time. ✓
  • Boolean pass-through without %qKEYVIZ_ENABLED is validated to be true|false before the env block, matching ADMIN_ENABLED. ✓
  • KEYVIZ_FANOUT_NODES_Q ordering — computed on line 1248, used in update_one_node called on line 1252. ✓
  • Backward compatibilityKEYVIZ_ENABLED=false (default) causes build_keyviz_flags to return immediately. No change for existing deploys. ✓
  • Referenced design doc docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md exists in-tree. ✓

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76f290618f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +94 to +95
# --adminEnabled expose an unauthenticated keyviz endpoint and
# respond unconditionally.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix incorrect auth guidance for non-admin fanout peers

This comment says peers without --adminEnabled expose an unauthenticated keyviz endpoint, but that does not match the current server behavior: startAdminFromFlags returns immediately when admin is disabled (main_admin.go), so those peers do not expose /admin/api/v1/keyviz/matrix at all, and the route is documented as auth-protected when admin is running (internal/admin/server.go). Following this guidance can lead operators to include non-admin nodes in KEYVIZ_FANOUT_NODES and get avoidable fan-out failures (connection/refused or degraded peer status).

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant