ops(rolling-update): add KeyViz heatmap support by bootjp · Pull Request #693 · bootjp/elastickv

bootjp · 2026-04-27T16:17:36Z

Summary

Mirror the existing ADMIN_* env-driven plumbing for the KeyViz heatmap sampler so operators can flip --keyvizEnabled and the --keyvizFanoutNodes peer list from deploy.env without editing the script per-deploy.

What's added

KEYVIZ_ENABLED master switch (validated as true|false at the top of the local script alongside ADMIN_ENABLED).
KEYVIZ_FANOUT_NODES comma-separated host:port list. Pre-quoted via printf %q for the same reason ADMIN_FULL_ACCESS_KEYS is: commas survive an unquoted env pass but pre-quoting keeps the pattern uniform.
build_keyviz_flags helper (nameref-output, mirrors build_admin_flags). Empty array when KEYVIZ_ENABLED != "true", so existing deploys see no behaviour change.
KEYVIZ_* env vars threaded through the SSH env passthrough so the heredoc-resident build_keyviz_flags reads them on the remote.
env.example block documenting the variables and pointing at the Phase 2-C fan-out auth follow-up (docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md).

What's not added

KEYVIZ_STEP, KEYVIZ_MAX_TRACKED_ROUTES, KEYVIZ_HISTORY_COLUMNS, KEYVIZ_FANOUT_TIMEOUT — these have sensible binary-side defaults and operators have not asked for env-driven overrides yet. Adding them is a one-line _flags+= edit per knob if needed.
Inter-node fan-out auth — Phase 2-C does not yet ship a pre-shared bearer token, so a multi-node KEYVIZ_FANOUT_NODES setup surfaces 5/N peer 401s in the SPA degraded banner. The env.example comment documents this.

Verification

Deployed to a live 5-node cluster (192.168.0.210-214) with KEYVIZ_ENABLED=true and KEYVIZ_FANOUT_NODES set across all five admin listeners; --keyviz* flags appeared on every container's docker inspect, and the admin keyviz handler returned a matrix with the fanout block populated.

$ docker inspect elastickv --format '{{json .Args}}' | python3 -c "import sys,json;a=json.load(sys.stdin);print([v for v in a if v.startswith('--keyviz')])"
['--keyvizEnabled', '--keyvizFanoutNodes']

$ curl -s -b cookies.txt https://elastickv.bootjp.dev/admin/api/v1/keyviz/matrix \
  | jq '.fanout | {responded, expected}'
{
  "responded": 1,
  "expected": 6
}

(The 5/6 401s are the expected fan-out-auth gap, surfaced by the PR #687 FanoutBanner in the SPA. Tracked separately.)

Self-review (CLAUDE.md 5 lenses)

Data loss — None. Build/runbook plumbing only.
Concurrency — None.
Performance — None on the deploy path. Per-node sampler cost is covered by the sampler's own design doc.
Data consistency — None.
Test coverage — bash -n syntax-check passes. Existing rolling-update.sh has no automated tests in this repo; the verification path is "deploy to a real cluster and inspect docker run args."

Test plan

bash -n scripts/rolling-update.sh — syntax OK
Live 5-node deploy with KEYVIZ_ENABLED=true — verified
CI: shellcheck (if/when added)

@claude review

Mirror the existing ADMIN_* env-driven plumbing for the KeyViz heatmap sampler so operators can flip --keyvizEnabled and the --keyvizFanoutNodes peer list from deploy.env without editing the docker run command. The previous workflow required either editing the script per-deploy or running the binary outside the rolling- update path. What's added: - KEYVIZ_ENABLED master switch (validated as true|false at the top of the local script alongside ADMIN_ENABLED). - KEYVIZ_FANOUT_NODES comma-separated host:port list. Pre-quoted via printf %q for the same reason ADMIN_FULL_ACCESS_KEYS is: commas survive an unquoted env pass but pre-quoting keeps the pattern uniform. - build_keyviz_flags helper (nameref-output, mirrors build_admin_flags). Empty array when KEYVIZ_ENABLED != "true", so existing deploys see no behaviour change. - KEYVIZ_* env vars threaded through the SSH env passthrough so the heredoc-resident build_keyviz_flags reads them on the remote. - env.example block documenting the variables and pointing at the Phase 2-C fan-out auth follow-up (docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md). What's not added: - KEYVIZ_STEP, KEYVIZ_MAX_TRACKED_ROUTES, KEYVIZ_HISTORY_COLUMNS, KEYVIZ_FANOUT_TIMEOUT — these have sensible binary-side defaults and operators have not asked for env-driven overrides yet. Adding them is a one-line _flags+= edit per knob if needed. - Inter-node fan-out auth — Phase 2-C does not yet ship a pre- shared bearer token, so a multi-node KEYVIZ_FANOUT_NODES setup surfaces 5/N peer 401s in the SPA degraded banner. The env.example comment documents this. Verified locally: deployed to a 5-node cluster with KEYVIZ_ENABLED=true and KEYVIZ_FANOUT_NODES set across all five admin listeners; --keyviz* flags appeared on every container's docker inspect, and the admin keyviz handler returned a matrix with the fan-out block populated. Self-review (CLAUDE.md 5 lenses): 1. Data loss -- None. Build/runbook plumbing only. 2. Concurrency -- None. 3. Performance -- None on the deploy path. Per-node sampler cost is covered by the sampler's own design doc. 4. Data consistency -- None. 5. Test coverage -- bash -n syntax-check passes. Existing rolling- update.sh has no automated tests in this repo; the verification path is "deploy to a real cluster and inspect docker run args."

coderabbitai · 2026-04-27T16:17:44Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 38 minutes and 43 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 14c1c398-c3fe-493a-a56b-ab09e17efdcf

📥 Commits

Reviewing files that changed from the base of the PR and between 303ebe2 and 76f2906.

📒 Files selected for processing (2)

scripts/rolling-update.env.example
scripts/rolling-update.sh

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ops/rolling-update-keyviz-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bootjp · 2026-04-27T16:20:22Z

@claude review

claude · 2026-04-27T16:20:38Z

Claude finished @bootjp's task in 3m 3s —— View job

Code Review

Gather context and understand the request
Read changed files
Review scripts/rolling-update.sh
Review scripts/rolling-update.env.example
Post review findings

Overall: Clean, backward-compatible PR. The env-driven plumbing faithfully mirrors the ADMIN_* pattern. No correctness bugs found. A few minor style/consistency nits follow.

`scripts/rolling-update.sh`

1. Nameref quote inconsistency (rolling-update.sh:873)

# build_keyviz_flags
local -n _flags=$1       # no quotes

# build_admin_flags
local -n _flags="$1"     # quoted

$1 is always a bare variable name here so it's functionally harmless, but omitting the quotes is inconsistent with the established pattern. Fix this →

2. Env vars read inline rather than snapshotted into locals (rolling-update.sh:874,878)

build_admin_flags opens with a locals-first pattern explicitly called out in a comment (referencing PR #678 where a missing local caused a bug):

local enabled="${ADMIN_ENABLED:-false}"   # snapshot into local

build_keyviz_flags reads the vars inline instead:

if [[ "${KEYVIZ_ENABLED:-false}" != "true" ]]; then   # direct global read
  ...
if [[ -n "${KEYVIZ_FANOUT_NODES:-}" ]]; then           # direct global read

With only two vars this is low-risk today, but it's inconsistent with the defence-in-depth rationale documented in build_admin_flags. Worth snapshotting for symmetry, especially since future knobs (KEYVIZ_STEP, etc.) will expand this list. Fix this →

3. Stale comment at line 1230

# The two boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*) are validated
# at the top of the local script to be the literal "true" or "false"...

KEYVIZ_ENABLED was added to that same validation loop, so the comment is now inaccurate. Fix this →

`scripts/rolling-update.env.example`

4. Ambiguous 401 note (line 90)

# Phase 2-C does not yet ship inter-node auth, so peers reject the
# fan-out call with 401 unless --adminEnabled is off on the peer.

"unless --adminEnabled is off on the peer" is easy to misread. If --adminEnabled is OFF, the admin listener doesn't exist — you'd get a connection refused, not a 401. The intended meaning is that a peer running with --adminEnabled (but without the future bearer-token support) rejects fan-out with 401; a peer without --adminEnabled has an unauthenticated keyviz endpoint and responds 200. Consider rewording:

# Phase 2-C does not yet ship inter-node auth, so peers running with
# --adminEnabled will reject the fan-out call with 401; peers without
# --adminEnabled expose an unauthenticated keyviz endpoint and respond
# normally.

Fix this →

Non-issues (confirmed correct)

KEYVIZ_FANOUT_NODES_Q ordering — assigned at line 1245 (after update_one_node is defined, before it's called at line 1249). Correct.
KEYVIZ_ENABLED passed unquoted in the env block — validated to be "true" or "false" at the top of the local script, exactly like ADMIN_ENABLED. Safe.
build_keyviz_flags defined after run_container — both are inside the SSH heredoc (lines 538–1099). All functions are defined before any are called. No forward-reference issue.
Referenced design doc (docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md) exists in-tree. ✓
Backward compatibility — KEYVIZ_ENABLED defaults to false; build_keyviz_flags returns an empty array when disabled. Existing deploys see no change. ✓

gemini-code-assist

Code Review

This pull request introduces support for the KeyViz heatmap sampler in the rolling update scripts. It adds new environment variables, KEYVIZ_ENABLED and KEYVIZ_FANOUT_NODES, to the example configuration and the update script, ensuring the boolean flag is validated and the settings are correctly passed to the container as CLI flags. Feedback was provided to refactor the build_keyviz_flags function to use local variables and quoted namerefs for better consistency and safety.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Address claude bot review on 6c7ae79. Findings 1 and 2 already landed in 94ade0d (gemini's nameref-quote + locals-snapshot suggestion accepted by the author). This commit picks up the two remaining items: - Stale comment at scripts/rolling-update.sh:1233 still said "two boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*)" -- the validation loop above also covers KEYVIZ_ENABLED now. Rewrite to "boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*, KEYVIZ_ENABLED)". - env.example fan-out note was both ambiguously phrased and outdated. PR #692 ("forward session cookie on KeyViz fan-out so peers do not 401") shipped on main while this PR was in review, so the prior "Phase 2-C does not yet ship inter-node auth" note is no longer accurate. Rewrite to describe the as-shipped behaviour: the aggregator forwards the operator's session cookie, so peers with --adminEnabled accept the call as long as the same signing key and role allow-lists are configured cluster-wide; peers without --adminEnabled expose an unauthenticated endpoint and respond unconditionally. bash -n scripts/rolling-update.sh passes.

bootjp · 2026-04-27T16:45:16Z

@claude review

claude · 2026-04-27T16:45:31Z

Claude finished @bootjp's task in 3m 1s —— View job

Code Review

Gather context and understand the request
Read changed files (scripts/rolling-update.sh, scripts/rolling-update.env.example)
Analyze code quality, consistency, and correctness
Post review findings

Overall: The PR is clean and well-structured. The four nits from the previous Claude review (nameref quoting, local-var snapshotting, stale boolean-flags comment, ambiguous 401 wording in env.example) have all been addressed in the latest commit. One new edge case is worth noting:

`scripts/rolling-update.sh`

Finding: printf '%q' applied to KEYVIZ_FANOUT_NODES before emptiness check (rolling-update.sh:1248)

printf '%q' "" emits the two-character string ''. That value travels through the SSH env block and lands in the remote shell as a non-empty variable. When KEYVIZ_ENABLED=true but KEYVIZ_FANOUT_NODES is unset (the default), build_keyviz_flags sees fanout_nodes='\'\'', [[ -n "$fanout_nodes" ]] fires, and the docker invocation gets --keyvizFanoutNodes ''.

The ADMIN pattern has the same root issue but is guarded by file-existence checks and required-field validation before any flag is emitted. KEYVIZ_FANOUT_NODES has no such guard.

The comment on line 1245 already acknowledges that commas survive unquoted, so the simplest fix is to only apply printf %q when the value is non-empty:

# Before (line 1248):
KEYVIZ_FANOUT_NODES_Q="$(printf '%q' "$KEYVIZ_FANOUT_NODES")"

# After:
if [[ -n "$KEYVIZ_FANOUT_NODES" ]]; then
  KEYVIZ_FANOUT_NODES_Q="$(printf '%q' "$KEYVIZ_FANOUT_NODES")"
else
  KEYVIZ_FANOUT_NODES_Q=""
fi

This matches the intent of the guard in build_keyviz_flags and is safe because commas, colons, and dots in host:port lists need no shell-quoting. Fix this →

Severity: low — only affects deployments where KEYVIZ_ENABLED=true without KEYVIZ_FANOUT_NODES set. The binary likely rejects or ignores '' as a fanout list, but it's worth closing before the flag surfaces in more deployments.

Confirmed correct (non-issues)

Previous review nits — all four addressed in 94ade0d: nameref quoting, local-var snapshotting, stale boolean-flags comment, and env.example 401 wording. ✓
build_keyviz_flags defined after run_container — both are inside the SSH heredoc; bash resolves function references at call time, not definition time. ✓
Boolean pass-through without %q — KEYVIZ_ENABLED is validated to be true|false before the env block, matching ADMIN_ENABLED. ✓
KEYVIZ_FANOUT_NODES_Q ordering — computed on line 1248, used in update_one_node called on line 1252. ✓
Backward compatibility — KEYVIZ_ENABLED=false (default) causes build_keyviz_flags to return immediately. No change for existing deploys. ✓
Referenced design doc docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md exists in-tree. ✓

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76f290618f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T16:49:35Z

+# --adminEnabled expose an unauthenticated keyviz endpoint and
+# respond unconditionally.


Fix incorrect auth guidance for non-admin fanout peers

This comment says peers without --adminEnabled expose an unauthenticated keyviz endpoint, but that does not match the current server behavior: startAdminFromFlags returns immediately when admin is disabled (main_admin.go), so those peers do not expose /admin/api/v1/keyviz/matrix at all, and the route is documented as auth-protected when admin is running (internal/admin/server.go). Following this guidance can lead operators to include non-admin nodes in KEYVIZ_FANOUT_NODES and get avoidable fan-out failures (connection/refused or degraded peer status).

Useful? React with 👍 / 👎.

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread scripts/rolling-update.sh

bootjp and others added 3 commits April 28, 2026 01:30

Merge branch 'main' into ops/rolling-update-keyviz-support

0dd12ed

Update scripts/rolling-update.sh

94ade0d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops(rolling-update): add KeyViz heatmap support#693

ops(rolling-update): add KeyViz heatmap support#693
bootjp wants to merge 4 commits intomainfrom
ops/rolling-update-keyviz-support

bootjp commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# --adminEnabled expose an unauthenticated keyviz endpoint and
		# respond unconditionally.

Conversation

bootjp commented Apr 27, 2026

Summary

What's added

What's not added

Verification

Self-review (CLAUDE.md 5 lenses)

Test plan

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

scripts/rolling-update.sh

scripts/rolling-update.env.example

Non-issues (confirmed correct)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

scripts/rolling-update.sh

Confirmed correct (non-issues)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

claude Bot commented Apr 27, 2026 •

edited

Loading

`scripts/rolling-update.sh`

`scripts/rolling-update.env.example`

claude Bot commented Apr 27, 2026 •

edited

Loading

`scripts/rolling-update.sh`