Skip to content

feat(jepsen): M5a — setup-hook verification + elastickv-list-routes CLI#925

Merged
bootjp merged 4 commits into
mainfrom
feat/composed1-m5a-setup-hook
Jun 4, 2026
Merged

feat(jepsen): M5a — setup-hook verification + elastickv-list-routes CLI#925
bootjp merged 4 commits into
mainfrom
feat/composed1-m5a-setup-hook

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 4, 2026

Draft — 第4スライス (Composed-1 M5a)。設計ドキュメント §3.3 の "Setup hook verification" 部分。PR #924 (launch script) と並列で、それがマージされた後に launch script を統合します。

内容

2 つのピースを 1 PR にまとめています — Clojure 側のフックは Go CLI なしでは使えないため。

1. cmd/elastickv-list-routes (Go)

クラスタの現在の route catalog を JSON 出力する新規バイナリ。

{
  "catalog_version": 7,
  "routes": [
    {"route_id": 100, "raft_group_id": 1, "start": "...", "end": "...", "state": "ROUTE_STATE_ACTIVE"}
  ]
}

start/end は base64 でエンコード。Smoke test 3 件 / 6 assertions:

  • 空 routes が [] (not nil) で serialise
  • | を含む生バイト (base64 アルファベット外) を正しく round-trip
  • End (+infinity sentinel) を空文字列で表現

2. Clojure verify-multi-group-routing!

dynamodb-multi-table-workload の workload client setup! 冒頭で呼び出されます:

  • cmd/elastickv-list-routes に shell out
  • 正規表現で raft_group_id を抽出 — JSON 依存をあえて入れない (CLI の Go test が contract)
  • <2 distinct groups なら ex-info で fail-fast (launch script のフラグを名指しでエラーメッセージに)

jepsen.client/setup! はテストあたり 1 回しか呼ばれない (per-node の jepsen.db/setup! と異なる) ので first-node ゲートは不要。

検証テスト 4 件 / 5 assertions:

  • multi-group routing → #{1 2}
  • 同一グループ collapse → #{1} (これは reject される)
  • routes フィールド欠落 → #{} (将来のスキーマドリフトで noisy fail)
  • 空白/改行 tolerance

設計上の判断

なぜ JSON 依存ではなく正規表現か: CLI の JSON shape は Go test で pin されている。フックが必要なのは「ルートを所有するグループ数」の粗い検査だけ。将来 raft_group_id がリネームされたら正規表現が #{} を返し、verify-multi-group-routing! が例外を投げて schema drift が大声で surface する — silent pass を防げる。

Verification

Remaining M5a work

Second-half slice of Composed-1 M5a per PR #905 design doc §3.3
'Setup hook verification' bullet.

Two pieces, intentionally one PR because the Clojure-side hook is
useless without the Go CLI it shells out to:

* cmd/elastickv-list-routes — new Go binary that prints the
  cluster's current route catalog as JSON (catalog_version,
  routes[] with route_id / raft_group_id / start / end / state).
  start and end are base64-encoded raw bytes so any byte sequence
  in the routing keyspace survives JSON.  Non-zero exit on any
  RPC failure so the Jepsen hook sees it verbatim.

  Smoke-tested via main_test.go (3 tests / 6 assertions):
    - empty routes serialise as [] (not nil) — Clojure callers
      don't have to special-case nil-vs-empty
    - route bytes containing '|' (outside base64 alphabet)
      round-trip cleanly
    - empty End (the +infinity sentinel) serialises as ''
      so Clojure can distinguish it from unset

* jepsen/src/elastickv/dynamodb_multi_table_workload.clj — added
  verify-multi-group-routing! that shells out to the CLI, parses
  the JSON via a regex for raft_group_id (intentionally avoids a
  JSON-lib dep — the CLI's emit() is the contract), and throws
  ex-info if <2 distinct groups are observed.  The error message
  names the launch-script flag the operator is missing.

  Called from the workload client's setup! before any table
  creation.  jepsen.client/setup! fires exactly once per test
  (unlike jepsen.db/setup! which fires per node), so no
  first-node gate is needed — corrected from an earlier mental
  model I'd carried over from the design doc.

  4 new tests / 5 assertions verifying distinct-group-ids:
    - multi-group routing yields #{1 2}
    - same-group routes collapse to #{1}
    - missing 'routes' field returns #{}
    - whitespace / newlines between key:value tolerated

Why the regex instead of a JSON dependency: the CLI's JSON shape
is pinned by Go tests; this hook only needs a coarse-grained
'how many groups own routes' check.  A future ListRoutes schema
change that renames raft_group_id makes the regex return #{}
which causes verify-multi-group-routing! to throw — schema drift
surfaces loudly rather than silently passing.

Verification:
  * go build + go test -race + golangci-lint on
    cmd/elastickv-list-routes -> 0 issues
  * lein test elastickv.dynamodb-multi-table-workload-test ->
    Ran 13 tests containing 27 assertions, 0 failures, 0 errors
    (up from 9 / 22 — added the four distinct-group-ids tests)

The next slice (PR #924's follow-up commit) wires
elastickv-list-routes into scripts/run-jepsen-m5-local.sh's
build step + threads the binary path into the workload's
:list-routes-bin opt so the local Jepsen run picks it up
automatically.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 59 minutes and 47 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f751ed53-4a48-4ab1-ab35-98fc416def37

📥 Commits

Reviewing files that changed from the base of the PR and between 0471e7e and 68adde1.

📒 Files selected for processing (4)
  • cmd/elastickv-list-routes/main.go
  • cmd/elastickv-list-routes/main_test.go
  • jepsen/src/elastickv/dynamodb_multi_table_workload.clj
  • jepsen/test/elastickv/dynamodb_multi_table_workload_test.clj
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/composed1-m5a-setup-hook

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Go command-line tool, elastickv-list-routes, along with tests, to print the cluster's route catalog as JSON. This tool is integrated into the Jepsen test suite as a setup-hook verification step to ensure multi-group routing is active before workloads run. Feedback suggests using io.Writer instead of *os.File in the Go emit function to improve testability, and dynamically resolving the default gRPC address in the Jepsen setup hook from the test configuration nodes to support distributed environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread cmd/elastickv-list-routes/main.go Outdated

// emit serialises resp as JSON to w. Extracted from run() so the
// encoding is testable without dialling a real server.
func emit(resp *pb.ListRoutesResponse, w *os.File) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a concrete *os.File type for the writer parameter limits the flexibility of the emit function and makes testing more cumbersome (requiring temporary files on disk). Consider using io.Writer instead, which is the standard Go idiom for writing output and allows testing with in-memory buffers like bytes.Buffer.

Suggested change
func emit(resp *pb.ListRoutesResponse, w *os.File) error {
func emit(resp *pb.ListRoutesResponse, w io.Writer) error {

Comment on lines +327 to +329
(let [bin (or (:list-routes-bin test) default-list-routes-bin)
addr (or (:grpc-host-port test) default-grpc-host-port)
result (shell/sh bin "--address" addr)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding 127.0.0.1:50051 as the default gRPC address works fine for local testing, but will fail in a distributed Jepsen environment where the database nodes run on separate hosts. Consider dynamically resolving the default address using the first node in the test configuration.

  (let [bin    (or (:list-routes-bin test) default-list-routes-bin)
        addr   (or (:grpc-host-port test)
                   (if-let [node (first (:nodes test))]
                     (str (name node) ":50051")
                     default-grpc-host-port))
        result (shell/sh bin "--address" addr)]

Threads the elastickv-list-routes binary path and the gRPC
target address through the workload CLI to the setup-hook
verification (verify-multi-group-routing!).  Without these,
the hook falls back to bare-name 'elastickv-list-routes' on
PATH and 127.0.0.1:50051 — fine for the M5a launch script's
default topology but brittle when the binary lives outside
PATH (e.g. a tmp build directory shared with the launch
script's bash variables).

Implementation:

  * dynamo-cli-opts gains --list-routes-bin PATH and
    --grpc-host-port HOST:PORT entries.  Both default to nil
    so verify-multi-group-routing!'s 'or default' fallback
    fires when omitted.

  * elastickv-dynamodb-multi-table-test now puts
    :list-routes-bin / :grpc-host-port directly into the test
    map (alongside :name / :nodes / :db).  The verification
    hook reads them via (:list-routes-bin test) etc. —
    jepsen.client/setup! receives the test map, not opts.

  * prepare-dynamo-opts unchanged.  Clojure tools.cli leaves
    parsed CLI flags on the options map as :list-routes-bin
    and :grpc-host-port keys; the existing (assoc options
    :dynamo-host ...) call preserves them via assoc's
    'keep-everything-else' semantic.

Verification: lein test elastickv.dynamodb-multi-table-workload-test
still 13 tests / 27 assertions, 0 failures, 0 errors.  CLI
default semantics covered by the existing
distinct-group-ids-* tests (the defaults are inert when no
shell-out actually fires).

The follow-up — wiring 'go build ./cmd/elastickv-list-routes'
into scripts/run-jepsen-m5-local.sh + passing
--list-routes-bin / --grpc-host-port to lein run — lands on
PR #924's branch as a separate commit once that PR's first
review pass is in.
bootjp added a commit that referenced this pull request Jun 4, 2026
…-bin / --grpc-host-port

Follow-up commit on PR #924's launch script.  Brings the setup-hook
verification online by:

  * Adding 'go build ./cmd/elastickv-list-routes' alongside the
    existing server + route-key builds in the script's [build]
    section.  The output binary lives at /tmp/elastickv4-m5-list-routes
    matching the BINARY / ROUTE_KEY_BIN naming convention.

  * Passing --list-routes-bin and --grpc-host-port to lein run so
    the workload's verify-multi-group-routing! (introduced in
    PR #925) shells out to the freshly-built binary against the
    process's primary gRPC address (PROC_ADDR=127.0.0.1:50051).

Without these flags the hook falls back to PATH lookup, which
fails in a tmp-build environment.  With them, every local Jepsen
run starts with a clear yes/no on whether the --raftGroups /
--shardRanges combo actually produced a multi-group routing
catalog — silencing the entire class of 'workload ran clean
but never exercised dispatchMultiShardTxn' results that motivated
the design doc §3.3 hook in the first place.

Cross-PR ordering note: this commit depends on PR #925's
--list-routes-bin / --grpc-host-port CLI flags being merged.
Until that lands, this commit's launch will fail with
'Unknown option: --list-routes-bin' — kept on this PR for review
ergonomics; flag commit is tagged 'do not merge before PR #925'
in the PR description.

Verification: bash -n scripts/run-jepsen-m5-local.sh -> OK.
…C addr)

gemini-code-assist medium findings on a229f5b / 4b71c5e:

* emit() signature *os.File -> io.Writer (Go idiom + testability).
  Updated all three tests to use bytes.Buffer rather than
  os.CreateTemp + os.ReadFile.  Net: -25 lines of test scaffolding,
  same assertions.  go test -race -count=1 + golangci-lint
  both green.

* default-grpc-host-port-for dynamic resolution.  The previous
  hard-coded '127.0.0.1:50051' default worked for the local
  launch script but punted every distributed Jepsen run to
  localhost — verify-multi-group-routing! would dial the wrong
  host and fail with a confusing connection error.

  Fix: extracted default-grpc-host-port-for that reads
  (first (:nodes test)) and combines with the new
  default-grpc-port constant (50051).  Falls back to
  127.0.0.1:50051 only when :nodes is missing entirely (the
  workload-builder unit tests' case).  Caller in
  verify-multi-group-routing! updated to call the resolver
  function rather than read a static def.

  Two new tests pin the resolution:
    - default-grpc-host-port-resolves-from-first-node:
      string nodes, FQDN-style nodes, keyword nodes all
      resolve via (name).
    - default-grpc-host-port-falls-back-on-empty-nodes:
      missing :nodes and empty [] both use 127.0.0.1.

Verification:
  * lein test elastickv.dynamodb-multi-table-workload-test ->
    Ran 15 tests containing 32 assertions, 0 failures, 0 errors
    (up from 13 / 27 — added the two default-grpc-host-port-for tests)
  * go test -race -count=1 ./cmd/elastickv-list-routes -> ok
  * golangci-lint ./cmd/elastickv-list-routes -> 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 4, 2026

Pushed as e67d0df — 2 gemini-code-assist medium findings on a229f5b / 4b71c5e addressed:

  • emit(*os.File)emit(io.Writer) (cmd/elastickv-list-routes/main.go): Go の慣用句 + テスト容易性。3 件のテストを bytes.Buffer 利用に書き換え (net -25 行のテストスキャフォールド、assertions 同等)。
  • default gRPC アドレスの動的解決: ハードコード 127.0.0.1:50051default-grpc-host-port-for 関数に分離。(first (:nodes test)) から hostname を解決し、distributed Jepsen run で正しいノードを dial。:nodes が空 / 欠落の場合のみ 127.0.0.1 にフォールバック。
  • 2 件の新規テストで keyword / FQDN / nil :nodes ケースを pin (default-grpc-host-port-resolves-from-first-node, default-grpc-host-port-falls-back-on-empty-nodes).

検証:

  • lein test elastickv.dynamodb-multi-table-workload-test → 15 tests / 32 assertions, 0 failures (13/27 から 2 件 / 5 assertions 増)
  • go test -race ./cmd/elastickv-list-routes → ok
  • golangci-lint → 0 issues

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 4, 2026

Claude finished @bootjp's task in 3m 13s —— View job


Review of e67d0df — M5a setup-hook verification + elastickv-list-routes CLI

  • Read all 4 changed files
  • Review Go code (cmd/elastickv-list-routes/)
  • Review Clojure code (jepsen/src/elastickv/)
  • Review tests
  • Post findings

Both gemini-code-assist findings are correctly addressed. The io.Writer refactor is idiomatic and the dynamic address resolution covers the distributed-Jepsen case. 15 Clojure tests / 32 assertions and the Go test suite all passing is a good signal. Below are the findings from five independent lenses (per CLAUDE.md).


Lens 1 — Data Loss

No persistence path in this changeset. The deferred conn.Close() pattern (main.go:83-90) deliberately swallows close errors after a successful emit and prints to stderr — this matches the comment and the cmd/elastickv-split precedent. No concerns.


Lens 2 — Concurrency / Distributed Failures

emit is pure (no shared state). verify-multi-group-routing! calls shell/sh synchronously. The jepsen.client/setup! lifecycle comment (dynamodb_multi_table_workload.clj:388-390) correctly notes it's called exactly once per test, not per-node, so no concurrency issues.

One point to verify: setup! accesses the ddb field (clj:399) but names the receiver _this. In Clojure defrecord, fields are available as direct lexical bindings in protocol method bodies regardless of the receiver name (same pattern as open! using node->port directly at clj:375). Confirmed correct. ✓


Lens 3 — Performance

One-shot RPC. rpcTimeout = 10 * time.Second is appropriate for a pre-test CLI. The distinct-group-ids regex is a linear scan on a small JSON blob. No concerns.


Lens 4 — Data Consistency

Finding 1 (low — documentation bug): The file header doc comment at main.go:23 shows "state": "ACTIVE", but r.GetState().String() at main.go:119 emits the full protobuf name "ROUTE_STATE_ACTIVE" (confirmed in proto/distribution.pb.go:38). The emitted JSON contract and the header doc are inconsistent.

// header says:
{"route_id": 100, "raft_group_id": 1, "start": "...", "end": "...", "state": "ACTIVE"}
// actual emit:
{"state": "ROUTE_STATE_ACTIVE"}

Fix this →

The distinct-group-ids regex and the Clojure side don't use the state field, so there's no runtime impact — pure documentation drift.


Lens 5 — Test Coverage

Finding 2 (low — test gap): TestEmit_RoundTripsRouteBytes (main_test.go:35-70) asserts RouteID, RaftGroupID, and the base64 round-trip but never checks out.Routes[0].State. If the String() format for RouteState ever changes (e.g. stripped prefix to "ACTIVE"), no test would catch it before the Clojure regex starts silently ignoring the field.

Suggested addition at main_test.go:68:

require.Equal(t, "ROUTE_STATE_ACTIVE", out.Routes[0].State)

Fix this →

Finding 3 (very low — style): TestEmit_EmptyEndDistinguishable uses require.Contains(t, buf.String(), "end": "") which couples the assertion to the enc.SetIndent("", " ") formatting. An unmarshal-and-check approach would be more robust, though changing the indent is unlikely in practice. Low priority.

Coverage otherwise looks solid:

  • TestEmit_Empty — nil routes → [] (not null)
  • TestEmit_RoundTripsRouteBytes — raw bytes with | through base64
  • TestEmit_EmptyEndDistinguishable — nil End""
  • distinct-group-ids — multi-group, single-group, missing field, whitespace tolerance (4 tests)
  • default-grpc-host-port-for — string node, FQDN, keyword node, empty/missing :nodes (2 tests)

Summary

Two genuine findings (both low priority), both fixable in-place. The io.Writer and dynamic address fixes from gemini are correctly implemented. No blocking issues — this is approvable once the doc/test nits are resolved (or consciously deferred to a follow-up).
| Branch

bootjp added a commit that referenced this pull request Jun 4, 2026
claude[bot] review on 834b172 — all 4 gemini fixes confirmed
correct.  Two non-blocking notes addressed inline:

* Minor — line 75-78 comment overstated scope.  The guard
  runs unconditionally, not just under --no-rebuild.
  Rewrote the comment to reflect the actual semantics
  (catches both the --no-rebuild case AND a fresh-build
  environment where a helper somehow produced a
  non-executable).

* Suggestion — cross-PR dependency machine-readable guard.
  Added a pre-flight check before 'go build ./cmd/elastickv-list-routes'
  that surfaces a clear error if cmd/elastickv-list-routes/
  doesn't exist in the current tree (i.e. PR #925 hasn't
  been merged yet).  Without this guard, anyone trying to
  run the script from PR #924 alone gets an opaque
  'package not found' error from go build with no
  remediation hint.

Verification: bash -n scripts/run-jepsen-m5-local.sh -> OK.
…aude[bot] low on PR #925)

claude[bot] review on e67d0df — both gemini findings confirmed
correct.  Two low-priority items addressed:

* Doc drift (main.go:23): header doc showed
  "state": "ACTIVE" but the actual emit is
  r.GetState().String() which returns "ROUTE_STATE_ACTIVE"
  (the full protobuf enum name from proto/distribution.pb.go).
  Updated the header example to match the actual on-the-wire
  shape.

* Test gap (main_test.go: TestEmit_RoundTripsRouteBytes):
  added one assertion at the bottom verifying
  out.Routes[0].State == "ROUTE_STATE_ACTIVE".  A future
  enum-name change (e.g. the proto regen stripping the prefix
  to "ACTIVE") would now fail this test rather than being
  silently ignored by the Clojure regex-based parser.

The very-low style nit (TestEmit_EmptyEndDistinguishable
coupling to enc.SetIndent format) is explicitly deferred per
claude[bot]'s assessment: changing the JSON indent is
unlikely and the unmarshal-and-check approach would lose the
'serialised as a literal string' clarity the current
assertion conveys.

Verification:
  * go test -race -count=1 ./cmd/elastickv-list-routes -> ok
  * golangci-lint ./cmd/elastickv-list-routes -> 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 4, 2026

Pushed as 68adde1 — claude[bot] low findings on e67d0df addressed:

  • Doc drift (main.go:23): ヘッダ doc の "state": "ACTIVE" を実際の emit ("state": "ROUTE_STATE_ACTIVE" — proto's String() の出力) に一致させる。
  • Test gap (main_test.go: TestEmit_RoundTripsRouteBytes): out.Routes[0].State == "ROUTE_STATE_ACTIVE" の assert を追加。将来の enum 名変更が silent に通り抜けないように pin。

very-low の style nit (TestEmit_EmptyEndDistinguishable が enc.SetIndent フォーマットに couple) は claude[bot] 自身の評価通り deferred。JSON indent 変更は unlikely で、unmarshal-and-check への切り替えは現状の 'serialised as a literal string' の意味の明確性を失う。

go test -race ./cmd/elastickv-list-routes → ok / golangci-lint → 0 issues

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 4, 2026

Claude finished @bootjp's task in 59s —— View job


Review of 68adde1 — addressing previous claude[bot] low findings

  • Read changed files (main.go, main_test.go)
  • Verify doc drift fix (main.go:23 header comment)
  • Verify test gap fix (State assertion in TestEmit_RoundTripsRouteBytes)
  • Full five-lens review pass
  • Post findings

Both previous findings from e67d0df are correctly addressed.

Doc drift fix — main.go:23
The header comment now reads "state": "ROUTE_STATE_ACTIVE", matching the actual r.GetState().String() output from the protobuf enum. The doc and the emitted JSON contract are now consistent.

Test gap fix — main_test.go:75
require.Equal(t, "ROUTE_STATE_ACTIVE", out.Routes[0].State) is present with an explanatory comment that explicitly names why the assertion exists (detect proto enum rename before it reaches the Clojure regex silently). The co-attribution comment is also well-scoped and correct.


Five-lens pass (68adde1)

Lens 1 — Data Loss: No persistence path. Binary is read-only (ListRoutes RPC). No concerns.

Lens 2 — Concurrency / Distributed Failures: emit is stateless; setup! is synchronous and called exactly once per test run. No concerns.

Lens 3 — Performance: One-shot RPC with a 10 s timeout. No hot path touched. No concerns.

Lens 4 — Data Consistency: Both previously flagged items resolved. No new consistency concerns found.

Lens 5 — Test Coverage: Coverage is now complete for the Go binary:

  • Empty routes → [] not null (TestEmit_Empty)
  • Raw bytes with | round-trip via base64 (TestEmit_RoundTripsRouteBytes)
  • State field pinned to "ROUTE_STATE_ACTIVE" — future enum renames now caught here
  • nil End → "" explicit serialisation (TestEmit_EmptyEndDistinguishable)

The deferred style nit (TestEmit_EmptyEndDistinguishable coupling to indent format) remains low priority and the deferral rationale is sound.


No new findings. The commit is clean — doc, test, and implementation are all internally consistent. Approvable once the E2E Jepsen run (pending PR #924 integration) is checked off.
| Branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant