feat(jepsen): M5a — setup-hook verification + elastickv-list-routes CLI#925
Conversation
Second-half slice of Composed-1 M5a per PR #905 design doc §3.3 'Setup hook verification' bullet. Two pieces, intentionally one PR because the Clojure-side hook is useless without the Go CLI it shells out to: * cmd/elastickv-list-routes — new Go binary that prints the cluster's current route catalog as JSON (catalog_version, routes[] with route_id / raft_group_id / start / end / state). start and end are base64-encoded raw bytes so any byte sequence in the routing keyspace survives JSON. Non-zero exit on any RPC failure so the Jepsen hook sees it verbatim. Smoke-tested via main_test.go (3 tests / 6 assertions): - empty routes serialise as [] (not nil) — Clojure callers don't have to special-case nil-vs-empty - route bytes containing '|' (outside base64 alphabet) round-trip cleanly - empty End (the +infinity sentinel) serialises as '' so Clojure can distinguish it from unset * jepsen/src/elastickv/dynamodb_multi_table_workload.clj — added verify-multi-group-routing! that shells out to the CLI, parses the JSON via a regex for raft_group_id (intentionally avoids a JSON-lib dep — the CLI's emit() is the contract), and throws ex-info if <2 distinct groups are observed. The error message names the launch-script flag the operator is missing. Called from the workload client's setup! before any table creation. jepsen.client/setup! fires exactly once per test (unlike jepsen.db/setup! which fires per node), so no first-node gate is needed — corrected from an earlier mental model I'd carried over from the design doc. 4 new tests / 5 assertions verifying distinct-group-ids: - multi-group routing yields #{1 2} - same-group routes collapse to #{1} - missing 'routes' field returns #{} - whitespace / newlines between key:value tolerated Why the regex instead of a JSON dependency: the CLI's JSON shape is pinned by Go tests; this hook only needs a coarse-grained 'how many groups own routes' check. A future ListRoutes schema change that renames raft_group_id makes the regex return #{} which causes verify-multi-group-routing! to throw — schema drift surfaces loudly rather than silently passing. Verification: * go build + go test -race + golangci-lint on cmd/elastickv-list-routes -> 0 issues * lein test elastickv.dynamodb-multi-table-workload-test -> Ran 13 tests containing 27 assertions, 0 failures, 0 errors (up from 9 / 22 — added the four distinct-group-ids tests) The next slice (PR #924's follow-up commit) wires elastickv-list-routes into scripts/run-jepsen-m5-local.sh's build step + threads the binary path into the workload's :list-routes-bin opt so the local Jepsen run picks it up automatically.
|
Warning Review limit reached
More reviews will be available in 59 minutes and 47 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Go command-line tool, elastickv-list-routes, along with tests, to print the cluster's route catalog as JSON. This tool is integrated into the Jepsen test suite as a setup-hook verification step to ensure multi-group routing is active before workloads run. Feedback suggests using io.Writer instead of *os.File in the Go emit function to improve testability, and dynamically resolving the default gRPC address in the Jepsen setup hook from the test configuration nodes to support distributed environments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
| // emit serialises resp as JSON to w. Extracted from run() so the | ||
| // encoding is testable without dialling a real server. | ||
| func emit(resp *pb.ListRoutesResponse, w *os.File) error { |
There was a problem hiding this comment.
Using a concrete *os.File type for the writer parameter limits the flexibility of the emit function and makes testing more cumbersome (requiring temporary files on disk). Consider using io.Writer instead, which is the standard Go idiom for writing output and allows testing with in-memory buffers like bytes.Buffer.
| func emit(resp *pb.ListRoutesResponse, w *os.File) error { | |
| func emit(resp *pb.ListRoutesResponse, w io.Writer) error { |
| (let [bin (or (:list-routes-bin test) default-list-routes-bin) | ||
| addr (or (:grpc-host-port test) default-grpc-host-port) | ||
| result (shell/sh bin "--address" addr)] |
There was a problem hiding this comment.
Hardcoding 127.0.0.1:50051 as the default gRPC address works fine for local testing, but will fail in a distributed Jepsen environment where the database nodes run on separate hosts. Consider dynamically resolving the default address using the first node in the test configuration.
(let [bin (or (:list-routes-bin test) default-list-routes-bin)
addr (or (:grpc-host-port test)
(if-let [node (first (:nodes test))]
(str (name node) ":50051")
default-grpc-host-port))
result (shell/sh bin "--address" addr)]
Threads the elastickv-list-routes binary path and the gRPC
target address through the workload CLI to the setup-hook
verification (verify-multi-group-routing!). Without these,
the hook falls back to bare-name 'elastickv-list-routes' on
PATH and 127.0.0.1:50051 — fine for the M5a launch script's
default topology but brittle when the binary lives outside
PATH (e.g. a tmp build directory shared with the launch
script's bash variables).
Implementation:
* dynamo-cli-opts gains --list-routes-bin PATH and
--grpc-host-port HOST:PORT entries. Both default to nil
so verify-multi-group-routing!'s 'or default' fallback
fires when omitted.
* elastickv-dynamodb-multi-table-test now puts
:list-routes-bin / :grpc-host-port directly into the test
map (alongside :name / :nodes / :db). The verification
hook reads them via (:list-routes-bin test) etc. —
jepsen.client/setup! receives the test map, not opts.
* prepare-dynamo-opts unchanged. Clojure tools.cli leaves
parsed CLI flags on the options map as :list-routes-bin
and :grpc-host-port keys; the existing (assoc options
:dynamo-host ...) call preserves them via assoc's
'keep-everything-else' semantic.
Verification: lein test elastickv.dynamodb-multi-table-workload-test
still 13 tests / 27 assertions, 0 failures, 0 errors. CLI
default semantics covered by the existing
distinct-group-ids-* tests (the defaults are inert when no
shell-out actually fires).
The follow-up — wiring 'go build ./cmd/elastickv-list-routes'
into scripts/run-jepsen-m5-local.sh + passing
--list-routes-bin / --grpc-host-port to lein run — lands on
PR #924's branch as a separate commit once that PR's first
review pass is in.
…-bin / --grpc-host-port Follow-up commit on PR #924's launch script. Brings the setup-hook verification online by: * Adding 'go build ./cmd/elastickv-list-routes' alongside the existing server + route-key builds in the script's [build] section. The output binary lives at /tmp/elastickv4-m5-list-routes matching the BINARY / ROUTE_KEY_BIN naming convention. * Passing --list-routes-bin and --grpc-host-port to lein run so the workload's verify-multi-group-routing! (introduced in PR #925) shells out to the freshly-built binary against the process's primary gRPC address (PROC_ADDR=127.0.0.1:50051). Without these flags the hook falls back to PATH lookup, which fails in a tmp-build environment. With them, every local Jepsen run starts with a clear yes/no on whether the --raftGroups / --shardRanges combo actually produced a multi-group routing catalog — silencing the entire class of 'workload ran clean but never exercised dispatchMultiShardTxn' results that motivated the design doc §3.3 hook in the first place. Cross-PR ordering note: this commit depends on PR #925's --list-routes-bin / --grpc-host-port CLI flags being merged. Until that lands, this commit's launch will fail with 'Unknown option: --list-routes-bin' — kept on this PR for review ergonomics; flag commit is tagged 'do not merge before PR #925' in the PR description. Verification: bash -n scripts/run-jepsen-m5-local.sh -> OK.
…C addr) gemini-code-assist medium findings on a229f5b / 4b71c5e: * emit() signature *os.File -> io.Writer (Go idiom + testability). Updated all three tests to use bytes.Buffer rather than os.CreateTemp + os.ReadFile. Net: -25 lines of test scaffolding, same assertions. go test -race -count=1 + golangci-lint both green. * default-grpc-host-port-for dynamic resolution. The previous hard-coded '127.0.0.1:50051' default worked for the local launch script but punted every distributed Jepsen run to localhost — verify-multi-group-routing! would dial the wrong host and fail with a confusing connection error. Fix: extracted default-grpc-host-port-for that reads (first (:nodes test)) and combines with the new default-grpc-port constant (50051). Falls back to 127.0.0.1:50051 only when :nodes is missing entirely (the workload-builder unit tests' case). Caller in verify-multi-group-routing! updated to call the resolver function rather than read a static def. Two new tests pin the resolution: - default-grpc-host-port-resolves-from-first-node: string nodes, FQDN-style nodes, keyword nodes all resolve via (name). - default-grpc-host-port-falls-back-on-empty-nodes: missing :nodes and empty [] both use 127.0.0.1. Verification: * lein test elastickv.dynamodb-multi-table-workload-test -> Ran 15 tests containing 32 assertions, 0 failures, 0 errors (up from 13 / 27 — added the two default-grpc-host-port-for tests) * go test -race -count=1 ./cmd/elastickv-list-routes -> ok * golangci-lint ./cmd/elastickv-list-routes -> 0 issues
|
Pushed as e67d0df — 2 gemini-code-assist medium findings on a229f5b / 4b71c5e addressed:
検証:
@claude review |
|
Claude finished @bootjp's task in 3m 13s —— View job Review of e67d0df — M5a setup-hook verification +
|
claude[bot] review on 834b172 — all 4 gemini fixes confirmed correct. Two non-blocking notes addressed inline: * Minor — line 75-78 comment overstated scope. The guard runs unconditionally, not just under --no-rebuild. Rewrote the comment to reflect the actual semantics (catches both the --no-rebuild case AND a fresh-build environment where a helper somehow produced a non-executable). * Suggestion — cross-PR dependency machine-readable guard. Added a pre-flight check before 'go build ./cmd/elastickv-list-routes' that surfaces a clear error if cmd/elastickv-list-routes/ doesn't exist in the current tree (i.e. PR #925 hasn't been merged yet). Without this guard, anyone trying to run the script from PR #924 alone gets an opaque 'package not found' error from go build with no remediation hint. Verification: bash -n scripts/run-jepsen-m5-local.sh -> OK.
…aude[bot] low on PR #925) claude[bot] review on e67d0df — both gemini findings confirmed correct. Two low-priority items addressed: * Doc drift (main.go:23): header doc showed "state": "ACTIVE" but the actual emit is r.GetState().String() which returns "ROUTE_STATE_ACTIVE" (the full protobuf enum name from proto/distribution.pb.go). Updated the header example to match the actual on-the-wire shape. * Test gap (main_test.go: TestEmit_RoundTripsRouteBytes): added one assertion at the bottom verifying out.Routes[0].State == "ROUTE_STATE_ACTIVE". A future enum-name change (e.g. the proto regen stripping the prefix to "ACTIVE") would now fail this test rather than being silently ignored by the Clojure regex-based parser. The very-low style nit (TestEmit_EmptyEndDistinguishable coupling to enc.SetIndent format) is explicitly deferred per claude[bot]'s assessment: changing the JSON indent is unlikely and the unmarshal-and-check approach would lose the 'serialised as a literal string' clarity the current assertion conveys. Verification: * go test -race -count=1 ./cmd/elastickv-list-routes -> ok * golangci-lint ./cmd/elastickv-list-routes -> 0 issues
|
Pushed as 68adde1 — claude[bot] low findings on e67d0df addressed:
very-low の style nit (TestEmit_EmptyEndDistinguishable が
@claude review |
|
Claude finished @bootjp's task in 59s —— View job Review of 68adde1 — addressing previous claude[bot] low findings
Both previous findings from e67d0df are correctly addressed. Doc drift fix — Test gap fix — Five-lens pass (68adde1)Lens 1 — Data Loss: No persistence path. Binary is read-only ( Lens 2 — Concurrency / Distributed Failures: Lens 3 — Performance: One-shot RPC with a 10 s timeout. No hot path touched. No concerns. Lens 4 — Data Consistency: Both previously flagged items resolved. No new consistency concerns found. Lens 5 — Test Coverage: Coverage is now complete for the Go binary:
The deferred style nit ( No new findings. The commit is clean — doc, test, and implementation are all internally consistent. Approvable once the E2E Jepsen run (pending PR #924 integration) is checked off. |
Draft — 第4スライス (Composed-1 M5a)。設計ドキュメント §3.3 の "Setup hook verification" 部分。PR #924 (launch script) と並列で、それがマージされた後に launch script を統合します。
内容
2 つのピースを 1 PR にまとめています — Clojure 側のフックは Go CLI なしでは使えないため。
1.
cmd/elastickv-list-routes(Go)クラスタの現在の route catalog を JSON 出力する新規バイナリ。
{ "catalog_version": 7, "routes": [ {"route_id": 100, "raft_group_id": 1, "start": "...", "end": "...", "state": "ROUTE_STATE_ACTIVE"} ] }start/endは base64 でエンコード。Smoke test 3 件 / 6 assertions:[](not nil) で serialise|を含む生バイト (base64 アルファベット外) を正しく round-tripEnd(+infinity sentinel) を空文字列で表現2. Clojure
verify-multi-group-routing!dynamodb-multi-table-workloadの workload clientsetup!冒頭で呼び出されます:cmd/elastickv-list-routesに shell outraft_group_idを抽出 — JSON 依存をあえて入れない (CLI の Go test が contract)<2 distinct groupsならex-infoで fail-fast (launch script のフラグを名指しでエラーメッセージに)jepsen.client/setup!はテストあたり 1 回しか呼ばれない (per-node のjepsen.db/setup!と異なる) ので first-node ゲートは不要。検証テスト 4 件 / 5 assertions:
#{1 2}#{1}(これは reject される)routesフィールド欠落 →#{}(将来のスキーマドリフトで noisy fail)設計上の判断
なぜ JSON 依存ではなく正規表現か: CLI の JSON shape は Go test で pin されている。フックが必要なのは「ルートを所有するグループ数」の粗い検査だけ。将来
raft_group_idがリネームされたら正規表現が#{}を返し、verify-multi-group-routing!が例外を投げて schema drift が大声で surface する — silent pass を防げる。Verification
go build+go test -race+golangci-lintoncmd/elastickv-list-routes— 0 issueslein test elastickv.dynamodb-multi-table-workload-test→ 13 tests / 27 assertions, 0 failures (既存 9/22 から 4 件 / 5 assertions 増)Remaining M5a work
elastickv-list-routesのビルドステップを追加 +:list-routes-binopt の thread-throughdispatchMultiShardTxnを実行 / Elle が G1c ゼロ (E2E run)