ZK multi path transaction API support#18380
Conversation
Replace PinotZkOp + PinotZkMultiResult + ZkMultiWriter with a single ZkMultiWriteBuilder; PinotHelixResourceManager.multiWriteZK() returns a fresh builder. execute() returns void and throws KeeperException on atomic rollback, so callers branch on BadVersionException / NoNodeException / NodeExistsException for the retry-vs-hard-error trichotomy without a custom result type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18380 +/- ##
============================================
+ Coverage 63.68% 63.71% +0.03%
- Complexity 1682 1684 +2
============================================
Files 3262 3263 +1
Lines 199826 199903 +77
Branches 31031 31041 +10
============================================
+ Hits 127255 127366 +111
+ Misses 62421 62388 -33
+ Partials 10150 10149 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Switch ZkMultiWriteBuilder ctor to concrete ZkClient (matches the rest of Pinot, which uses ZkClient over RealmAwareZkClient). - Document the builder is single-use and not thread-safe. - Restrict multiWriteZK to Helix property-store paths: builder takes a propertyStoreRoot prefix and op paths are property-store-relative (the same paths callers pass to ZkHelixPropertyStore). Validates prefix shape in the constructor. - Simplify PinotHelixResourceManager multi-write client setup: derive zkAddress from the started Helix manager instead of caching it + drop unused _stopped flag. - Add PinotHelixResourceManagerStatelessTest#testMultiWriteZkSegment MetadataUpdates: pre-creates two SegmentZKMetadata znodes, updates both atomically via multiWriteZK(), and verifies round-trip read through the property store. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xiangfu0
left a comment
There was a problem hiding this comment.
Flagging one backward-compatibility risk in multiWriteZK() root selection.
| */ | ||
| public ZkMultiWriteBuilder multiWriteZK() { | ||
| return new ZkMultiWriteBuilder(getOrBuildMultiWriteZkClient(), | ||
| PropertyPathBuilder.propertyStore(_helixClusterName)); |
There was a problem hiding this comment.
Hard-coding PropertyPathBuilder.propertyStore(_helixClusterName) here drops the same fallback behavior that the rest of Pinot gets from Helix's property store for legacy /<cluster>/HELIX_PROPERTYSTORE roots. On an upgraded cluster that still stores metadata under the legacy subtree, multiWriteZK().set(...) will start failing with NoNodeException, and create(...) can silently write shadow znodes under /<cluster>/PROPERTYSTORE that the normal controller reads never see. This public API needs to derive the actual active property-store root or preserve the same fallback semantics before it is safe to expose.
There was a problem hiding this comment.
Took a closer look.
The concerns are true, but it's not a regression: existing _propertyStore.set(...) and PinotHelixResourceManager.setZKData(...) already bypass the auto-promote that only AutoFallbackPropertyStore.update triggers, so they have the same hazard on a legacy-root cluster. The only Pinot reference to HELIX_PROPERTYSTORE anywhere in the codebase is the read-side PredownloadZKClient; no modern Pinot cluster runs against the legacy root.
Given that this API doesn't introduce a failure mode beyond what existing direct-set paths already have on the same hypothetical legacy cluster, I'd rather not expand the surface area with fallback-aware logic for a path with no real callers.
Let me know if you think its still important to add
| Preconditions.checkState(_helixZkManager != null, | ||
| "multiWriteZK unavailable: PinotHelixResourceManager has not been started"); | ||
| String zkAddress = _helixZkManager.getMetadataStoreConnectionString(); | ||
| int sessionTimeoutMs = CommonConstants.Helix.ZkClient.DEFAULT_SESSION_TIMEOUT_MS; |
There was a problem hiding this comment.
can sessionTimeoutMs and connectTimeoutMs configurable from the current zk configure way, e.g. through system properties ?
There was a problem hiding this comment.
Done in [0db74ff] — threaded ControllerConf into PinotHelixResourceManager so the dedicated multi-write ZkClient now honors the same zk.client.session.timeout.ms / zk.client.connection.timeout.ms controller config keys that HelixSetupUtils already uses, falling back to the existing defaults when no config is supplied.
On the system-property side (jute.maxbuffer and friends): those are read by the ZooKeeper client library itself from JVM system properties at client init time, so they're picked up automatically through the shared ZkClient.Builder — no extra plumbing needed.
Thread ControllerConf into PinotHelixResourceManager so the dedicated multi-write ZkClient honors the existing zk.client.session.timeout.ms and zk.client.connection.timeout.ms controller config keys (same keys HelixSetupUtils already uses), instead of always using the hardcoded defaults. JVM-level ZooKeeper system properties (e.g. jute.maxbuffer) continue to be picked up automatically by the ZooKeeper client library through the shared ZkClient.Builder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds a fluent
ZkMultiWriteBuilderAPI for issuing atomic ZooKeepermulti()transactions over Helix property-store paths, exposed viaPinotHelixResourceManager.multiWriteZK(). This lets callers mutate multiple property-store znodes (e.g. severalSegmentZKMetadataentries ) as a single all-or-nothing batch instead of issuing per-path writes that can leave the property store in a half-applied state on a crash or version-mismatch retry.What's in the API
set(path, record[, expectedVersion])— overwrite a znode with optional CAS.create(path, record)— create a persistent znode.delete(path[, expectedVersion])— delete with optional CAS.check(path, expectedVersion)— version assertion that gates the rest of the batch.execute()— submits all accumulated ops as a single ZKmulti().Op paths are property-store-relative (the same paths callers pass to
ZkHelixPropertyStore). The builder prepends/{cluster}/PROPERTYSTOREbefore submitting. Multi-path writes outside the property store are intentionally unsupported.Error model
On atomic rollback,
execute()throws the underlyingKeeperExceptionsubtype so callers can branch cleanly:BadVersionException/NoNodeException/NodeExistsException→ retryable concurrent-state changes.Per-op offender info is reachable via
KeeperException#getResults(). Connectivity / session failures (timeout, interrupt, session expiry) propagate as the originalZkExceptionsince they are not atomic outcomes. The builder is single-use and not thread-safe — obtain a fresh one per transaction.Why a dedicated
ZkClientHelix 1.3.2 does not expose
multi()onBaseDataAccessor, and theZkClientinsideZKHelixManageris not publicly reachable. Reusing it would require reflection, which breaks across Helix point-releases.PinotHelixResourceManagerlazily builds one dedicatedZkClienton firstmultiWriteZK()call (derived from the started Helix manager's ZK address) and closes it asynchronously onstop(). This is consistent with the controller's existing footprint —_propertyStoreand_leadControllerManageralready hold their own ZK sessions.Test plan
ZkMultiWriteBuilderTestcovering ops, fluent chaining, single-use guard, path-prefix validation, andKeeperExceptionunwrapping fromZkException.PinotHelixResourceManagerStatelessTest#testMultiWriteZkSegmentMetadataUpdates— pre-creates twoSegmentZKMetadataznodes, updates both atomically viamultiWriteZK(), and verifies round-trip read through the property store.mvn spotless:apply && mvn checkstyle:check