Skip to content

ZK multi path transaction API support#18380

Merged
noob-se7en merged 5 commits into
apache:masterfrom
krishan1390:zk_multi_path_api
May 13, 2026
Merged

ZK multi path transaction API support#18380
noob-se7en merged 5 commits into
apache:masterfrom
krishan1390:zk_multi_path_api

Conversation

@krishan1390
Copy link
Copy Markdown
Contributor

@krishan1390 krishan1390 commented Apr 30, 2026

Summary

Adds a fluent ZkMultiWriteBuilder API for issuing atomic ZooKeeper multi() transactions over Helix property-store paths, exposed via PinotHelixResourceManager.multiWriteZK(). This lets callers mutate multiple property-store znodes (e.g. several SegmentZKMetadata entries ) as a single all-or-nothing batch instead of issuing per-path writes that can leave the property store in a half-applied state on a crash or version-mismatch retry.

helixResourceManager.multiWriteZK()
    .set(segmentPath1, seg1.toZNRecord(), expectedVersion1)
    .set(segmentPath2, seg2.toZNRecord(), expectedVersion2)
    .check(guardPath, guardVersion)
    .execute();   // throws KeeperException on atomic rollback

What's in the API

  • set(path, record[, expectedVersion]) — overwrite a znode with optional CAS.
  • create(path, record) — create a persistent znode.
  • delete(path[, expectedVersion]) — delete with optional CAS.
  • check(path, expectedVersion) — version assertion that gates the rest of the batch.
  • execute() — submits all accumulated ops as a single ZK multi().

Op paths are property-store-relative (the same paths callers pass to ZkHelixPropertyStore). The builder prepends /{cluster}/PROPERTYSTORE before submitting. Multi-path writes outside the property store are intentionally unsupported.

Error model

On atomic rollback, execute() throws the underlying KeeperException subtype so callers can branch cleanly:

  • BadVersionException / NoNodeException / NodeExistsException → retryable concurrent-state changes.
  • Other subtypes → hard errors.

Per-op offender info is reachable via KeeperException#getResults(). Connectivity / session failures (timeout, interrupt, session expiry) propagate as the original ZkException since they are not atomic outcomes. The builder is single-use and not thread-safe — obtain a fresh one per transaction.

Why a dedicated ZkClient

Helix 1.3.2 does not expose multi() on BaseDataAccessor, and the ZkClient inside ZKHelixManager is not publicly reachable. Reusing it would require reflection, which breaks across Helix point-releases. PinotHelixResourceManager lazily builds one dedicated ZkClient on first multiWriteZK() call (derived from the started Helix manager's ZK address) and closes it asynchronously on stop(). This is consistent with the controller's existing footprint — _propertyStore and _leadControllerManager already hold their own ZK sessions.

Test plan

  • Added ZkMultiWriteBuilderTest covering ops, fluent chaining, single-use guard, path-prefix validation, and KeeperException unwrapping from ZkException.
  • Added PinotHelixResourceManagerStatelessTest#testMultiWriteZkSegmentMetadataUpdates — pre-creates two SegmentZKMetadata znodes, updates both atomically via multiWriteZK(), and verifies round-trip read through the property store.
  • mvn spotless:apply && mvn checkstyle:check

krishan1390 and others added 2 commits April 23, 2026 17:18
Replace PinotZkOp + PinotZkMultiResult + ZkMultiWriter with a single
ZkMultiWriteBuilder; PinotHelixResourceManager.multiWriteZK() returns a
fresh builder. execute() returns void and throws KeeperException on
atomic rollback, so callers branch on BadVersionException /
NoNodeException / NodeExistsException for the retry-vs-hard-error
trichotomy without a custom result type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 86.07595% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.71%. Comparing base (5fd3183) to head (a64da9c).

Files with missing lines Patch % Lines
...ntroller/helix/core/PinotHelixResourceManager.java 77.77% 3 Missing and 5 partials ⚠️
.../pinot/common/utils/helix/ZkMultiWriteBuilder.java 93.02% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18380      +/-   ##
============================================
+ Coverage     63.68%   63.71%   +0.03%     
- Complexity     1682     1684       +2     
============================================
  Files          3262     3263       +1     
  Lines        199826   199903      +77     
  Branches      31031    31041      +10     
============================================
+ Hits         127255   127366     +111     
+ Misses        62421    62388      -33     
+ Partials      10150    10149       -1     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.71% <86.07%> (+0.03%) ⬆️
temurin 63.71% <86.07%> (+0.03%) ⬆️
unittests 63.71% <86.07%> (+0.03%) ⬆️
unittests1 55.86% <93.02%> (+0.08%) ⬆️
unittests2 34.97% <63.29%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Switch ZkMultiWriteBuilder ctor to concrete ZkClient (matches the
  rest of Pinot, which uses ZkClient over RealmAwareZkClient).
- Document the builder is single-use and not thread-safe.
- Restrict multiWriteZK to Helix property-store paths: builder takes a
  propertyStoreRoot prefix and op paths are property-store-relative
  (the same paths callers pass to ZkHelixPropertyStore). Validates
  prefix shape in the constructor.
- Simplify PinotHelixResourceManager multi-write client setup: derive
  zkAddress from the started Helix manager instead of caching it +
  drop unused _stopped flag.
- Add PinotHelixResourceManagerStatelessTest#testMultiWriteZkSegment
  MetadataUpdates: pre-creates two SegmentZKMetadata znodes, updates
  both atomically via multiWriteZK(), and verifies round-trip read
  through the property store.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krishan1390 krishan1390 marked this pull request as ready for review April 30, 2026 08:38
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging one backward-compatibility risk in multiWriteZK() root selection.

*/
public ZkMultiWriteBuilder multiWriteZK() {
return new ZkMultiWriteBuilder(getOrBuildMultiWriteZkClient(),
PropertyPathBuilder.propertyStore(_helixClusterName));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coding PropertyPathBuilder.propertyStore(_helixClusterName) here drops the same fallback behavior that the rest of Pinot gets from Helix's property store for legacy /<cluster>/HELIX_PROPERTYSTORE roots. On an upgraded cluster that still stores metadata under the legacy subtree, multiWriteZK().set(...) will start failing with NoNodeException, and create(...) can silently write shadow znodes under /<cluster>/PROPERTYSTORE that the normal controller reads never see. This public API needs to derive the actual active property-store root or preserve the same fallback semantics before it is safe to expose.

Copy link
Copy Markdown
Contributor Author

@krishan1390 krishan1390 May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a closer look.

The concerns are true, but it's not a regression: existing _propertyStore.set(...) and PinotHelixResourceManager.setZKData(...) already bypass the auto-promote that only AutoFallbackPropertyStore.update triggers, so they have the same hazard on a legacy-root cluster. The only Pinot reference to HELIX_PROPERTYSTORE anywhere in the codebase is the read-side PredownloadZKClient; no modern Pinot cluster runs against the legacy root.

Given that this API doesn't introduce a failure mode beyond what existing direct-set paths already have on the same hypothetical legacy cluster, I'd rather not expand the surface area with fallback-aware logic for a path with no real callers.

Let me know if you think its still important to add

Preconditions.checkState(_helixZkManager != null,
"multiWriteZK unavailable: PinotHelixResourceManager has not been started");
String zkAddress = _helixZkManager.getMetadataStoreConnectionString();
int sessionTimeoutMs = CommonConstants.Helix.ZkClient.DEFAULT_SESSION_TIMEOUT_MS;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can sessionTimeoutMs and connectTimeoutMs configurable from the current zk configure way, e.g. through system properties ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in [0db74ff] — threaded ControllerConf into PinotHelixResourceManager so the dedicated multi-write ZkClient now honors the same zk.client.session.timeout.ms / zk.client.connection.timeout.ms controller config keys that HelixSetupUtils already uses, falling back to the existing defaults when no config is supplied.

On the system-property side (jute.maxbuffer and friends): those are read by the ZooKeeper client library itself from JVM system properties at client init time, so they're picked up automatically through the shared ZkClient.Builder — no extra plumbing needed.

krishan1390 and others added 2 commits May 11, 2026 21:31
Thread ControllerConf into PinotHelixResourceManager so the dedicated
multi-write ZkClient honors the existing zk.client.session.timeout.ms
and zk.client.connection.timeout.ms controller config keys (same keys
HelixSetupUtils already uses), instead of always using the hardcoded
defaults. JVM-level ZooKeeper system properties (e.g. jute.maxbuffer)
continue to be picked up automatically by the ZooKeeper client library
through the shared ZkClient.Builder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@noob-se7en noob-se7en merged commit ac1eb38 into apache:master May 13, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants