[fix](local shuffle) fix bucket shuffle exchange incorrectly marked as serial in pooling mode#62054
[fix](local shuffle) fix bucket shuffle exchange incorrectly marked as serial in pooling mode#62054924060929 wants to merge 4 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
e935b63 to
4c3e765
Compare
|
run buildall |
…s serial in pooling mode 1. ExchangeNode.toThrift: is_serial = isSerialOperator() && useSerialSource(). isSerialOperator() is pure operator-level: only UNPARTITIONED or use_serial_exchange. useSerialSource() is the fragment-level pooling guard, kept in toThrift (not in isSerialOperator) to avoid infinite recursion: useSerialSource() calls planRoot.isSerialOperator() internally. 2. DistributePlanner.filterInstancesWhichCanReceiveDataFromRemote(): use linkNode.isSerialOperator() && linkNode.getFragment().useSerialSource() to decide whether to dedupe to first-per-worker destinations. Bucket shuffle / hash exchanges are never serial — scan seriality is handled by the scan pipeline independently. 3. Add unit tests for both changes.
4c3e765 to
122110b
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 16520 ms |
TPC-DS: Total hot run time: 73983 ms |
…in pooling mode ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#62054 Problem Summary: When ExchangeNode.isSerialOperator() is changed to not include hasSerialScanNode() (PR apache#62054), non-serial BUCKET_SHUFFLE exchanges with pooled scans cause query hangs. Two issues are fixed: 1. **FE: Sender destination mismatch** - `sortDestinationInstancesByBuckets()` used scan source buckets (`bucketIndexToScanNodeToTablets`) which are pooled to one instance per BE, causing ALL bucket data to be sent to just 1 instance per BE. But `computeBucketIdToInstanceId()` (receiver side) uses join bucket assignments (`getAssignedJoinBucketIndexes()`) which distribute buckets across instances. Fix: Use `getAssignedJoinBucketIndexes()` for `LocalShuffleBucketJoinAssignedJob` in non-serial mode, consistent with `computeBucketIdToInstanceId()`. 2. **BE: Padding instance hang** - When num_instances > bucket count on a BE, extra "padding" instances create VDataStreamRecvrs that never receive EOS. Fix: Identify padding instances (not in `bucket_seq_to_instance_idx` values) and create their recvrs with num_senders=0 so they complete immediately. ### Release note None ### Check List (For Author) - Test: Manual test with 3-BE cluster, all combinations of serial/non-serial, share_hash_table on/off, parallel_pipeline_task_num 0-8 - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…in pooling mode ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#62054 Problem Summary: When ExchangeNode.isSerialOperator() is changed to not include hasSerialScanNode() (PR apache#62054), non-serial BUCKET_SHUFFLE exchanges with pooled scans cause query hangs. Two issues are fixed: 1. **FE: Sender destination mismatch** - `sortDestinationInstancesByBuckets()` used scan source buckets (`bucketIndexToScanNodeToTablets`) which are pooled to one instance per BE, causing ALL bucket data to be sent to just 1 instance per BE. But `computeBucketIdToInstanceId()` (receiver side) uses join bucket assignments (`getAssignedJoinBucketIndexes()`) which distribute buckets across instances. Fix: Use `getAssignedJoinBucketIndexes()` for `LocalShuffleBucketJoinAssignedJob` in non-serial mode, consistent with `computeBucketIdToInstanceId()`. 2. **BE: Padding instance hang** - When num_instances > bucket count on a BE, extra "padding" instances create VDataStreamRecvrs that never receive EOS. Fix: Identify padding instances (not in `bucket_seq_to_instance_idx` values) and create their recvrs with num_senders=0 so they complete immediately. ### Release note None ### Check List (For Author) - Test: Manual test with 3-BE cluster, all combinations of serial/non-serial, share_hash_table on/off, parallel_pipeline_task_num 0-8 - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
baeaebc to
d2aa5c4
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
1 finding
fe/fe-core/src/test/java/org/apache/doris/nereids/trees/plans/distribute/BucketShuffleDestinationTest.java: the new unit test is wired to the pre-change private API and does not actually execute the code path it is meant to validate.
Critical checkpoints
- Goal of the task: Partially achieved. The FE/BE logic change for non-serial BUCKET_SHUFFLE exchanges looks internally consistent with
ThriftPlansBuilder.computeBucketIdToInstanceId(), but the added FE test is broken, so the PR does not currently provide working proof for this fix. - Scope/minimality: Yes. The production changes are focused on exchange seriality and bucket-to-instance routing.
- Concurrency: Reviewed the BE dependency/sender-count changes in
exchange_source_operatorandpipeline_fragment_context; I did not find a concrete lock-order or readiness regression in the inspected paths. - Lifecycle/static initialization: Reviewed the recvr reset/close path; no static-init issue found, and the recvr lifecycle still appears bounded.
- Configuration: No new config items.
- Compatibility: No incompatible FE/BE protocol or storage-format change identified in the reviewed paths.
- Parallel code paths: Checked the FE destination ordering against
ThriftPlansBuilder.computeBucketIdToInstanceId()and the BEbucket_seq_to_instance_idxhandling; those paths are aligned after this change. - Special conditional checks: The new serial-exchange guard and padding-instance handling are documented well enough to follow.
- Test coverage: Not sufficient yet. One newly added FE test is stale and will fail before exercising the target logic.
- Observability: Sufficient for this change; logging adjustments are minor.
- Transaction/persistence: Not applicable.
- Data write/atomicity: Not applicable.
- FE-BE variable passing: Checked the relevant
bucket_seq_to_instance_idxflow; no concrete mismatch found. - Performance: The change addresses the reported parallelism regression; I did not find a new obvious hot-path issue beyond the broken test.
- Other issues: None beyond the test failure below.
d2aa5c4 to
d4ec761
Compare
…in pooling mode ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#62054 Problem Summary: When ExchangeNode.isSerialOperator() is changed to not include hasSerialScanNode() (PR apache#62054), non-serial BUCKET_SHUFFLE exchanges with pooled scans cause query hangs. Two issues are fixed: 1. **FE: Sender destination mismatch** - `sortDestinationInstancesByBuckets()` used scan source buckets (`bucketIndexToScanNodeToTablets`) which are pooled to one instance per BE, causing ALL bucket data to be sent to just 1 instance per BE. But `computeBucketIdToInstanceId()` (receiver side) uses join bucket assignments (`getAssignedJoinBucketIndexes()`) which distribute buckets across instances. Fix: Use `getAssignedJoinBucketIndexes()` for `LocalShuffleBucketJoinAssignedJob` in non-serial mode, consistent with `computeBucketIdToInstanceId()`. 2. **BE: Padding instance hang** - When num_instances > bucket count on a BE, extra "padding" instances create VDataStreamRecvrs that never receive EOS. Fix: Identify padding instances (not in `bucket_seq_to_instance_idx` values) and create their recvrs with num_senders=0 so they complete immediately. ### Release note None ### Check List (For Author) - Test: Manual test with 3-BE cluster, all combinations of serial/non-serial, share_hash_table on/off, parallel_pipeline_task_num 0-8 - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/review |
|
run buildall |
There was a problem hiding this comment.
Found 1 issue.
- High:
DistributePlanner.filterInstancesWhichCanReceiveDataFromRemote()now drops theenableShareHashTableForBroadcastJoin && linkNode.isRightChildOfBroadcastHashJoin()branch, so Nereids no longer collapses broadcast-build receivers to one instance per BE.PhysicalPlanTranslatorstill marks those exchanges, and the legacy coordinator still relies on that contract to let sibling instances share a single build hash table. With this patch, ordinary broadcast joins can start broadcasting the build side to every local pipeline instance again, which regresses the shared-hash-table optimization outside the bucket-shuffle bug being fixed.
Critical checkpoint conclusions:
- Goal of current task: Partially accomplished. The bucket-shuffle seriality/routing fix looks correct for the targeted path, and the added FE tests cover that path, but the broader receiver-selection change regresses shared broadcast-hash-table planning.
- Small, clear, focused modification: Mostly focused, but
filterInstancesWhichCanReceiveDataFromRemote()now changes broadcast-join behavior unrelated to the reported bucket-shuffle issue. - Concurrency: Applicable in the BE exchange changes. I did not find a new lock-order or atomic-order bug in the patched BE paths.
- Lifecycle/static initialization: Applicable in the BE receiver lifecycle. The new zero-sender receiver lifecycle appears internally consistent; no static-init issue found.
- Configuration items: No new configs added.
- Compatibility: No FE/BE protocol or storage compatibility issue identified in the reviewed paths.
- Functionally parallel code paths: Not fully handled. The legacy coordinator still preserves one-receiver-per-BE behavior for shared broadcast hash tables, but the Nereids planner path removed the equivalent branch.
- Special conditional checks: The new serial-exchange comments are clear enough.
- Test coverage: Insufficient for the full behavior change. Added FE tests cover bucket-shuffle seriality and bucket routing, but there is no test for
enableShareHashTableForBroadcastJoinon a broadcast join, and no BE test for the new zero-sender receiver path. - Observability: Adequate. Logging changes are minor and do not block review.
- Transaction/persistence: Not applicable.
- Data writes/modifications: Not applicable.
- New FE-BE variables passed across all paths: No missing new variable path found in the reviewed changes.
- Performance: Regression found in the shared broadcast-hash-table path described above.
- Other issues: None beyond the item above.
Overall opinion: needs follow-up before merge because the FE receiver-selection predicate now regresses an existing optimization on broadcast joins unrelated to bucket shuffle.
| return getFirstInstancePerWorker(receiverPlan.getInstanceJobs()); | ||
| } else if (enableShareHashTableForBroadcastJoin && linkNode.isRightChildOfBroadcastHashJoin()) { | ||
| // isSerialOperator(): UNPARTITIONED or use_serial_exchange (operator-level) | ||
| // useSerialSource(): fragment is in pooling mode (fragment-level guard) |
There was a problem hiding this comment.
filterInstancesWhichCanReceiveDataFromRemote() used to collapse destinations not only for local-shuffle serial exchanges, but also when enableShareHashTableForBroadcastJoin and linkNode.isRightChildOfBroadcastHashJoin() were true. That second branch is still part of the execution contract: PhysicalPlanTranslator still marks the right side of broadcast joins (fe/fe-core/src/main/java/org/apache/doris/nereids/glue/translator/PhysicalPlanTranslator.java:1705-1710), and the legacy coordinator still sends only one destination per BE while the sibling instances share that hash table (fe/fe-core/src/main/java/org/apache/doris/qe/Coordinator.java:1451-1470, 1594-1614). After this change, Nereids will add every local instance as a destination for those exchanges, so the build side gets broadcast to every pipeline instance again. That reintroduces the per-instance network/build cost the shared-hash-table optimization was avoiding for ordinary broadcast joins, even though this PR is only trying to relax bucket-shuffle seriality. Please preserve the broadcast-join dedupe branch here (or fold it into the new predicate) alongside the bucket-shuffle fix.
…in pooling mode ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#62054 Problem Summary: When ExchangeNode.isSerialOperator() is changed to not include hasSerialScanNode() (PR apache#62054), non-serial BUCKET_SHUFFLE exchanges with pooled scans cause query hangs. Two issues are fixed: 1. **FE: Sender destination mismatch** - `sortDestinationInstancesByBuckets()` used scan source buckets (`bucketIndexToScanNodeToTablets`) which are pooled to one instance per BE, causing ALL bucket data to be sent to just 1 instance per BE. But `computeBucketIdToInstanceId()` (receiver side) uses join bucket assignments (`getAssignedJoinBucketIndexes()`) which distribute buckets across instances. Fix: Use `getAssignedJoinBucketIndexes()` for `LocalShuffleBucketJoinAssignedJob` in non-serial mode, consistent with `computeBucketIdToInstanceId()`. 2. **BE: Padding instance hang** - When num_instances > bucket count on a BE, extra "padding" instances create VDataStreamRecvrs that never receive EOS. Fix: Identify padding instances (not in `bucket_seq_to_instance_idx` values) and create their recvrs with num_senders=0 so they complete immediately. ### Release note None ### Check List (For Author) - Test: Manual test with 3-BE cluster, all combinations of serial/non-serial, share_hash_table on/off, parallel_pipeline_task_num 0-8 - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d4ec761 to
aa0752f
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Found 1 issue.
be/src/exec/pipeline/pipeline_fragment_context.cpp: the newdest_instancesset is built frombucket_seq_to_instance_idx, whose values are FE fragment-instance indexes (ThriftPlansBuilder.instanceToIndex()), butExchangeLocalStatelater checks membership withinfo.task_idx, which is the BE pipeline task index. Those two index spaces are only equal whileparallel_instances == 1. In pooled fragments the BE creates multiple tasks per fragment instance, so any task withtask_idx >= fragment_instance_countis now misclassified as a padding instance, getsnum_senders=0, and returns EOS immediately. That drops valid receivers for real fragment instances instead of just the extra local-shuffle tasks. Please key this logic by fragment instance index / sender id rather than pipeline task index, or derive the destination set in BE terms before checking it.
Critical checkpoint conclusions:
- Goal / correctness: The PR clearly aims to stop BUCKET_SHUFFLE from being forced serial in pooled fragments and to avoid hangs from padding receivers. The FE routing fix is consistent with
computeBucketIdToInstanceId(), but the BE destination-instance check is keyed to the wrong index space, so the end-to-end fix is not yet correct. Existing tests cover FE seriality and bucket routing, but there is no BE test for the new padding-instance path. - Change size / focus: The change is reasonably focused on exchange seriality and receiver routing.
- Concurrency: The touched BE paths are mostly existing dependency/receiver flows; I did not find a new lock-order issue in the changed code. The main problem is incorrect task classification, not locking.
- Lifecycle / initialization: No new special lifecycle or static-init issue identified.
- Config / compatibility: No new config or storage compatibility issue identified.
- Parallel paths: The review checked FE destination building and BE receiver creation together; that cross-path consistency is exactly where the bug appears.
- Conditional checks: The new conditional around destination instances lacks the needed invariant that FE instance indexes must match BE task indexes; today that invariant does not hold in pooled execution.
- Test coverage: FE unit tests were added, but BE coverage is missing for the new
num_senders=0path and for pooled multi-task fragments. - Observability: Logging changes are acceptable; no extra observability seems required for this small fix.
- Transaction / persistence / data writes: Not applicable.
- FE-BE variable passing: The existing FE->BE variables are used consistently, but interpreted inconsistently on BE in the new code.
- Performance: The intended performance improvement is valid, but the index mismatch can suppress legitimate receivers and break execution.
- Other issues: None beyond the incorrect index-space assumption above.
| tnode.exchange_node.partition_type == TPartitionType::BUCKET_SHFFULE_HASH_PARTITIONED && | ||
| !(tnode.__isset.is_serial_operator && tnode.is_serial_operator)) { | ||
| std::unordered_set<int> dest_instances; | ||
| for (const auto& [bucket, idx] : _params.bucket_seq_to_instance_idx) { |
There was a problem hiding this comment.
bucket_seq_to_instance_idx stores FE fragment-instance indexes (ThriftPlansBuilder.instanceToIndex()), but later ExchangeLocalState checks this set with info.task_idx, which is the BE pipeline task index. In pooled fragments those are different once parallel_instances > 1, so real receiver tasks can be treated as padding tasks and created with num_senders=0. This makes them return EOS immediately instead of receiving remote data. The destination set here needs to be expressed in the same index space that is_destination_instance() will use, or the later check needs to compare against fragment-instance identity rather than task index.
There was a problem hiding this comment.
This concern does not apply here. Let me explain:
-
bucket_seq_to_instance_idxvalues are 0-based local instance indices per worker (fromThriftPlansBuilder.instanceToIndex()at line 421-422). -
task_idxequalsinstance_idx— the loop index into_params.local_paramspassed toPipelineTaskconstructor at line 483. -
In our specific case (non-serial BUCKET_SHUFFLE exchange), the exchange source operator's
is_serial_operator()returns false, soadd_operator(op, _parallel_instances)at line 1362 does NOT callset_num_tasks(parallelism)(seepipeline.cpp:77). The pipeline keepsnum_tasks = _num_instances. -
Since
num_tasks > 1, tasks are created for ALLinstance_idxfrom 0 to_num_instances-1(line 435:pipeline->num_tasks() > 1 || instance_idx == 0). -
Therefore
task_idx == instance_idxfor every task, which is the same index space asbucket_seq_to_instance_idxvalues.
The parallel_instances field is set to 1 for pooled fragments (ignoreDataDistribution() at line 503), but it only takes effect when is_serial_operator() returns true (pipeline.cpp:77). For our non-serial exchange, parallel_instances has no effect on task creation.
TPC-H: Total hot run time: 28693 ms |
TPC-DS: Total hot run time: 177299 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
| return; | ||
| } else { | ||
| _finished_channels.emplace(channel_id); | ||
| if (_working_channels_count.fetch_sub(1) == 1) { |
There was a problem hiding this comment.
why change the code?no need?
There was a problem hiding this comment.
not remove, just make code more clearer, the remaining is _working_channels_count.fetch_sub(1) - 1, and remaining == 0 looks more meaningful
auto remaining = _working_channels_count.fetch_sub(1) - 1;
if (remaining == 0)There was a problem hiding this comment.
old code seems more clearer. the new code look wired
There was a problem hiding this comment.
This BE change has been reverted in the latest push. The fix is now FE-only.
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#62054 Problem Summary: Restore the BE-side exchange source and pipeline fragment logic to the state before the padding-instance fix so the branch can be validated with FE-only changes. ### Release note None ### Check List (For Author) - Test: No need to test (validation will be done in CI/manual runs after this temporary revert) - Behavior changed: Yes (temporarily removes the BE-side padding-instance handling for validation) - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
…change ### What problem does this PR solve? Problem Summary: When a bucket scan fragment has a non-serial BUCKET_SHUFFLE exchange (after removing hasSerialScanNode() from ExchangeNode.isSerialOperator()), the BE does not insert local exchange because the data distribution matches. However, the FE degreeOfParallelism() in UnassignedScanBucketOlapTableJob uses getTotalTabletsNum() (total across ALL BEs) to inflate the instance count. This creates "padding" instances with no bucket assignments that create VDataStreamRecvrs expecting senders that will never arrive, causing query hangs. Fix: When the fragment has a non-serial exchange, cap instanceNum at maxParallel (per-BE bucket count) to avoid creating padding instances. ### Release note None ### Check List (For Author) - Test: Manual test on 3-BE cluster, all pptn/serial/shareHT combinations pass - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
|
/review |
There was a problem hiding this comment.
Found 1 issue.
Critical checkpoint conclusions:
- Goal of the task: Partially accomplished. The PR fixes the BUCKET_SHUFFLE seriality/destination problem, but the FE-only fallback in
UnassignedScanBucketOlapTableJob.degreeOfParallelism()is broader than the bug it describes and can reduce valid fragment parallelism. The added tests do not cover this path. - Modification size/focus: Mostly small and focused, except for the new fragment-wide
hasNonSerialExchangeInFragment()cap, which changes behavior for more than the BUCKET_SHUFFLE case. - Concurrency: No new FE locking/concurrency issues found in the changed code.
- Lifecycle/static initialization: No special lifecycle or static-init concerns found.
- Configuration: No new configs.
- Compatibility: No incompatible FE/BE protocol or storage-format change in this version of the PR.
- Functionally parallel paths: Not fully handled. Plain
HASH_PARTITIONEDexchanges share the same non-serial predicate here, but they do not have the same sender/receiver padding problem as BUCKET_SHUFFLE. - Special conditional checks: The new
hasNonSerialExchangeInFragment()guard is not narrow enough for the stated failure mode. - Test coverage: New unit tests cover exchange seriality and bucket destination ordering, but there is no regression test for the new
degreeOfParallelism()cap or for pooled bucket-scan fragments with non-BUCKET exchanges. - Observability: No additional observability needed for this planner-only change.
- Transaction/persistence/data writes: Not applicable.
- FE/BE variable passing: No new variables added.
- Performance: Regression risk exists because the new cap can unnecessarily cut instance count to per-BE bucket count.
- Other issues: None beyond the regression above.
Summary opinion: not ready as-is because the new FE-only cap conflates all non-serial exchanges with the specific BUCKET_SHUFFLE padding-instance hang.
| // for the exchange pipeline (distribution matches BUCKET_HASH_SHUFFLE). Instances | ||
| // beyond per-BE bucket count are "padding" with no bucket assignment — they create | ||
| // VDataStreamRecvrs that never receive data and hang. Cap at per-BE bucket count. | ||
| if (useLocalShuffleToAddParallel && hasNonSerialExchangeInFragment()) { |
There was a problem hiding this comment.
This guard looks too broad for the hang described in the PR. hasNonSerialExchangeInFragment() is also true for ordinary HASH_PARTITIONED exchanges, but those plans still include every receiver instance in the sender destination list, so the extra local-shuffle instances are not the same kind of untargeted "padding" instances as in BUCKET_SHUFFLE. On the BE side Pipeline::need_to_local_exchange() treats BUCKET_HASH_SHUFFLE and HASH_SHUFFLE as compatible hash distributions (be/src/exec/pipeline/pipeline.cpp:57-72), so a pooled bucket-scan fragment can still legitimately use more instances than its local bucket count when the downstream operator only requires hash shuffle. Capping to maxParallel here would silently reduce that parallelism.
Can this be narrowed to the actual failing case (BUCKET_SHFFULE_HASH_PARTITIONED / bucket-index destinations), with a regression test that proves a pooled bucket-scan fragment with a non-serial HASH_PARTITIONED exchange still keeps its higher instance count?
FE UT Coverage ReportIncrement line coverage |
What problem does this PR solve?
Issue Number: close #xxx
Problem Summary:
When a fragment has pooled scan (
ignore_data_distributionenabled), BUCKET_SHUFFLE exchanges were incorrectly treated as serial on the BE side (due tohasSerialScanNode()inExchangeNode.isSerialOperator()). This caused the exchange to reduce receiver instances to only 1 per BE instead of multiple instances matching bucket count, significantly hurting parallelism and query performance (e.g., TPCH Q9 lineitem+orders bucket shuffle join).Root cause: The seriality of an exchange should depend solely on the exchange type and
use_serial_exchangesetting — not on whether the downstream scan is pooled. Scan pooling is handled independently by the scan pipeline's local shuffle mechanism.Changes
FE changes:
ExchangeNode.isSerialOperator(): RemovehasSerialScanNode()guard. OnlyUNPARTITIONEDexchanges or those withuse_serial_exchange=trueare serial. BUCKET_SHUFFLE exchanges are never serial.DistributePlanner.filterInstancesWhichCanReceiveDataFromRemote(): UseisSerialOperator() && useSerialSource()to decide whether to dedupe receivers to 1 per BE, instead of checkingLocalShuffleAssignedJobtype. Preserve the independentenableShareHashTableForBroadcastJoinbranch for broadcast join shared hash table optimization (consistent withCoordinator.computeFragmentExecParams()).DistributePlanner.sortDestinationInstancesByBuckets(): For non-serial exchanges withLocalShuffleBucketJoinAssignedJob, usegetAssignedJoinBucketIndexes()(the distributed join bucket assignments) to determine sender destinations, consistent withThriftPlansBuilder.computeBucketIdToInstanceId()on the receiver side. Previously, it always used scan source buckets which are pooled to one instance per BE, causing a sender-receiver bucket mapping mismatch.BE changes:
pipeline_fragment_context.cpp: For non-serial BUCKET_SHUFFLE exchanges, compute the set of destination instances frombucket_seq_to_instance_idxvalues. Pass this toExchangeSourceOperatorX.exchange_source_operator: Padding instances (those withnum_instances > bucket_countper BE, having no bucket assignment) createVDataStreamRecvrwithnum_senders=0so they return EOS immediately instead of waiting indefinitely for senders that will never target them.Release note
None
Check List (For Author)