Skip to content

branch-4.0: [fix](coordinator) fix computeDestIdToInstanceId picking wrong ExchangeNode for multi-input fragments #63615#63819

Open
github-actions[bot] wants to merge 2 commits into
branch-4.0from
auto-pick-63615-branch-4.0
Open

branch-4.0: [fix](coordinator) fix computeDestIdToInstanceId picking wrong ExchangeNode for multi-input fragments #63615#63819
github-actions[bot] wants to merge 2 commits into
branch-4.0from
auto-pick-63615-branch-4.0

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

Cherry-picked from #63615

…geNode for multi-input fragments (#63615)

## Proposed changes

Fix `Rows mismatched! Data may be lost` error when a fragment receives
data from
multiple ExchangeNode inputs with different partition types (e.g. NLJ
with
HASH-partitioned probe + BROADCAST build).

### Root cause

`ThriftPlansBuilder.filterInstancesWhichReceiveDataFromRemote` used
`.iterator().next()` to pick the first input ExchangeNode. The iteration
order
over a `Set<Entry>` is non-deterministic. When it happens to pick the
BROADCAST
input (1 destination per BE), `shuffle_idx_to_instance_idx` has only 1
entry,
while the HASH LOCAL_EXCHANGE expects N entries (one per pipeline task).
Most
hash partition indices find no mapping, and BE reports the error.

Reproduction: a CTE query with `MultiCastDataSinks` sending
UNPARTITIONED (to a
BROADCAST build) and HASH_PARTITIONED (to an INNER JOIN build) into the
same
scan-free fragment. The bug is non-deterministic because it depends on
Set
iteration order.

### Fix

Iterate all input exchanges and select the one with the most
destinations on the
target worker. This correctly identifies the main data-carrying
(HASH-partitioned) exchange, ensuring the map is complete.
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@924060929
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 81.25% (13/16) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants