Skip to content

[refactor](local shuffle) Move local exchange planning from BE to FE#63366

Open
924060929 wants to merge 7 commits into
masterfrom
fe_local_shuffle_rebase3
Open

[refactor](local shuffle) Move local exchange planning from BE to FE#63366
924060929 wants to merge 7 commits into
masterfrom
fe_local_shuffle_rebase3

Conversation

@924060929
Copy link
Copy Markdown
Contributor

@924060929 924060929 commented May 18, 2026

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary:

Move local exchange (LE) planning from BE's _plan_local_exchange (pipeline build time) to a new FE-side planner. The FE planner mirrors BE semantics, brings several correctness fixes, and is gated by a session variable so the legacy BE path stays available as a fallback.

Core design

  • New AddLocalExchange pass runs after DistributePlanner, walking each fragment's plan tree bottom-up via the polymorphic PlanNode.enforceAndDeriveLocalExchange(). Each node declares what distribution it requires of its children; the framework inserts LocalExchangeNode where needed.
  • LocalExchangeNode represents intra-fragment data redistribution and supports PASSTHROUGH, GLOBAL/LOCAL/BUCKET HASH_SHUFFLE, BROADCAST, PASS_TO_ONE, ADAPTIVE_PASSTHROUGH, LOCAL_MERGE_SORT, NOOP.
  • Per-BE instance semantics: maxPerBeInstances (max pipeline instances assigned to any single BE) is used instead of global instance count to match BE's _num_instances check. Planning is a no-op when maxPerBeInstances == 1.
  • Serial → non-serial fan-out: when a serial operator feeds a non-serial parent without an intermediate LE, the framework inserts a PASSTHROUGH LE to restore N-task parallelism, matching BE's required_data_distribution() rule.
  • Requirement-based exchange type resolution via LocalExchangeTypeRequire: RequireHash adapts to any hash flavour, RequireSpecific preserves the exact requested type.

AggregationNode correctness fixes (DORIS-25413)

PR #62438 introduced a semantic split for required_data_distribution=HASH (correctness-required vs performance-only). BE's !_needs_finalize && !enable_local_exchange_before_agg → base early-return conflates both intents in AggSinkOperatorX and DistinctStreamingAggOperatorX, wrongly catching FIRST_MERGE (correctness) / non-streaming dedup (correctness) and producing PASSTHROUGH-over-serial-child → wrong aggregation results. The FE planner adds the missing !isMerge() / useStreamingPreagg=true guards so FIRST_MERGE and non-streaming dedup always emit HASH, regardless of the flag. Also adds requiresShuffleForCorrectness() (mirrors BE's is_shuffled_operator()) so SetOperationNode propagates the "downstream depends on hash" flag correctly through chains.

Session variables

Architectural notes

This PR puts the FE planner in the driver's seat for LE insertion but intentionally keeps BE-side machinery as a fallback:

  1. is_serial_operator is still computed on both sides — any future change to BE's per-operator C++ override must be mirrored in FE.
  2. Legacy BE planner (pipeline_fragment_context.cpp::_plan_local_exchange) is preserved and gated by runtime_state.h::plan_local_shuffle(); the two paths are mutually exclusive.
  3. _propagate_local_exchange_num_tasks is kept as a runtime safety net for paired-pipeline num_tasks mismatches.

Build fixes (cross-toolchain portability)

  • multi_version.h: replace atomic_load/atomic_store (deprecated in libstdc++ C++20 / LLVM 20) with std::shared_mutex-based RW locking.
  • memory.cpp: fix std::max type mismatch (long vs int64_t) on macOS.
  • bucketed_aggregation_sink_operator.h: fix ExchangeType::NOOPTLocalPartitionType::NOOP after thrift enum rename.

Release note

Add session variable enable_local_shuffle_planner (default true) to control whether local exchange nodes are planned in FE (new path) or in BE (legacy _plan_local_exchange). The two paths are mutually exclusive; the legacy path remains intact behind this flag.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. Local exchange node insertion now happens at FE planning time when enable_local_shuffle_planner=true (default). Plan shapes (LOCAL_EXCHANGE_NODE in TPlanNode) and exchange counts may differ from the legacy BE-planned path, but query results remain equivalent. Setting enable_local_shuffle_planner=false restores the legacy behavior bit-for-bit.
  • Does this need documentation?

    • No.
    • Yes. Session variable enable_local_shuffle_planner should be added to the documentation; doc PR will be filed separately.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@924060929
Copy link
Copy Markdown
Contributor Author

/review

@924060929 924060929 force-pushed the fe_local_shuffle_rebase3 branch from 2e22e0a to 6fa1901 Compare May 18, 2026 11:56
@924060929
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two blocking issues in the FE local exchange planner path. Critical checkpoints: goal/test coverage: the PR implements FE-side local exchange planning with substantial unit/regression coverage, but the new default path is not safe for mixed FE/BE versions and has a correctness gap in serial-source gating. Scope: the change is focused on local exchange planning, though it is broad and cross-module. Concurrency/lifecycle: no direct new shared Java concurrency issue found; BE deferred local exchanger lifecycle appears intentionally wired before pipeline prepare. Configuration/compatibility: enable_local_shuffle_planner defaults to true and sends new thrift plan nodes/types without a visible BE-version gate, which is a rolling-upgrade blocker. Parallel paths: the old coordinator is forced to BE planning, but the Nereids FE path still needs the compatibility and serial-gating fixes. Testing: good coverage was added, but it does not cover mixed-version execution or the non-serial-source fragment cases noted inline. User focus: no additional user-provided focus points.

description = {"是否在FE规划Local Shuffle",
"Whether to plan local shuffle in frontend"}, needForward = true)
private boolean enableLocalShufflePlanner = true;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default enables the FE planner to serialize the new LOCAL_EXCHANGE_NODE/TLocalPartitionType protocol to every BE as soon as the FE is upgraded. During a rolling upgrade, an old BE does not have TPlanNodeType::LOCAL_EXCHANGE_NODE handling in _create_operator and will reject/fail such fragments, while RuntimeState::plan_local_shuffle() also disables the legacy BE planner because enable_local_shuffle_planner is set. Please gate this path on BE capability/version (or keep the default off until all BEs support the new node) so mixed FE/BE deployments continue to execute queries during upgrade.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our deployment process upgrades BE before FE, so during a rolling upgrade an old BE never receives LOCAL_EXCHANGE_NODE from a new FE — the new thrift node type is only sent after all BEs in the cluster have been upgraded. Keeping the default true is intentional so we get the FE planner enabled out of the box once the rollout completes. If you have a deployment topology that flips this order, please set enable_local_shuffle_planner=false in fe.conf or via SQL and the legacy BE path stays bit-for-bit identical to the old behavior.

// 4. Layer 1: skip LE when serial operator or ancestor in same pipeline
// Equivalent to BE's need_to_local_exchange: any_of(operators[idx..end], is_serial) → skip
if (translatorContext.hasSerialAncestorInPipeline(this) || isSerialNode()) {
return childOutput;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skip uses isSerialNode() even though the comment above isSerialOperatorOnBe() says an isSerialNode() only actually runs with one BE task when fragment.useSerialSource(context) is true. For fragments where useSerialSource is false (for example ignore_storage_data_distribution=false, query cache, or NAAJ), a node such as a scalar aggregate or unpartitioned exchange can still return isSerialNode()==true but BE will execute it with normal parallelism (is_serial_operator=false in thrift). In that case this branch skips a required LocalExchange even though BE would not consider the ancestor serial, so downstream hash/passthrough requirements can be silently dropped. The serial-ancestor propagation at line 1094 has the same issue. Please base these planner decisions on isSerialOperatorOnBe(translatorContext.getConnectContext()), not the syntactic isSerialNode(), except for the explicitly documented heavy-op/local-fragment cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 15d92ba. Both the Layer 1 skip and the serial-ancestor propagation in enforceRequire now use isSerialOperatorOnBe(translatorContext.getConnectContext()) instead of the raw isSerialNode().

Verified that BE's OperatorBase constructs from the Thrift is_serial_operator flag (which FE writes via isSerialOperatorOnBe, not isSerialNode) — so Pipeline::need_to_local_exchange's op->is_serial_operator() check returns false when fragment.useSerialSource(ctx) is false, even if isSerialNode() is true. The previous code would have over-skipped LocalExchange in exactly the scenarios you listed (ignore_storage_data_distribution=false, query cache, NAAJ).

Updated LocalShuffleNodeCoverageTest.testMaterializationNode and testSetOperationAndAssertNumRowsNode to reflect the corrected behavior: in the fragment-less unit-test path isSerialOperatorOnBe returns false (the fragment != null guard) so the framework no longer skips Layer 1 and inserts the required LocalExchange.

Other isSerialNode() call sites in PlanNode.java were audited and left as-is:

  • toThrift() already uses isSerialOperatorOnBe
  • hasSerialChildren() is a pure node-level tree walk used only for fragment-internal heuristics
  • createLocalExchange() heavy-op gate is already inside a fragment.useSerialSource(ctx) branch, so isSerialNode and isSerialOperatorOnBe are equivalent there

@924060929 924060929 force-pushed the fe_local_shuffle_rebase3 branch from 6fa1901 to 15d92ba Compare May 18, 2026 13:21
@924060929
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 73.93% (431/583) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.06% (1854/2375)
Line Coverage 64.52% (33325/51653)
Region Coverage 65.21% (16520/25335)
Branch Coverage 55.70% (8827/15848)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31734 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 15d92ba024b265d9441145d7bd26d93590aa9c63, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17803	3914	3926	3914
q2	q3	10929	1434	801	801
q4	4816	481	340	340
q5	10658	2257	2105	2105
q6	397	175	139	139
q7	914	793	617	617
q8	9633	1832	1667	1667
q9	6923	5130	4985	4985
q10	6518	2090	1770	1770
q11	430	268	239	239
q12	668	423	298	298
q13	18212	3499	2774	2774
q14	260	253	239	239
q15	q16	823	779	721	721
q17	918	940	963	940
q18	7080	5849	6162	5849
q19	1241	1333	1062	1062
q20	503	407	258	258
q21	6216	2696	2699	2696
q22	458	385	320	320
Total cold run time: 105400 ms
Total hot run time: 31734 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4514	4779	4621	4621
q2	q3	4801	5222	4621	4621
q4	2181	2213	1441	1441
q5	4809	4669	4682	4669
q6	232	191	134	134
q7	1807	1659	1400	1400
q8	2178	1885	1875	1875
q9	7379	7481	7388	7388
q10	4474	4422	3979	3979
q11	540	377	348	348
q12	725	726	508	508
q13	2994	3423	2767	2767
q14	270	277	253	253
q15	q16	670	702	606	606
q17	1267	1240	1233	1233
q18	7261	6876	6773	6773
q19	1096	1105	1118	1105
q20	2217	2232	1921	1921
q21	5321	4605	4449	4449
q22	523	482	399	399
Total cold run time: 55259 ms
Total hot run time: 50490 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 72.45% (305/421) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 63.48% (23997/37804)
Line Coverage 47.20% (247315/523994)
Region Coverage 44.16% (203214/460125)
Branch Coverage 45.44% (87932/193505)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 63.93% (468/732) 🎉
Increment coverage report
Complete coverage report

@924060929 924060929 force-pushed the fe_local_shuffle_rebase3 branch 2 times, most recently from f87c73e to 4affd22 Compare May 19, 2026 03:40
@924060929
Copy link
Copy Markdown
Contributor Author

run buildall

@924060929 924060929 force-pushed the fe_local_shuffle_rebase3 branch from 4affd22 to 287ef12 Compare May 19, 2026 06:15
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31488 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4affd222d6550dd103b7829a6fced0b7d45d317a, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17594	3891	3973	3891
q2	q3	10951	1405	844	844
q4	4808	482	346	346
q5	10383	2275	2135	2135
q6	365	183	144	144
q7	936	776	645	645
q8	9783	1852	1610	1610
q9	7005	4945	4940	4940
q10	6473	2240	1807	1807
q11	428	262	242	242
q12	653	436	297	297
q13	18141	3324	2779	2779
q14	270	250	237	237
q15	q16	811	768	721	721
q17	908	941	925	925
q18	7073	5714	5921	5714
q19	1229	1310	1127	1127
q20	519	416	272	272
q21	5733	2807	2498	2498
q22	451	371	314	314
Total cold run time: 104514 ms
Total hot run time: 31488 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4640	4749	4525	4525
q2	q3	4851	5185	4650	4650
q4	2126	2222	1435	1435
q5	4774	4715	4650	4650
q6	224	166	119	119
q7	1673	1601	1396	1396
q8	2223	1908	1912	1908
q9	7411	7410	7408	7408
q10	4480	4414	3997	3997
q11	522	382	348	348
q12	713	720	507	507
q13	3056	3419	2783	2783
q14	282	275	251	251
q15	q16	678	701	610	610
q17	1267	1251	1255	1251
q18	7354	6931	6785	6785
q19	1154	1100	1150	1100
q20	2228	2221	1933	1933
q21	5311	4653	4496	4496
q22	530	453	396	396
Total cold run time: 55497 ms
Total hot run time: 50548 ms

@924060929
Copy link
Copy Markdown
Contributor Author

run buildall

@924060929 924060929 force-pushed the fe_local_shuffle_rebase3 branch from 287ef12 to 2d021f7 Compare May 19, 2026 07:28
Previously, local exchange (LE) nodes were inserted exclusively by the
BE's `_plan_local_exchange` at pipeline build time.  The FE had no
visibility into which operators needed a fan-out or shuffle before
execution, making it impossible to validate, optimize, or override LE
decisions at planning time.

This PR introduces a full FE-side local exchange planner that mirrors
BE semantics, brings several correctness fixes, and leaves the legacy
BE path fully intact behind a feature flag.  See "Current architecture
notes" at the bottom for what the FE planner does and does not own.

A new `AddLocalExchange` pass runs after normal fragment assignment.
It walks each fragment's plan tree bottom-up, calling the polymorphic
`PlanNode.enforceAndDeriveLocalExchange()` on every node.  Nodes
declare what distribution they require of their children; the framework
inserts `LocalExchangeNode` where needed.

`LocalExchangeNode` represents intra-fragment data redistribution and
supports the full set of exchange types: PASSTHROUGH, HASH_SHUFFLE,
BUCKET_HASH_SHUFFLE, GLOBAL_EXECUTION_HASH_SHUFFLE, BROADCAST,
PASS_TO_ONE, ADAPTIVE_PASSTHROUGH, LOCAL_MERGE_SORT, and NOOP.

The pass is guarded by `enable_local_shuffle_planner` (default true).
When disabled, BE continues to run its own `_plan_local_exchange` as
before, keeping the old path fully intact.

`maxPerBeInstances` (max pipeline instances assigned to any single BE)
is used instead of a global `instanceCount`.  Planning is a no-op when
`maxPerBeInstances == 1` — inserting LE on a single-threaded pipeline
would cause task-count mismatches and pipeline starvation.

When a serial operator (e.g. OlapScanNode with a single tablet bucket)
feeds a non-serial parent without an intermediate LE, downstream tasks
starve waiting for data that never arrives.  The framework detects this
case and inserts a PASSTHROUGH LE to restore N-task parallelism, exactly
matching BE's `required_data_distribution()` serial → PASSTHROUGH rule.

`LocalExchangeTypeRequire` abstracts two strategies:
- `RequireHash` — always resolves to `LOCAL_EXECUTION_HASH_SHUFFLE`
  (safe for intra-fragment hash partitioning).
- `RequireSpecific` — preserves BUCKET_HASH_SHUFFLE /
  GLOBAL_EXECUTION_HASH_SHUFFLE without degradation.

PR #62438 added `enable_local_exchange_before_agg`, but its BE guard
`!_needs_finalize && !enable_local_exchange_before_agg → base` conflated
two semantically different cases in AggSink and DistinctStreamingAgg:

- **AggSink**: `!finalize && hasKeys` covered both LOCAL preagg
  (performance-only) and FIRST_MERGE dedup (correctness-critical).
  The flag-gated early-return wrongly skipped HASH for FIRST_MERGE,
  producing PASSTHROUGH-over-serial-child → wrong aggregation results.

- **DistinctStreamingAgg**: `!finalize` covered both streaming preagg
  (`useStreamingPreagg=true`, performance) and non-streaming dedup
  (`useStreamingPreagg=false`, correctness).  Same class of bug.

FE fix:
- AggSink: restrict the flag-gated base path to `!isMerge()` LOCAL
  phases.  FIRST_MERGE always emits HASH regardless of the flag.
- DistinctStreamingAgg: restrict to `useStreamingPreagg=true`.
  Non-streaming dedup always emits HASH.

Also add `requiresShuffleForCorrectness()` to mirror BE's
`is_shuffled_operator()`, so SetOperationNode propagates the
"downstream depends on hash" flag correctly instead of using the
coarser `parentRequire.preferType().isHashShuffle()` check that
over-inserted HASH LE on every union branch under a streaming preagg.

These fixes reduce FE/BE consistency mismatches from 8 to 3
(only pre-existing NLJ optimization differences remain).

- `enable_local_shuffle_planner` — use FE planner (default true)
- `enable_local_shuffle` — master switch for local shuffle
- `enable_local_exchange_before_agg` — HASH LE before non-final agg
  (default true, mirrors #62438)

`validateNoSerialWithoutLocalExchange()` walks the final plan tree and
logs a warning whenever a serial operator feeds a non-serial parent
without an intermediate LocalExchangeNode, catching planning gaps
before execution.

- `test_enable_local_exchange_before_agg.groovy` — 10 agg patterns
  with the flag on and off; covers the FIRST_MERGE and
  DistinctStreamingAgg correctness fixes.
- `test_local_shuffle_fe_be_consistency.groovy` — runs the same SQL
  with `enable_local_shuffle_planner=true` and `=false` across the
  full operator matrix (Agg, Sort, Analytic, HashJoin, NLJ, Set, Union,
  TableFunction, AssertNumRows, RQG-derived corner cases) and asserts
  result rows are identical.  Only data correctness is asserted — the
  two planners legitimately differ on the exact exchange counts/types
  they emit, so plan-shape equality is intentionally not checked.
- `test_local_shuffle_rqg_bugs.groovy` — reproduces 20+ RQG-found
  crashes and wrong-result cases.
- `test_old_coordinator_local_shuffle.groovy` — verifies the old
  coordinator path is unaffected.
- `test_multilevel_join_agg_local_shuffle.groovy` — multi-level join
  and aggregation plan shapes.

- `multi_version.h`: replace `atomic_load/atomic_store` (deprecated in
  libstdc++ C++20 / LLVM 20) with `std::shared_mutex`-based RW locking.
- `memory.cpp`: fix `std::max` type mismatch (`long` vs `int64_t`)
  on macOS.
- `bucketed_aggregation_sink_operator.h`: fix `ExchangeType::NOOP` →
  `TLocalPartitionType::NOOP` after thrift enum rename.

This PR puts the FE planner in the driver's seat for LE insertion but
intentionally does NOT remove the BE-side machinery — readers should be
aware of three pieces the FE planner shares with or defers to BE:

1. **`is_serial_operator` is computed on both sides.**  FE computes the
   flag and writes it into Thrift, but BE's
   `OperatorBase::is_serial_operator()` is still overridden per operator
   in C++ and used for BE-side runtime decisions.  Any future change to
   the BE override needs to be mirrored on the FE side (and vice versa)
   to keep the planner's view consistent with execution.

2. **The legacy BE planner stays as a fallback.**
   `pipeline_fragment_context.cpp::_plan_local_exchange` is preserved
   and gated by `runtime_state.h::plan_local_shuffle()`: when
   `enable_local_shuffle_planner=false`, BE plans LE itself, exactly as
   before.  The two paths are mutually exclusive, never both running on
   the same query.

3. **`_propagate_local_exchange_num_tasks` is kept as a runtime safety
   net.**  The two propagation passes in
   `pipeline_fragment_context.cpp` fix up paired pipelines whose
   `num_tasks` end up mismatched (e.g. when AGG/SORT/JOIN pipeline
   splits leave a serial Exchange feeding an N-task sink).  FE's
   framework-level serial→non-serial fan-out (`enforceRequire` step 3)
   and the `validateNoSerialWithoutLocalExchange` check aim to make
   these mismatches impossible by construction, but the BE-side fixup
   remains as a defensive guard.

Co-authored-by: Gabriel <liwenqiang@selectdb.com>
@924060929 924060929 force-pushed the fe_local_shuffle_rebase3 branch from 2d021f7 to 9016e5a Compare May 19, 2026 08:06
@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 11.11% (46/414) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.56% (20733/38712)
Line Coverage 37.18% (196066/527339)
Region Coverage 33.52% (153723/458547)
Branch Coverage 34.55% (66975/193832)

… DSL

- Delete duplicate testAggFromScanUsesLocalExecutionHashShuffle
- Rewrite 8 substring-based tests as DSL shape assertions
- Add testUnionAllScanAndValues (Tier B from Trino)
- Add assertNoLocalExchangeOfType helper for negative checks
- Add nestedLoopJoin/partitionSort/olapScan() factories to PlanShape(Dsl)
@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 11.11% (46/414) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.59% (20747/38712)
Line Coverage 37.21% (196215/527345)
Region Coverage 33.52% (153718/458547)
Branch Coverage 34.57% (67007/193832)

…_multilevel_join_agg_local_shuffle

The local-shuffle FE planner inserts LocalExchange nodes after the Nereids
physical plan stage, so `explain shape plan` output is independent of
enable_local_shuffle_planner.  When the on/off variants did differ, the
diff was always driven by stats-sensitive rewrites (e.g. cost-based
InferSetOperatorDistinct), not by the planner mode itself — meaning the
shape check was effectively asserting stats stability across environments,
which it cannot guarantee.

Keep result-equality (qt_*_result_on / _result_off) and check_sql_equal
between planner modes; drop the shape assertions and the 72 _shape_on
blocks from the .out file.
@924060929
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.06% (1854/2375)
Line Coverage 64.52% (33334/51662)
Region Coverage 65.20% (16523/25341)
Branch Coverage 55.75% (8838/15852)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 11.11% (46/414) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.56% (20734/38712)
Line Coverage 37.19% (196094/527345)
Region Coverage 33.52% (153693/458547)
Branch Coverage 34.56% (66979/193832)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31328 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9016e5a105f84231f1a535be233aaf3e595c711d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17620	4009	3866	3866
q2	q3	10989	1467	845	845
q4	4810	476	354	354
q5	10550	2299	2104	2104
q6	382	181	139	139
q7	1020	761	642	642
q8	9603	1863	1716	1716
q9	6948	5174	4937	4937
q10	6492	2085	1794	1794
q11	428	269	245	245
q12	647	422	295	295
q13	18188	3370	2776	2776
q14	268	259	238	238
q15	q16	823	778	708	708
q17	931	931	900	900
q18	6829	5674	5604	5604
q19	1154	1225	1129	1129
q20	504	412	262	262
q21	5488	2532	2467	2467
q22	434	374	307	307
Total cold run time: 104108 ms
Total hot run time: 31328 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4246	4221	4168	4168
q2	q3	4537	4942	4338	4338
q4	2110	2242	1392	1392
q5	4393	4312	5249	4312
q6	256	194	144	144
q7	1913	1816	1568	1568
q8	2442	2154	2075	2075
q9	7958	7888	7894	7888
q10	4565	4579	4065	4065
q11	565	428	374	374
q12	738	735	518	518
q13	3221	3668	2992	2992
q14	290	298	281	281
q15	q16	722	771	666	666
q17	1355	1339	1339	1339
q18	7802	7376	6840	6840
q19	1155	1076	1078	1076
q20	2217	2227	1956	1956
q21	5329	4679	4490	4490
q22	533	449	416	416
Total cold run time: 56347 ms
Total hot run time: 50898 ms

…local-shuffle planner

FE-side local-shuffle planner wraps a serial fragment root with a PASSTHROUGH
LocalExchangeNode (AddLocalExchange#addLocalExchangeForFragment) so the data
sink can fan out across pipeline tasks.  That replaces `fragment.getPlanRoot()`
with a LocalExchangeNode wherever the original root was a serial FileScanNode.

InsertIntoTableCommand#applyInsertPlanStatistic was using
`fragment.getPlanRoot() instanceof FileScanNode` to find load-source scans, so
after the LE wrap the instanceof check fails, addLoadFileInfo is never called,
and LoadStatistic.fileNum / totalFileSizeB stay 0 even though the BE-side
scannedRows / loadBytes counters work normally.

Symptom: job_p0.streaming_job.test_streaming_insert_job fails with
  loadStat.fileNumber == 0 (expected 2) and loadStat.fileSize == 0 (expected 256)
while scannedRows / loadBytes are correct.

Fix: peel any LocalExchangeNode wrappers off the fragment root before the
instanceof check, then extract fileNum / totalFileSize from the underlying
FileScanNode as before.

Verified locally: INSERT INTO t SELECT * FROM LOCAL(...) shows FileNumber=1
FileSize=12 with the fix, FileNumber=0 FileSize=0 without it.
…rk it serial (DORIS-25865)

The FE local-shuffle planner used to insert a LocalExchangeNode directly
under RecursiveCteNode, which broke two RecursiveCte invariants:

1. ThriftPlansBuilder locates the recursive sender fragment via
   `recursiveCteNode.getChild(1).getChild(0).getFragment()`.  A wrapper LE
   between RecCte and the cross-fragment ExchangeNode shifts that path off
   the receiver and pulls the recursive producer fragment into
   `fragmentsToReset`, so BE rejects with
     [INTERNAL_ERROR]Fragment N contains a recursive CTE node
   from RecCTESourceOperatorX::prepare().

2. BE's RecCTESourceOperatorX::is_serial_operator() always returns true.
   RecursiveCteNode#isSerialNode() on the FE side defaulted to false, so
   the planner left the producer fragment with parallel=N sender pipelines
   even though only one instance actually emits data.  The downstream
   cross-fragment Exchange then waits forever on the N-1 silent senders.

Fix in RecursiveCteNode:
  - override isSerialNode() to return true so addLocalExchangeForFragment
    wraps the fragment root with PASSTHROUGH LE and fans the single
    producer out to N parallel sinks (mirrors BE-native behaviour);
  - override enforceAndDeriveLocalExchange to call children's own
    enforceAndDeriveLocalExchange directly, bypassing the framework's
    enforceRequire so no LE gets inserted between RecCte and its
    cross-fragment Exchange children — children's subtrees still get LE
    planning as normal.

Add regression test test_local_shuffle_recursive_cte covering the three
downstream consumer shapes the JIRA listed plus join / negative control:
  rec_cte_agg, rec_cte_window, rec_cte_grouping_sets, rec_cte_select,
  rec_cte_join.  Each is asserted to produce identical rows under
  enable_local_shuffle_planner=true vs =false.
@924060929
Copy link
Copy Markdown
Contributor Author

run buildall

Two regression checks were comparing actual rows against .out in a
specific order even though the SQL was order-insensitive at that point.
The FE local-shuffle planner can legitimately change row delivery order
within a fragment, exposing the latent assumption and failing the test.

- unnest_order_by_list_test.groovy: qt_window_function_order_by_unnested_value
  has a window RANK() over UNNEST without an outer ORDER BY.  Switch to
  order_qt_* (framework sorts the actual rows before comparing) and re-sort
  the corresponding .out block so the expected side is also in sorted order.

- test_python_udaf_complex.groovy: qt_json_array_agg → order_qt_json_array_agg.
  The query GROUP BY category already returns rows in alphabetical category
  order, so no .out change is needed.  Note: this only stabilises row order;
  the python UDAF's per-group array contents still depend on row arrival
  order inside each group, so a stricter pin (ORDER BY id in a subquery or
  array_sort around the agg) would still be needed if that variability
  resurfaces.
@924060929
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31839 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4e05e7bfbb8aace37eb65198a1530e9325047d34, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17008	4093	4055	4055
q2	q3	10494	1424	891	891
q4	4685	481	365	365
q5	7816	2357	2142	2142
q6	267	191	146	146
q7	1026	794	642	642
q8	9380	1750	1655	1655
q9	5537	5094	5054	5054
q10	6382	2144	1768	1768
q11	429	268	244	244
q12	638	431	315	315
q13	18180	3449	2784	2784
q14	268	257	237	237
q15	q16	825	813	716	716
q17	1009	982	1006	982
q18	7125	5794	5761	5761
q19	1179	1232	1085	1085
q20	519	422	274	274
q21	5609	2650	2411	2411
q22	446	375	312	312
Total cold run time: 98822 ms
Total hot run time: 31839 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4361	4345	4290	4290
q2	q3	4528	4949	4331	4331
q4	2184	2261	1419	1419
q5	4568	4406	4368	4368
q6	311	331	177	177
q7	2301	1961	1738	1738
q8	2704	2361	2325	2325
q9	8320	7948	8070	7948
q10	4572	4454	4060	4060
q11	610	441	418	418
q12	751	745	553	553
q13	3315	3705	2988	2988
q14	292	313	297	297
q15	q16	753	755	686	686
q17	1422	1440	1554	1440
q18	8205	7423	7590	7423
q19	1195	1153	1091	1091
q20	2232	2272	1928	1928
q21	5465	4899	4652	4652
q22	549	483	401	401
Total cold run time: 58638 ms
Total hot run time: 52533 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 80.85% (477/590) 🎉
Increment coverage report
Complete coverage report

…E-planned LE

Two correctness issues in the FE-planned local-shuffle path, both surfaced
by single-tablet POOLING / share-scan fragments.

1. FE planner inserted LE(LOCAL_HASH) below a streaming partial agg with
   distributeExprLists = child table distribution (e.g. [id]) instead of
   grouping_exprs (e.g. [category]).  BE's AggSinkOperatorX /
   StreamingAggOperatorX::update_operator picks _partition_exprs =
   grouping_exprs when the chain is not followed_by_shuffled_operator —
   the common case for a streaming preagg at fragment root with only a
   cross-fragment HASH ExchangeSink above.  Using child distribution
   scattered same-group rows across N partial-agg instances, turning the
   preagg into a no-op and breaking row-arrival order at the downstream
   merge-finalize (manifests as non-deterministic group_concat /
   py_json_array_agg output, e.g. test_python_udaf_complex json_array_agg).

   Fix: add overridable PlanNode#getLocalExchangeDistributeExprs(childIndex,
   followedByShuffled) defaulting to the child's distribution, and override
   it on AggregationNode to mirror BE's update_operator: use child
   distribution only when (followedByShuffled || hasDistinct); otherwise
   use grouping_exprs.

2. BE _create_deferred_local_exchangers used sender_count =
   upstream_pipe->num_tasks() with no max(_, _num_instances).  When the
   upstream pipeline has a serial source (POOLING OlapScan, serial
   Exchange), num_tasks() stays at 1 and _propagate_local_exchange_num_tasks
   Pass 1 deliberately does not raise it, but the shared exchanger is
   shared across all _num_instances fragment instances on this BE — each
   instance closes once, so total close-count = _num_instances.  Initial 1
   minus _num_instances closes drove _running_sink_operators negative
   (e.g. -5 for 6 instances, -15 for 16), so the exchanger never reached
   "all senders done", downstream sources blocked on
   SHUFFLE_DATA_DEPENDENCY forever, and the query hung.  Fragments hold
   block references through the hang; on BE shutdown
   mem_tracker_limiter::~MemTrackerLimiter fired FATAL, aborting BE and
   producing the build-948971 "stop grace fail" — root case being
   dictionary_p0.test_dict_load_and_get_ip_trie's refresh dictionary
   running scan + LE(PASSTHROUGH) + cross-fragment DICTIONARY_SINK.

   Fix: mirror BE-planned _add_local_exchange_impl (~line 1023) which uses
   std::max(cur_pipe->num_tasks(), _num_instances).

Tests
- New LocalExchangePlannerTest#testStreamingAggHashShuffleUsesGroupingExprs:
  with t1 DISTRIBUTED BY HASH(k1) and SELECT k2, count(*) GROUP BY k2
  (k2 non-bucket -> two-phase agg), asserts the LE below the streaming
  partial agg carries [k2] (grouping key) not [k1] (child distribution).
  Verified failing pre-fix, passing post-fix.  Whole class (26 tests) green.
- Local cluster (output_test, 29030): group_concat probe stable 1,2,3,4,5
  across 20 runs after both fixes; matches BE-planner=false output.
@924060929
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 74.54% (448/601) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants