[improvement](fe) Avoid two-phase agg for single instance#63732
[improvement](fe) Avoid two-phase agg for single instance#63732BiteTheDDDDt wants to merge 12 commits into
Conversation
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: In a single-BE single-instance execution, non-distinct aggregation does not benefit from splitting into local and global phases. The split can add an unnecessary hash exchange and extra aggregate operator for high-cardinality group-by queries. This change detects the single execution instance case during Nereids non-distinct aggregate implementation and only generates the one-phase aggregate candidate. It also lets the global aggregate request ANY child distribution in that case so the optimizer does not add a redundant exchange.
### Release note
Optimize aggregation planning for single-BE single-instance execution by avoiding unnecessary local/global aggregate split.
### Check List (For Author)
- Test: FE Unit Test
- ./run-fe-ut.sh --run org.apache.doris.nereids.rules.implementation.BucketedAggregateTest,org.apache.doris.nereids.properties.RequestPropertyDeriverTest
- cd fe && mvn checkstyle:check -pl fe-core
- Behavior changed: Yes. Non-distinct aggregate in single-BE single-instance execution uses a one-phase plan in auto agg phase.
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Temporarily force parallel_pipeline_task_num to 1 and disable fuzzy randomization for this variable so pipeline tests can deterministically expose plan-shape cases affected by single-instance aggregation planning. This commit is intended to be reverted after collecting affected cases.
### Release note
None
### Check List (For Author)
- Test: FE style check
- cd fe && mvn checkstyle:check -pl fe-core
- Behavior changed: Yes. Temporarily defaults parallel_pipeline_task_num to 1 and excludes it from fuzzy session variables.
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Optimize Nereids planning for single-BE single-instance execution by suppressing two-phase and bucketed aggregate candidates for non-distinct aggregates and by relaxing the global aggregate's child distribution request to ANY so no exchange enforcer is added. Also changes the default of parallelPipelineTaskNum from 0 (auto) to 1 and removes its fuzzy randomization.
Changes:
- Add
AggregateUtils.isSingleExecutionInstanceand use it inSplitAggWithoutDistinct(skip 2-phase/bucketed candidates) andRequestPropertyDeriver(requestANYfor global agg child). - Change
parallelPipelineTaskNumdefault from0to1and drop its fuzzy randomization ininitFuzzyModeVariables. - Update
BucketedAggregateTestandRequestPropertyDeriverTestto setparallelPipelineTaskNum/beNumberForTestexplicitly and add new single-instance coverage.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/nereids/util/AggregateUtils.java | New helper detecting one-BE, one-parallel-instance execution. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/implementation/SplitAggWithoutDistinct.java | Skip 2-phase/bucketed candidates in auto mode when single instance. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/properties/RequestPropertyDeriver.java | Request ANY for global hash-agg child when single instance. |
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Change parallelPipelineTaskNum default to 1 and disable fuzzy randomization. |
| fe/fe-core/src/test/java/org/apache/doris/nereids/rules/implementation/BucketedAggregateTest.java | Set parallel instance explicitly; add single-instance one-phase test. |
| fe/fe-core/src/test/java/org/apache/doris/nereids/properties/RequestPropertyDeriverTest.java | Add single-instance global agg test; use proper ConnectContext with beNumberForTest=3. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @VarAttrDef.VarAttr(name = PARALLEL_PIPELINE_TASK_NUM, fuzzy = false, needForward = true, | ||
| setter = "setPipelineTaskNum") | ||
| public int parallelPipelineTaskNum = 0; | ||
| public int parallelPipelineTaskNum = 1; |
| this.feDebug = true; | ||
| this.enableConditionCache = Config.pull_request_id % 2 == 0; | ||
| this.parallelPipelineTaskNum = random.nextInt(8); | ||
| this.parallelPrepareThreshold = random.nextInt(32) + 1; |
| int beNumber = connectContext.getSessionVariable().getBeNumberForTest(); | ||
| if (beNumber < 0) { | ||
| beNumber = connectContext.getEnv().getClusterInfo().getAllBackendByCurrentCluster(true).size(); | ||
| } | ||
| beNumber = Math.max(1, beNumber); | ||
| String clusterName = connectContext.getSessionVariable().resolveCloudClusterName(connectContext); | ||
| int parallelInstance = Math.max(1, connectContext.getSessionVariable().getParallelExecInstanceNum(clusterName)); | ||
| return beNumber == 1 && parallelInstance == 1; |
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Some plan-shape tests rely on multi-instance aggregate planning. When parallel_pipeline_task_num is forced to 1 for pipeline validation, those tests can choose the single-instance one-phase aggregate path and fail their old plan assertions. Pin these tests to parallel_pipeline_task_num=2 so they continue covering the existing multi-instance plan shapes deterministically.
### Release note
None
### Check List (For Author)
- Test: FE Unit Test
- ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.SplitMultiDistinctTest,org.apache.doris.nereids.rules.rewrite.AggregateStrategiesTest,org.apache.doris.nereids.rules.rewrite.AggregateUnionPlanTest,org.apache.doris.nereids.processor.post.ShuffleKeyPrunerTest,org.apache.doris.planner.StatisticDeriveTest,org.apache.doris.qe.OlapQueryCacheTest,org.apache.doris.nereids.glue.translator.PhysicalPlanTranslatorTest,org.apache.doris.nereids.rules.exploration.mv.PartitionColumnTraceTest
- cd fe && mvn checkstyle:check -pl fe-core
- Behavior changed: No. Test-only fixed session variables.
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The single-instance aggregate optimization changed global aggregate child request properties to ANY when only one BE and one pipeline task were detected. That skips required hash/gather distribution enforcement for inputs whose physical distribution still is not globally aggregated, such as TVF or insert paths, and can produce wrong aggregate results. Keep the optimization that avoids generating redundant two-phase aggregate candidates, but preserve the normal global aggregate request property derivation so correctness-sensitive exchanges remain when needed. Remove the temporary vault regression pin because it masked this correctness issue instead of fixing the root cause.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- FE UT: mvn test -pl fe-core -am -Dtest=org.apache.doris.nereids.properties.RequestPropertyDeriverTest,org.apache.doris.nereids.rules.implementation.BucketedAggregateTest -DfailIfNoTests=false -Dcheckstyle.skip=true -q
- FE Checkstyle: mvn checkstyle:check -pl fe-core
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: After pruning two-phase aggregate candidates for single-BE single-instance execution, one-phase global aggregate can still need a hash or gather enforcer to preserve aggregate correctness. ChildrenPropertiesRegulator still banned one-phase aggregate when its child was a distribute enforcer, assuming a two-phase candidate remained available. In the single-instance optimization path that leaves no valid alternative and can produce lowestCostPlans missing GATHER errors. Allow one-phase aggregate with required enforcers when the plan is known to execute on one BE with one pipeline task.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- FE UT: mvn test -pl fe-core -am -Dtest=org.apache.doris.nereids.properties.ChildrenPropertiesRegulatorTest,org.apache.doris.nereids.properties.RequestPropertyDeriverTest,org.apache.doris.nereids.rules.implementation.BucketedAggregateTest -DfailIfNoTests=false -Dcheckstyle.skip=true -q
- FE Checkstyle: mvn checkstyle:check -pl fe-core
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The single-instance aggregate optimization only generated one-phase aggregate candidates, but ChildrenPropertiesRegulator still applied the UnionAll one-phase aggregate ban first. Nested set operations such as INTERSECT/EXCEPT under UNION DISTINCT can be rewritten through aggregate nodes; when they require GATHER at the root, the one-phase candidate was rejected and no two-phase fallback existed, so planning failed with missing lowestCostPlans for GATHER. This change lets explicit agg_phase=1 and detected single-execution-instance cases bypass the UnionAll ban before other one-phase aggregate checks, preserving a valid one-phase plan while keeping the existing ban for multi-instance planning.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.AggregateUnionPlanTest#testNestedSetOperationDistinctUnionSingleInstance
- ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.AggregateUnionPlanTest,org.apache.doris.nereids.properties.ChildrenPropertiesRegulatorTest,org.apache.doris.nereids.rules.implementation.BucketedAggregateTest
- cd fe && mvn checkstyle:check -pl fe-core
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The single-instance aggregate optimization changes aggregate physical plans when tests run with one BE and one pipeline task. Existing plan and shape regression cases record multi-instance aggregate plans, so they should explicitly run with more than one pipeline task instead of depending on the environment default.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Verified regression suite diffs and ran git diff --check. Local regression execution was blocked because no FE/MySQL service was listening on 127.0.0.1:9030.
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: QueryCacheNormalizerTest validates query cache normalization against multi-instance aggregate plans. The single-instance aggregate optimization can choose one-phase aggregate plans under the default FE UT session, which changes the cache point and normalized digest behavior being tested. Set the test session to use two pipeline tasks so these query cache normalizer cases continue to cover the intended multi-instance plan shape.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Ran QueryCacheNormalizerTest targeted methods with fe-core dependencies.
- Ran FE checkstyle for fe-core.
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The single-instance aggregate optimization skipped two-phase aggregate candidates in auto mode. Some aggregate functions, such as orthogonal bitmap expression calculate, do not support one-phase execution, so single-instance planning could produce no valid implementation and fail with a missing GATHER root plan. Keep two-phase candidates when no one-phase candidate can be generated, and pin affected plan regression cases to multi-instance execution where they explicitly validate multi-phase plan shapes.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Ran AggregateUnionPlanTest with fe-core dependencies.
- Ran FE checkstyle for fe-core.
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Several regression failures were exposed by the single-instance aggregate optimization. count_by_enum returned JSON from an unordered map, so foreach aggregate baselines could fail on key ordering only. Decimal view persistence with distinct sum could lose the session-variable guard when the distinct aggregate was rewritten to final multi-distinct phases, causing Decimal256 result type mismatches. Query cache and Python UDAF merge-propagation tests also validate multi-phase aggregate behavior, so pin those specific cases to multi-instance execution.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Ran AggregateUnionPlanTest with fe-core dependencies.
- Ran FE checkstyle for fe-core.
- Ran BE clang-format/check-format and clang-tidy for the changed BE header.
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The foreach aggregate regression cases compared the full JSON string produced by count_by_enum_foreach. That JSON contains a cbe object backed by unordered map iteration, so key order can vary without a semantic result change. The previous test stabilization changed BE JSON serialization order, which is a behavior change for a test-only instability. Revert the BE serialization change and instead split the regression check: keep stable foreach aggregate outputs in the original query, and validate count_by_enum_foreach through JSON path extraction of each expected field so object key order no longer matters.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check
- Ran build-support/check-format.sh
- Attempted ./run-regression-test.sh --run -d function_p0 -s test_agg_foreach,test_agg_foreach_not_null, but local Doris connection was refused
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Two regression failures remained after enabling the single-instance aggregate path. The foreach suite still expected array_agg_foreach over nested arrays to fail, but the single-instance plan no longer needs array_agg merge/write and can return a valid result; the test now checks stable shape/counts under an explicit single-instance hint. The decimal256 persisted-view case failed because the final multi-distinct rewrite wrapped the local aggregate output with SessionVarGuardExpr, which is not needed for the local alias and could break later aggregate translation; the guard is kept on the final aggregate result where the persisted session-sensitive type must be preserved. A FE unit reproducer covers the decimal256 distinct aggregate view without group by.
### Release note
None
### Check List (For Author)
- Test: Regression test / Unit Test
- mvn -pl fe-core -am -DskipUT=false -DskipITs -Dtest=org.apache.doris.nereids.rules.rewrite.AggregateUnionPlanTest test
- cd fe && mvn checkstyle:check -pl fe-core
- Attempted ./run-regression-test.sh --run -d function_p0 -s test_agg_foreach, but local Doris connection was refused
- Behavior changed: No
- Does this need documentation: No
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: In a single-BE single-instance execution, non-distinct aggregation does not benefit from splitting into local and global phases. The split can add an unnecessary hash exchange and extra aggregate operator for high-cardinality group-by queries. This change detects the single execution instance case during Nereids non-distinct aggregate implementation and only generates the one-phase aggregate candidate. It also lets the global aggregate request ANY child distribution in that case so the optimizer does not add a redundant exchange.
Release note
Optimize aggregation planning for single-BE single-instance execution by avoiding unnecessary local/global aggregate split.
Check List (For Author)