Skip to content

[improvement](fe) Avoid two-phase agg for single instance#63732

Open
BiteTheDDDDt wants to merge 12 commits into
apache:masterfrom
BiteTheDDDDt:optimize-single-instance-agg
Open

[improvement](fe) Avoid two-phase agg for single instance#63732
BiteTheDDDDt wants to merge 12 commits into
apache:masterfrom
BiteTheDDDDt:optimize-single-instance-agg

Conversation

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: In a single-BE single-instance execution, non-distinct aggregation does not benefit from splitting into local and global phases. The split can add an unnecessary hash exchange and extra aggregate operator for high-cardinality group-by queries. This change detects the single execution instance case during Nereids non-distinct aggregate implementation and only generates the one-phase aggregate candidate. It also lets the global aggregate request ANY child distribution in that case so the optimizer does not add a redundant exchange.

Release note

Optimize aggregation planning for single-BE single-instance execution by avoiding unnecessary local/global aggregate split.

Check List (For Author)

  • Test: FE Unit Test
    • ./run-fe-ut.sh --run org.apache.doris.nereids.rules.implementation.BucketedAggregateTest,org.apache.doris.nereids.properties.RequestPropertyDeriverTest
    • cd fe && mvn checkstyle:check -pl fe-core
  • Behavior changed: Yes. Non-distinct aggregate in single-BE single-instance execution uses a one-phase plan in auto agg phase.
  • Does this need documentation: No

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: In a single-BE single-instance execution, non-distinct aggregation does not benefit from splitting into local and global phases. The split can add an unnecessary hash exchange and extra aggregate operator for high-cardinality group-by queries. This change detects the single execution instance case during Nereids non-distinct aggregate implementation and only generates the one-phase aggregate candidate. It also lets the global aggregate request ANY child distribution in that case so the optimizer does not add a redundant exchange.

### Release note

Optimize aggregation planning for single-BE single-instance execution by avoiding unnecessary local/global aggregate split.

### Check List (For Author)

- Test: FE Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.nereids.rules.implementation.BucketedAggregateTest,org.apache.doris.nereids.properties.RequestPropertyDeriverTest
    - cd fe && mvn checkstyle:check -pl fe-core
- Behavior changed: Yes. Non-distinct aggregate in single-BE single-instance execution uses a one-phase plan in auto agg phase.
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 27, 2026 08:00
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Temporarily force parallel_pipeline_task_num to 1 and disable fuzzy randomization for this variable so pipeline tests can deterministically expose plan-shape cases affected by single-instance aggregation planning. This commit is intended to be reverted after collecting affected cases.

### Release note

None

### Check List (For Author)

- Test: FE style check
    - cd fe && mvn checkstyle:check -pl fe-core
- Behavior changed: Yes. Temporarily defaults parallel_pipeline_task_num to 1 and excludes it from fuzzy session variables.
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimize Nereids planning for single-BE single-instance execution by suppressing two-phase and bucketed aggregate candidates for non-distinct aggregates and by relaxing the global aggregate's child distribution request to ANY so no exchange enforcer is added. Also changes the default of parallelPipelineTaskNum from 0 (auto) to 1 and removes its fuzzy randomization.

Changes:

  • Add AggregateUtils.isSingleExecutionInstance and use it in SplitAggWithoutDistinct (skip 2-phase/bucketed candidates) and RequestPropertyDeriver (request ANY for global agg child).
  • Change parallelPipelineTaskNum default from 0 to 1 and drop its fuzzy randomization in initFuzzyModeVariables.
  • Update BucketedAggregateTest and RequestPropertyDeriverTest to set parallelPipelineTaskNum/beNumberForTest explicitly and add new single-instance coverage.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/nereids/util/AggregateUtils.java New helper detecting one-BE, one-parallel-instance execution.
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/implementation/SplitAggWithoutDistinct.java Skip 2-phase/bucketed candidates in auto mode when single instance.
fe/fe-core/src/main/java/org/apache/doris/nereids/properties/RequestPropertyDeriver.java Request ANY for global hash-agg child when single instance.
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java Change parallelPipelineTaskNum default to 1 and disable fuzzy randomization.
fe/fe-core/src/test/java/org/apache/doris/nereids/rules/implementation/BucketedAggregateTest.java Set parallel instance explicitly; add single-instance one-phase test.
fe/fe-core/src/test/java/org/apache/doris/nereids/properties/RequestPropertyDeriverTest.java Add single-instance global agg test; use proper ConnectContext with beNumberForTest=3.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@VarAttrDef.VarAttr(name = PARALLEL_PIPELINE_TASK_NUM, fuzzy = false, needForward = true,
setter = "setPipelineTaskNum")
public int parallelPipelineTaskNum = 0;
public int parallelPipelineTaskNum = 1;
this.feDebug = true;
this.enableConditionCache = Config.pull_request_id % 2 == 0;
this.parallelPipelineTaskNum = random.nextInt(8);
this.parallelPrepareThreshold = random.nextInt(32) + 1;
Comment on lines +95 to +102
int beNumber = connectContext.getSessionVariable().getBeNumberForTest();
if (beNumber < 0) {
beNumber = connectContext.getEnv().getClusterInfo().getAllBackendByCurrentCluster(true).size();
}
beNumber = Math.max(1, beNumber);
String clusterName = connectContext.getSessionVariable().resolveCloudClusterName(connectContext);
int parallelInstance = Math.max(1, connectContext.getSessionVariable().getParallelExecInstanceNum(clusterName));
return beNumber == 1 && parallelInstance == 1;
BiteTheDDDDt and others added 2 commits May 27, 2026 17:55
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Some plan-shape tests rely on multi-instance aggregate planning. When parallel_pipeline_task_num is forced to 1 for pipeline validation, those tests can choose the single-instance one-phase aggregate path and fail their old plan assertions. Pin these tests to parallel_pipeline_task_num=2 so they continue covering the existing multi-instance plan shapes deterministically.

### Release note

None

### Check List (For Author)

- Test: FE Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.SplitMultiDistinctTest,org.apache.doris.nereids.rules.rewrite.AggregateStrategiesTest,org.apache.doris.nereids.rules.rewrite.AggregateUnionPlanTest,org.apache.doris.nereids.processor.post.ShuffleKeyPrunerTest,org.apache.doris.planner.StatisticDeriveTest,org.apache.doris.qe.OlapQueryCacheTest,org.apache.doris.nereids.glue.translator.PhysicalPlanTranslatorTest,org.apache.doris.nereids.rules.exploration.mv.PartitionColumnTraceTest
    - cd fe && mvn checkstyle:check -pl fe-core
- Behavior changed: No. Test-only fixed session variables.
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The single-instance aggregate optimization changed global aggregate child request properties to ANY when only one BE and one pipeline task were detected. That skips required hash/gather distribution enforcement for inputs whose physical distribution still is not globally aggregated, such as TVF or insert paths, and can produce wrong aggregate results. Keep the optimization that avoids generating redundant two-phase aggregate candidates, but preserve the normal global aggregate request property derivation so correctness-sensitive exchanges remain when needed. Remove the temporary vault regression pin because it masked this correctness issue instead of fixing the root cause.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - FE UT: mvn test -pl fe-core -am -Dtest=org.apache.doris.nereids.properties.RequestPropertyDeriverTest,org.apache.doris.nereids.rules.implementation.BucketedAggregateTest -DfailIfNoTests=false -Dcheckstyle.skip=true -q

    - FE Checkstyle: mvn checkstyle:check -pl fe-core

- Behavior changed: No

- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: After pruning two-phase aggregate candidates for single-BE single-instance execution, one-phase global aggregate can still need a hash or gather enforcer to preserve aggregate correctness. ChildrenPropertiesRegulator still banned one-phase aggregate when its child was a distribute enforcer, assuming a two-phase candidate remained available. In the single-instance optimization path that leaves no valid alternative and can produce lowestCostPlans missing GATHER errors. Allow one-phase aggregate with required enforcers when the plan is known to execute on one BE with one pipeline task.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - FE UT: mvn test -pl fe-core -am -Dtest=org.apache.doris.nereids.properties.ChildrenPropertiesRegulatorTest,org.apache.doris.nereids.properties.RequestPropertyDeriverTest,org.apache.doris.nereids.rules.implementation.BucketedAggregateTest -DfailIfNoTests=false -Dcheckstyle.skip=true -q

    - FE Checkstyle: mvn checkstyle:check -pl fe-core

- Behavior changed: No

- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The single-instance aggregate optimization only generated one-phase aggregate candidates, but ChildrenPropertiesRegulator still applied the UnionAll one-phase aggregate ban first. Nested set operations such as INTERSECT/EXCEPT under UNION DISTINCT can be rewritten through aggregate nodes; when they require GATHER at the root, the one-phase candidate was rejected and no two-phase fallback existed, so planning failed with missing lowestCostPlans for GATHER. This change lets explicit agg_phase=1 and detected single-execution-instance cases bypass the UnionAll ban before other one-phase aggregate checks, preserving a valid one-phase plan while keeping the existing ban for multi-instance planning.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.AggregateUnionPlanTest#testNestedSetOperationDistinctUnionSingleInstance
    - ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.AggregateUnionPlanTest,org.apache.doris.nereids.properties.ChildrenPropertiesRegulatorTest,org.apache.doris.nereids.rules.implementation.BucketedAggregateTest
    - cd fe && mvn checkstyle:check -pl fe-core
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 93.33% (14/15) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 31.11% (14/45) 🎉
Increment coverage report
Complete coverage report

BiteTheDDDDt and others added 2 commits May 28, 2026 14:21
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The single-instance aggregate optimization changes aggregate physical plans when tests run with one BE and one pipeline task. Existing plan and shape regression cases record multi-instance aggregate plans, so they should explicitly run with more than one pipeline task instead of depending on the environment default.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Verified regression suite diffs and ran git diff --check. Local regression execution was blocked because no FE/MySQL service was listening on 127.0.0.1:9030.
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: QueryCacheNormalizerTest validates query cache normalization against multi-instance aggregate plans. The single-instance aggregate optimization can choose one-phase aggregate plans under the default FE UT session, which changes the cache point and normalized digest behavior being tested. Set the test session to use two pipeline tasks so these query cache normalizer cases continue to cover the intended multi-instance plan shape.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Ran QueryCacheNormalizerTest targeted methods with fe-core dependencies.
    - Ran FE checkstyle for fe-core.
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 93.33% (14/15) 🎉
Increment coverage report
Complete coverage report

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The single-instance aggregate optimization skipped two-phase aggregate candidates in auto mode. Some aggregate functions, such as orthogonal bitmap expression calculate, do not support one-phase execution, so single-instance planning could produce no valid implementation and fail with a missing GATHER root plan. Keep two-phase candidates when no one-phase candidate can be generated, and pin affected plan regression cases to multi-instance execution where they explicitly validate multi-phase plan shapes.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Ran AggregateUnionPlanTest with fe-core dependencies.
    - Ran FE checkstyle for fe-core.
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 18.42% (14/76) 🎉
Increment coverage report
Complete coverage report

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Several regression failures were exposed by the single-instance aggregate optimization. count_by_enum returned JSON from an unordered map, so foreach aggregate baselines could fail on key ordering only. Decimal view persistence with distinct sum could lose the session-variable guard when the distinct aggregate was rewritten to final multi-distinct phases, causing Decimal256 result type mismatches. Query cache and Python UDAF merge-propagation tests also validate multi-phase aggregate behavior, so pin those specific cases to multi-instance execution.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Ran AggregateUnionPlanTest with fe-core dependencies.
    - Ran FE checkstyle for fe-core.
    - Ran BE clang-format/check-format and clang-tidy for the changed BE header.
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The foreach aggregate regression cases compared the full JSON string produced by count_by_enum_foreach. That JSON contains a cbe object backed by unordered map iteration, so key order can vary without a semantic result change. The previous test stabilization changed BE JSON serialization order, which is a behavior change for a test-only instability. Revert the BE serialization change and instead split the regression check: keep stable foreach aggregate outputs in the original query, and validate count_by_enum_foreach through JSON path extraction of each expected field so object key order no longer matters.

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Ran git diff --check

    - Ran build-support/check-format.sh

    - Attempted ./run-regression-test.sh --run -d function_p0 -s test_agg_foreach,test_agg_foreach_not_null, but local Doris connection was refused

- Behavior changed: No

- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 49.23% (32/65) 🎉
Increment coverage report
Complete coverage report

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Two regression failures remained after enabling the single-instance aggregate path. The foreach suite still expected array_agg_foreach over nested arrays to fail, but the single-instance plan no longer needs array_agg merge/write and can return a valid result; the test now checks stable shape/counts under an explicit single-instance hint. The decimal256 persisted-view case failed because the final multi-distinct rewrite wrapped the local aggregate output with SessionVarGuardExpr, which is not needed for the local alias and could break later aggregate translation; the guard is kept on the final aggregate result where the persisted session-sensitive type must be preserved. A FE unit reproducer covers the decimal256 distinct aggregate view without group by.

### Release note

None

### Check List (For Author)

- Test: Regression test / Unit Test

    - mvn -pl fe-core -am -DskipUT=false -DskipITs -Dtest=org.apache.doris.nereids.rules.rewrite.AggregateUnionPlanTest test

    - cd fe && mvn checkstyle:check -pl fe-core

    - Attempted ./run-regression-test.sh --run -d function_p0 -s test_agg_foreach, but local Doris connection was refused

- Behavior changed: No

- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 96.77% (30/31) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants