Skip to content

Conversation

@924060929
Copy link
Contributor

cherry pick from #35871

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@924060929 924060929 changed the title [enhancement](Nereids) support 4 phases distinct aggregate with full … [enhancement](Nereids) support 4 phases distinct aggregate with full distribution Jun 7, 2024
@924060929
Copy link
Contributor Author

run buildall

…distribution (apache#35871)

The origin implementation of 4 phases distinct aggregate only support the pattern which not contains `group by`, and only one distinct aggregate function

for example:
```sql
select count(distinct sex), sum(age)
from student
```

This pr complement the 4 phases distinct aggregate with full distribution, to avoid data skew in the `group by`.

for example
```sql
select sex, sum(distinct age)
from student
group by sex;
```
The sex only contains two distinct values, `male` and `female`, and the table store millions rows.
Shuffle by the `sex` cause the data skew and lots of instances process empty rows.

The 4 phase aggregate shuffle `sex, age` to distinct rows first, so more instances can do parallel distinct, the plan shape will like this:
```

PhysicalAggregate(groupBy=[sex], output=[sex, sum(partial_sum(age))], mode=BUFFER_TO_RESULT)
                                        |
                         PhysicalDistribute(columns=[sex])
                                        |
PhysicalAggregate(groupBy=[sex], output=[sex, partial_sum(age)], mode=INPUT_TO_BUFFER)
                                        |
    PhysicalAggregate(groupBy=[sex, age], output=[sex, age], mode=BUFFER_TO_BUFFER)
                                        |
                         PhysicalDistribute(columns=[sex, age])   # more columns to shuffle avoid data skew
                                        |
PhysicalAggregate(groupBy=[sex, age], output=[sex, age], mode=INPUT_TO_BUFFER)
                                        |
                          PhysicalOlapScan(name=student)
```

(cherry picked from commit 03f1cbd)
@morningman morningman force-pushed the branch-2.1-4phase branch from cb1c156 to 12179b3 Compare June 7, 2024 07:37
@morningman
Copy link
Contributor

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants