Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-4889][VL] Support approx_percentile #5007

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

WangGuangxin
Copy link
Contributor

What changes were proposed in this pull request?

Support approx_percentile for VL backend

(Fixes: #4889)

How was this patch tested?

UT

Copy link

#4889

Copy link

Run Gluten Clickhouse CI

@zhouyuan
Copy link
Contributor

CC: @zhztheplayer

@WangGuangxin
Copy link
Contributor Author

I believe the change on Gluten side is ready now. But the ApproxPercentileAggregate implements in velox has to make some minor modification, becasue when add intermediate data, it checks the row vector encoding in addIntermediateImpl
https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/aggregates/ApproxPercentileAggregate.cpp#L652 ,
I believe the encoding is guaranteed by extractAccumulators https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/aggregates/ApproxPercentileAggregate.cpp#L262.

But when it comes in Gluten, the output of PartialAgg are processed by Shuffle, which will flatten all vectors, so that the encoding is not guaranteed when added to FinalAgg.
https://github.com/apache/incubator-gluten/blob/main/cpp/velox/shuffle/VeloxShuffleWriter.cc#L343

cc @zhztheplayer @liujiayi771 @ulysses-you

@WangGuangxin WangGuangxin marked this pull request as ready for review March 20, 2024 01:58
@zhztheplayer
Copy link
Member

Thank you in advance!

I believe the change on Gluten side is ready now.

Did that mean this PR can be merged prior to merging Velox changes?


test("Support ApproximatePercentile") {
runQueryAndCompare("""
|SELECT approx_percentile(col, array(0.5, 0.4, 0.1), 100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember Spark has percentile_approx. Is that one supported by this patch either?

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good if test passes. Thanks a lot for helping on this.

@PHILO-HE If you have further comments

@liujiayi771
Copy link
Contributor

@WangGuangxin Perhaps we could create an issue in the upstream Velox community to remove this verification.

@zhztheplayer
Copy link
Member

zhztheplayer commented Mar 21, 2024

Hi @WangGuangxin so the current plan is to merge

VELOX_REPO=https://github.com/wangguangxin/velox.git
VELOX_BRANCH=2024_03_15_approx_percentile

to upstream Velox before this ? Am I understanding correctly?

@WangGuangxin
Copy link
Contributor Author

this

yes, we should merge to upsteam first before merge this PR

ae.aggregateFunction match {
case _: ApproximatePercentile =>
ae.mode match {
case Partial | PartialMerge => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only support Partial and Final in HashAggregateExecTransformer.scala. Shall we add PartialMerge here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it should be removed

@liujiayi771
Copy link
Contributor

@WangGuangxin Any progress of this PR?

.getOrElse(attr)
case other => other
}
copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this could make the offloaded partial aggregate operator incompatible with vanilla Spark's final aggregate operator? Say if final aggregation is fallen back then UB will be led.

I am considering whether we could use the approach like we had done for approx_count_distinct (#1676), to replace approx_percentile with something like velox_approx_percentile that has different intermediate type definition (or use an internal project to match up with vanilla's intermediate type) at the very early planning phase? Then we can distinguish between vanilla's and Velox's implementations for this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll rebase and refact it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Add support of percentile_approx / approx_percentile
4 participants