-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4233] [SQL] WIP:Simplify the UDAF API (Interface) #3247
Conversation
Test build #23312 has finished for PR 3247 at commit
|
@rxin, @marmbrus, most of ideas are from the Hive/Shark, can you please go thru the code and give some feedbacks? Sorry, it's a BIG commit, but I really want to merge all of the aggregation improvement(totally 3 - 4 BIG PRs) in next release. (Most of the tricks happens in |
@chenghao-intel, I glanced at this really quickly and will take a closer look once we cut an RC for 1.2. Overall this is probably a good direction to go in. |
Test build #23833 has finished for PR 3247 at commit
|
Test build #24242 has finished for PR 3247 at commit
|
Test build #24297 has finished for PR 3247 at commit
|
Test build #24526 has finished for PR 3247 at commit
|
Test build #24941 has finished for PR 3247 at commit
|
Test build #24950 has finished for PR 3247 at commit
|
test this please |
Test build #25020 has finished for PR 3247 at commit
|
Test build #25043 has finished for PR 3247 at commit
|
Test build #25047 has finished for PR 3247 at commit
|
@marmbrus , this PR passed the unit test, but some of details need to be discussed. Can you review this? Particularly for the UDAF interface design. Sorry about so many code changes, as I almost rewrote all of the UDAF relevant code. |
CountDistinct(args.map(nodeToExpr)) | ||
case Token("TOK_FUNCTIONDI", Token(SUM(), Nil) :: arg :: Nil) => Sum(nodeToExpr(arg), true) | ||
case Token("TOK_FUNCTIONDI", Token(MAX(), Nil) :: arg :: Nil) => Max(nodeToExpr(arg), true) | ||
case Token("TOK_FUNCTIONDI", Token(MIN(), Nil) :: arg :: Nil) => Min(nodeToExpr(arg), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does MIN distinct mean?
I only looked at this quickly, but I like the goals, especially the middle one. Our current implementation is really wasteful on memory. Some thoughts:
Other things:
|
Test build #25783 has finished for PR 3247 at commit
|
Test build #26013 has finished for PR 3247 at commit
|
@marmbrus I've rebased to the latest master, and also updated the benchmark result, Sorry, the interface has slight different than the design doc in jira, I will update that soon, but the general idea would be the same. |
Test build #26014 has finished for PR 3247 at commit
|
Sorry, @maropu I'v updated. Let's see if will break anything. |
Test build #29286 has finished for PR 3247 at commit
|
Test build #29287 has finished for PR 3247 at commit
|
Test build #29289 has finished for PR 3247 at commit
|
Test build #29291 has finished for PR 3247 at commit
|
Test build #29310 has finished for PR 3247 at commit
|
Test build #29457 has finished for PR 3247 at commit
|
@chenghao-intel I'm also with your refactoring idea though, it's too big to merge into the master in bulk.
Thought? |
@maropu Glad to know you're interested with the refactoring! @maropu I will be glad to merge/review your PRs if you make the change against my repo, or @marmbrus @rxin is it possible to create a branch for this PR in apache repo? |
Is it not possible to create that simple patch that removes DISTINCT aggregation expressions? |
OK, I got your mean, as I put into the description of this PR, we want to make a unified UDAF interface in this PR, |
@inline | ||
final def getStruct(bound: BoundReference): Row = getStruct(bound.ordinal) | ||
/* end of the syntactic sugar it as API */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes must be needed for this patch?
The interfaces of Row
are related to all the other operator.
I think that if necessary, you make a PR first to add these interfaces in Row
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that.
This PR is superseded by another PR and can be close, right? |
Yea, I will close this for #5542 |
Simplify the UDAF API is the first step of optimization for Aggregation (see https://issues.apache.org/jira/browse/SPARK-4366).
Currently UDAF cannot scale up when data volume grows, particularly for the
distinct
aggregation expressions. This PR doesn't aim for fixing thedistinct
performance, but facilitateUDAF Developers will not write the distinct expression any more, like
DistinctAverage
is not necessary, asAverage
is provided, the framework will handle thedistinct
internally.Aggregation Buffer will be stored as
MutableRow
with schema, it means the UDAF developers will benefit from the Catalyst Expression Evaluation framework in UDAF development.The following sub tasks have been done:
max
,first
,last
etc.)Still some open issues need to be done with follow-up PRs:
DISTINCT
(before the data shuffling)Data Gen Code:
And the benchmark code and result:
CMD: bin/spark-shell --master local[6] --jars conf/hive-site.xml --driver-memory 10g
CMD: bin/spark-shell --master local[6] --jars conf/hive-site.xml --driver-memory 5g
In general, the new implementation has better in memory usage, and faster while group by keys specified, in the meantime, we have room to optimize the non-group by key cases for the new implementation, which is supposed to have no impact with the UDAF interface.
PS: the old implementation seems doesn't support the query like
select avg(distinct value) from xxx