-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SQL] Minor changes for dataframe implementation #4336
Conversation
Test build #26653 has started for PR 4336 at commit
|
Test build #26653 has finished for PR 4336 at commit
|
Test PASSed. |
|
||
override def collectAsList(): java.util.List[Row] = java.util.Arrays.asList(rdd.collect() :_*) | ||
|
||
override def count(): Long = groupBy().count().rdd.collect().head.getLong(0) | ||
override def count(): Long = rdd.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes correct? Or are you removing the optimizations that we have in place for count and collects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh? If I understand correctly, I think the rdd.count() is the most optimized (partial aggregation is done in before shuffling). @rxin , can you confirm that? Sorry If I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus is correct. rdd.count() doesn't go through the optimizer. The original solution goes through the optimizer.
Maybe a better change is to add some inline comment to explain this makes sure it goes through the optimizer, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but the rdd.count()
is not necessary to go through the Catalyst optimizer, isn't it? It's already an parallel processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example of a query that can take advantage of the optimizer:
df.count()
If you run count from rdd, then all columns are extracted. If you run count as is, no actual columns are read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should always go through the optimizer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense, thanks for the explanation. :)
No description provided.