Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL] Minor changes for dataframe implementation #4336

Closed
wants to merge 1 commit into from

Conversation

chenghao-intel
Copy link
Contributor

No description provided.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26653 has started for PR 4336 at commit 3293408.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26653 has finished for PR 4336 at commit 3293408.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26653/
Test PASSed.


override def collectAsList(): java.util.List[Row] = java.util.Arrays.asList(rdd.collect() :_*)

override def count(): Long = groupBy().count().rdd.collect().head.getLong(0)
override def count(): Long = rdd.count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes correct? Or are you removing the optimizations that we have in place for count and collects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh? If I understand correctly, I think the rdd.count() is the most optimized (partial aggregation is done in before shuffling). @rxin , can you confirm that? Sorry If I am wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus is correct. rdd.count() doesn't go through the optimizer. The original solution goes through the optimizer.

Maybe a better change is to add some inline comment to explain this makes sure it goes through the optimizer, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but the rdd.count() is not necessary to go through the Catalyst optimizer, isn't it? It's already an parallel processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example of a query that can take advantage of the optimizer:

df.count()

If you run count from rdd, then all columns are extracted. If you run count as is, no actual columns are read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should always go through the optimizer :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense, thanks for the explanation. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants