[SQL] Minor changes for dataframe implementation #4336

chenghao-intel · 2015-02-03T14:02:17Z

No description provided.

SparkQA · 2015-02-03T14:07:51Z

Test build #26653 has started for PR 4336 at commit 3293408.

This patch merges cleanly.

SparkQA · 2015-02-03T15:18:08Z

Test build #26653 has finished for PR 4336 at commit 3293408.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-03T15:18:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26653/
Test PASSed.

marmbrus · 2015-02-03T19:59:34Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala


  override def collectAsList(): java.util.List[Row] = java.util.Arrays.asList(rdd.collect() :_*)

-  override def count(): Long = groupBy().count().rdd.collect().head.getLong(0)
+  override def count(): Long = rdd.count()


Are these changes correct? Or are you removing the optimizations that we have in place for count and collects?

Oh? If I understand correctly, I think the rdd.count() is the most optimized (partial aggregation is done in before shuffling). @rxin , can you confirm that? Sorry If I am wrong.

@marmbrus is correct. rdd.count() doesn't go through the optimizer. The original solution goes through the optimizer.

Maybe a better change is to add some inline comment to explain this makes sure it goes through the optimizer, etc.

Hmm, but the rdd.count() is not necessary to go through the Catalyst optimizer, isn't it? It's already an parallel processing.

As an example of a query that can take advantage of the optimizer:

df.count()

If you run count from rdd, then all columns are extracted. If you run count as is, no actual columns are read.

You should always go through the optimizer :)

Ok, that makes sense, thanks for the explanation. :)

minor changes

3293408

marmbrus reviewed Feb 3, 2015
View reviewed changes

chenghao-intel closed this Feb 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SQL] Minor changes for dataframe implementation #4336

[SQL] Minor changes for dataframe implementation #4336

chenghao-intel commented Feb 3, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

marmbrus Feb 3, 2015

chenghao-intel Feb 4, 2015

rxin Feb 4, 2015

chenghao-intel Feb 4, 2015

rxin Feb 4, 2015

marmbrus Feb 4, 2015

chenghao-intel Feb 4, 2015

[SQL] Minor changes for dataframe implementation #4336

[SQL] Minor changes for dataframe implementation #4336

Conversation

chenghao-intel commented Feb 3, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

marmbrus Feb 3, 2015

Choose a reason for hiding this comment

chenghao-intel Feb 4, 2015

Choose a reason for hiding this comment

rxin Feb 4, 2015

Choose a reason for hiding this comment

chenghao-intel Feb 4, 2015

Choose a reason for hiding this comment

rxin Feb 4, 2015

Choose a reason for hiding this comment

marmbrus Feb 4, 2015

Choose a reason for hiding this comment

chenghao-intel Feb 4, 2015

Choose a reason for hiding this comment