[SPARK-2973] [SQL] Avoid spark job for take on all ExecutedCommands #5247

scwf · 2015-03-29T02:22:59Z

I have manually tested this with

sql("show tables").take(1)

it will not start a spark job.

SparkQA · 2015-03-29T02:28:17Z

Test build #29355 has started for PR 5247 at commit 17358b8.

This patch merges cleanly.

SparkQA · 2015-03-29T03:52:13Z

Test build #29355 has finished for PR 5247 at commit 17358b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LogicalLocalRDD(output: Seq[Attribute], sideEffect: Seq[Row])(sqlContext: SQLContext)
- case class PhysicalLocalRDD(output: Seq[Attribute], sideEffectResult: Seq[Row]) extends LeafNode

AmplabJenkins · 2015-03-29T03:52:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29355/
Test PASSed.

rxin · 2015-03-29T04:26:18Z

Can we use LocalRelation?

liancheng · 2015-03-29T04:41:59Z

Yes, that's also what I'm going to say. As described in the JIRA ticket title, using a LocalRelation can be simpler.

scwf · 2015-03-29T06:10:49Z

Ok. Get it

SparkQA · 2015-03-29T11:03:18Z

Test build #29365 has started for PR 5247 at commit a129816.

This patch merges cleanly.

SparkQA · 2015-03-29T12:26:31Z

Test build #29365 has finished for PR 5247 at commit a129816.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-29T12:26:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29365/
Test PASSed.

scwf · 2015-03-31T01:14:00Z

@rxin @liancheng is this ok?

rxin · 2015-03-31T01:36:35Z

@liancheng can you look at this ? thanks.

scwf · 2015-04-06T13:42:22Z

/cc @liancheng

marmbrus · 2015-04-07T02:41:10Z

Thanks for working on this! I have one issue with the current implementation. In particular, it is essentially doing query planning inside of [[DataFrame]]. If you look, this is very close to the logic that can be found here:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

Line 263 in a95043b

case r: RunnableCommand => ExecutedCommand(r) :: Nil

I'd rather not spread this logic out over several different places.

Really, it seems to me like ExecutedCommand is probably redundant with LocalRelation. I haven't spent a lot of time looking at this, but it seems like things would be cleaner if we do the following:

High level trait called Command. All things that should be eagerly evaluated will mix in this trait. We can simplify the matching in DataFrame to only check if something is a Command and if so call queryExecution.executedPlan to trigger eager execution.
Remove ExecutedCommand, do the conversion LocalRelation in the query planner.

What do you think?

scwf · 2015-04-13T03:56:31Z

Hi @marmbrus,
First i think this change is not doing query planning because
query planning is to convert a logical plan to spark plan (correct me if i misunderstanding)
and here i just make a new logical node from RunnableCommand to LocalRelation

Second, your suggestion is really useful, how about change as following:
1 add a rule to optimizer to transform from RunnableCommand to LocalRelation
2 remove ExecutedCommand, since

case logical.LocalRelation(output, data) =>
        LocalTableScan(output, data) :: Nil

will transform LocalRelation to LocalTableScan

marmbrus · 2015-04-14T01:14:26Z

You are correct, thanks for clarifying. Query planning was not the right phrase, but really my point was that ideally the logic in DataFrame would handle only ensuring commands are executed eagerly. Hopefully this can be as simple as matching on Command and making sure executedCommand is evaluated. (However this would require refactoring to make sure that every operation implements the Command trait)

I like your second point, but I'd make one change. Lets not put it in the optimizer since that feels a little weird to me (its not really an optimization). Instead, in the query planner lets go directly from RunnableCommand to LocalTableScan What do you think?

scwf · 2015-04-14T03:00:44Z

yeah, its good to directly from RunnableCommand to LocalTableScan, i am updating this.

SparkQA · 2015-04-14T03:58:38Z

Test build #30216 has started for PR 5247 at commit b1232ed.

This patch does not merge cleanly.

SparkQA · 2015-04-14T04:00:41Z

Test build #30216 has finished for PR 5247 at commit b1232ed.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-04-14T04:00:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30216/
Test FAILed.

SparkQA · 2015-04-15T06:22:45Z

Test build #30317 has started for PR 5247 at commit 7f51f7e.

scwf · 2015-04-15T06:26:00Z

Here is a problem need to be fixed: the ddl command will be executed twice

SparkQA · 2015-04-15T06:26:50Z

Test build #30317 has finished for PR 5247 at commit 7f51f7e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-15T06:26:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30317/
Test FAILed.

AmplabJenkins · 2015-04-15T06:32:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30316/
Test FAILed.

scwf · 2015-04-15T06:51:14Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

-         _: WriteToFile =>
+    case _ : Command =>
+      queryExecution.sparkPlan.executeCollect()
+      queryExecution.analyzed


This will leads to execute command twice when we do action operator on dataframe, such as
sql(s"CREATE DATABASE xxx").count()
first execution is the eager execution when constructing dataframe
second is when execute count, it also trigger execution.

So maybe we still need construct LocalRelation here?

scwf · 2015-04-15T23:29:48Z

@marmbrus i think we can not directly from RunnableCommand to LocalTableScan in planner, that will leads to execute command twice as i described before. So here is other two ways to do this:
1 add a rule in analyzer/optimizer to change RunnableCommand to localRelation
2 revert to my first commit version

your idea?

SparkQA · 2015-04-21T23:42:51Z

Test build #30711 has started for PR 5247 at commit 0f623ca.

SparkQA · 2015-04-22T00:47:41Z

Test build #30711 has finished for PR 5247 at commit 0f623ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-22T00:47:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30711/
Test FAILed.

scwf · 2015-04-22T00:56:27Z

Jenkins failed org.apache.spark.sql.hive.thriftserver.CliSuite.Simple commands but locally test ok.

scwf · 2015-04-22T01:01:33Z

Jenkins, retest this please.

SparkQA · 2015-04-22T01:03:42Z

Test build #30715 has started for PR 5247 at commit 0f623ca.

SparkQA · 2015-04-22T02:43:14Z

Test build #30715 has finished for PR 5247 at commit 0f623ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-22T02:43:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30715/
Test PASSed.

scwf · 2015-04-22T23:49:07Z

@marmbrus any more comments?

scwf · 2015-05-03T01:23:30Z

Retest this please

AmplabJenkins · 2015-05-03T01:27:10Z

Merged build triggered.

AmplabJenkins · 2015-05-03T01:27:15Z

Merged build started.

SparkQA · 2015-05-03T01:28:47Z

Test build #31679 has started for PR 5247 at commit 0f623ca.

SparkQA · 2015-05-03T03:08:15Z

Test build #31679 has finished for PR 5247 at commit 0f623ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-03T03:08:20Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-03T03:08:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31679/
Test PASSed.

scwf · 2015-05-06T21:52:15Z

/cc @marmbrus

AmplabJenkins · 2015-05-09T01:22:10Z

Merged build triggered.

AmplabJenkins · 2015-05-09T01:22:20Z

Merged build started.

SparkQA · 2015-05-09T01:24:05Z

Test build #32289 has started for PR 5247 at commit 6402c61.

SparkQA · 2015-05-09T03:17:25Z

Test build #32289 has finished for PR 5247 at commit 6402c61.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-09T03:17:29Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-09T03:17:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32289/
Test PASSed.

scwf · 2015-05-12T02:33:26Z

ping @marmbrus

marmbrus · 2015-05-13T22:59:49Z

This implementation is still making changes to the query plan in DataFrame that aren't reflected in the queryExecution. This is something I want to avoid. Also, it seems to me that this issue is pretty much already fixed as most things inherit from RunnableCommand, which should never run a Spark job. So I'd rather not add hacks in the wrong places that are doing things like double conversion into catalyst format (which I think the current implementation does).

scwf · 2015-05-15T03:41:04Z

sql("show tables").take(1) still start a spark job on the master branch.
after thinking this more, i think we can convert runnablecommand to LocalRelation in analyzer and also we do not need executecommand

marmbrus · 2015-05-15T03:46:49Z

If we can do that in the query planner that sounds reasonable to me. It
would also be nice to add Command to all commands so we can just match on
that and call executedPlan to force the lazy val calculation for all all of
them with out special cases here.
On May 14, 2015 8:41 PM, "Fei Wang" notifications@github.com wrote:

sql("show tables").take(1) still start a spark job on the master branch.
after thinking this more, i think we can convert runnablecommand to
LocalRelation and also we do not need executecommand

—
Reply to this email directly or view it on GitHub
#5247 (comment).

scwf · 2015-07-01T02:00:11Z

close this, will file a new PR when i have time to fix the test failure

scwf changed the title ~~[SQL] [SPARK-2973] Avoid spark job for take on all ExecutedCommands~~ [SPARK-2973] [SQL] Avoid spark job for take on all ExecutedCommands Mar 29, 2015

scwf force-pushed the spark-2973 branch from 17358b8 to a129816 Compare March 29, 2015 10:58

scwf reviewed Apr 15, 2015
View reviewed changes

avoid spark job for take on cmd

d92c21d

scwf force-pushed the spark-2973 branch from 7f51f7e to d92c21d Compare April 21, 2015 02:17

fix conflicts

6402c61

scwf closed this Jul 1, 2015

[SPARK-2973] [SQL] Avoid spark job for take on all ExecutedCommands #5247

[SPARK-2973] [SQL] Avoid spark job for take on all ExecutedCommands #5247

Conversation

scwf commented Mar 29, 2015

SparkQA commented Mar 29, 2015

SparkQA commented Mar 29, 2015

AmplabJenkins commented Mar 29, 2015

rxin commented Mar 29, 2015

liancheng commented Mar 29, 2015

scwf commented Mar 29, 2015

SparkQA commented Mar 29, 2015

SparkQA commented Mar 29, 2015

AmplabJenkins commented Mar 29, 2015

scwf commented Mar 31, 2015

rxin commented Mar 31, 2015

scwf commented Apr 6, 2015

marmbrus commented Apr 7, 2015

scwf commented Apr 13, 2015

marmbrus commented Apr 14, 2015

scwf commented Apr 14, 2015

SparkQA commented Apr 14, 2015

SparkQA commented Apr 14, 2015

AmplabJenkins commented Apr 14, 2015

SparkQA commented Apr 15, 2015

scwf commented Apr 15, 2015

SparkQA commented Apr 15, 2015

AmplabJenkins commented Apr 15, 2015

AmplabJenkins commented Apr 15, 2015

scwf Apr 15, 2015

Choose a reason for hiding this comment

scwf commented Apr 15, 2015

SparkQA commented Apr 21, 2015

SparkQA commented Apr 22, 2015

AmplabJenkins commented Apr 22, 2015

scwf commented Apr 22, 2015

scwf commented Apr 22, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

AmplabJenkins commented Apr 22, 2015

scwf commented Apr 22, 2015

scwf commented May 3, 2015

AmplabJenkins commented May 3, 2015

AmplabJenkins commented May 3, 2015

SparkQA commented May 3, 2015

SparkQA commented May 3, 2015

AmplabJenkins commented May 3, 2015

AmplabJenkins commented May 3, 2015

scwf commented May 6, 2015

AmplabJenkins commented May 9, 2015

AmplabJenkins commented May 9, 2015

SparkQA commented May 9, 2015

SparkQA commented May 9, 2015

AmplabJenkins commented May 9, 2015

AmplabJenkins commented May 9, 2015

scwf commented May 12, 2015

marmbrus commented May 13, 2015

scwf commented May 15, 2015

marmbrus commented May 15, 2015

scwf commented Jul 1, 2015