Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2973] [SQL] Avoid spark job for take on all ExecutedCommands #5247

Closed
wants to merge 3 commits into from

Conversation

scwf
Copy link
Contributor

@scwf scwf commented Mar 29, 2015

I have manually tested this with

sql("show tables").take(1)

it will not start a spark job.

@SparkQA
Copy link

SparkQA commented Mar 29, 2015

Test build #29355 has started for PR 5247 at commit 17358b8.

  • This patch merges cleanly.

@scwf scwf changed the title [SQL] [SPARK-2973] Avoid spark job for take on all ExecutedCommands [SPARK-2973] [SQL] Avoid spark job for take on all ExecutedCommands Mar 29, 2015
@SparkQA
Copy link

SparkQA commented Mar 29, 2015

Test build #29355 has finished for PR 5247 at commit 17358b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LogicalLocalRDD(output: Seq[Attribute], sideEffect: Seq[Row])(sqlContext: SQLContext)
    • case class PhysicalLocalRDD(output: Seq[Attribute], sideEffectResult: Seq[Row]) extends LeafNode

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29355/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Mar 29, 2015

Can we use LocalRelation?

@liancheng
Copy link
Contributor

Yes, that's also what I'm going to say. As described in the JIRA ticket title, using a LocalRelation can be simpler.

@scwf
Copy link
Contributor Author

scwf commented Mar 29, 2015

Ok. Get it

@SparkQA
Copy link

SparkQA commented Mar 29, 2015

Test build #29365 has started for PR 5247 at commit a129816.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 29, 2015

Test build #29365 has finished for PR 5247 at commit a129816.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29365/
Test PASSed.

@scwf
Copy link
Contributor Author

scwf commented Mar 31, 2015

@rxin @liancheng is this ok?

@rxin
Copy link
Contributor

rxin commented Mar 31, 2015

@liancheng can you look at this ? thanks.

@scwf
Copy link
Contributor Author

scwf commented Apr 6, 2015

/cc @liancheng

@marmbrus
Copy link
Contributor

marmbrus commented Apr 7, 2015

Thanks for working on this! I have one issue with the current implementation. In particular, it is essentially doing query planning inside of [[DataFrame]]. If you look, this is very close to the logic that can be found here:

case r: RunnableCommand => ExecutedCommand(r) :: Nil

I'd rather not spread this logic out over several different places.

Really, it seems to me like ExecutedCommand is probably redundant with LocalRelation. I haven't spent a lot of time looking at this, but it seems like things would be cleaner if we do the following:

  • High level trait called Command. All things that should be eagerly evaluated will mix in this trait. We can simplify the matching in DataFrame to only check if something is a Command and if so call queryExecution.executedPlan to trigger eager execution.
  • Remove ExecutedCommand, do the conversion LocalRelation in the query planner.

What do you think?

@scwf
Copy link
Contributor Author

scwf commented Apr 13, 2015

Hi @marmbrus,
First i think this change is not doing query planning because
query planning is to convert a logical plan to spark plan (correct me if i misunderstanding)
and here i just make a new logical node from RunnableCommand to LocalRelation

Second, your suggestion is really useful, how about change as following:
1 add a rule to optimizer to transform from RunnableCommand to LocalRelation
2 remove ExecutedCommand, since

case logical.LocalRelation(output, data) =>
        LocalTableScan(output, data) :: Nil

will transform LocalRelation to LocalTableScan

@marmbrus
Copy link
Contributor

You are correct, thanks for clarifying. Query planning was not the right phrase, but really my point was that ideally the logic in DataFrame would handle only ensuring commands are executed eagerly. Hopefully this can be as simple as matching on Command and making sure executedCommand is evaluated. (However this would require refactoring to make sure that every operation implements the Command trait)

I like your second point, but I'd make one change. Lets not put it in the optimizer since that feels a little weird to me (its not really an optimization). Instead, in the query planner lets go directly from RunnableCommand to LocalTableScan What do you think?

@scwf
Copy link
Contributor Author

scwf commented Apr 14, 2015

yeah, its good to directly from RunnableCommand to LocalTableScan, i am updating this.

@SparkQA
Copy link

SparkQA commented Apr 14, 2015

Test build #30216 has started for PR 5247 at commit b1232ed.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Apr 14, 2015

Test build #30216 has finished for PR 5247 at commit b1232ed.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30216/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30317 has started for PR 5247 at commit 7f51f7e.

@scwf
Copy link
Contributor Author

scwf commented Apr 15, 2015

Here is a problem need to be fixed: the ddl command will be executed twice

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30317 has finished for PR 5247 at commit 7f51f7e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30317/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30316/
Test FAILed.

_: WriteToFile =>
case _ : Command =>
queryExecution.sparkPlan.executeCollect()
queryExecution.analyzed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will leads to execute command twice when we do action operator on dataframe, such as
sql(s"CREATE DATABASE xxx").count()
first execution is the eager execution when constructing dataframe
second is when execute count, it also trigger execution.

So maybe we still need construct LocalRelation here?

@scwf
Copy link
Contributor Author

scwf commented Apr 15, 2015

@marmbrus i think we can not directly from RunnableCommand to LocalTableScan in planner, that will leads to execute command twice as i described before. So here is other two ways to do this:
1 add a rule in analyzer/optimizer to change RunnableCommand to localRelation
2 revert to my first commit version

your idea?

@SparkQA
Copy link

SparkQA commented Apr 21, 2015

Test build #30711 has started for PR 5247 at commit 0f623ca.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30711 has finished for PR 5247 at commit 0f623ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30711/
Test FAILed.

@scwf
Copy link
Contributor Author

scwf commented Apr 22, 2015

Jenkins failed org.apache.spark.sql.hive.thriftserver.CliSuite.Simple commands but locally test ok.

@scwf
Copy link
Contributor Author

scwf commented Apr 22, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30715 has started for PR 5247 at commit 0f623ca.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30715 has finished for PR 5247 at commit 0f623ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30715/
Test PASSed.

@scwf
Copy link
Contributor Author

scwf commented Apr 22, 2015

@marmbrus any more comments?

@scwf
Copy link
Contributor Author

scwf commented May 3, 2015

Retest this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 3, 2015

Test build #31679 has started for PR 5247 at commit 0f623ca.

@SparkQA
Copy link

SparkQA commented May 3, 2015

Test build #31679 has finished for PR 5247 at commit 0f623ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31679/
Test PASSed.

@scwf
Copy link
Contributor Author

scwf commented May 6, 2015

/cc @marmbrus

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 9, 2015

Test build #32289 has started for PR 5247 at commit 6402c61.

@SparkQA
Copy link

SparkQA commented May 9, 2015

Test build #32289 has finished for PR 5247 at commit 6402c61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32289/
Test PASSed.

@scwf
Copy link
Contributor Author

scwf commented May 12, 2015

ping @marmbrus

@marmbrus
Copy link
Contributor

This implementation is still making changes to the query plan in DataFrame that aren't reflected in the queryExecution. This is something I want to avoid. Also, it seems to me that this issue is pretty much already fixed as most things inherit from RunnableCommand, which should never run a Spark job. So I'd rather not add hacks in the wrong places that are doing things like double conversion into catalyst format (which I think the current implementation does).

@scwf
Copy link
Contributor Author

scwf commented May 15, 2015

sql("show tables").take(1) still start a spark job on the master branch.
after thinking this more, i think we can convert runnablecommand to LocalRelation in analyzer and also we do not need executecommand

@marmbrus
Copy link
Contributor

If we can do that in the query planner that sounds reasonable to me. It
would also be nice to add Command to all commands so we can just match on
that and call executedPlan to force the lazy val calculation for all all of
them with out special cases here.
On May 14, 2015 8:41 PM, "Fei Wang" notifications@github.com wrote:

sql("show tables").take(1) still start a spark job on the master branch.
after thinking this more, i think we can convert runnablecommand to
LocalRelation and also we do not need executecommand


Reply to this email directly or view it on GitHub
#5247 (comment).

@scwf
Copy link
Contributor Author

scwf commented Jul 1, 2015

close this, will file a new PR when i have time to fix the test failure

@scwf scwf closed this Jul 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants