Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30508][SQL] Add SparkSession.executeCommand API for external datasource #27199

Closed
wants to merge 14 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Jan 14, 2020

What changes were proposed in this pull request?

This PR adds SparkSession.executeCommand API for external datasource to execute a random command like

val df = spark.executeCommand("xxxCommand", "xxxSource", "xxxOptions")

Note that the command doesn't execute in Spark, but inside an external execution engine depending on data source. And it will be eagerly executed after executeCommand called and the returned DataFrame will contain the output of the command(if any).

Why are the changes needed?

This can be useful when user wants to execute some commands out of Spark. For example, executing custom DDL/DML command for JDBC, creating index for ElasticSearch, creating cores for Solr and so on(as @HyukjinKwon suggested).

Previously, user needs to use an option to achieve the goal, e.g. spark.read.format("xxxSource").option("command", "xxxCommand").load(), which is kind of cumbersome. With this change, it can be more convenient for user to achieve the same goal.

Does this PR introduce any user-facing change?

Yes, new API from SparkSession and a new interface ExternalCommandRunnableProvider.

How was this patch tested?

Added a new test suite.

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116696 has finished for PR 27199 at commit 86a17f7.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 14, 2020

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116697 has finished for PR 27199 at commit d157a3d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 14, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116705 has started for PR 27199 at commit d157a3d.


override def run(sparkSession: SparkSession): Seq[Row] = {
val output = provider.executeCommand(command, parameters)
Seq(Row(output.mkString("\n")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can output each string as one row: output.map(Row(_))

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116711 has finished for PR 27199 at commit 9ed683f.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116714 has finished for PR 27199 at commit 5f553f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Unstable
public interface ExternalCommandRunnableProvider {
/**
* Execute a random DDL/DML command inside an external execution engine rather than Spark,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume that the command runs on the driver side only, or a datasource can execute it in parallel on all available executors, and combine the results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this only happens in driver. But, ideally, driver won't run the command itself, but delegates it to external execution engine depends on the datasource. For example, JDBC could establish a connection and run the command on the external DMBS.

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 16, 2020

@cloud-fan @MaxGekk @HyukjinKwon updated, thanks.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116809 has finished for PR 27199 at commit bcd7338.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51 Ngone51 changed the title [SPARK-30508][SQL] Add DataFrameReader.executeCommand API for external datasource [SPARK-30508][SQL] Add SparkSession.executeCommand API for external datasource Jan 17, 2020
@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116904 has finished for PR 27199 at commit 48e5a26.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116901 has finished for PR 27199 at commit 7d0e59f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 17, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116919 has finished for PR 27199 at commit 48e5a26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @since 3.0.0
*/
@Unstable
public interface ExternalCommandRunnableProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about ExternalCommandRunnerProvider

@Unstable
public interface ExternalCommandRunnableProvider {
/**
* Execute a random command inside an external execution engine rather than Spark.
Copy link
Contributor

@cloud-fan cloud-fan Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a random command -> an arbitrary string command

val df = spark.executeCommand("hello", "cmdSource", parameters)
// executeCommand should execute the command eagerly
assert(System.getProperty("command") === "world")
val output1 = df.collect()(0).getString(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check result with checkAnswer?

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117218 has finished for PR 27199 at commit d2c28e0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

import java.util.Map;

/**
* @since 3.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some classdoc?

@@ -2,6 +2,7 @@ org.apache.spark.sql.sources.FakeSourceOne
org.apache.spark.sql.sources.FakeSourceTwo
org.apache.spark.sql.sources.FakeSourceThree
org.apache.spark.sql.sources.FakeSourceFour
org.apache.spark.sql.sources.CommandRunnableDataSource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to register it as we are not testing the short name. We can just use the full class name when calling spark.executeCommand

System.setProperty("command", "hello")
val parameters = Map("one" -> "1", "two" -> "2").asJava
assert(System.getProperty("command") === "hello")
val df = spark.executeCommand("hello", "cmdSource", parameters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we can use classOf[CommandRunnableDataSource].getName instead of cmdSource

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117682 has finished for PR 27199 at commit 5ae5bd1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Test build #117671 has passed all the tests. I am merging it to master.

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117671 has finished for PR 27199 at commit 1fdbe30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants