Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors. #28986

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jul 2, 2020

What changes were proposed in this pull request?

This PR proposes to disallow to create SparkContext in executors, e.g., in UDFs.

Why are the changes needed?

Currently executors can create SparkContext, but shouldn't be able to create it.

sc.range(0, 1).foreach { _ =>
  new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
}

Does this PR introduce any user-facing change?

Yes, users won't be able to create SparkContext in executors.

How was this patch tested?

Addes tests.

@ueshin ueshin changed the title [SPARK-32160][CORE] Disallow to create SparkContext in executors. [SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors. Jul 2, 2020
* Throws an exception if a SparkContext is about to be created in executors.
*/
private[spark] def assertOnDriver(): Unit = {
if (TaskContext.get != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this under local mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under local mode:

scala> sc.range(0, 1).foreach { _ => new SparkContext(new SparkConf().setAppName("test").setMaster("local")) }
java.lang.IllegalStateException: SparkContext should only be created and accessed on the driver.
...

before this patch:

scala> sc.range(0, 1).foreach { _ => new SparkContext(new SparkConf().setAppName("test").setMaster("local")) }
org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
...

Although the exception is different, it fails anyway.

I think the new error message is more reasonable.

@jiangxb1987
Copy link
Contributor

I agree we should not create a SparkContext on an executor. We should make sure the approach doesn't break the case when the executor runs in the same jvm with the driver.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @ueshin .

@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124930 has started for PR 28986 at commit aeed617.

@dongjoon-hyun
Copy link
Member

The current run seems to hit a flaky test.

[info] - SPARK-30656: minPartitions *** FAILED *** (1 minute)
[info]   The code passed to eventually never returned normally. Attempted 3865 times over 1.0001984272333333 minutes. Last failure message: isPropagated was false Partition [topic-15, 0] metadata not propagated after timeout. (KafkaTestUtils.scala:643)

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

Current run fails again at the different test case.

[info] - read Kafka transactional messages: read_committed *** FAILED *** (1 minute)
[info]   The code passed to eventually never returned normally. Attempted 3685 times over 1.0001156077 minutes. Last failure message: isPropagated was false Partition [topic-39, 0] metadata not propagated after timeout. (KafkaTestUtils.scala:643)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #124974 has finished for PR 28986 at commit aeed617.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125015 has finished for PR 28986 at commit aeed617.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125049 has finished for PR 28986 at commit aeed617.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125083 has finished for PR 28986 at commit aeed617.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125240 has finished for PR 28986 at commit aeed617.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125243 has finished for PR 28986 at commit 0faac2a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Jul 7, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125263 has finished for PR 28986 at commit 0faac2a.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 7, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125264 has finished for PR 28986 at commit 0faac2a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125265 has finished for PR 28986 at commit 0faac2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125285 has finished for PR 28986 at commit 0faac2a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125282 has finished for PR 28986 at commit 0faac2a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 8, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125336 has finished for PR 28986 at commit 0faac2a.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125342 has finished for PR 28986 at commit 0faac2a.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125357 has finished for PR 28986 at commit 0faac2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun added a commit that referenced this pull request Jul 8, 2020
### What changes were proposed in this pull request?

This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation.

- #28848 (comment) (amp-jenkins-worker-06)
- #28926 (comment) (amp-jenkins-worker-06)
- #28969 (comment) (amp-jenkins-worker-06)
- #28975 (comment) (amp-jenkins-worker-05)
- #28986 (comment)  (amp-jenkins-worker-05)
- #28992 (comment) (amp-jenkins-worker-06)
- #28993 (comment) (amp-jenkins-worker-05)
- #28999 (comment) (amp-jenkins-worker-04)
- #29010 (comment) (amp-jenkins-worker-03)
- #29013 (comment) (amp-jenkins-worker-04)
- #29016 (comment) (amp-jenkins-worker-05)
- #29025 (comment) (amp-jenkins-worker-04)
- #29042 (comment) (amp-jenkins-worker-03)

### Why are the changes needed?

Apache Spark `release-build.sh` generates the official document by using the following command.
- https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341

```bash
PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build
```

And, this is executed by the following `unidoc` command for Scala/Java API doc.
- https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30

```ruby
system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed")
```

However, the PR builder disabled `Jekyll build` and instead has a different test coverage.
```python
# determine if docs were changed and if we're inside the amplab environment
# note - the below commented out until *all* Jenkins workers can get `jekyll` installed
# if "DOCS" in changed_modules and test_env == "amplab_jenkins":
#    build_spark_documentation()
```

```
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:
-Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos
-Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc
```

### Does this PR introduce _any_ user-facing change?

No. (This is used only for testing and not used in the official doc generation.)

### How was this patch tested?

Pass the Jenkins without doc generation invocation.

Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@ueshin
Copy link
Member Author

ueshin commented Jul 8, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125415 has finished for PR 28986 at commit 0faac2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master, and branch-3.0.

HyukjinKwon pushed a commit that referenced this pull request Jul 9, 2020
…tors

### What changes were proposed in this pull request?

This PR proposes to disallow to create `SparkContext` in executors, e.g., in UDFs.

### Why are the changes needed?

Currently executors can create SparkContext, but shouldn't be able to create it.

```scala
sc.range(0, 1).foreach { _ =>
  new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
}
```

### Does this PR introduce _any_ user-facing change?

Yes, users won't be able to create `SparkContext` in executors.

### How was this patch tested?

Addes tests.

Closes #28986 from ueshin/issues/SPARK-32160/disallow_spark_context_in_executors.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit cfecc20)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@ueshin
Copy link
Member Author

ueshin commented Jul 9, 2020

Finally tests passed. Thanks all for triggering tests again and again!

* Throws an exception if a SparkContext is about to be created in executors.
*/
private def assertOnDriver(): Unit = {
if (TaskContext.get != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin could you submit a follow-up PR to add a conf? In Spark 3.0, turn it off by default; in Spark 3.1 turn it on by default? Also add it to the migration guide?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually affect any legitimate use case that would otherwise work? this should be more of a fail-fast for things that will already fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a user error. However, the users may use a Spark library in executors but the library calls SparkContext.getOrCreate. We should still let their workloads work if it worked in the previous releases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but would that ever succeed, using the SparkContext that was created? I'm trying to work out what library would do this and not fail. Is it, perhaps, some boilerplate initialization code that gets executed on driver and executor, and it makes a SparkContext that is never actually used on the executor, but now it fails fast?

I get it, if that's the use case. A release note and/or hidden config to disable it might be an OK workaround.

Alternatively if arguing that this is not-uncommon, maybe we just don't do this at all, and revert it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, had been discussing this offline with @ueshin and @HyukjinKwon - one use case that previously (and admittedly unfortunately) relied on the ability to create SparkContexts in the executors is MLflow's mlflow.pyfunc.spark_udf API (see docs & usage example), which provides a unified interface for scoring models trained in arbitrary ML frameworks (tensorflow, scikit-learn, pyspark.ml) on Spark DataFrames as a pandas UDF. Most ML frameworks (e.g. sklearn) are single-node frameworks that can operate on pandas dataframes, so applying them via pandas UDF works well. For a pyspark.ml model to be applied via pandas UDF, we need to convert the input pandas series -> spark dataframe on the executors, which requires a SparkContext on the executors.

I'll dig to see if there's a way to keep the MLflow use case working with this change (e.g. pyspark.ml models may ultimately perform inference via spark UDF, so maybe we can somehow extract the underlying UDF from the model and return that from mlflow.pyfunc.spark_udf?), but otherwise agree that we may want to revert this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a PR to add configs #29278.
Could you take a look at it when you are available?

Copy link
Member

@HyukjinKwon HyukjinKwon Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit odds to have two separate configuration for Scala and Python. I don't think we will happen to allow this in PySpark specifically in the future even if it worked. I think we will discourage this behaviour anyway whether it is in Scala or Python, and deprecate/remove both configurations together eventually. In which case would we disable it in Scala side but enable it in Python side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case would we disable it in Scala side but enable it in Python side?

Like how MLflow uses it today. It's pretty hard to make it work correctly in Scala side. But it's working in Python today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see but I guess we should discourage/deprecate/remove the behaviour in the end - am I correct? It's a bit weird to have two configuration to control the same behaviour (to end users). And I think it's a bit unlikely when it should be disabled in Scala but enabled in Python specifically. We could just enable it in both sides.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit to use a single config.

HyukjinKwon pushed a commit that referenced this pull request Jul 31, 2020
… create SparkContext in executors

### What changes were proposed in this pull request?

This is a follow-up of #28986.
This PR adds a config to switch allow/disallow to create `SparkContext` in executors.

- `spark.driver.allowSparkContextInExecutors`

### Why are the changes needed?

Some users or libraries actually create `SparkContext` in executors.
We shouldn't break their workloads.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to create `SparkContext` in executors with the config enabled.

### How was this patch tested?

More tests are added.

Closes #29278 from ueshin/issues/SPARK-32160/add_configs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet