-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors. #28986
[SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors. #28986
Conversation
* Throws an exception if a SparkContext is about to be created in executors. | ||
*/ | ||
private[spark] def assertOnDriver(): Unit = { | ||
if (TaskContext.get != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this under local mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under local mode:
scala> sc.range(0, 1).foreach { _ => new SparkContext(new SparkConf().setAppName("test").setMaster("local")) }
java.lang.IllegalStateException: SparkContext should only be created and accessed on the driver.
...
before this patch:
scala> sc.range(0, 1).foreach { _ => new SparkContext(new SparkConf().setAppName("test").setMaster("local")) }
org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
...
Although the exception is different, it fails anyway.
I think the new error message is more reasonable.
I agree we should not create a SparkContext on an executor. We should make sure the approach doesn't break the case when the executor runs in the same jvm with the driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @ueshin .
Test build #124930 has started for PR 28986 at commit |
The current run seems to hit a flaky test.
|
Retest this please. |
Current run fails again at the different test case.
|
Retest this please. |
Test build #124974 has finished for PR 28986 at commit
|
Test build #125015 has finished for PR 28986 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
retest this please |
Test build #125049 has finished for PR 28986 at commit
|
retest this please |
Test build #125083 has finished for PR 28986 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable
retest this please |
Test build #125240 has finished for PR 28986 at commit
|
Test build #125243 has finished for PR 28986 at commit
|
Jenkins, retest this please. |
Test build #125263 has finished for PR 28986 at commit
|
retest this please |
Test build #125264 has finished for PR 28986 at commit
|
retest this please |
Test build #125265 has finished for PR 28986 at commit
|
retest this please |
Test build #125285 has finished for PR 28986 at commit
|
Test build #125282 has finished for PR 28986 at commit
|
retest this please |
Test build #125336 has finished for PR 28986 at commit
|
retest this please |
Test build #125342 has finished for PR 28986 at commit
|
retest this please |
Test build #125357 has finished for PR 28986 at commit
|
### What changes were proposed in this pull request? This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation. - #28848 (comment) (amp-jenkins-worker-06) - #28926 (comment) (amp-jenkins-worker-06) - #28969 (comment) (amp-jenkins-worker-06) - #28975 (comment) (amp-jenkins-worker-05) - #28986 (comment) (amp-jenkins-worker-05) - #28992 (comment) (amp-jenkins-worker-06) - #28993 (comment) (amp-jenkins-worker-05) - #28999 (comment) (amp-jenkins-worker-04) - #29010 (comment) (amp-jenkins-worker-03) - #29013 (comment) (amp-jenkins-worker-04) - #29016 (comment) (amp-jenkins-worker-05) - #29025 (comment) (amp-jenkins-worker-04) - #29042 (comment) (amp-jenkins-worker-03) ### Why are the changes needed? Apache Spark `release-build.sh` generates the official document by using the following command. - https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341 ```bash PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build ``` And, this is executed by the following `unidoc` command for Scala/Java API doc. - https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30 ```ruby system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed") ``` However, the PR builder disabled `Jekyll build` and instead has a different test coverage. ```python # determine if docs were changed and if we're inside the amplab environment # note - the below commented out until *all* Jenkins workers can get `jekyll` installed # if "DOCS" in changed_modules and test_env == "amplab_jenkins": # build_spark_documentation() ``` ``` Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos -Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc ``` ### Does this PR introduce _any_ user-facing change? No. (This is used only for testing and not used in the official doc generation.) ### How was this patch tested? Pass the Jenkins without doc generation invocation. Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Jenkins, retest this please. |
Test build #125415 has finished for PR 28986 at commit
|
Merged to master, and branch-3.0. |
…tors ### What changes were proposed in this pull request? This PR proposes to disallow to create `SparkContext` in executors, e.g., in UDFs. ### Why are the changes needed? Currently executors can create SparkContext, but shouldn't be able to create it. ```scala sc.range(0, 1).foreach { _ => new SparkContext(new SparkConf().setAppName("test").setMaster("local")) } ``` ### Does this PR introduce _any_ user-facing change? Yes, users won't be able to create `SparkContext` in executors. ### How was this patch tested? Addes tests. Closes #28986 from ueshin/issues/SPARK-32160/disallow_spark_context_in_executors. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit cfecc20) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Finally tests passed. Thanks all for triggering tests again and again! |
* Throws an exception if a SparkContext is about to be created in executors. | ||
*/ | ||
private def assertOnDriver(): Unit = { | ||
if (TaskContext.get != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin could you submit a follow-up PR to add a conf? In Spark 3.0, turn it off by default; in Spark 3.1 turn it on by default? Also add it to the migration guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually affect any legitimate use case that would otherwise work? this should be more of a fail-fast for things that will already fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a user error. However, the users may use a Spark library in executors but the library calls SparkContext.getOrCreate. We should still let their workloads work if it worked in the previous releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but would that ever succeed, using the SparkContext that was created? I'm trying to work out what library would do this and not fail. Is it, perhaps, some boilerplate initialization code that gets executed on driver and executor, and it makes a SparkContext that is never actually used on the executor, but now it fails fast?
I get it, if that's the use case. A release note and/or hidden config to disable it might be an OK workaround.
Alternatively if arguing that this is not-uncommon, maybe we just don't do this at all, and revert it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, had been discussing this offline with @ueshin and @HyukjinKwon - one use case that previously (and admittedly unfortunately) relied on the ability to create SparkContext
s in the executors is MLflow's mlflow.pyfunc.spark_udf
API (see docs & usage example), which provides a unified interface for scoring models trained in arbitrary ML frameworks (tensorflow, scikit-learn, pyspark.ml) on Spark DataFrames as a pandas UDF. Most ML frameworks (e.g. sklearn) are single-node frameworks that can operate on pandas dataframes, so applying them via pandas UDF works well. For a pyspark.ml model to be applied via pandas UDF, we need to convert the input pandas series -> spark dataframe on the executors, which requires a SparkContext on the executors.
I'll dig to see if there's a way to keep the MLflow use case working with this change (e.g. pyspark.ml models may ultimately perform inference via spark UDF, so maybe we can somehow extract the underlying UDF from the model and return that from mlflow.pyfunc.spark_udf
?), but otherwise agree that we may want to revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted a PR to add configs #29278.
Could you take a look at it when you are available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit odds to have two separate configuration for Scala and Python. I don't think we will happen to allow this in PySpark specifically in the future even if it worked. I think we will discourage this behaviour anyway whether it is in Scala or Python, and deprecate/remove both configurations together eventually. In which case would we disable it in Scala side but enable it in Python side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case would we disable it in Scala side but enable it in Python side?
Like how MLflow uses it today. It's pretty hard to make it work correctly in Scala side. But it's working in Python today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see but I guess we should discourage/deprecate/remove the behaviour in the end - am I correct? It's a bit weird to have two configuration to control the same behaviour (to end users). And I think it's a bit unlikely when it should be disabled in Scala but enabled in Python specifically. We could just enable it in both sides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit to use a single config.
… create SparkContext in executors ### What changes were proposed in this pull request? This is a follow-up of #28986. This PR adds a config to switch allow/disallow to create `SparkContext` in executors. - `spark.driver.allowSparkContextInExecutors` ### Why are the changes needed? Some users or libraries actually create `SparkContext` in executors. We shouldn't break their workloads. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to create `SparkContext` in executors with the config enabled. ### How was this patch tested? More tests are added. Closes #29278 from ueshin/issues/SPARK-32160/add_configs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to disallow to create
SparkContext
in executors, e.g., in UDFs.Why are the changes needed?
Currently executors can create SparkContext, but shouldn't be able to create it.
Does this PR introduce any user-facing change?
Yes, users won't be able to create
SparkContext
in executors.How was this patch tested?
Addes tests.