New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24954][Core] Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled #21915
Conversation
@jiangxb1987, thanks! I am a bot who has found some folks who might be able to help with the review:@mateiz, @rxin and @kayousterhout |
cc @squito |
Test build #93791 has finished for PR 21915 at commit
|
*/ | ||
private def checkBarrierStageWithDynamicAllocation(rdd: RDD[_]): Unit = { | ||
if (rdd.isBarrier() && Utils.isDynamicAllocationEnabled(sc.getConf)) { | ||
throw new SparkException("Don't support run a barrier stage with dynamic resource " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false".
- Make the error message a constant to simplify test.
|
||
test("submit a barrier ResultStage with dynamic resource allocation enabled") { | ||
val conf = new SparkConf() | ||
.set("spark.dynamicAllocation.enabled", "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work if we use nvm, I was thinking LocalSparkContext
?MLlibTestSparkContext
...
thanks @jiangxb1987, lgtm aside from @mengxr 's comments |
Test build #93877 has finished for PR 21915 at commit
|
retest this please |
Test build #93889 has finished for PR 21915 at commit
|
retest this please |
Test build #93910 has finished for PR 21915 at commit
|
retest this please |
LGTM pending jenkins. Was there a JIRA reported about KafkaContinuousSourceSuite failure? cc: @tedyu ? |
Test build #93931 has finished for PR 21915 at commit
|
retest this please |
Test build #93971 has finished for PR 21915 at commit
|
retest this please |
Test build #94006 has finished for PR 21915 at commit
|
…allocation enabled
663b900
to
f3ea9c6
Compare
Test build #94089 has finished for PR 21915 at commit
|
Test build #94090 has finished for PR 21915 at commit
|
test this please |
Test build #94115 has finished for PR 21915 at commit
|
retest this please |
Test build #94131 has finished for PR 21915 at commit
|
LGTM. Merged into master. Thanks! |
What changes were proposed in this pull request?
We don't support run a barrier stage with dynamic resource allocation enabled, it shall lead to some confusing behaviors (eg. with dynamic resource allocation enabled, it may happen that we acquire some executors (but not enough to launch all the tasks in a barrier stage) and later release them due to executor idle time expire, and then acquire again).
We perform the check on job submit and fail fast if running a barrier stage with dynamic resource allocation enabled.
How was this patch tested?
Added new test suite
BarrierStageOnSubmittedSuite
to cover all the fail fast cases that submitted a job containing one or more barrier stages.