[SPARK-28769][CORE] Improve warning message of BarrierExecutionMode when required slots > maximum slots#25487
[SPARK-28769][CORE] Improve warning message of BarrierExecutionMode when required slots > maximum slots#25487sarutak wants to merge 3 commits intoapache:masterfrom
Conversation
|
Test build #109298 has finished for PR 25487 at commit
|
|
cc: @jiangxb1987 |
core/src/main/scala/org/apache/spark/scheduler/BarrierJobAllocationFailed.scala
Show resolved
Hide resolved
| s" (Retry ${retryCount}/$maxFailureNumTasksCheck failed)" | ||
| } | ||
|
|
||
| logWarning(s"The job $jobId requires to run a barrier stage " + |
There was a problem hiding this comment.
I think this needs to be rewritten a little, and doesn't need the special case above
Barrier stage in job $jobId requires ${...} slots, but only ${...} are available. Failure ${numCheckFailures} / ${maxFailureNumTasksCheck}
There was a problem hiding this comment.
My first idea was like what you suggests but If we do so, we get following messages.
19/08/22 01:48:34 WARN DAGScheduler: Barrier stage in job 0 requires 3 slots, but only 2 are available. Failure 1 / 3
19/08/22 01:48:49 WARN DAGScheduler: Barrier stage in job 0 requires 3 slots, but only 2 are available. Failure 2 / 3
19/08/22 01:49:04 WARN DAGScheduler: Barrier stage in job 0 requires 3 slots, but only 2 are available. Failure 3 / 3
19/08/22 01:49:19 WARN DAGScheduler: Barrier stage in job 0 requires 3 slots, but only 2 are available. Failure 4 / 3
Failure 4 / 3 looks weird. Another solution would be using maxFailureNumTaskCheck + 1 as the number of maximum attempt.
There was a problem hiding this comment.
How about rendering a message like "Will retry up to x more times?"
There was a problem hiding this comment.
Hmm... it's like as follows?
19/08/22 02:54:37 WARN DAGScheduler: Barrier stage in job 0 requires 3 slots, but only 2 are available. Will retry up to 3 more times
19/08/22 02:54:52 WARN DAGScheduler: Barrier stage in job 0 requires 3 slots, but only 2 are available. Will retry up to 2 more times
19/08/22 02:55:07 WARN DAGScheduler: Barrier stage in job 0 requires 3 slots, but only 2 are available. Will retry up to 1 more times
19/08/22 02:55:22 WARN DAGScheduler: Barrier stage in job 0 requires 3 slots, but only 2 are available. Will retry up to 0 more times
org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently. Please init a new cluster with more CPU cores or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage.
There was a problem hiding this comment.
Heh OK I accept that's a little odd too, but I think it's understandable enough and a little less complex to deal with.
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Outdated
Show resolved
Hide resolved
|
Test build #109507 has finished for PR 25487 at commit
|
| private def checkBarrierStageWithNumSlots(rdd: RDD[_]): Unit = { | ||
| if (rdd.isBarrier() && rdd.getNumPartitions > sc.maxNumConcurrentTasks) { | ||
| throw new BarrierJobSlotsNumberCheckFailed | ||
| lazy val numPartitions = rdd.getNumPartitions |
There was a problem hiding this comment.
Oh wait, one last little thing: why are these local vars lazy? They're cheap to access. Even if not, just nest two if statements below to conditionally access them.
There was a problem hiding this comment.
Yes they, especially rdd.getNumPartitions, are cheap to access. It's just for removing unused variable and method call in case rdd.isBarrier() returns false.
I have no strong preference for whether making them lazy, nesting if statements or just removing lazy modifier.
It's just my curiosity but making those variables lazy brings another overhead or poor readability of code?
Anyway, I'll remove lazy
There was a problem hiding this comment.
I'm actually not sure what if anything it does for a local. It introduces another lambda/function and check when accessed though, so probably isn't even saving anything.
|
Test build #109586 has finished for PR 25487 at commit
|
|
Merged to master |
What changes were proposed in this pull request?
Improved warning message in Barrier Execution Mode when required slots > maximum slots.
The new message contains information about required slots, maximum slots and how many times retry failed.
Why are the changes needed?
Providing to users with the number of required slots, maximum slots and how many times retry failed might help users to decide what they should do.
For example, continuing to wait for retry succeeded or killing jobs.
Does this PR introduce any user-facing change?
Yes.
If
spark.scheduler.barrier.maxConcurrentTaskCheck.maxFailures=3, we get following warning message.Before applying this change:
After applying this change:
How was this patch tested?
I tested manually using Spark Shell with following configuration and script. And then, checked log message.