New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29177] [Core] fix zombie tasks after stage abort #25850
Conversation
retest this please. |
Test build #110991 has finished for PR 25850 at commit
|
Test build #110993 has finished for PR 25850 at commit
|
Test build #110999 has finished for PR 25850 at commit
|
Test build #111040 has finished for PR 25850 at commit
|
Test build #111042 has finished for PR 25850 at commit
|
@xuanyuanking Could you please help review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pinging me, I think it make sense to handle a success task as killed task for resource cleaning. We did the same thing in TaskSetManager.handleSuccessfulTask
for speculative tasks.
@@ -64,6 +64,8 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul | |||
val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match { | |||
case directResult: DirectTaskResult[_] => | |||
if (!taskSetManager.canFetchMoreResults(serializedData.limit())) { | |||
scheduler.handleFailedTask(taskSetManager, tid, TaskState.KILLED, TaskKilled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about directly call taskSetManager.handleFailedTask
here?
If canFetchMoreResults
return false, taskSetManger.isZombie has set to true. scheduler.handlerFailedTask
equally same with taskSetManager.handleFailedTask
, and this will make UT easy to write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling scheduler.handleFailedTask
is to be consistent with other cases in this function.
core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala
Outdated
Show resolved
Hide resolved
Test build #111119 has finished for PR 25850 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit.
LGTM, cc @jiangxb1987 @cloud-fan
@@ -64,6 +64,8 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul | |||
val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match { | |||
case directResult: DirectTaskResult[_] => | |||
if (!taskSetManager.canFetchMoreResults(serializedData.limit())) { | |||
scheduler.handleFailedTask(taskSetManager, tid, TaskState.KILLED, TaskKilled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to leave a comment here to explain why we handle the oversize task as a killed task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
add comments
Test build #111204 has finished for PR 25850 at commit
|
### What changes were proposed in this pull request? Do task handling even the task exceeds maxResultSize configured. More details are in the jira description https://issues.apache.org/jira/browse/SPARK-29177 . ### Why are the changes needed? Without this patch, the zombie tasks will prevent yarn from recycle those containers running these tasks, which will affect other applications. ### Does this PR introduce any user-facing change? No ### How was this patch tested? unit test and production test with a very large `SELECT` in spark thriftserver. Closes #25850 from adrian-wang/zombie. Authored-by: Daoyuan Wang <me@daoyuan.wang> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c08bc37) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
thanks, merging to master/2.4! |
@cloud-fan @adrian-wang oops, looks like this doesn't compile in 2.4:
Want to revert it or just hot-fix forward? it may be pretty easy. |
@srowen thanks for catching! I've pushed a commit to fix it. |
What changes were proposed in this pull request?
Do task handling even the task exceeds maxResultSize configured. More details are in the jira description https://issues.apache.org/jira/browse/SPARK-29177 .
Why are the changes needed?
Without this patch, the zombie tasks will prevent yarn from recycle those containers running these tasks, which will affect other applications.
Does this PR introduce any user-facing change?
No
How was this patch tested?
unit test and production test with a very large
SELECT
in spark thriftserver.