[SPARK-23888][CORE] correct the comment of hasAttemptOnHost() #20998

Ngone51 · 2018-04-06T16:02:05Z

What changes were proposed in this pull request?

There's a bug in:

/** Check whether a task is currently running an attempt on a given host */
 private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
   taskAttempts(taskIndex).exists(_.host == host)
 }

This will ignore hosts which only have finished attempts, so we should check whether the attempt is currently running on the given host.

And it is possible for a speculative task to run on a host where another attempt failed here before.
Assume we have only two machines: host1, host2. We first run task0.0 on host1. Then, due to a long time waiting for task0.0, we launch a speculative task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run since there's already a copy running on host2. After another long time waiting, we launch a new speculative task0.2. And, now, we can run task0.2 on host1 again, since there's no more running attempt on host1.

After discussion, we simply make the comment be consistent the method's behavior.

How was this patch tested?

Added.

Ngone51 · 2018-04-06T16:05:00Z

ping @pwendell @kayousterhout . pls help review, thanks :)

felixcheung · 2018-04-08T06:59:33Z

Jenkins, ok to test

felixcheung · 2018-04-08T07:00:30Z

sounds fair, but shouldn't this be up to the scheduler backend? multiple tasks/attempts can run simultaneously on the same physical host?

Ngone51 · 2018-04-08T10:40:11Z

Hi, @felixcheung , thank for trigger a test and your comments.

shouldn't this be up to the scheduler backend?

Actually, it is TaskSchedulerImpl who holds a thread to check whether there are any speculative tasks need to be scheduled periodically. If any, then, call backend.reviveOffers to offer resources . But, it's TaskSetManager who decides whether we need to launch a speculative task for a certain task.

multiple tasks/attempts can run simultaneously on the same physical host?

I think multiple task attempts(actually, speculative tasks) can run on the sam physical host, but not simultaneously, as long as there's no running attempt on it. In PR description, I illustrate a case in which a speculative task chose to run on a host, where a previous task attempts have been run on, but failed finally. I think if the task's failure is not relevant to the host, 'run on the same host' can be acceptable.

SparkQA · 2018-04-08T11:28:03Z

Test build #89026 has finished for PR 20998 at commit 3584a09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2018-04-09T22:40:01Z

Adding isRunning can cause a single 'bad' node (from task pov - not necessarily only bad hardware: just that task fails on node) can keep tasks to fail repeatedly causing app to exit.

Particularly with blacklist'ing, I am not very sure how the interactions will play out .. @squito might have more comments.
In general, this is not a benign change imo and can have non trivial side effects.

In the specific usecase of only two machines, it is an unfortunate side effect.

Ngone51 · 2018-04-10T09:17:03Z

Hi, @mridulm, thank for your comment. Actually, I have the same worry with you. May be we can make this change as a second choice for hasAttemptOnHost , in case of there's really no other hosts to select.

squito

This change certainly makes it agree with the comment, so I think we should either make this change, or change the comment.

Blacklisting should still work as expected. dequeueSpeculativeTask also checks the blacklist, so if host1 is blacklisted, you'll still skip it. But, with blacklisting off, its more significant change. Even on a large cluster, I can imagine this happening most of the time when the non-speculative task fails, due to locality preferences.

Basically there is a behavior choice: Should a speculative task ever be allowed to run on a host where the task has failed previously? I think it should, as that is better handled by blacklisting.

squito · 2018-04-10T14:22:05Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    // there's already a running copy.
+    clock.advance(1000)
+    info1.finishTime = clock.getTimeMillis()
+    assert(info1.running === false)


assert(!info1.running)

squito · 2018-04-10T14:22:14Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    // no more running copy of task0
+    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL).get.index === 0)
+    val info3 = manager.taskAttempts(0)(0)
+    assert(info3.running === true)


assert(info3.running)

squito · 2018-04-10T14:33:44Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    // after a long long time, task0.0 failed, and task0.0 can not re-run since
+    // there's already a running copy.
+    clock.advance(1000)
+    info1.finishTime = clock.getTimeMillis()


it would be better here for you to call manager.handleFailedTask, to more accurately simulate the real behavior, and also makes the purpose of a test a little more clear.

nice suggestion.

you shouldn't need to set info.finishTime anymore, that should be taken care of by manager.handleFailedTask.

Ngone51 · 2018-04-11T01:51:39Z

Hi, @squito . Thank for review and comments.

SparkQA · 2018-04-11T06:08:34Z

Test build #89164 has finished for PR 20998 at commit 2ed9584.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-04-11T14:48:22Z

@mridulm more thoughts? I think this is the right change but I will leave open for a bit to get more input

squito · 2018-04-11T14:54:39Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    // after a long long time, task0.0 failed, and task0.0 can not re-run since
+    // there's already a running copy.
+    clock.advance(1000)
+    info1.finishTime = clock.getTimeMillis()


you shouldn't need to set info.finishTime anymore, that should be taken care of by manager.handleFailedTask.

squito · 2018-04-11T14:55:55Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

-  test("speculative task should not run on a given host where another attempt " +
-    "is already running on") {
+  test("SPARK-23888: speculative task should not run on a given host " +
+    "where another attempt is already running on") {


I'd reword this to be a bit more specific to what you're trying to test:

speculative task cannot run on host with another running attempt, but can run on a host with a failed attempt.

Sure. Also, do we need to reword PR and jira title? @squito

mridulm · 2018-04-11T18:26:01Z

@squito My concern is, in large workloads, some nodes simply become bad for some tasks (transient env or hardware issues, colocating containers, etc) while being fine for others; speculative tasks should alleviate performance concerns and not increase chances of application failure due to locality preference affinity. For cluster sizes which are very small, speculative execution is less relevant than for those which are large - and here we are tuning for the former.

SparkQA · 2018-04-12T05:08:42Z

Test build #89228 has finished for PR 20998 at commit 5901728.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-04-13T14:31:30Z

I'm not even really concerned about the case for two hosts -- I agree its fine if we do something sub-optimal. I'm more concerned about code-clarity and the behavior in general. It seems cleaner to me if speculation doesn't worry about where its failed before, and those exclusions are left to the blacklist.

But it sounds like you're saying the prior behavior was really desirable -- you think its better if speculation always excludes hosts that task has ever failed on? I'm happy to defer to your opinion on this, I haven't really stressed speculative execution yet. Then lets just change that comment in the code to be consistent.

mridulm · 2018-04-13T22:46:39Z

@squito I completely agree that the comment is inaccurate.
Note that this is for a specific taskset, so impact is limited to that taskset (w.r.t using executors for spec exec)

squito · 2018-04-17T13:59:58Z

@Ngone51 can you instead leave the behavior as is, and just update the comment?

Sorry that its going to be a small change in the end, and all the extra work the bad comments led you to do, but still appreciate you noticing this and fixing. a good PR with a quality test too.

Ngone51 · 2018-04-17T14:29:07Z

Will do, and it's okay.
My view limited in the source code yet, but you guys have more practical experience. So I learned from your points. It's beneficial.

SparkQA · 2018-04-17T19:04:34Z

Test build #89460 has finished for PR 20998 at commit 0c6f305.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-04-18T06:11:36Z

LGTM

srowen · 2018-04-23T14:03:48Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -287,7 +287,7 @@ private[spark] class TaskSetManager(
    None
  }

-  /** Check whether a task is currently running an attempt on a given host */
+  /** Check whether a task once run an attempt on a given host */


Should this be "once ran"?

Yes. Thank you.

SparkQA · 2018-04-23T18:32:24Z

Test build #89728 has finished for PR 20998 at commit e44d80b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-04-23T20:39:35Z

merged to master, thanks @Ngone51 . I also updated the commit msg some before committing, I thought it best to focus on the eventual change, figured it wasn't worth bugging you for another update cycle.

Ngone51 · 2018-04-24T05:29:08Z

Agree and thank you @squito .

And thanks for all of you. @felixcheung @mridulm @jiangxb1987 @srowen

add task attempt running check in hasAttemptOnHost

3584a09

squito reviewed Apr 10, 2018

View reviewed changes

address review comments

2ed9584

squito reviewed Apr 11, 2018

View reviewed changes

address comments

5901728

Ngone51 added 2 commits April 17, 2018 22:47

correct the comment

500cc77

remove UT

0c6f305

Ngone51 changed the title ~~[SPARK-23888][CORE] speculative task should not run on a given host where another attempt is already running on~~ [SPARK-23888][CORE] correct the comment of hasAttemptOnHost() Apr 17, 2018

srowen reviewed Apr 23, 2018

View reviewed changes

address comment

e44d80b

asfgit closed this in c8f3ac6 Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23888][CORE] correct the comment of hasAttemptOnHost() #20998

[SPARK-23888][CORE] correct the comment of hasAttemptOnHost() #20998

Ngone51 commented Apr 6, 2018 •

edited

Loading

Ngone51 commented Apr 6, 2018

felixcheung commented Apr 8, 2018

felixcheung commented Apr 8, 2018 •

edited

Loading

Ngone51 commented Apr 8, 2018 •

edited

Loading

SparkQA commented Apr 8, 2018

mridulm commented Apr 9, 2018

Ngone51 commented Apr 10, 2018

squito left a comment

squito Apr 10, 2018

squito Apr 10, 2018

squito Apr 10, 2018

Ngone51 Apr 11, 2018

squito Apr 11, 2018

Ngone51 commented Apr 11, 2018

SparkQA commented Apr 11, 2018

squito commented Apr 11, 2018

squito Apr 11, 2018

squito Apr 11, 2018

Ngone51 Apr 12, 2018

mridulm commented Apr 11, 2018

SparkQA commented Apr 12, 2018

squito commented Apr 13, 2018

mridulm commented Apr 13, 2018

squito commented Apr 17, 2018

Ngone51 commented Apr 17, 2018

SparkQA commented Apr 17, 2018

jiangxb1987 commented Apr 18, 2018

srowen Apr 23, 2018

Ngone51 Apr 23, 2018

SparkQA commented Apr 23, 2018

squito commented Apr 23, 2018

Ngone51 commented Apr 24, 2018

[SPARK-23888][CORE] correct the comment of hasAttemptOnHost() #20998

[SPARK-23888][CORE] correct the comment of hasAttemptOnHost() #20998

Conversation

Ngone51 commented Apr 6, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Ngone51 commented Apr 6, 2018

felixcheung commented Apr 8, 2018

felixcheung commented Apr 8, 2018 • edited Loading

Ngone51 commented Apr 8, 2018 • edited Loading

SparkQA commented Apr 8, 2018

mridulm commented Apr 9, 2018

Ngone51 commented Apr 10, 2018

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ngone51 commented Apr 11, 2018

SparkQA commented Apr 11, 2018

squito commented Apr 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm commented Apr 11, 2018

SparkQA commented Apr 12, 2018

squito commented Apr 13, 2018

mridulm commented Apr 13, 2018

squito commented Apr 17, 2018

Ngone51 commented Apr 17, 2018

SparkQA commented Apr 17, 2018

jiangxb1987 commented Apr 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 23, 2018

squito commented Apr 23, 2018

Ngone51 commented Apr 24, 2018

Ngone51 commented Apr 6, 2018 •

edited

Loading

felixcheung commented Apr 8, 2018 •

edited

Loading

Ngone51 commented Apr 8, 2018 •

edited

Loading