[SPARK-29976][CORE] Trigger speculation for stages with too few tasks by yuchenhuo · Pull Request #26614 · apache/spark

yuchenhuo · 2019-11-21T00:05:00Z

What changes were proposed in this pull request?

This PR add an optional spark conf for speculation to allow speculative runs for stages where there are only a few tasks.

spark.speculation.task.duration.threshold

If provided, tasks would be speculatively run if the TaskSet contains less tasks than the number of slots on a single executor and the task is taking longer time than the threshold.

Why are the changes needed?

This change helps avoid scenarios where there is single executor that could hang forever due to disk issue and we unfortunately assigned the single task in a TaskSet to that executor and cause the whole job to hang forever.

Does this PR introduce any user-facing change?

yes. If the new config spark.speculation.task.duration.threshold is provided and the TaskSet contains less tasks than the number of slots on a single executor and the task is taking longer time than the threshold, then speculative tasks would be submitted for the running tasks in the TaskSet.

How was this patch tested?

Unit tests are added to TaskSetManagerSuite.

jiangxb1987 · 2019-11-21T00:16:35Z

test this please

jiangxb1987

Looks good only some minor comments.

jiangxb1987 · 2019-11-21T00:23:19Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+  private def testSingleTaskSpeculation(singleTaskEnabled: Boolean): Unit = {
+    sc = new SparkContext("local", "test")
+    // Set the speculation multiplier to be 0 so speculative tasks are launched immediately
+    sc.conf.set(config.SPECULATION_MULTIPLIER, 0.0)


Why do we need to set this config here?

yeah, this is not really useful. I can remove it

jiangxb1987 · 2019-11-21T00:24:19Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    sc.conf.set(config.SPECULATION_ENABLED, true)
+    sc.conf.set(config.SPECULATION_SINGLETASKSTAGE_ENABLED, singleTaskEnabled)
+    // Set the threshold to be 60 minutes
+    sc.conf.set(config.SPECULATION_SINGLETASKSTAGE_DURATION_THRESHOLD.key, "60min")


nit: why not keep it the same with the default value?

This would validate that setting the conf actually works. IMO it's better than testing the default value. I can add another test to test the default value is 30 if that's preferred.

sounds good, let's keep the current way

jiangxb1987 · 2019-11-21T01:54:34Z

cc @tgravescs @squito @cloud-fan

SparkQA · 2019-11-21T03:07:48Z

Test build #114180 has finished for PR 26614 at commit bc0f7e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-11-22T21:50:09Z

I understand the case you're trying to address, but special casing for one task doesn't feel right to me. If there are two tasks, you could end up in the same situation, with two tasks running on the same bad executor? Would it make sense to have this as just a global min time for speculation?

OTOH, I do know that 1 task is common, especially when doing a final aggregation. Its not a bad solution, I just don't know if I want to be stuck w/ this conf.

yuchenhuo · 2019-11-23T00:14:43Z

@squito Good point! If I understand correctly, the better way to do this is to have a flag like spark.speculation.aggressive and maybe check the executed time of all the running tasks and if it exceeds the threshold, then speculative run the task. Do you think it necessary to additionally check if the tasks are running on the same executor and only speculate if so?

cloud-fan · 2019-11-25T05:18:42Z

We need finished tasks to have an estimation of task duration. I agree with @squito that the only missing thing is a user-supplied task duration. Can we only have one config to specify the task duration for speculation?

yuchenhuo · 2019-11-25T06:17:02Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+        foundTasks |= checkAndSubmitSpeculatableTask(tid, time, threshold)
+      }
+    }
+    if (speculationTaskDurationThresOpt.isDefined) {


There should be a way to combine the two if condition here but might make the logic a bit more complicated. Not sure if it's worthy to do so.

yuchenhuo · 2019-11-25T06:26:11Z

@cloud-fan @squito I've updated the PR corresponding to your comments. Though I guess the "ideal" way to do this is to have a mechanism to always speculative run if the threshold exceeds && there are some empty slots, but I'm not sure if it's necessary since if there is no empty slot the speculatable tasks wouldn't be run anyway.

cloud-fan · 2019-11-25T06:26:58Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+  private[spark] val SPECULATION_TASK_DURATION_THRESHOLD =
+    ConfigBuilder("spark.speculation.task.duration.threshold")
+      .timeConf(TimeUnit.MILLISECONDS)
+      .createOptional


can we add a simple .doc to explain what this config does?

Do I need to change https://github.com/apache/spark/blob/master/docs/configuration.md?

viirya · 2019-11-25T07:55:27Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+    if (speculationTaskDurationThresOpt.isDefined) {
+      val time = clock.getTimeMillis()
+      val threshold = speculationTaskDurationThresOpt.get
+      logDebug("Checking tasks taking long time than provided speculation threshold: " + threshold)


taking long time -> taking longer time

viirya · 2019-11-25T07:57:50Z

Please also update the title and description accordingly.

tgravescs · 2019-11-25T14:47:02Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+  private[spark] val SPECULATION_TASK_DURATION_THRESHOLD =
+    ConfigBuilder("spark.speculation.task.duration.threshold")
+      .doc("Task duration after which scheduler would try to speculative run the task. If " +
+        "provided, tasks would be rerun as long as they exceed the threshold no matter whether" +


should change rerun to something like speculatively run.

tgravescs · 2019-11-25T14:57:35Z

So while I agree that this could easily happen for 2 tasks instead of 1 in cases like when both put on the same executor, if you make this apply all the time (or more then 1 task) then you have to estimate your worst case timeout across the entire application. So if you have stages with 10000 tasks that take a long time, and then another stage with 1 tasks that takes as a shorter time you have to set the config to the longer time. The 10000 task stage I would think you would want the normal speculation configs to apply to. Perhaps we either want a config for max number of tasks to apply it to or we make it smarter and say apply when you only have a single executor or tasks <= a single executor can fit

tgravescs · 2019-11-25T16:05:38Z

note I realize there are a bunch of corner cases here, but take back my suggestion of the number of executors, you could have 1 executor but run 10000 tasks on it and you still might want to use the regular logic. I think our best bet is either leave at 1, a config, or tasks <= the number that can fix on a single executor makes more sense to me.

squito · 2019-11-25T18:01:39Z

yeah I am also torn like @tgravescs . There are a ton of corner cases. I really don't like special casing one task, but I'm not sure of the clean way to configure this. Even with 10 tasks, if you've got 4 cores per executor you've easily got 4 tasks stuck on your one bad executor, and with a default speculation quantile of 0.75 you wouldn't finish 8 tasks successfully to start speculation. If you add in the fact that the poor performance may be across an entire node, and 64 cores per node is not uncommon, the limit goes way higher.

speculative execution is always a heuristic, we know its not going to be perfect. I feel like when you enable speculation, you are saying you're willing to accept some wasted resources, so its more acceptable to run some speculative tasks when you don't really need to. But how much waste is OK? In Tom's exxample, say you had 10k tasks that each took an hour, but all are actually running fine -- the waste is pretty serious, you'll launch a speculative version of each task so its 10k cpu-hours wasted.

One alternative might be to only have this kick in when all tasks are running on the same host (the TSM already knows the hosts of the running task, its in TaskInfo, it would be easy to see if there is just one host used across all tasks).

yuchenhuo · 2019-11-25T22:11:07Z

@tgravescs @squito Looks like there are two potential solutions (1) we check if all of the tasks are running on the same executor first and if so do the time threshold speculative run. (2) add another conf specifying the minimum number of total tasks in the stage that we would trigger the time threshold check.

I think both of the solutions solve the problem I'm hitting but option (2) seems more configurable and handles slightly more corner cases (i.e. multiple problematic nodes).

jiangxb1987 · 2019-11-25T22:32:33Z

+1 for restricting the total running tasks <= the number of slots in one executor. This ensures when there are suspicious tasks that didn't finish after a while (most likely hang) we could start speculative runs, on the other hand, this ensures the speculative tasks started won't waste resources identical to one executor.

tgravescs · 2019-11-25T23:05:43Z

So with the all tasks run on the same host solution, it might change part way through the stage. So it might start out with all tasks on the same executor and then uses this new timeout config but then you get more executors and then change to the other speculative configs. This might be confusing to users.

tgravescs · 2019-11-26T14:43:18Z

it seems like either the number of slots on 1 executor or a config might be best. I think there are corner cases for all of these its just picking the one that seems to cover most.

just a side note, Tez has a similar configs but it only applies when a single task is run. it is obviously different but did solve a problem was saw at Yahoo. So based on what we saw there I'm kind of leaning towards the tasks <= number of slots of 1 executor. That doesn't add another config, covers the 1 task case, plus the 1 bad executor case. @squito thoughts?

squito · 2019-11-26T17:53:39Z

sure I think I'm OK with that, its a decent compromise. You wouldn't launch speculative tasks if you've got multiple executors on a bad node, but thats OK (IIRC we also won't make a dynamic allocation to get an executor on a new node, which would be needed to really handle that case).

A couple of nitpicky points:

when you say the number of tasks <= number of slots of 1 executor -- is that the total number of tasks in the taskset, or the delta minFinishedForSpeculation - numSuccessfulTasks? The reason to do the delta is say you've got 10 tasks in the taskset, but the last 4 are all running on the bad executor. The taskset as a whole is too big to meet that condition, but with minFinishedForSpeculation=7 and numSuccessfulTasks=6 you'd meet the delta.
Doesn't it still need another config to decide what the timeout is in this case?

tgravescs · 2019-11-26T19:04:53Z

Yes you still need the config for the timeout, you just don't need a second one when to apply that config. ie when you have <= second config task number then use the config, otherwise use the normal speculation logic.

I was originally thinking the total number of tasks <= number of slots on 1 executor, then apply the timeout config, seemed the most straightforward and obvious to the user. I'm fine with either way though as long as it can be explained to user. I think using delta does complicate things again as it uses the new algorithm sometimes and then the original algorithm at other times. My initial thought is to keep it simple in initial implementation, they can always turn the spark.speculation.quantile down when you have a larger number of tasks, but lots of corner cases again. The thing with 1 task is that the current settings will never work for it because you need at least 1 to compare against.

Note you will ask for a new executor if you speculate and the executors are all used. It might not be on a different node though.

jiangxb1987 · 2019-11-26T19:31:11Z

I was proposing something like this:

     if (tasksSuccessful >= minFinishedForSpeculation && tasksSuccessful > 0) {
      // Try to add speculative tasks that has been running more than SPECULATION_MULTIPLIER * medianDuration.
    } else if (speculationTaskDurationThresOpt.isDefined && runningTasks <= conf.getInt("spark.executor.cores", 1) / conf.getInt("spark.task.cpus", 1)) {
      // Try to add speculative tasks that has been running more than a specified duration.
    } else {
      // Do not add speculative tasks.
    }

So this just introduces a new way to add speculative tasks that have been running more than a specified duration, which should be easy to reason about.

Please note I only consider the number of running tasks in the TaskSet, because the original speculation logic didn't include pending tasks either. On the other hand, if we keep get those long running tasks, at the end more executors would be required to run speculative tasks.

tgravescs · 2019-11-26T21:53:08Z

original speculation logic doesn't including pending because it can only use the time of successful tasks. Unless you are referring to something else.

I'm ok with any of these as they will be better then what we have now. If you look at running, it could change very quickly, because with dynamic allocation I might have 1 executor, start 4 tasks on it, but then some time later get another executor so then I no longer apply the timeout and use the original speculation logic

jiangxb1987 · 2019-11-26T22:22:52Z

Ah good point, considering both the running and pending tasks would make the speculation strategy more stable (in the means once you entered the speculation task duration threshold branch, unless more tasks finished successfully, you would always choose the same speculation strategy in the next few scheduling iterations).

Also, if runningTasks + pendingTasks > SlotsPerExecutor, that means either too many tasks have been running, or we are waiting for extra executors to be launched, in both cases we'd better not hurry into adding more speculative tasks.

yuchenhuo · 2019-11-27T08:01:29Z

@squito @tgravescs @jiangxb1987 Thanks for all the suggestions. I've updated the PR correspondingly.

tgravescs · 2019-12-02T18:03:34Z

test this please

tgravescs · 2019-12-02T18:14:55Z

so just want to confirm with @squito and @jiangxb1987 ,

It looks like what is implemented now is the number of unfinished <= number of slots on 1 executors and you don't hit the normal speculation logic:
val numUnfinishedTasks = numTasks - numSuccessfulTasks
val speculationTaskNumThres = conf.get(EXECUTOR_CORES) / conf.get(CPUS_PER_TASK)

based on the discussions above I thought we were going to do the number of total tasks <= number of slots on 1 executor to keep it simple, thoughts?

SparkQA · 2019-12-02T20:41:02Z

Test build #114734 has finished for PR 26614 at commit 1dcd5d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2019-12-03T00:28:39Z

We would enter the original speculation logic anyway when enough tasks have finished successfully, so the difference is only on the case when some of the tasks have succeed, but the number of successful tasks is less than minFinishedForSpeculation.
For example, you may have 4 slots on each executor, the currently running taskSet has 5 tasks, 2 of which has succeed, 1 task is running, and the rest 2 tasks are pending. In this case when the task running time has exceed the speculationTaskDurationThresOpt.get then I think it should be reasonable to submit a speculative task, because the risk that it would consume too much resources is low, and it could possibly resolve the potential task hang issue (like the scenario described in the JIRA).

tgravescs · 2019-12-03T14:36:14Z

Right, I get that it still uses regular speculation logic if enough have finished, my concerns are confusion to the user or it kicking in when users don't want it to.

Lets say I set both speculation policies because I have different stages with different requirements. I have one stage with 1 task that is problematic, the speculationTaskDurationThresOpt will also be applied to all my other stages that I configured the normal spark speculation configs for. If the speculationTaskDurationThresOpt is something that could is widely different for different stages then its harder to configure this way and can kick in when I don't want it to or when I don't expect it to. the normal speculation configs are based on a multiplier of other task time, this is a just a hardcoded timeout. so lets say my normal speculation config multiplier would kick in only after an hour and my speculationTaskDurationThresOpt is set to 15 minutes. I'm going to start speculating a lot more when the unfinished gets below that threshold.

I totally get that this perhaps covers more scenarios which in my opinion is good and bad as shown above. I was thinking keeping this simple for now and just having it apply if total tasks <= slots on 1 executor. That should be very easy for user to understand and know when it will apply. It solves the issue reported in this jira. If we start to find more specific cases we want to get smarter then we can enhance it later.

jiangxb1987 · 2019-12-03T23:51:19Z

I think I get your concern now, we might have two stages running concurrently, the expected task duration for the first stage could be 15mins and for the second stage it could be 1hr. Thus if we set the speculationTaskDurationThresOpt to 30mins then tasks from the second stage would all get speculated which is not desired.

However I don't see why this is related to comparing the speculationTaskDurationThresOpt with unfinished tasks versus total tasks. Even if we choose total tasks instead of unfinished tasks, it can still happen that one stage contains only one task, but the task duration is actually expected to be longer than speculationTaskDurationThresOpt, then a speculative task shall get launched anyway.

tgravescs · 2019-12-04T14:51:10Z

it minimizes impact and makes it predictable when the new speculationTaskDurationThresOpt is applied. If you only apply it when the number of tasks is small < number of slots per executors, its easier to reason about, if it can apply during any stage then I need to worry about it being applied to my large stages even if I configured the other speculation configs to be what I really want it to use.

I agree with you that if you have 2 stages of 1 task each picking the timeout here can be tricky, which is why the normal speculation configs use a multiple of the run time. You can't do that with only 1 task though. But I don't see how to get around that.

My point is with using the unfinished, it now expands that same impact to not only stages with 1 task but all my stages.

jiangxb1987 · 2019-12-04T19:01:57Z

Now I agree, we shall be conservative on the behavior change and limit it only on "small" TaskSet (that contains less tasks than the slots per executor), thus we'd better use numTasks instead of numUnfinishedTasks when comparing with speculationTaskNumThres.

yuchenhuo · 2019-12-06T00:08:33Z

@tgravescs @jiangxb1987 Thanks for the feedbacks! I've updated the PR corresponding to the discussion. May I get another review?

tgravescs · 2019-12-06T13:48:10Z

test this please

tgravescs · 2019-12-06T13:58:56Z

docs/configuration.md

+    Task duration after which scheduler would try to speculative run the task. If provided, tasks
+    would be speculatively run if current stage contains less tasks than the number of slots on a
+    single executor and the task is taking longer time than the threshold. This config helps
+    speculate stage with very few tasks.


it might be nice ot add a sentence that the regular other speculation configs may also apply if executor slots large enough

it also might be nice to say default unit is milliseconds if unit not specified.

tgravescs · 2019-12-06T14:06:56Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+  // speculative run based on the time threshold. SPARK-29976: We set this value to be the number
+  // of slots on a single executor so that we wouldn't speculate too aggressively but still
+  // handle basic cases.
+  val speculationTaskNumThres = conf.get(EXECUTOR_CORES) / conf.get(CPUS_PER_TASK)


use sched.CPUS_PER_TASK instead of conf.get(CPUS_PER_TASK).

tgravescs · 2019-12-06T14:12:43Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

-        }
+        foundTasks |= checkAndSubmitSpeculatableTask(tid, time, threshold)
+      }
+    } else if (speculationTaskDurationThresOpt.isDefined && numTasks <= speculationTaskNumThres) {


we can do the comparison numTasks <= speculationTaskNumThres once when taskSetManager created, the numTasks isn't changing in the TaskSet so do it once at top, then we don't even need speculationTaskNumThres

tgravescs · 2019-12-06T14:16:44Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+      sc.conf.set(config.SPECULATION_TASK_DURATION_THRESHOLD.key, "60min")
+    }
+    sched = new FakeTaskScheduler(sc, ("exec1", "host1"), ("exec2", "host2"))
+    // Create a task set with only one task


remove comment since numTasks passed in

tgravescs · 2019-12-06T14:19:56Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+      assert(!manager.checkSpeculatableTasks(0))
+      assert(sched.speculativeTasks.size == numTasks)
+    } else {
+      // If the feature flag is turned off, or the stage contains too few tasks


I think you mean to many tasks

tgravescs · 2019-12-06T14:24:26Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+  test("SPARK-29976 when a speculation time threshold is provided, should not speculative " +
+      "if there are too many tasks in the stage even though time threshold is provided") {
+    testSpeculationDurationThreshold(true, 2, 1)
+  }


it would be nice to add anther test here that that test interaction of the speculative configs. Meaning I have both the threshold set and the speculation quantile is smaller, the threshold can still apply and vice versa, the quantile can still apply.

SparkQA · 2019-12-06T16:22:06Z

Test build #114947 has finished for PR 26614 at commit bf22446.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs

changes look good, thanks @yuchenhuo

squito · 2019-12-11T16:53:10Z

sorry for the delays on my end, a late lgtm from me too. Thanks @yuchenhuo !

new conf for single task stage speculation

bc0f7e1

jiangxb1987 reviewed Nov 21, 2019

View reviewed changes

address comments

58362cd

dongjoon-hyun added the SPARK CORE label Nov 21, 2019

yuchenhuo added 3 commits November 24, 2019 21:40

global threshold for speculation

48b90b8

update comments

d427c69

update tests

1cd95a6

yuchenhuo commented Nov 25, 2019

View reviewed changes

cloud-fan reviewed Nov 25, 2019

View reviewed changes

add doc

8e505ca

viirya reviewed Nov 25, 2019

View reviewed changes

tgravescs reviewed Nov 25, 2019

View reviewed changes

yuchenhuo changed the title ~~[SPARK-29976] New conf for single task stage speculation~~ [SPARK-29976][CORE] New conf for single task stage speculation Nov 25, 2019

yuchenhuo added 3 commits November 26, 2019 23:27

address comments

7ac8bbe

add tests

2f9b65b

nits and configuration

1dcd5d3

yuchenhuo changed the title ~~[SPARK-29976][CORE] New conf for single task stage speculation~~ [SPARK-29976][CORE] Trigger speculation for stages with too few tasks Dec 2, 2019

check the total number of tasks instead of number of unfinished tasks

bf22446

tgravescs reviewed Dec 6, 2019

View reviewed changes

yuchenhuo added 3 commits December 8, 2019 21:08

nits

93c2859

fix bug and refactor test

274bcd1

add test

181bd89

tgravescs approved these changes Dec 10, 2019

View reviewed changes

asfgit closed this in ad238a2 Dec 10, 2019

Conversation

yuchenhuo commented Nov 21, 2019 • edited by jiangxb1987 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

jiangxb1987 commented Nov 21, 2019

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Nov 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Nov 21, 2019

Uh oh!

SparkQA commented Nov 21, 2019

Uh oh!

squito commented Nov 22, 2019

Uh oh!

yuchenhuo commented Nov 23, 2019

Uh oh!

cloud-fan commented Nov 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuchenhuo commented Nov 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Nov 25, 2019

Uh oh!

tgravescs commented Nov 25, 2019

Uh oh!

squito commented Nov 25, 2019

Uh oh!

yuchenhuo commented Nov 25, 2019

Uh oh!

jiangxb1987 commented Nov 25, 2019

Uh oh!

tgravescs commented Nov 25, 2019

Uh oh!

tgravescs commented Nov 26, 2019

Uh oh!

squito commented Nov 26, 2019

Uh oh!

tgravescs commented Nov 26, 2019

Uh oh!

jiangxb1987 commented Nov 26, 2019

Uh oh!

tgravescs commented Nov 26, 2019

Uh oh!

jiangxb1987 commented Nov 26, 2019

Uh oh!

yuchenhuo commented Nov 27, 2019

Uh oh!

tgravescs commented Dec 2, 2019

Uh oh!

tgravescs commented Dec 2, 2019

yuchenhuo commented Nov 21, 2019 •

edited by jiangxb1987

Loading

jiangxb1987 Nov 21, 2019 •

edited

Loading