[SPARK-20286][core] Improve logic for timing out executors in dynamic allocation. #24704

vanzin · 2019-05-24T22:39:07Z

This change refactors the portions of the ExecutorAllocationManager class that
track executor state into a new class, to achieve a few goals:

make the code easier to understand
better separate concerns (task backlog vs. executor state)
less synchronization between event and allocation threads
less coupling between the allocation code and executor state tracking

The executor tracking code was moved to a new class (ExecutorMonitor) that
encapsulates all the logic of tracking what happens to executors and when
they can be timed out. The logic to actually remove the executors remains
in the EAM, since it still requires information that is not tracked by the
new executor monitor code.

In the executor monitor itself, of interest, specifically, is a change in
how cached blocks are tracked; instead of polling the block manager, the
monitor now uses events to track which executors have cached blocks, and
is able to detect also unpersist events and adjust the time when the executor
should be removed accordingly. (That's the bug mentioned in the PR title.)

Because of the refactoring, a few tests in the old EAM test suite were removed,
since they're now covered by the newly added test suite. The EAM suite was
also changed a little bit to not instantiate a SparkContext every time. This
allowed some cleanup, and the tests also run faster.

Tested with new and updated unit tests, and with multiple TPC-DS workloads
running with dynamic allocation on; also some manual tests for the caching
behavior.

… allocation. This change refactors the portions of the ExecutorAllocationManager class that track executor state into a new class, to achieve a few goals: - make the code easier to understand - better separate concerns (task backlog vs. executor state) - less synchronization between event and allocation threads - less coupling between the allocation code and executor state tracking The executor tracking code was moved to a new class (ExecutorMonitor) that encapsulates all the logic of tracking what happens to executors and when they can be timed out. The logic to actually remove the executors remains in the EAM, since it still requires information that is not tracked by the new executor monitor code. In the executor monitor itself, of interest, specifically, is a change in how cached blocks are tracked; instead of polling the block manager, the monitor now uses events to track which executors have cached blocks, and is able to detect also unpersist events and adjust the time when the executor should be removed accordingly. (That's the bug mentioned in the PR title.) Because of the refactoring, a few tests in the old EAM test suite were removed, since they're now covered by the newly added test suite. The EAM suite was also changed a little bit to not instantiate a SparkContext every time. This allowed some cleanup, and the tests also run faster. Tested with new and updated unit tests, and with multiple TPC-DS workloads running with dynamic allocation on; also some manual tests for the caching behavior.

vanzin · 2019-05-24T22:40:21Z

@dhruve I cleaned up the code I had lying around and made it work on top of the existing EAM. This PR is a replacement for #22015.

@attilapiros I removed some code you added recently as part of this patch.

vanzin · 2019-05-24T22:54:42Z

core/src/test/scala/org/apache/spark/deploy/ExternalShuffleServiceDbSuite.scala

@@ -70,10 +71,12 @@ class ExternalShuffleServiceDbSuite extends SparkFunSuite {
    }
  }

+  def shuffleServiceConf: SparkConf = sparkConf.clone().set(SHUFFLE_SERVICE_PORT, 0)


This change isn't exactly related; but this test was failing when I forgot a shuffle service running locally when running tests for the rest of this PR. So might as well fix this.

SparkQA · 2019-05-25T00:17:21Z

Test build #105774 has finished for PR 24704 at commit 05b6802.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-05-27T03:06:20Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

@@ -142,30 +137,22 @@ private[spark] class ExecutorAllocationManager(
  // Executors that have been requested to be removed but have not been killed yet
  private val executorsPendingToRemove = new mutable.HashSet[String]


Is it possible also moving this into ExecutorMonitor ? So that ExecutorAllocationMonitor would be more pure by removing onExecutorRemoved and ExecutorMonitor would be more knowledgeable about executor's state, including pending.

Ngone51 · 2019-05-27T03:08:28Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

@@ -744,10 +592,6 @@ private[spark] class ExecutorAllocationManager(
        if (totalPendingTasks() == 0) {
          allocationManager.onSchedulerQueueEmpty()
        }
-
-        // Mark the executor on which this task is scheduled as busy
-        executorIdToTaskIds.getOrElseUpdate(executorId, new mutable.HashSet[Long]) += taskId


Also remove line 719 of executorId ?

Ngone51 · 2019-05-27T13:03:36Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+            // and if it changed, don't consider this executor for removal yet, delaying
+            // the decision until the next check.
+            if (exec.timeoutAt > deadline) {
+              recomputeTimedOutExecs = true


I'm a little confused about recomputeTimedOutExecs. Why do we need to eagerly recompute timed out executors while exec.timeoutAt changed to be bigger ? If exec.timeoutAt changed to be bigger than final newNextTimeout, we can just follow the condition now >= nextTimeout.get() to decide whether we need the check. And, instead of setting recomputeTimedOutExecs = true, maybe, we should just update the newNextTimeout comparing to changed exec.timeoutAt. WDYT ?

we should just update the newNextTimeout comparing to changed exec.timeoutAt

I think that can work and would avoid a little bit of unnecessary work. Let me do that instead.

SparkQA · 2019-05-28T23:53:02Z

Test build #105879 has finished for PR 24704 at commit 7f72720.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-29T17:17:46Z

Test build #105918 has finished for PR 24704 at commit 7927239.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-05-29T17:15:13Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+            // the executor to change after it was read above. Check the deadline again,
+            // and if it changed, don't consider this executor for removal yet.
+            val newDeadline = exec.timeoutAt
+            if (newDeadline > deadline) {


why isn't this newDeadline > now? somewhat pathological case, but if this thread was really slow, and you set your timeouts super tiny, could happen. more importantly, I just think it would be more clear.

squito · 2019-05-29T17:57:44Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+    throw new SparkException(s"${DYN_ALLOCATION_EXECUTOR_IDLE_TIMEOUT.key} must be >= 0!")
+  }
+  if (storageTimeoutMs < 0) {
+    throw new SparkException(s"${DYN_ALLOCATION_CACHED_EXECUTOR_IDLE_TIMEOUT.key} must be >= 0!")


these checks could now go on the configs themselves

squito · 2019-05-29T18:03:05Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+        exec.updateTimeout()
+      }
+    } else {
+      exec.cachedBlocks.get(blockId.rddId).foreach { blocks =>


do you want to do this when storageLevel.isValid && fetchFromShuffleSvcEnabled && storageLevel.useDisk? I If its the first time the block is cached, then its a no-op. If it has already been cached in memory, then this would no longer apply the cached timeout to this executor. I guess that is desired? just want to check.

ok I see this is explicitly checked in a test too.

Yep, that's basically what SPARK-27677 does.

squito · 2019-05-29T18:03:54Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+        blocks -= blockId.splitIndex
+        if (blocks.isEmpty) {
+          exec.cachedBlocks -= blockId.rddId
+          exec.updateTimeout()


could be done only if exec.cachedBlocks.isEmpty, though I guess there is nothing wrong this way.

squito · 2019-05-29T18:04:43Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+    private var runningTasks: Int = 0
+
+    // This should only be used in the event thread.
+    val cachedBlocks = new mutable.HashMap[Int, mutable.BitSet]()


would help to add a comment that its rdd_id -> partition_ids

squito · 2019-05-29T18:39:48Z

core/src/test/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitorSuite.scala

+    assert(monitor.timedOutExecutors(storageDeadline) ===  Seq("1"))
+
+    conf.set(SHUFFLE_SERVICE_ENABLED, true).set(SHUFFLE_SERVICE_FETCH_RDD_ENABLED, true)
+    monitor = new ExecutorMonitor(conf, client, clock)


set the monitor back at the end of this test (will be pretty confusing if somebody tries to add another test after this one otherwise)

Not sure what you mean? The monitor is re-initialized for each test.

ooops my fault.

SparkQA · 2019-05-29T22:51:38Z

Test build #105930 has finished for PR 24704 at commit 50d9c91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-05-30T06:29:31Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+      // If the executor was thought to be timed out, but the new deadline is later than the
+      // old one, ask the EAM thread to update the list of timed out executors.
+      if (timedOut && newDeadline > oldDeadline) {
+        nextTimeout.set(Long.MinValue)


Why set current nextTimeout to MinValue ? Do you mean we need eagerly recompute timed out executors at next round to update the list of timed out executors ? But, as far as I know from the code, each time EAM thread invoke on timedOutExecutors(), it gets the list of timed out executors and asks cluster manager to kill them by ExecutorAllocationClient immediately. So, if an executor has already been considered timed out and killing after last round, will there be any difference for that executor if we eagerly update the list of timed out executors in next round ?

I guess, here, you mean to handle the case that an event comes making the executor to be non timed out again after the executor just considered to be timeout nearly after scanning. But, as I mentioned above, I think it won't work. Instead of after scanning, I think, maybe, we can do optimization for the case when it happens during scanning. For example, wrapping a while loop on scanning and exit until all timed out executors' timeOut to be true:

// this check guard against event which updates deadline during scanning while(!checkTimedOut(timedOutExecs)) { ... timedOutExecs = scanning { } ... }

WDYT ?

it gets the list of timed out executors and asks cluster manager to kill them by ExecutorAllocationClient immediately. So, if an executor has already been considered timed out and killing after last round, will there be any difference for that executor if we eagerly update the list of timed out executors in next round ?

only the CoarseGrainedSchedulerBackend has a consistent view of which executors are actually running tasks, and so it has ultimate say in whether or not to kill an executor. You might not have killed an executor earlier because it was actually busy.

So here is where you "catch up" with the view of the CGSB, so you know that executor shouldn't be considered timed out and requested to be killed next time around.

(at least I think so, marcelo can confirm ...)

What Imran said, with a complication.

The scenario here is decidedly rare. The EAM thread sees a timed out executor at the same time a new task is starting on it. So the EAM tries to kill it, but the CGSB says no.

While that's happening the task start event arrives, and the monitor now knows the executor is not idle anymore. But the executor is still there in the list of timed out executors, so on every cycle, the EAM would still try to kill it and the CGSB will keep saying no. Until it becomes idle again (but not timed out!), and then it will be killed, but at the wrong time.

So it needs to be removed from the timed out list to avoid that problem.

There's still a case where the task start event isn't processed until after the task has actually ended, and the EAM thread runs again and actually manages to kill an executor before it should. But that should be even rarer, and I'm not sure that can be fixed, since it involves synchronization with threads that this code has no control over.

Make sense.@squito @vanzin

But the executor is still there in the list of timed out executors, so on every cycle, the EAM would still try to kill it and the CGSB will keep saying no.

I just realized that we have a cached timedOutExecs list in monitor. But there's one last point we may overlook. Once an executor has considered to be timed out in a previous scanning round, its pendingRemoval would be marked as true. So, in next scanning round, we'll filter it out before checking its deadline. Then, it won't appear in timedOutExecs due to pendingRemoval rather than its new deadline. We should cover this.

But, why we need to cache timedOutExecs in monitor ? Could we always return a latest timed out execs list to EAM on every cycle(e.g. return empty when now < nextTimeout.get) ?

So, in next scanning round, we'll filter it out before checking its deadline. Then, it won't appear in timedOutExecs due to pendingRemoval rather than its new deadline. We should cover this.

I'm not sure I understand this part. pendingRemoval is after the CGSB has acknowledged that it will remove the executor. We shouldn't need to send any more messages about killing the executor after this.

But, why we need to cache timedOutExecs in monitor ? Could we always return a latest timed out execs list to EAM on every cycle(e.g. return empty when now < nextTimeout.get) ?

I think you might be right about this, I would need to read it more carefully ... but in any case, I think that would make it significantly harder to follow. Its a lot easier to reason about if ExecutorMonitor.timedOutExecutors() consistently returns the right list.

Yeah, this doesn't work because of an edge case.

When calculating the timed out executors, the next timeout is set for when the next /non-timed out/ executor will time out. That is the important part.

Now imagine that you return 3 timed out executors and the EAM doesn't kill any of them (either because of limits, or because the CGSB doesn't do it). Now you will not get a call to executorsKilled, so the next timeout will not be reset.

Which means that next time you call timedOutExecutors() it will return an empty list, instead of the 3 timed out executors.

So you need that cache.

BTW you could say "just call executorsKilled with an empty list" but then that means the optimization doesn't work, and you're basically calculating the timed out list on every EAM schedule call.

Okay, I see the problem here. But, things here I care about is that, with a cached list, EAM may performs unnecessary executors removing for many many times before nextTimeout change. For example, we have 3 executors in monitor, and EAM gets those 3 timeout executors in a single round scanning. That's means, we can not update nextTimeout during this scanning and nextTimeout would just be Long.MaxValue before new events(which updates nextTimeout) comes. And, for EAM, before nextTimeout updates, it always gets those 3 cached timed out executors on every 100ms, and performs unnecessary executor removing again and again. This case may be rare, or have less impact on EAM, but I do worry about this similar possible unnecessary behavior with that cache list.

I think remove that cache list can be achieved with a little more change:

if (deadline > now) { newNextTimeout = math.min(newNextTimeout, deadline) exec.timedOut = false false } else { exec.timedOut = true true }

instead of only updating newNextTimeout for non timed out executors, we should update it for timed out executors, too

when we get a call to executorsKilled(executorIds), updates nextTimeout comparing to those killed executors' timeoutAt, rather than just set to Long.MinValue

with a cached list, EAM may performs unnecessary executors removing for many many times before nextTimeout change

That's actually the point. If the EAM did not kill the executors, it most probably means that there's an event that will be delivered to the listener and cause the next timeout to be updated (e.g. a task start).

If such event doesn't arrive, then the executors are still timed out, and will be always returned to the EAM. Then it may just decide not to kill them again (e.g. because the lower limit of executors has been reached).

All that is fine, and cheap computationally.

instead of only updating newNextTimeout for non timed out executors, we should update it for timed out executors, too

That means that if no executors are killed, you'll be re-scanning the whole executor list on every EAM interval until something changes, which is the exact thing the cache is trying to avoid.

Make sense.

squito · 2019-05-30T15:52:11Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+    timedOutExecs = Nil
+  }
+
+  def timedOutExecutors(): Seq[String] = {


I think its worth a comment that this can only be called from the EAM scheduling thread.

squito · 2019-05-30T16:08:40Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+
+            // An event arriving while this scan is happening may cause the deadline for
+            // the executor to change after it was read above. Check the deadline again,
+            // and if it changed, don't consider this executor for removal yet.


this is only important for block updates, right? because if the executor is actually running a task, the CoarseGrainedSchedulerBackend should ignore the request to kill the task (and is the only thing which has a consistent view of that).

I'm not so sure this complexity is really necessary, as it will be imperfect anyway (the update could happen immediately after you do this). And it even exists for the other case -- you decided to not time out an executor, but then you just get an update to drop a block or unpersist an RDD, and now it should timeout.

I think you're right. I was trying to close a very specific and narrow race here, but yes, different races still exist and there isn't a good way to avoid them. The best is to hope the CGSB doesn't kill the executor.

then you just get an update to drop a block or unpersist an RDD, and now it should timeout

That is fine, the EAM should pick that up in the next cycle 100ms later.

squito · 2019-05-30T17:13:08Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+      // If the executor was thought to be timed out, but the new deadline is later than the
+      // old one, ask the EAM thread to update the list of timed out executors.
+      if (timedOut && newDeadline > oldDeadline) {
+        nextTimeout.set(Long.MinValue)


it gets the list of timed out executors and asks cluster manager to kill them by ExecutorAllocationClient immediately. So, if an executor has already been considered timed out and killing after last round, will there be any difference for that executor if we eagerly update the list of timed out executors in next round ?

only the CoarseGrainedSchedulerBackend has a consistent view of which executors are actually running tasks, and so it has ultimate say in whether or not to kill an executor. You might not have killed an executor earlier because it was actually busy.

So here is where you "catch up" with the view of the CGSB, so you know that executor shouldn't be considered timed out and requested to be killed next time around.

(at least I think so, marcelo can confirm ...)

SparkQA · 2019-05-30T22:43:43Z

Test build #105976 has finished for PR 24704 at commit fe8eb1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-05-31T14:19:23Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+            exec.timedOut = false
+            false
+          } else {
+            exec.timedOut = true


Change is good here. Previously I thought it overlap with some behavior in Tracker. And, now, it is quite simple and more readable and easy to understand.

Ngone51 · 2019-05-31T14:21:47Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+          } else {
+            exec.timedOut = true
+            true
+          }


if (deadline > now) { newNextTimeout = math.min(newNextTimeout, deadline) exec.timedOut = false false } else { exec.timedOut = true true }

Maybe ?

if (deadline > now) { newNextTimeout = math.min(newNextTimeout, deadline) exec.timedOut = false } else { exec.timedOut = true } exec.timedOut

I was trying to avoid another read of a volatile variable.

dhruve · 2019-05-31T16:09:43Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+
+    def updateTimeout(): Unit = {
+      val oldDeadline = timeoutAt
+      val newDeadline = if (idleStart >= 0) {


idleStart seems to be redundant. We could just use runningTasks == 0

I think you need it when updating the timeout, specifically when you drop cached blocks. You want to track the original time the executor went idle. (you could make the approximation to just restart the idle time when the blocks get dropped, but might as well do it right)

Yeah, it's not redundant. When checking for idleness, this check is the same as runningTasks == 0, but we need both pieces of information for the timeout calculation to be correct.

dhruve · 2019-05-31T16:10:25Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+    val cachedBlocks = new mutable.HashMap[Int, mutable.BitSet]()
+
+    // For testing.
+    def isIdle: Boolean = idleStart >= 0


Can replace it with runningTasks == 0

squito · 2019-05-31T16:51:41Z

core/src/test/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitorSuite.scala

+    assert(monitor.timedOutExecutors(idleDeadline).isEmpty)
+    assert(monitor.timedOutExecutors(storageDeadline) === Seq("1"))
+
+    monitor.onUnpersistRDD(SparkListenerUnpersistRDD(2))


from @dhruve 's comment above, I realized it might be good to add to this test case a bit:

// if we get an unpersist event much later, which moves an executor from having cached blocks // to no longer having cached blocks, we still know the timeout from the original time the // executor went idle clock.setTime(idleDeadline) ...

Actually the clock.setTime() is not needed, because the timedOutExecutors call here is a test-specific one that takes the time as a parameter.

But I'll add the comment.

squito

one small suggestion on the tests, otherwise lgtm.

one thing I was thinking is that we don't have any tests on the handling when the CGSB decides to not kill an executor. That test really would need to go in ExecutorAllocationManagerSuite, turn off the dynamic allocation testing config, and mock the cgsb

squito · 2019-05-31T17:28:38Z

one thing I was thinking is that we don't have any tests on the handling when the CGSB decides to not kill an executor. That test really would need to go in ExecutorAllocationManagerSuite, turn off the dynamic allocation testing config, and mock the cgsb

actually, that is totally wrong -- you can do this test in ExecutionMonitorSuite, just be interleaving calls to timedOutExecutors(), executorsKilled(), onTaskStart(), & executorsRemoved().

Now that the monitor is also keeping track of which executors are pending for removal, that cache becomes unnecessary, since it knows exactly when it should look at the executor list again to see which ones can be killed.

squito

lgtm

squito · 2019-05-31T20:55:01Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+      updateNextTimeout(newNextTimeout)
+      timedOutExecs
+    } else {
+      Nil


I actually thought it was more clear with the cached value, but fine with this too

I actually found some issue with the current code after I removed another piece of code that seemed to be doing nothing useful. That might also be related to the unit test failure in the last run. Taking a look...

squito · 2019-05-31T20:59:15Z

core/src/test/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitorSuite.scala

+    assert(monitor.timedOutExecutors().toSet === Set("1", "2", "3"))
+    assert(monitor.pendingRemovalCount === 0)
+
+    monitor.executorsKilled(Seq("1"))


a brief comment here would be helpful, eg. "The scheduler only kills some of the executors we asked it to (eg. because another task just started there, but really for whatever reason the scheduler feels like). Make sure the timedOut list gets updated appropriately."

SparkQA · 2019-05-31T21:23:47Z

Test build #106031 has finished for PR 24704 at commit 42e5e9e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 42e5e9e.

SparkQA · 2019-06-01T01:34:51Z

Test build #106037 has finished for PR 24704 at commit 5eb29c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-01T04:23:36Z

Test build #106043 has finished for PR 24704 at commit 9bf4ece.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-06-02T13:32:40Z

LGTM

attilapiros

I read the changes but this was just a superficial review besides that I do not want my super tiny NIT to be lost.

attilapiros · 2019-05-28T18:13:18Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+  // "task start" event has just been posted to the listener bus and hasn't yet been delivered to
+  // this listener. There are safeguards in other parts of the code that would prevent that executor
+  // from being removed.
+  private var nextTimeout = new AtomicLong(Long.MaxValue)


Nit: This can be a val.

squito · 2019-06-03T14:46:56Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+    // of timed out executors needs to be updated due to the executor's state changing.
+    @volatile var timedOut: Boolean = false
+
+    var pendingRemoval: Boolean = false


I think this should be volatile too -- its accessed by the EAM thread and the listenerbus thread

I don't think it needs to. I tried very hard to avoid volatile when not necessary; in this code, it's generally only needed when ordering is important when checking multiple variables. That is not the case for this one, since it doesn't really depend on any other variables.

At first I also didn't have timedOut and timeoutAt as volatiles, but I couldn't 100% convince me that it was not needed. But in this case I'm pretty sure it doesn't do anything useful.

ok, I guess the one access which is in the listener thread is in onExecutorRemoved(). If that sees the stale value, then it will incorrectly set the nextTimeout, telling the monitor to recompute the list. But while that is inefficient, its never incorrect. And the hope is, its rare enough that its still better to avoid making it volatile.

is that right?

I'd say it at least deserves a comment in onExecutorRemoved.

If that sees the stale value, then it will incorrectly set the nextTimeout

Right, but notice that isn't really related to the variable being volatile or not, it's just a race that can also occur if the variable is volatile.

Writing to that variable is an atomic operation. Volatile controls what happens to other reads / writes around that volatile write. That's why it's needed when the order of the reads and writes is important. But here it isn't.

SparkQA · 2019-06-03T19:16:21Z

Test build #106118 has finished for PR 24704 at commit 2e2432f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-05T04:08:19Z

Test build #106177 has finished for PR 24704 at commit 9ae62d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-06-05T13:10:38Z

merged to master

… allocation. This change refactors the portions of the ExecutorAllocationManager class that track executor state into a new class, to achieve a few goals: - make the code easier to understand - better separate concerns (task backlog vs. executor state) - less synchronization between event and allocation threads - less coupling between the allocation code and executor state tracking The executor tracking code was moved to a new class (ExecutorMonitor) that encapsulates all the logic of tracking what happens to executors and when they can be timed out. The logic to actually remove the executors remains in the EAM, since it still requires information that is not tracked by the new executor monitor code. In the executor monitor itself, of interest, specifically, is a change in how cached blocks are tracked; instead of polling the block manager, the monitor now uses events to track which executors have cached blocks, and is able to detect also unpersist events and adjust the time when the executor should be removed accordingly. (That's the bug mentioned in the PR title.) Because of the refactoring, a few tests in the old EAM test suite were removed, since they're now covered by the newly added test suite. The EAM suite was also changed a little bit to not instantiate a SparkContext every time. This allowed some cleanup, and the tests also run faster. Tested with new and updated unit tests, and with multiple TPC-DS workloads running with dynamic allocation on; also some manual tests for the caching behavior. Closes apache#24704 from vanzin/SPARK-20286. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>

vanzin commented May 24, 2019

View reviewed changes

Ngone51 reviewed May 27, 2019

View reviewed changes

Marcelo Vanzin added 4 commits May 28, 2019 10:33

Unused variables.

6eddd77

Fix streaming test.

b99e0e1

Tweak handling of not-quite-timed-out executors.

4b7d00b

Track execs pending for removal in the monitor.

7f72720

Recompute time outs when an unexpected removal happens.

7927239

squito reviewed May 29, 2019

View reviewed changes

Feedback.

50d9c91

Ngone51 reviewed May 30, 2019

View reviewed changes

squito reviewed May 30, 2019

View reviewed changes

Feedback.

fe8eb1c

Ngone51 reviewed May 31, 2019

View reviewed changes

dhruve reviewed May 31, 2019

View reviewed changes

squito reviewed May 31, 2019

View reviewed changes

squito approved these changes May 31, 2019

View reviewed changes

Marcelo Vanzin added 3 commits May 31, 2019 11:41

Test tweak.

625fd5e

Add test for the pending removal tracking.

6be5cee

Remove cache of timed out execs.

42e5e9e

Now that the monitor is also keeping track of which executors are pending for removal, that cache becomes unnecessary, since it knows exactly when it should look at the executor list again to see which ones can be killed.

squito approved these changes May 31, 2019

View reviewed changes

Marcelo Vanzin added 2 commits May 31, 2019 15:50

Add comment.

ccf3f1d

Revert "Remove cache of timed out execs."

5eb29c4

This reverts commit 42e5e9e.

Improve a test.

9bf4ece

attilapiros reviewed Jun 3, 2019

View reviewed changes

squito reviewed Jun 3, 2019

View reviewed changes

Use val.

2e2432f

Merge branch 'master' into SPARK-20286

9ae62d9

asfgit closed this in b312033 Jun 5, 2019

vanzin deleted the SPARK-20286 branch June 5, 2019 17:05

JoshRosen mentioned this pull request Jul 14, 2019

[SPARK-20286][SPARK-24786][Core][DynamicAllocation] Release executors on unpersisting RDD #22015

Closed

		@@ -142,30 +137,22 @@ private[spark] class ExecutorAllocationManager(
		// Executors that have been requested to be removed but have not been killed yet
		private val executorsPendingToRemove = new mutable.HashSet[String]

[SPARK-20286][core] Improve logic for timing out executors in dynamic allocation. #24704

[SPARK-20286][core] Improve logic for timing out executors in dynamic allocation. #24704

Conversation

vanzin commented May 24, 2019

vanzin commented May 24, 2019

Choose a reason for hiding this comment

SparkQA commented May 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 28, 2019

SparkQA commented May 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin May 30, 2019 • edited

Choose a reason for hiding this comment

Ngone51 May 31, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin May 31, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito left a comment

Choose a reason for hiding this comment

squito commented May 31, 2019

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 31, 2019

SparkQA commented Jun 1, 2019

SparkQA commented Jun 1, 2019

Ngone51 commented Jun 2, 2019

attilapiros left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 3, 2019

SparkQA commented Jun 5, 2019

squito commented Jun 5, 2019

vanzin May 30, 2019 •

edited

Ngone51 May 31, 2019 •

edited

vanzin May 31, 2019 •

edited