[SPARK-24817][Core] Implement BarrierTaskContext.barrier() #21898

jiangxb1987 · 2018-07-27T18:16:39Z

What changes were proposed in this pull request?

Implement BarrierTaskContext.barrier(), to support global sync between all the tasks in a barrier stage.
The function set a global barrier and waits until all tasks in this stage hit this barrier. Similar to MPI_Barrier function in MPI, the barrier() function call blocks until all tasks in the same stage have reached this routine. The global sync shall finish immediately once all tasks in the same barrier stage reaches the same barrier.

This PR implements BarrierTaskContext.barrier() based on netty-based RPC client, introduces new BarrierCoordinator and new BarrierCoordinatorMessage, and new config to handle timeout issue.

How was this patch tested?

Add BarrierTaskContextSuite to test BarrierTaskContext.barrier()

SparkQA · 2018-07-27T18:23:58Z

Test build #93685 has finished for PR 21898 at commit de517f5.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class BarrierCoordinator(

SparkQA · 2018-07-28T03:47:46Z

Test build #93706 has finished for PR 21898 at commit 5c5db85.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-07-28T17:11:20Z

retest this please

SparkQA · 2018-07-28T21:11:18Z

Test build #93730 has finished for PR 21898 at commit 5c5db85.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-07-29T04:36:04Z

core/src/main/scala/org/apache/spark/BarrierTaskContext.scala

+   *     } catch {
+   *         case e: Exception => logWarning("...", e)
+   *     }
+   *     context.barrier()


I think here should not be another context.barrier()?

This is to demonstrate that in one task there is only one call of barrier() while in others there may be two calls of barrier(). Please refer to BarrierTaskContextSuite."throw exception if barrier() call mismatched". However, I'm still considering what shall be the most proper behavior for this scenario.

viirya · 2018-07-29T04:46:58Z

core/src/main/scala/org/apache/spark/BarrierTaskContextImpl.scala

+    try {
+      barrierCoordinator.askSync[Unit](
+        message = RequestToSync(numTasks, stageId, stageAttemptNumber, taskAttemptId, barrierEpoch),
+        timeout = new RpcTimeout(31536000 /** = 3600 * 24 * 365 */ seconds, "barrierTimeout"))


Use BARRIER_SYNC_TIMEOUT here?

I set a fix timeout for RPC intentionally, so users shall get a SparkException thrown by BarrierCoordinator, instead of RPCTimeoutException from the RPC framework.

You should add an inline comment so readers understand why.

viirya · 2018-07-29T04:51:48Z

core/src/main/scala/org/apache/spark/BarrierTaskContextImpl.scala

+      logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt $stageAttemptNumber) finished " +
+        "global sync successfully, waited for " +
+        s"${(System.currentTimeMillis() - startTime) / 1000} seconds, current barrier epoch is " +
+        s"$barrierEpoch.")


Shall we stop timer for this epoch here if the global sync finished successfully?

Nice catch! just updated.

SparkQA · 2018-07-29T07:05:01Z

Test build #93738 has finished for PR 21898 at commit 766381d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-29T07:05:01Z

Test build #93737 has finished for PR 21898 at commit 01c274b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-07-29T07:07:41Z

retest this please

SparkQA · 2018-07-29T10:58:35Z

Test build #93742 has finished for PR 21898 at commit 766381d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-29T11:08:30Z

Test build #93743 has finished for PR 21898 at commit 766381d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-07-29T12:23:55Z

retest this please

SparkQA · 2018-07-29T17:06:01Z

Test build #93748 has finished for PR 21898 at commit 766381d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-30T15:38:15Z

cc @mengxr @cloud-fan @rxin

SparkQA · 2018-07-30T18:22:42Z

Test build #93785 has finished for PR 21898 at commit cb1861d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-31T13:00:24Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+
+  // Barrier epoch for each stage attempt, fail a sync request if the barrier epoch in the request
+  // mismatches the barrier epoch in the coordinator.
+  private val barrierEpochByStageIdAndAttempt = new HashMap[Int, HashMap[Int, Int]]


is it better to use HashMap[(Int, Int), Int]?

Also, how about using AtomicLong to remember the epoch?

cloud-fan · 2018-07-31T13:11:39Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+
+  private[spark] val BARRIER_SYNC_TIMEOUT =
+    ConfigBuilder("spark.barrier.sync.timeout")
+      .doc("The timeout in milliseconds for each barrier() call from a barrier task. If the " +


will users set this config in milliseconds? I feel seconds should be more common.

cloud-fan · 2018-07-31T13:14:21Z

core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala

@@ -61,6 +61,9 @@ private[spark] trait TaskScheduler {
   */
  def killTaskAttempt(taskId: Long, interruptThread: Boolean, reason: String): Boolean

+  // Kill all the running task attempts in a stage.
+  def killAllTaskAttempts(stageId: Int, interruptThread: Boolean, reason: String): Unit


why do we need this?

I'm also confused here. Is it part of this PR? We should kill all task attempts in case of any task failures in a barrier stage, not limited to context.barrier() failures. Right?

Submitted #21943 , however, still need this here until #21943 is merged, because otherwise the test cases shall fail.

IIRC killing all tasks is just the best effort, we can guarantee the tasks are all killed. Shall we tolerate this in the barrier scheduling?

mengxr

One high-level comment is to move fail all task attempts to a separate PR to make this one minimal.

mengxr · 2018-07-31T18:00:26Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+import org.apache.spark.internal.Logging
+import org.apache.spark.rpc.{RpcCallContext, RpcEnv, ThreadSafeRpcEndpoint}
+
+class BarrierCoordinator(


package private

add ScalDoc

mengxr · 2018-07-31T18:03:24Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+
+  private val timer = new Timer("BarrierCoordinator barrier epoch increment timer")
+
+  // Barrier epoch for each stage attempt, fail a sync request if the barrier epoch in the request


Epoch counter for each barrier (stage, attempt).

Remove "fail ..." because it is not implemented by this variable.

mengxr · 2018-07-31T18:03:32Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+
+  // Barrier epoch for each stage attempt, fail a sync request if the barrier epoch in the request
+  // mismatches the barrier epoch in the coordinator.
+  private val barrierEpochByStageIdAndAttempt = new HashMap[Int, HashMap[Int, Int]]


mengxr · 2018-07-31T19:12:07Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+  // mismatches the barrier epoch in the coordinator.
+  private val barrierEpochByStageIdAndAttempt = new HashMap[Int, HashMap[Int, Int]]
+
+  // Any access to this should be synchronized.


Then shall we switch to Java's ConcurrentHashMap?

mengxr · 2018-07-31T19:12:42Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+
+  // Any access to this should be synchronized.
+  private val syncRequestsByStageIdAndAttempt =
+    new HashMap[Int, HashMap[Int, ArrayBuffer[RpcCallContext]]]


Ditto (stage, attempt) -> contexts.

mengxr · 2018-07-31T23:39:08Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+
+  private[spark] val BARRIER_SYNC_TIMEOUT =
+    ConfigBuilder("spark.barrier.sync.timeout")
+      .doc("The timeout in milliseconds for each barrier() call from a barrier task. If the " +


mengxr · 2018-07-31T23:42:15Z

core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala

@@ -61,6 +61,9 @@ private[spark] trait TaskScheduler {
   */
  def killTaskAttempt(taskId: Long, interruptThread: Boolean, reason: String): Boolean

+  // Kill all the running task attempts in a stage.
+  def killAllTaskAttempts(stageId: Int, interruptThread: Boolean, reason: String): Unit


I'm also confused here. Is it part of this PR? We should kill all task attempts in case of any task failures in a barrier stage, not limited to context.barrier() failures. Right?

mengxr · 2018-07-31T23:43:13Z

core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala

+
+  test("global sync by barrier() call") {
+    val conf = new SparkConf()
+      .setMaster("local-cluster[4, 1, 1024]")


should comment why we need local cluster

mengxr · 2018-07-31T23:47:35Z

core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala

+
+    val times = rdd2.collect()
+    // All the tasks shall finish global sync within a short time slot.
+    assert(times.max - times.min <= 5)


5ms seem too risky to me. Actually, 1 second is perhaps okay here.

mengxr · 2018-07-31T23:50:02Z

core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala

+    assert(error.contains("within 100 ms"))
+  }
+
+  ignore("throw exception if barrier() call mismatched") {


Why ignored? To create this scenario, we might need to create a new thread to call context.barrier() and then interrupt the thread.

SparkQA · 2018-08-01T21:31:35Z

Test build #93890 has finished for PR 21898 at commit 94d3671.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-08-01T22:18:03Z

retest this please

SparkQA · 2018-08-02T03:07:03Z

Test build #93911 has finished for PR 21898 at commit 94d3671.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-02T04:06:56Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+  private def cleanupSyncRequests(stageId: Int, stageAttemptId: Int): Unit = {
+    val requests = syncRequestsByStageIdAndAttempt.remove((stageId, stageAttemptId))
+    if (requests != null) {
+      requests.clear()


is this needed? when we call syncRequestsByStageIdAndAttempt.remove((stageId, stageAttemptId)), the array buffer becomes dangling and will be GCed.

This is just to be safe, in case the requests are held in other places, we can still GC the RpcCallContexts

Agree with @cloud-fan that this is not necessary. It only explicitly clears the ArrayBuffer object instead of the contexts.

cloud-fan · 2018-08-02T04:07:53Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+  private def getOrInitSyncRequests(
+      stageId: Int,
+      stageAttemptId: Int,
+      numTasks: Int = 0): ArrayBuffer[RpcCallContext] = {


when will we use the default value 0?

cloud-fan · 2018-08-02T04:09:53Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+    if (syncRequests.size == numTasks) {
+      syncRequests.foreach(_.reply(()))
+      return true
+    }


nit:

if (...) { ... true } else { false }

cloud-fan · 2018-08-02T04:11:45Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+          timer.schedule(new TimerTask {
+            override def run(): Unit = {
+              // Timeout for current barrier() call, fail all the sync requests.
+              val requests = getOrInitSyncRequests(stageId, stageAttemptId)


what if all the sync requests finish before timeout? then here we may init the request array again.

we should have some tests for the timeout behavior, by setting a very small timeout.

Em, how about cancel the TimerTask when sync request finished successfully?

yea we should do that, but we also need to consider race like sync request finishes and timer triggers at the same time.

We will also remove the internal data on stage completed, so I assume the race condition you mentioned won't cause serious issues, the internal data will after all be removed.

cloud-fan · 2018-08-02T04:14:06Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+          }, timeout * 1000)
+        }
+
+        syncRequests += context


although very unlikely, shall we add an assert that syncRequests.length == numTasks? Just in case we have a bug and some barrier tasks have a different value of numTasks.

We don't remember the numTasks in BarrierCoordinator, if it really worth that then we have to use another map to store the information.

each barrier task remembers numTask, so here we can make sure barrier tasks of same group would have same numTasks

kiszk · 2018-08-06T00:54:22Z

Good to see to finish without failure.
I am curious why 94247 is successful while 94241 was failed with the same set of test suites since they are tested using the same source revision.

jiangxb1987 · 2018-08-06T01:00:57Z

why 94247 is successful while 94241 was failed with the same set of test suites since they are tested using the same source revision.

They are not - I made the variable timer lazy for 94247.

kiszk · 2018-08-06T01:14:05Z

I see. got it, thanks

cloud-fan · 2018-08-06T01:32:35Z

core/src/main/scala/org/apache/spark/BarrierTaskContext.scala

+
+  // Number of tasks of the current barrier stage, a barrier() call must collect enough requests
+  // from different tasks within the same barrier stage attempt to succeed.
+  private lazy val numTasks = getTaskInfos().size


this can be a def.

If change it to a def then we have to call getTaskInfos() every time, the current lazy val shall only call getTaskInfos() once.

cloud-fan · 2018-08-06T01:34:19Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -1930,6 +1930,12 @@ class SparkContext(config: SparkConf) extends Logging {
    Utils.tryLogNonFatalError {
      _executorAllocationManager.foreach(_.stop())
    }
+    if (_dagScheduler != null) {


why this change?

This is to fix #21898 (comment) , previously LiveListenerBus was stopped before we stop DAGScheduler.

cloud-fan · 2018-08-06T01:34:19Z

core/src/main/scala/org/apache/spark/BarrierTaskContext.scala

+    val callSite = Utils.getCallSite()
+    logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt $stageAttemptNumber) has entered " +
+      s"the global sync, current barrier epoch is $barrierEpoch.")
+    logTrace(s"Current callSite: $callSite")


or simpler: logTrace("Current callSite: " + Utils.getCallSite())

kiszk · 2018-08-06T08:45:51Z

core/src/main/scala/org/apache/spark/BarrierCoordinator.scala

+    listenerBus: LiveListenerBus,
+    override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint with Logging {
+
+  private lazy val timer = new Timer("BarrierCoordinator barrier epoch increment timer")


Will we identify the underlying reason before merging to master?

This is certainly a potential bug in SparkSubmit and not related to the changes made in this PR, I don't feel it shall block this PR.

I opened https://issues.apache.org/jira/browse/SPARK-25030 to track the issue.

Add a comment above this line?

gatorsmile · 2018-08-06T10:33:52Z

ok to test

cloud-fan · 2018-08-06T13:11:13Z

retest this please

cloud-fan · 2018-08-06T13:13:53Z

LGTM, pending jenkins

SparkQA · 2018-08-06T17:37:11Z

Test build #94279 has finished for PR 21898 at commit 027ca71.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-06T17:59:39Z

Test build #94286 has finished for PR 21898 at commit 027ca71.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-06T18:38:01Z

Test build #94277 has finished for PR 21898 at commit 027ca71.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-06T18:38:14Z

Test build #94275 has finished for PR 21898 at commit 1f71e65.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-06T20:56:13Z

test this please

mengxr · 2018-08-07T00:07:14Z

test this please

SparkQA · 2018-08-07T01:58:31Z

Test build #94318 has finished for PR 21898 at commit 1f71e65.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-07T02:09:07Z

retest this please

cloud-fan · 2018-08-07T02:09:50Z

is there a way to increase the build timeout? cc @shaneknapp

SparkQA · 2018-08-07T03:45:24Z

Test build #94326 has finished for PR 21898 at commit 1f71e65.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-07T04:57:34Z

retest this please

mengxr · 2018-08-07T05:43:24Z

test this please

HyukjinKwon · 2018-08-07T05:51:41Z

@rxin, here we seems indeed starting to hit the time limit now.

SparkQA · 2018-08-07T06:39:38Z

Test build #94333 has finished for PR 21898 at commit 1f71e65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-07T06:51:49Z

great, finally tests pass! thanks, merging to master!

SparkQA · 2018-08-07T07:05:02Z

Test build #94344 has finished for PR 21898 at commit 1f71e65.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-07T07:05:02Z

Test build #94341 has finished for PR 21898 at commit 1f71e65.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 force-pushed the taskcontext.barrier branch from de517f5 to 5c5db85 Compare July 27, 2018 23:58

viirya reviewed Jul 29, 2018

View reviewed changes

jiangxb1987 force-pushed the taskcontext.barrier branch from 766381d to cb1861d Compare July 30, 2018 12:29

jiangxb1987 mentioned this pull request Jul 31, 2018

[WIP][SPARK-24375][Prototype] Support barrier scheduling #21494

Closed

cloud-fan reviewed Jul 31, 2018

View reviewed changes

mengxr requested changes Jul 31, 2018

View reviewed changes

cloud-fan reviewed Aug 2, 2018

View reviewed changes

cloud-fan reviewed Aug 6, 2018

View reviewed changes

kiszk reviewed Aug 6, 2018

View reviewed changes

update

027ca71

add comment

1f71e65

HyukjinKwon mentioned this pull request Aug 7, 2018

[SPARK-24886][INFRA] Fix the testing script to increase timeout for Jenkins build (from 300m to 340m) #21845

Closed

asfgit closed this in 388f5a0 Aug 7, 2018


		private val timer = new Timer("BarrierCoordinator barrier epoch increment timer")

		// Barrier epoch for each stage attempt, fail a sync request if the barrier epoch in the request

[SPARK-24817][Core] Implement BarrierTaskContext.barrier() #21898

[SPARK-24817][Core] Implement BarrierTaskContext.barrier() #21898

Conversation

jiangxb1987 commented Jul 27, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 27, 2018

SparkQA commented Jul 28, 2018

kiszk commented Jul 28, 2018

SparkQA commented Jul 28, 2018

viirya Jul 29, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 29, 2018

SparkQA commented Jul 29, 2018

jiangxb1987 commented Jul 29, 2018

SparkQA commented Jul 29, 2018

SparkQA commented Jul 29, 2018

jiangxb1987 commented Jul 29, 2018

SparkQA commented Jul 29, 2018

gatorsmile commented Jul 30, 2018

SparkQA commented Jul 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengxr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 1, 2018

gatorsmile commented Aug 1, 2018

SparkQA commented Aug 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Aug 6, 2018 • edited

jiangxb1987 commented Aug 6, 2018

kiszk commented Aug 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Aug 6, 2018

cloud-fan commented Aug 6, 2018

cloud-fan commented Aug 6, 2018

SparkQA commented Aug 6, 2018

jiangxb1987 commented Jul 27, 2018 •

edited

viirya Jul 29, 2018 •

edited

kiszk commented Aug 6, 2018 •

edited