[SPARK-9026] Refactor SimpleFutureAction.onComplete to not launch separate thread for every callback #7385

JoshRosen · 2015-07-14T02:47:56Z

This commit refactors SimpleFutureAction so that registering a callback with onComplete does not immediately tie up a thread in the provided execution context.

As @zsxwing noticed in #7276 (comment), the existing implementation of SimpleFutureAction.onComplete creates a separate thread to wrap the blocking awaitResult() callback:

  override def onComplete[U](func: (Try[T]) => U)(implicit executor: ExecutionContext) {
    executor.execute(new Runnable {
      override def run() {
        func(awaitResult())
      }
    })
  }

This PR addresses this issue by making JobWaiter into a Future using that future in the implementation of SimpleFutureAction's future methods.

This patch was pair-programmed with @tdas.

JoshRosen · 2015-07-14T02:48:54Z

@zsxwing, TD and I think that this patch's fix may eliminate the need to introduce a new submitAsyncJob API. I'm going to push another commit in a few minutes to add some code comments plus more tests, but I just wanted to submit this PR now to get your initial feedback on the approach.

JoshRosen · 2015-07-14T02:50:20Z

core/src/main/scala/org/apache/spark/scheduler/JobWaiter.scala

  private var finishedTasks = 0

  // Is the job as a whole finished (succeeded or failed)?
  @volatile
  private var _jobFinished = totalTasks == 0

+  if (_jobFinished) {


@zsxwing, this if statement fixes a subtle bug that we found in your JobWaiter future: if a job has no tasks, then it is marked as finished immediately and taskSucceeded will not be called, so we need to complete the promise here. We noticed this because a test in AsyncRDDActionsSuite was hanging.

… callback.

JoshRosen · 2015-07-14T03:25:53Z

@tdas, PTAL @ the regression test that I added.

SparkQA · 2015-07-14T04:50:14Z

Test build #37179 has finished for PR 7385 at commit 11c7450.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-14T05:01:31Z

Test build #37184 has finished for PR 7385 at commit 7b25e6b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-14T05:07:09Z

Ah, interesting: it looks like this failed some of the JavaAsyncRDDActions tests. I'll investigate.

SparkQA · 2015-07-14T05:27:37Z

Test build #37188 has finished for PR 7385 at commit d779af8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Least(children: Expression*) extends Expression
- case class Greatest(children: Expression*) extends Expression

JoshRosen · 2015-07-14T05:33:44Z

core/src/main/scala/org/apache/spark/FutureAction.scala

+  override def onComplete[U](func: (Try[T]) => U)(implicit executor: ExecutionContext): Unit = {
+    jobWaiter.toFuture.onComplete { (jobWaiterResult: Try[Unit]) =>
+      // If the job succeeded, then evaluate the result function; otherwise, preserve the exception.
+      _value = jobWaiterResult.map(_ => resultFunc)


@tdas, I think there's a bug here because we'll re-assign to _value if there are multiple onCompletes. There's also a race in allowing _value to be assigned here, since there's a lag between when the jobWaiter future completes and when this callback runs. Fixing this now...

zsxwing · 2015-07-14T06:33:50Z

@JoshRosen thank you for fixing this. Actually, I have another requirement for submitJob. Current submitJob signature is

  def submitJob[T, U, R](
      rdd: RDD[T],
      processPartition: Iterator[T] => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit,
      resultFunc: => R): SimpleFutureAction[R]

I cannot access TaskContext in this method. Could you change processPartition: Iterator[T] => U to processPartition: (TaskContext, Iterator[T]) => U, or add a new method for it?

JoshRosen · 2015-07-14T06:35:45Z

@zsxwing will TaskContext.get() work for you? AFAIK we're trying to avoid the introduction of new withContext methods when TaskContext.get() will suffice (see some of the old discussions regarding mapPartitionsWithContext, etc.).

zsxwing · 2015-07-14T06:37:22Z

@zsxwing will TaskContext.get() work for you? AFAIK we're trying to avoid the introduction of new withContext methods when TaskContext.get() will suffice (see some of the old discussions regarding mapPartitionsWithContext, etc.).

Forgot it. It should work. Thanks.

zsxwing · 2015-07-14T07:01:33Z

core/src/main/scala/org/apache/spark/FutureAction.scala

-      }
-    })
+  override def onComplete[U](func: (Try[T]) => U)(implicit executor: ExecutionContext): Unit = {
+    jobWaiterFuture.map { _ => resultFunc }.onComplete(func)


How about adding private lazy val _resultFunc = resultFunc and use _resultFunc in this class? This should avoid calling resultFunc multiple times.

I think that @tdas and I considered this and ended up not doing it because we thought that resultFunc would only be computed once, but I suppose it doesn't hurt to be more explicit here. Even if the current code works, if it's confusing enough to merit a comment then I think we should just be explicit and use a lazy val. I'll update this now to do this.

SparkQA · 2015-07-14T08:06:13Z

Test build #37200 has finished for PR 7385 at commit 1e2db7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…d once

JoshRosen · 2015-07-14T18:14:36Z

@zsxwing I've updated this to use a lazy val; please take another look.

SparkQA · 2015-07-14T20:12:50Z

Test build #37245 has finished for PR 7385 at commit 1346313.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-07-14T23:57:28Z

core/src/main/scala/org/apache/spark/FutureAction.scala

-      case JobSucceeded => scala.util.Success(resultFunc)
-      case JobFailed(e: Exception) => scala.util.Failure(e)
+    } else {
+      jobWaiter.awaitResult() match {


This part seems like a bad hack to use awaitResult to get the result. Rather, there should be a JobWaiter.jobResult (make it public), that return Option[JobResult] and use that.

JoshRosen · 2015-07-15T05:44:30Z

I just pushed a commit which changes JobWaiter to extend Future[Unit], which I think should resolve the most recent round of review feedback.

SparkQA · 2015-07-15T07:43:27Z

Test build #37319 has finished for PR 7385 at commit e08623a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2015-07-15T10:35:30Z

core/src/main/scala/org/apache/spark/scheduler/JobWaiter.scala

      throw new UnsupportedOperationException("taskSucceeded() called on a finished JobWaiter")
    }
    resultHandler(index, result.asInstanceOf[T])
    finishedTasks += 1
    if (finishedTasks == totalTasks) {
-      _jobFinished = true
-      jobResult = JobSucceeded
+      promise.trySuccess()
      this.notifyAll()


This line can be removed. Right?

SparkQA · 2015-07-18T02:04:21Z

Test build #37677 has finished for PR 7385 at commit 1a19268.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-18T08:16:01Z

Jenkins, retest this please.

SparkQA · 2015-07-18T10:31:33Z

Test build #37706 has finished for PR 7385 at commit 1a19268.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-19T01:09:14Z

Jenkins, retest this please.

SparkQA · 2015-07-19T03:11:45Z

Test build #37745 has finished for PR 7385 at commit 1a19268.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2015-07-20T06:55:55Z

core/src/main/scala/org/apache/spark/scheduler/JobWaiter.scala

+    // There are certain situations where jobFailed can be called multiple times for the same
+    // job. We guard against this by making this method idempotent.
+    if (!isCompleted) {
+      promise.failure(exception)


Looks tryFailure would be simpler.

zsxwing · 2015-07-22T15:24:47Z

retest this please

zsxwing · 2015-07-22T15:25:25Z

@tdas could you take a look at this PR again? I think it's better to merge this one at first.

SparkQA · 2015-07-22T15:26:31Z

Test build #61 has finished for PR 7385 at commit 1a19268.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T15:31:00Z

Test build #38083 has finished for PR 7385 at commit 1a19268.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T16:35:35Z

Test build #1162 has finished for PR 7385 at commit 1a19268.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

…refactoring

JoshRosen · 2015-07-22T22:27:53Z

Jenkins, retest this please.

SparkQA · 2015-07-23T00:46:40Z

Test build #38119 has finished for PR 7385 at commit 1a19268.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…refactoring

SparkQA · 2015-07-23T20:15:20Z

Test build #38258 has finished for PR 7385 at commit 692b3a4.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-24T01:30:00Z

Test build #38280 has finished for PR 7385 at commit 17edbcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-24T03:16:45Z

Test build #1191 has finished for PR 7385 at commit 17edbcd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-24T03:43:34Z

Test build #1192 has finished for PR 7385 at commit 17edbcd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-27T18:07:42Z

Jenkins, retest this please.

SparkQA · 2015-07-27T20:02:56Z

Test build #38573 has finished for PR 7385 at commit 17edbcd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-21T15:54:37Z

Test build #2093 has finished for PR 7385 at commit 17edbcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2015-12-11T00:47:11Z

@JoshRosen could you close this one, since #9264 fixes the issue as well? Thanks!

jaceklaskowski · 2016-01-02T12:11:41Z

@zsxwing @JoshRosen Does the comment need attention since the pr is closed, https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L438?

rxin · 2016-01-03T06:30:49Z

That's a good find.

cc @zsxwing

zsxwing · 2016-01-03T06:34:41Z

Thanks @jaceklaskowski I will submit a PR to remove submitJobThreadPool

JoshRosen reviewed Jul 14, 2015
View reviewed changes

JoshRosen added 4 commits July 13, 2015 20:23

Add regression test

df20ed5

Refactor SimpleFutureAction to not block threads for every onComplete…

55c41d3

… callback.

Add some comments

1deed38

Back out log4j.properties changes.

d779af8

JoshRosen force-pushed the simplefutureaction-refactoring branch from 7b25e6b to d779af8 Compare July 14, 2015 03:25

JoshRosen reviewed Jul 14, 2015
View reviewed changes

Fix race.

1e2db7f

zsxwing reviewed Jul 14, 2015
View reviewed changes

zsxwing mentioned this pull request Jul 14, 2015

[SPARK-8882][Streaming]Add a new Receiver scheduling mechanism #7276

Closed

Use lazy val to make ot clear that resultfunc should only be evaluate…

1346313

…d once

tdas reviewed Jul 14, 2015
View reviewed changes

Convert JobWaiter into a Future

e08623a

zsxwing reviewed Jul 15, 2015
View reviewed changes

zsxwing reviewed Jul 20, 2015
View reviewed changes

Merge remote-tracking branch 'origin/master' into simplefutureaction-…

c9ef8d4

…refactoring

Merge remote-tracking branch 'origin/master' into simplefutureaction-…

692b3a4

…refactoring

Exception -> Throwable

17edbcd

JoshRosen mentioned this pull request Oct 24, 2015

[SPARK-9026] [SPARK-4514] Modifications to JobWaiter, FutureAction, and AsyncRDDActions to support non-blocking operation #9264

Closed

JoshRosen closed this Dec 11, 2015

JoshRosen deleted the simplefutureaction-refactoring branch December 11, 2015 00:52

[SPARK-9026] Refactor SimpleFutureAction.onComplete to not launch separate thread for every callback #7385

[SPARK-9026] Refactor SimpleFutureAction.onComplete to not launch separate thread for every callback #7385

Conversation

JoshRosen commented Jul 14, 2015

JoshRosen commented Jul 14, 2015

JoshRosen Jul 14, 2015

Choose a reason for hiding this comment

JoshRosen commented Jul 14, 2015

SparkQA commented Jul 14, 2015

SparkQA commented Jul 14, 2015

JoshRosen commented Jul 14, 2015

SparkQA commented Jul 14, 2015

JoshRosen Jul 14, 2015

Choose a reason for hiding this comment

zsxwing commented Jul 14, 2015

JoshRosen commented Jul 14, 2015

zsxwing commented Jul 14, 2015

zsxwing Jul 14, 2015

Choose a reason for hiding this comment

JoshRosen Jul 14, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 14, 2015

JoshRosen commented Jul 14, 2015

SparkQA commented Jul 14, 2015

tdas Jul 14, 2015

Choose a reason for hiding this comment

JoshRosen commented Jul 15, 2015

SparkQA commented Jul 15, 2015

zsxwing Jul 15, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 18, 2015

JoshRosen commented Jul 18, 2015

SparkQA commented Jul 18, 2015

JoshRosen commented Jul 19, 2015

SparkQA commented Jul 19, 2015

zsxwing Jul 20, 2015

Choose a reason for hiding this comment

zsxwing commented Jul 22, 2015

zsxwing commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

JoshRosen commented Jul 22, 2015

SparkQA commented Jul 23, 2015

SparkQA commented Jul 23, 2015

SparkQA commented Jul 24, 2015

SparkQA commented Jul 24, 2015

SparkQA commented Jul 24, 2015

JoshRosen commented Jul 27, 2015

SparkQA commented Jul 27, 2015

SparkQA commented Nov 21, 2015

zsxwing commented Dec 11, 2015

jaceklaskowski commented Jan 2, 2016

rxin commented Jan 3, 2016

zsxwing commented Jan 3, 2016