[SPARK-19889][SQL] Make TaskContext callbacks thread safe #17244

hvanhovell · 2017-03-10T15:02:11Z

What changes were proposed in this pull request?

It is sometimes useful to use multiple threads in a task to parallelize tasks. These threads might register some completion/failure listeners to clean up when the task completes or fails. We currently cannot register such a callback and be sure that it will get called, because the context might be in the process of invoking its callbacks, when the the callback gets registered.

This PR improves this by making sure that you cannot add a completion/failure listener from a different thread when the context is being marked as completed/failed in another thread. This is done by synchronizing these methods on the task context itself.

Failure listeners were called only once. Completion listeners now follow the same pattern; this lifts the idempotency requirement for completion listeners and makes it easier to implement them. In some cases we can (accidentally) add a completion/failure listener after the fact, these listeners will be called immediately in order make sure we can safely clean-up after a task.

As a result of this change we could make the failure and completed flags non-volatile. The isCompleted() method now uses synchronization to ensure that updates are visible across threads.

How was this patch tested?

Adding tests to TaskContestSuite to test adding listeners to a completed/failed context.

hvanhovell · 2017-03-10T15:02:30Z

cc @rxin @sameeragarwal @zsxwing

SparkQA · 2017-03-10T17:07:01Z

Test build #74323 has finished for PR 17244 at commit d16ad88.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-03-10T17:41:31Z

LGTM

zsxwing · 2017-03-10T18:25:34Z

Using TaskContext.synchronized out of TaskContext should not be encouraged. How about make TaskContext.addTaskCompletionListener check isCompleted internally? If it's done, don't add it to the list, then either ignore the listener or call the listener immediately.

SparkQA · 2017-03-10T19:54:44Z

Test build #74327 has finished for PR 17244 at commit 535349d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-03-10T21:08:30Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

@@ -57,57 +68,75 @@ private[spark] class TaskContextImpl(
  // Whether the task has failed.
  @volatile private var failed: Boolean = false


This need not be volatile anymore - given that it is updated and queried within a synchronized block.
We could revisit for completed too - though that would be an extension.

If drop the volatility then we need to make isCompleted synchronized as well; to ensure safe publication.

Yes, which is why I mentioned it as extension :-)
For failed, it is already valid to remove volatile

mridulm · 2017-03-10T21:12:12Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

+  @GuardedBy("this")
  override def addTaskCompletionListener(listener: TaskCompletionListener): this.type = {
-    onCompleteCallbacks += listener
+    synchronized {


nit: method synchronized instead of block ?

mridulm · 2017-03-10T21:12:54Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

+  @GuardedBy("this")
  override def addTaskFailureListener(listener: TaskFailureListener): this.type = {
-    onFailureCallbacks += listener
+    synchronized {


nit: method synchronized instead of block ?

mridulm · 2017-03-10T21:13:53Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

+      if (completed) {
+        listener.onTaskCompletion(this)
+      }
+      // Always add the listener because it is legal to call them multiple times.


I did not realize this, interesting !

Well, I was rather surprised about this, but the current code path seems to allow this.

I have updated the doc in TaskContext to reflect this.

SparkQA · 2017-03-10T23:00:51Z

Test build #74335 has finished for PR 17244 at commit 12f947e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…s synchronized. Update documentation.

SparkQA · 2017-03-12T17:52:23Z

Test build #74404 has finished for PR 17244 at commit 41448ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-12T18:02:37Z

Test build #74405 has finished for PR 17244 at commit f3b9b97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-03-12T19:02:23Z

LGTM. Would be great if other reviewers can also take a look.
+CC @zsxwing, @rxin

sameeragarwal

minor nits, LGTM!

sameeragarwal · 2017-03-13T21:45:35Z

core/src/main/scala/org/apache/spark/TaskContext.scala

   * Adds a (Java friendly) listener to be executed on task completion.
-   * This will be called in all situation - success, failure, or cancellation.
+   * This will be called in all situation - success, failure, or cancellation. Adding a listener
+   * to an already completed task will result in that listeners being called immediately.


micro nit: s/listeners/listener here and below

sameeragarwal · 2017-03-13T21:49:00Z

core/src/main/scala/org/apache/spark/TaskContext.scala


  /**
-   * Adds a listener to be executed on task failure.
-   * Operations defined here must be idempotent, as `onTaskFailure` can be called multiple times.


Why delete this? onTaskFailure can also be called multiple times right?

This was disabled in #11504. So the comment does not make sense anymore.

sameeragarwal · 2017-03-13T22:16:21Z

core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala

+    context.addTaskCompletionListener(_ => invocations += 1)
+    assert(invocations == 1)
+    context.markTaskCompleted()
+    assert(invocations == 2)


can we call context.markTaskCompleted() once again and assert invocations == 2 to have a test for idempotency?

hvanhovell · 2017-03-13T22:37:58Z

Ok, had a small discussion offline. It seems weird that we have different calling policies for failure and completion listeners. I am going to change the invocation of completion listeners to exactly once as well.

SparkQA · 2017-03-14T01:25:11Z

Test build #74466 has finished for PR 17244 at commit 4199619.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-03-15T07:09:30Z

LGTM

hvanhovell · 2017-03-15T09:44:52Z

Thanks for the reviews! Merging to master.

cloud-fan · 2017-03-16T00:57:18Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

+  override def addTaskCompletionListener(listener: TaskCompletionListener)
+      : this.type = synchronized {
+    if (completed) {
+      listener.onTaskCompletion(this)


shall we also try catch here?

or call the invokeListeners

Why would we do that, if we are going to rethrow the exception anyway? The only difference is that it would be a TaskCompletionListenerException instead. Calling invokeListeners would also call already invoked listeners, which is what we are trying to avoid.

invokeListeners takes a list of listeners, so we are able to only call this listener.

I think it's better to make these listeners consistent, i.e. throw TaskCompletionListenerException when failure happens during calling listener.

Make TaskContext callbacks threadsafe.

d16ad88

Fix UT and move synchronization

535349d

CR

12f947e

mridulm reviewed Mar 10, 2017

View reviewed changes

hvanhovell added 5 commits March 12, 2017 15:45

Merge remote-tracking branch 'apache-github/master' into SPARK-19889

fc9a046

Make completed and failed non-volative. Make methods instead of block…

41448ab

…s synchronized. Update documentation.

Improve wording

631e7ab

Remove comma

0757ea3

Remove comma

f3b9b97

sameeragarwal approved these changes Mar 13, 2017

View reviewed changes

hvanhovell added 2 commits March 13, 2017 23:47

Make completion exactly once

1052d78

typo

4199619

asfgit closed this in 9ff85be Mar 15, 2017

cloud-fan reviewed Mar 16, 2017

View reviewed changes

		@@ -57,57 +68,75 @@ private[spark] class TaskContextImpl(
		// Whether the task has failed.
		@volatile private var failed: Boolean = false

[SPARK-19889][SQL] Make TaskContext callbacks thread safe #17244

[SPARK-19889][SQL] Make TaskContext callbacks thread safe #17244

Uh oh!

Conversation

hvanhovell commented Mar 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented Mar 10, 2017

Uh oh!

SparkQA commented Mar 10, 2017

Uh oh!

rxin commented Mar 10, 2017

Uh oh!

zsxwing commented Mar 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 10, 2017

Uh oh!

SparkQA commented Mar 12, 2017

Uh oh!

SparkQA commented Mar 12, 2017

Uh oh!

mridulm commented Mar 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameeragarwal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Mar 13, 2017

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

zsxwing commented Mar 15, 2017

Uh oh!

hvanhovell commented Mar 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Mar 10, 2017 •

edited

Loading

zsxwing commented Mar 10, 2017 •

edited

Loading

mridulm commented Mar 12, 2017 •

edited

Loading

hvanhovell Mar 16, 2017 •

edited

Loading