[SPARK-20131][Core]Don't use `this` lock in StandaloneSchedulerBackend.stop #17610

zsxwing · 2017-04-11T18:54:39Z

What changes were proposed in this pull request?

o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly is flaky is because there is a potential dead-lock in StandaloneSchedulerBackend which causes await timeout. Here is the related stack trace:

"Thread-31" #211 daemon prio=5 os_prio=31 tid=0x00007fedd4808000 nid=0x16403 waiting on condition [0x00007000239b7000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x000000079b49ca10> (a scala.concurrent.impl.Promise$CompletionLatch)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:402)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:213)
	- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
	- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
	at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:517)
	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1657)
	at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1921)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1302)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:1920)
	at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:708)
	at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43$$anonfun$apply$mcV$sp$66$$anon$3.run(StreamingContextSuite.scala:827)

"dispatcher-event-loop-3" #18 daemon prio=5 os_prio=31 tid=0x00007fedd603a000 nid=0x6203 waiting for monitor entry [0x0000700003be4000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:253)
	- waiting to lock <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:124)
	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

This PR removes synchronized and changes stopping to AtomicBoolean to ensure idempotent to fix the dead-lock.

How was this patch tested?

Jenkins

mridulm · 2017-04-11T19:19:12Z

Isn't it not depending on this being locked in super class methods invoked in the invocation subtree ?

If it is not, then why not simply restrict lock to changing of stopping flag (if already set, return, else set and proceed down method) ? In which case, use of this will do without need for a new lock ?

zsxwing · 2017-04-11T20:00:51Z

@mridulm yeah, I was thinking to just change stopping to a AtomicBoolean flag. However, it changes the semantics a little, e.g., the second stop will return at once when the first stop is running. I'm not sure if this is safe.

zsxwing · 2017-04-11T20:02:37Z

Isn't it not depending on this being locked in super class methods invoked in the invocation subtree ?

I don't get it. But I think the stack trace shows why this dead-lock happens.

SparkQA · 2017-04-11T21:36:28Z

Test build #75712 has finished for PR 17610 at commit e6d0841.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-04-11T23:56:50Z

I don't get it. But I think the stack trace shows why this dead-lock happens.

Based on your description/stacktrace, I get why the deadlock happens - what I meant was, do any of the super.* methods invoked in the stop call tree assume they are invoked with this already locked ?

If not, then a narrow lock on this just to flip the state of stopped might be better. An AtomicBoolean will introduce a new lock (which is not required here I think) and works the same.
The deadlock occurs because we are calling rpc with the lock held already (which is probably be a pattern we should somehow catch since it will invariably cause deadlocks !) - the flag flipping will not incur deadlocks due to that : unless stop is invoked from within some other receive

vanzin · 2017-04-12T17:18:02Z

An AtomicBoolean will introduce a new lock

I don't think it introduces a new lock, just an easy way to make stop() idempotent.

I guess the question to ask is (and is basically what you already asked): does stop() need to hold any locks? That synchronized has been there since the class was added, so it's hard to measure the original intent, but a quick look at the code seems to show that a lock is not necessary, and a boolean to check whether stop has already been called should be enough.

The new lock being added here only prevents two thread from concurrently executing stop(). I think that's not something that usually happens, and if it ever does, having one of them return immediately (instead of blocking) shouldn't cause any problems.

But, at the end, either solution should be fine, as long as stop() behaves correctly when it's not holding the "this" lock.

zsxwing · 2017-04-13T00:20:51Z

I removed the lock and changed stopping to AtomicBoolean to ensure idempotent.

cloud-fan · 2017-04-13T00:37:07Z

LGTM

vanzin · 2017-04-13T00:40:31Z

LGTM.

zsxwing · 2017-04-13T00:43:57Z

Merging to master and 2.1. Thanks!

…nd.stop ## What changes were proposed in this pull request? `o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` is flaky is because there is a potential dead-lock in StandaloneSchedulerBackend which causes `await` timeout. Here is the related stack trace: ``` "Thread-31" #211 daemon prio=5 os_prio=31 tid=0x00007fedd4808000 nid=0x16403 waiting on condition [0x00007000239b7000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000079b49ca10> (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:402) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:213) - locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116) - locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:517) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1657) at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1921) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1302) at org.apache.spark.SparkContext.stop(SparkContext.scala:1920) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:708) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43$$anonfun$apply$mcV$sp$66$$anon$3.run(StreamingContextSuite.scala:827) "dispatcher-event-loop-3" #18 daemon prio=5 os_prio=31 tid=0x00007fedd603a000 nid=0x6203 waiting for monitor entry [0x0000700003be4000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:253) - waiting to lock <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:124) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` This PR removes `synchronized` and changes `stopping` to AtomicBoolean to ensure idempotent to fix the dead-lock. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #17610 from zsxwing/SPARK-20131. (cherry picked from commit c5f1cc3) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

SparkQA · 2017-04-13T03:07:39Z

Test build #75751 has finished for PR 17610 at commit 929061c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…nd.stop ## What changes were proposed in this pull request? `o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` is flaky is because there is a potential dead-lock in StandaloneSchedulerBackend which causes `await` timeout. Here is the related stack trace: ``` "Thread-31" apache#211 daemon prio=5 os_prio=31 tid=0x00007fedd4808000 nid=0x16403 waiting on condition [0x00007000239b7000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000079b49ca10> (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:402) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:213) - locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116) - locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:517) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1657) at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1921) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1302) at org.apache.spark.SparkContext.stop(SparkContext.scala:1920) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:708) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43$$anonfun$apply$mcV$sp$66$$anon$3.run(StreamingContextSuite.scala:827) "dispatcher-event-loop-3" apache#18 daemon prio=5 os_prio=31 tid=0x00007fedd603a000 nid=0x6203 waiting for monitor entry [0x0000700003be4000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:253) - waiting to lock <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:124) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` This PR removes `synchronized` and changes `stopping` to AtomicBoolean to ensure idempotent to fix the dead-lock. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#17610 from zsxwing/SPARK-20131.

Use a separate lock for stop

e6d0841

zsxwing mentioned this pull request Apr 11, 2017

[SPARK-20131][DStream][Test] Flaky Test: org.apache.spark.streaming.StreamingContextSuite #17463

Closed

zsxwing added 2 commits April 12, 2017 17:15

Change stopping to AtomicBoolean

0d79882

Remove lock

929061c

zsxwing changed the title ~~[SPARK-20131][Core]Use a separate lock for StandaloneSchedulerBackend.stop~~ [SPARK-20131][Core]Don't use this lock in StandaloneSchedulerBackend.stop Apr 13, 2017

asfgit closed this in c5f1cc3 Apr 13, 2017

zsxwing deleted the SPARK-20131 branch April 13, 2017 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20131][Core]Don't use `this` lock in StandaloneSchedulerBackend.stop #17610

[SPARK-20131][Core]Don't use `this` lock in StandaloneSchedulerBackend.stop #17610

zsxwing commented Apr 11, 2017 •

edited

Loading

mridulm commented Apr 11, 2017 •

edited

Loading

zsxwing commented Apr 11, 2017

zsxwing commented Apr 11, 2017

SparkQA commented Apr 11, 2017

mridulm commented Apr 11, 2017 •

edited

Loading

vanzin commented Apr 12, 2017

zsxwing commented Apr 13, 2017

cloud-fan commented Apr 13, 2017

vanzin commented Apr 13, 2017

zsxwing commented Apr 13, 2017

SparkQA commented Apr 13, 2017

[SPARK-20131][Core]Don't use this lock in StandaloneSchedulerBackend.stop #17610

[SPARK-20131][Core]Don't use this lock in StandaloneSchedulerBackend.stop #17610

Conversation

zsxwing commented Apr 11, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

mridulm commented Apr 11, 2017 • edited Loading

zsxwing commented Apr 11, 2017

zsxwing commented Apr 11, 2017

SparkQA commented Apr 11, 2017

mridulm commented Apr 11, 2017 • edited Loading

vanzin commented Apr 12, 2017

zsxwing commented Apr 13, 2017

cloud-fan commented Apr 13, 2017

vanzin commented Apr 13, 2017

zsxwing commented Apr 13, 2017

SparkQA commented Apr 13, 2017

[SPARK-20131][Core]Don't use `this` lock in StandaloneSchedulerBackend.stop #17610

[SPARK-20131][Core]Don't use `this` lock in StandaloneSchedulerBackend.stop #17610

zsxwing commented Apr 11, 2017 •

edited

Loading

mridulm commented Apr 11, 2017 •

edited

Loading

mridulm commented Apr 11, 2017 •

edited

Loading