[FLINK-4543] [network] Fix potential deadlock in SpilledSubpartitionViewAsyncIO #2444

StephanEwen · 2016-08-31T14:31:03Z

The deadlock could occur in cases where the SpilledSubpartitionViewAsyncIO would simultaneously try to
release a buffer and encounter an error in another thread.

The field of congestion was the listener, which is now replaced by an AtomicReference, removing the
necessity to lock in the case of reporting the error.

The deadlock stack traces were:

Found one Java-level deadlock:
=============================
"pool-1-thread-2":
  waiting to lock monitor 0x00007fec2c006168 (object 0x00000000ef661c20, a java.lang.Object),
  which is held by "IOManager reader thread #1"
"IOManager reader thread #1":
  waiting to lock monitor 0x00007fec2c005ea8 (object 0x00000000ef62c8a8, a java.lang.Object),
  which is held by "pool-1-thread-2"

Java stack information for the threads listed above:
===================================================
"pool-1-thread-2":
        at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.notifyError(SpilledSubpartitionViewAsyncIO.java:309)
        - waiting to lock <0x00000000ef661c20> (a java.lang.Object)
        at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.onAvailableBuffer(SpilledSubpartitionViewAsyncIO.java:261)
        at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.access$300(SpilledSubpartitionViewAsyncIO.java:42)
        at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$BufferProviderCallback.onEvent(SpilledSubpartitionViewAsyncIO.java:380)
        at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$BufferProviderCallback.onEvent(SpilledSubpartitionViewAsyncIO.java:366)
        at org.apache.flink.runtime.io.network.util.TestPooledBufferProvider$PooledBufferProviderRecycler.recycle(TestPooledBufferProvider.java:135)
        - locked <0x00000000ef62c8a8> (a java.lang.Object)
        at org.apache.flink.runtime.io.network.buffer.Buffer.recycle(Buffer.java:118)
        - locked <0x00000000ef9597c0> (a java.lang.Object)
        at org.apache.flink.runtime.io.network.util.TestConsumerCallback$RecyclingCallback.onBuffer(TestConsumerCallback.java:72)
        at org.apache.flink.runtime.io.network.util.TestSubpartitionConsumer.call(TestSubpartitionConsumer.java:87)
        at org.apache.flink.runtime.io.network.util.TestSubpartitionConsumer.call(TestSubpartitionConsumer.java:39)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
"IOManager reader thread #1":
        at org.apache.flink.runtime.io.network.util.TestPooledBufferProvider$PooledBufferProviderRecycler.recycle(TestPooledBufferProvider.java:126)
        - waiting to lock <0x00000000ef62c8a8> (a java.lang.Object)
        at org.apache.flink.runtime.io.network.buffer.Buffer.recycle(Buffer.java:118)
        - locked <0x00000000efa016f0> (a java.lang.Object)
        at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.returnBufferFromIOThread(SpilledSubpartitionViewAsyncIO.java:275)
        - locked <0x00000000ef661c20> (a java.lang.Object)
        at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.access$100(SpilledSubpartitionViewAsyncIO.java:42)
        at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$IOThreadCallback.requestSuccessful(SpilledSubpartitionViewAsyncIO.java:343)
        at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$IOThreadCallback.requestSuccessful(SpilledSubpartitionViewAsyncIO.java:333)
        at org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannel.handleProcessedBuffer(AsynchronousFileIOChannel.java:199)
        at org.apache.flink.runtime.io.disk.iomanager.BufferReadRequest.requestDone(AsynchronousFileIOChannel.java:435)
        at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync$ReaderThread.run(IOManagerAsync.java:408)

Found 1 deadlock.

StephanEwen · 2016-08-31T14:31:25Z

@uce I think you'd be the best person to review this.

…iewAsyncIO. The deadlock could occur in cases where the SpilledSubpartitionViewAsyncIO would simultaneously try to release a buffer and encounter an error in another thread. The field of congestion was the listener, which is now replaced by an AtomicReference, removing the necessity to lock in the case of reporting the error.

uce

Thanks for looking into this. Changes are good.

+1 to merge.

…iewAsyncIO. The deadlock could occur in cases where the SpilledSubpartitionViewAsyncIO would simultaneously try to release a buffer and encounter an error in another thread. The field of congestion was the listener, which is now replaced by an AtomicReference, removing the necessity to lock in the case of reporting the error. This closes apache#2444

StephanEwen force-pushed the deadlock branch from 702afe9 to a9d2da2 Compare September 1, 2016 12:28

uce approved these changes Sep 26, 2016

View reviewed changes

asfgit closed this in 9090291 Sep 27, 2016

rmetzger added the component=Runtime/Network label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-4543] [network] Fix potential deadlock in SpilledSubpartitionViewAsyncIO #2444

[FLINK-4543] [network] Fix potential deadlock in SpilledSubpartitionViewAsyncIO #2444

Uh oh!

StephanEwen commented Aug 31, 2016

Uh oh!

StephanEwen commented Aug 31, 2016

Uh oh!

uce left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FLINK-4543] [network] Fix potential deadlock in SpilledSubpartitionViewAsyncIO #2444

[FLINK-4543] [network] Fix potential deadlock in SpilledSubpartitionViewAsyncIO #2444

Uh oh!

Conversation

StephanEwen commented Aug 31, 2016

Uh oh!

StephanEwen commented Aug 31, 2016

Uh oh!

uce left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants