[FLINK-3466] [runtime] Cancel state handled on state restore #2252

StephanEwen · 2016-07-14T20:50:32Z

This pull request fixes the issue that state restore operations can get stuck when tasks are cancelled during state restore. That happens due to a bug in HDFS, which deadlocks (or livelocks) when the reading thread is interrupted.

This introduces two things:

All state handles and key/value snapshots are now Closable. This does not delete any checkpoint data, but simply closes pending streams and data fetch handles. Operations concurrently accessing the state handles state should fail.
The StreamTask holds a set of "Closables" that it closes upon cancellation. This is a cleaner way of stopping in-progress work than relying on "interrupt()" to interrupt that work.

This mechanism should eventually be extended to also cancel operators and state handles pending asynchronous materialization.

There is a test that has an interrupt sensitive state handle (mimicking HDFS's deadlock behavior) that causes a stall without this pull request and cleanly finishes with the changes in this pull request.

This also adds a test validating that all state handled and key/value snapshots add a proper serialVersionUID.

uce · 2016-07-15T09:12:40Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/AbstractCloseableHandle.java

+ * A simple base for closable handles.
+ * 
+ * Offers to register a stream (or other closable object) that close calls are delegated to if
+ * the handel is closed or was already closed.


typo: handel => handle

Thanks, will fix it.

uce · 2016-07-15T09:42:33Z

Looks very good! The test failures seem unrelated:

ClientTest: https://issues.apache.org/jira/browse/FLINK-4220 (newly created)
JobManagerHACheckpointRecoveryITCase: https://issues.apache.org/jira/browse/FLINK-3516 (known instability)
Travis Scala dependency issue

The added tests and refactorings are very readable.

I think this is good to merge mod some minor inline comments.

StephanEwen · 2016-07-15T11:37:59Z

Thanks, I'll address your comments and merge this...

State handles are cancelable, to make sure long running checkpoint restore operations do finish early on cancallation, even if the code does not properly react to interrupts. This is especially important since HDFS client code is so buggy that it deadlocks when interrupted without closing.

StephanEwen · 2016-07-15T15:23:20Z

Manually merged in e9f660d

liuml07 · 2017-03-22T20:11:01Z

Is this related to https://issues.apache.org/jira/browse/HADOOP-14214? Thanks,

StephanEwen · 2017-03-27T12:40:52Z

I think it is yes. We worked around it in the meantime...

uce reviewed Jul 15, 2016
View reviewed changes

[FLINK-3466] [tests] Add serialization validation for state handles

67ab5a5

StephanEwen force-pushed the state_handle_cancellation branch from c411b37 to ff52e0e Compare July 15, 2016 11:43

StephanEwen force-pushed the state_handle_cancellation branch from ff52e0e to a340bf2 Compare July 15, 2016 13:24

StephanEwen closed this Jul 15, 2016

StephanEwen deleted the state_handle_cancellation branch August 1, 2016 18:09

StephanEwen restored the state_handle_cancellation branch August 1, 2016 18:09

rmetzger added the component=Runtime/StateBackends label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-3466] [runtime] Cancel state handled on state restore #2252

[FLINK-3466] [runtime] Cancel state handled on state restore #2252

Uh oh!

StephanEwen commented Jul 14, 2016 •

edited

Loading

Uh oh!

uce Jul 15, 2016

Uh oh!

StephanEwen Jul 15, 2016

Uh oh!

uce commented Jul 15, 2016

Uh oh!

StephanEwen commented Jul 15, 2016

Uh oh!

StephanEwen commented Jul 15, 2016

Uh oh!

liuml07 commented Mar 22, 2017

Uh oh!

StephanEwen commented Mar 27, 2017

Uh oh!

Uh oh!

[FLINK-3466] [runtime] Cancel state handled on state restore #2252

[FLINK-3466] [runtime] Cancel state handled on state restore #2252

Uh oh!

Conversation

StephanEwen commented Jul 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uce Jul 15, 2016

Choose a reason for hiding this comment

Uh oh!

StephanEwen Jul 15, 2016

Choose a reason for hiding this comment

Uh oh!

uce commented Jul 15, 2016

Uh oh!

StephanEwen commented Jul 15, 2016

Uh oh!

StephanEwen commented Jul 15, 2016

Uh oh!

liuml07 commented Mar 22, 2017

Uh oh!

StephanEwen commented Mar 27, 2017

Uh oh!

Uh oh!

StephanEwen commented Jul 14, 2016 •

edited

Loading