[FLINK-5823] [checkpoints] State Backends also handle Checkpoint Metadata and introduce Global Cleanup Hooks #3522

StephanEwen · 2017-03-13T10:43:50Z

Core Changes

Store Metadata in State Backend

The state backend is now responsible for storing the checkpoint metadata. There is no implicit assumption that the checkpoint metadata is stored in a file systems any more.

All checkpoint directory / savepoint directory specific config settings are now part of the state backends. The Checkpoint Coordinator simply calls the relevant methods on the state backends to store metadata.
Similar as the CheckpointStreamFactory for storing checkpoint state, there is now a CheckpointMetadataStreamFactory for the metadata.
State backends are not required to be able to persist metadata - only ifor HA setups and when externalized checkpoints are requested.
All checkpoints with persisted metadata are addressable via a "pointer", which is state-backend specific. For File-system based statebackends (all statebackends in Flink currently), this pointer is the file path.

Global cleanup hooks

State backends can implement an extended interface to create global cleanup hooks. For example for a file system, a global cleanup hook simply recursively deletes the checkpoint directory, which is for most file systems much faster than issuing a delete call per file.

The MemoryStateBackend and FsStateBackend use fast cleanup hooks, the RocksDBStateBackend should get then in a followup (see below).

Application-defined State Backends pick up additional values from the configuration

We need to keep supporting the scenario of setting a state backends in the user program, but configuring parameters like checkpoint directory in the cluster config. To support that, state backends may implement an additional interface which lets them pick up configuration values from the cluster configuration.

Altered user-facing Behavior

Externalized checkpoints store all files (state and metadata) strictly in the same directory now. (Savepoints were contained in a single directory already before)
Both savepoint commands and configuration parameters now require qualified URIs as well (i.e., file:///path/do/savepoint, whereas before the configs and command line also excepted /path/do/savepoint.
Because not having qualified URIs is error-prone anyways (auto fallback to local file system) I am actually in favor of doing this change.

Tests

This adds a lot of tests, which can due to the changed design be done completely on the state backends,
without instantiating a CheckpointCoordinator.

Checkpoint / Savepoint delete with global hook works as expected
Interaction of old cleanup logic and optional global cleanup hook
State backends are properly loaded from configuration
Application-configured State backends properly pick up additional configuration values

Followups

Implement the global hooks for the RocksDB state backend. Because the RocksDB state backend internally delegates to a "storage backend", this need to few extra tricks and was not done in this already large pull request.
The HA checkpoint store needs not store checkpoint metadata itself any more, it can simply store the pointer.
This abstraction allows state backends to over periodic cleanup hooks that can search for lost state (files that are not referenced any more), which can happen when TaskManager / JobManager processes fail during finalizing a checkpoint or when handing over state between JobManager / TaskManager.

…data and use global cleanup hooks This closes apache#3522

StephanEwen · 2017-03-13T16:40:10Z

CI tests pass (except one profile which does not complete within the limit of 50 minutes)

tillrohrmann

Really good changes @StephanEwen. The design and implementation are sound and well tested. Furthermore, the new code is thoroughly documented and difficult code sections are explained. +1 for merging :-)

tillrohrmann · 2017-03-28T15:16:45Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

+			catch (Exception e) {
+				// catch all exception, to not let errors in cleanup hooks fail the checkpoint
+				LOG.warn("Failed to create the cleanup hook for checkpoint " + checkpointId + 
+						". Generic cleanup path will be executed...");


e might be interesting to log since it could contain information what went wrong.

tillrohrmann · 2017-03-28T15:18:03Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

-						"belonging to task {} of job {}.", checkpointId, executionAttemptID, jobId,
-						throwable);
+					}
+					catch (Throwable t) {


Code reformatting

tillrohrmann · 2017-03-28T16:00:00Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpoint.java

+				catch (Exception e) {
+					// we catch all exceptions here to make sure we go through the generic path as a
+					// fallback in case where cleanup fails
+					LOG.warn("Fast cleanup hook f");


e is being swallowed and the log message seems to be truncated.

tillrohrmann · 2017-03-28T16:25:06Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/savepoint/SavepointV1.java

+			}
+			catch (Exception e) {
+				if (firstException == null) {
+					firstException = e;


We could add all following exceptions as suppressed exceptions.

tillrohrmann · 2017-03-28T16:28:23Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointMetadataStreamFactory.java

+	CheckpointMetadataOutputStream createCheckpointStateOutputStream() throws IOException;
+
+	/**
+	 * Gets the location (as a string pointer) where the metadata factory 


Sentence does not seem to be complete

tillrohrmann · 2017-03-29T16:00:26Z

...untime/src/main/java/org/apache/flink/runtime/state/filesystem/AbstractFileStateBackend.java

+				filesystem = FileSystem.get(checkpointDataUri);
+			} catch (IOException e) {
+				// should never happen
+				throw new FlinkRuntimeException("cannot access file system for URI " + checkpointDataUri);


Maybe adding e as cause?

tillrohrmann · 2017-03-29T16:01:35Z

...untime/src/main/java/org/apache/flink/runtime/state/filesystem/AbstractFileStateBackend.java

+						"the available configurations.";
+				LOG.warn("Could not verify checkpoint path. This might be caused by a genuine " +
+						"problem or by the fact that the file system is not accessible from the " +
+						"client. Reason:{}", reason);


We could fuse reason and the logging statement into a single string.

tillrohrmann · 2017-03-29T16:07:17Z

...e/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStateOutputStream.java

+		}
+
+		this.basePath = basePath;
+		this.fs = fs;


checkNotNull checks could be inserted here.

tillrohrmann · 2017-03-29T16:50:22Z

...k-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsStateBackendFactory.java

-	 * rather than in files */
-	public static final String MEMORY_THRESHOLD_CONF_KEY = "state.backend.fs.memory-threshold";
+	/** Config key for the maximum state size that is stored with the metadata */
+	public static final ConfigOption<Integer> MEMORY_THRESHOLD_CONF = ConfigOptions


I'm wondering whether we want to group all relevant state options into a dedicated StateOptions class or keep them in the respective class where they are used. I fear that it will be quite hard for the user to find all available options.

tillrohrmann · 2017-03-29T16:52:27Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/memory/MemoryStateBackend.java

- * capabilities to spill to disk. Checkpoints are serialized and the serialized data is
- * transferred
+ * This state backend holds the working state in the memory (JVM heap) of the TaskManagers.
+ * The state backend checkpoints state directly the JobManager's memory (hence the backend's name),


preposition seems to be missing here

StefanRRichter · 2017-04-27T09:52:17Z

...kend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateBackend.java

+	}
+
+	@Override
+	public StreamStateHandle resolveCheckpointLocation(String pointer) throws IOException {


I would suggest to introduce an actual class for the concept of pointers to improve type safety and readability. Through subclasses, it can also be easier to reason about what kinds of pointer a backend accepts or rejects, compared to think about how a string was parsed and failed.

StefanRRichter · 2017-04-27T09:53:48Z

...kend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateBackend.java

+
+	@Nullable
+	@Override
+	public String getMetadataPersistenceLocation() {


The meta data location might be a concept that justifies introducing a class. In particular, it seems that there might be different location concepts in the future, so simply encoding all this in a string is questionable.

StefanRRichter · 2017-04-27T10:14:26Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpoint.java

+			// check if we have a shortcut hook to dispose the state
+			final StateDisposeHook disposeHook = this.disposeHook;
+			if (disposeHook != null) {
+				// fast path!


This method is now a bit more lengthy than it could be; i would suggest to factor out the two paths (fast and standard) in 2 private methods. The textual comments are hinting this and could become obsolete once the code is split up into two named methods.

StefanRRichter · 2017-04-27T10:18:47Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpoint.java

+			// check if we have a shortcut hook to dispose the state
+			final StateDisposeHook disposeHook = this.disposeHook;
+			if (disposeHook != null) {
+				// fast path!


On a second look, do you think it could even make sense to have the previous behaviour also encapsulated as a StateDisposalHook and get rid of the if completely?

Agree, that should be the case in the long run.

StefanRRichter · 2017-04-27T12:25:47Z

...rc/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointMetadataOutputStream.java

+
+/**
+ * A {@link CheckpointMetadataOutputStream} that writes into a file and
+ * returns a {@link StreamStateHandle} upon closing.


I think the comment here is outdated and should link to StreamHandleAndPointer.

StefanRRichter · 2017-04-27T12:26:48Z

...rc/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointMetadataOutputStream.java

+
+	@Override
+	public void close() {
+		if (!closed) {


Either the volatile on closed is not required or this does not actually give the intended thread safety and should be replaced by AtomicBoolean.

StefanRRichter · 2017-04-27T12:31:03Z

...rc/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointMetadataOutputStream.java

+
+			try {
+				out.close();
+				fileSystem.delete(path, false);


I wonder if this delete is intended to be inside the try or should be moved to a finally clause

StefanRRichter · 2017-04-27T12:47:56Z

flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointMetadataStreamFactory.java

+	/**
+	 * A combination of a {@code StreamStateHandle} and an external pointer (in the form of a String).
+	 */
+	final class StreamHandleAndPointer {


Not entirely sure why we need this "AndPointer" part. The pointer in the end is a string and the interpretation of the string is generic and up to the backend that is confronted with it. So right now, the string is always a path, which is already contained in the plain StreamStateHandle that happens to be a FileStateHandle for the current backends. So similar to expecting a certain string format, the backend could as well expect certain subclasses of StreamStateHandle, like FileStateHandle (or something for a different backend), and resolve the "pointer" from there. Or maybe I am missing something fundamental here?

The idea of the pointer is to have a state-backend independent string that indicates a checkpoint location. I picked a string, because the location is ultimately passed as a string, either from the user command flink run -s <savepoint path> or from ZooKeeper' "last completed checkpoint" entry.

Introducing the StreamHandleAndPointer here spares the code from having to reason about state backend specific file handles. We can change that at some point, but for the breadth of changes already in this PR, this helped reduce the scope a bit.

StephanEwen · 2017-05-07T11:45:53Z

I would like to merge this for 1.3 with a slightly reduced scope.

I would include the refactoring to make the State Backends also responsible for storing the Metadata.

I would NOT merge the started work about cleanup hooks. The reason for that is that after some digging through this, I found we actually have to treat different file systems quite differently. For example for HDFS and posix-style file systems, we want to dispose the checkpoint-exclusive state via a rm -r.
However, for S3 we explicitly do not want to do that, but issue DELETE commands for each state object independently, because Flink has a more consistent view on what objects exist for a checkpoint than S3 has. Also, because a rm -r on the S3 file system means "list objects with path prefix; foreach object: delete" which is even worse.

StephanEwen · 2017-05-07T11:59:52Z

Here is a summary of why I picked a String initially for the metadata persistence location:

The idea of the pointer is to have a state-backend independent string that indicates a checkpoint location. I picked a string, because the location is ultimately passed as a string, either from the user command flink run -s or from ZooKeeper' "last completed checkpoint" entry.

Introducing the StreamHandleAndPointer here spares the code from having to reason about state backend specific file handles. We can change that at some point, but for the breadth of changes already in this PR, this helped reduce the scope a bit.

StefanRRichter · 2017-05-07T12:26:33Z

I agree that we can change the Strings to some proper types later. Besides what I mentioned before, I just thought that even if a string is the underlying data type for the moment, having it wrapped in a named Pointer class to make code more readable and "typesafe" could be beneficial.

[FLINK-5823] [checkpoints] State Backends also handle Checkpoint Meta…

460d480

…data and use global cleanup hooks This closes apache#3522

StephanEwen force-pushed the filestatebackend branch from 2bd8172 to 460d480 Compare March 13, 2017 10:58

StephanEwen mentioned this pull request Mar 17, 2017

[FLINK-6014][checkpoint] Allow the registration of state objects in checkpoints #3524

Closed

tillrohrmann approved these changes Mar 30, 2017

View reviewed changes

StefanRRichter reviewed Apr 27, 2017

View reviewed changes

StephanEwen mentioned this pull request Oct 25, 2017

[FLINK-5823] [checkpoints] State Backends also handle Checkpoint Metadata (part 1) #4907

Closed

StephanEwen closed this Jan 18, 2018

rmetzger added the component=Runtime/StateBackends label Mar 14, 2019

[FLINK-5823] [checkpoints] State Backends also handle Checkpoint Metadata and introduce Global Cleanup Hooks #3522

[FLINK-5823] [checkpoints] State Backends also handle Checkpoint Metadata and introduce Global Cleanup Hooks #3522

Uh oh!

Conversation

StephanEwen commented Mar 13, 2017

Core Changes

Store Metadata in State Backend

Global cleanup hooks

Application-defined State Backends pick up additional values from the configuration

Altered user-facing Behavior

Tests

Followups

Uh oh!

StephanEwen commented Mar 13, 2017

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanRRichter Apr 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanRRichter Apr 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanRRichter Apr 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanRRichter Apr 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephanEwen commented May 7, 2017

Uh oh!

StephanEwen commented May 7, 2017

Uh oh!

StefanRRichter commented May 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

StefanRRichter Apr 27, 2017 •

edited

Loading

StefanRRichter Apr 27, 2017 •

edited

Loading

StefanRRichter Apr 27, 2017 •

edited

Loading

StefanRRichter Apr 27, 2017 •

edited

Loading

StefanRRichter commented May 7, 2017 •

edited

Loading