[FLINK-13633][coordination] Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage #9598

wangyang0918 · 2019-09-03T07:31:14Z

What is the purpose of the change

This pull request moves submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage. If the flink cluster terminates exceptionally, some external tools could be used to clean up these residual files.

Brief change log

Use cluster-id sub directory instead of ha storage in ZookeeperUtils#createCompletedCheckpoints() and ZookeeperUtils#createJobGraphs().

Verifying this change

This change added tests and can be verified as follows:

Added integration tests ZooKeeperHaStorageITCase to check the high availability storage directory structure.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2019-09-03T07:32:44Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 140ccb0 (Wed Oct 16 08:18:32 UTC 2019)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2019-09-03T07:41:29Z

CI report:

22408a3 : SUCCESS Build
c44c9e7 : FAILURE Build
140ccb0 : SUCCESS Build

tisonkun

Thanks for your contribution @wangyang0918! I left several comments.

Also I think it is worth to add some javadoc in ZooKeeperUtils to describe the layout HAStorage should be. It would help contributors include ourselves to understand the design here later. You can see also ZooKeeperHaServices.

tisonkun · 2019-09-05T06:09:53Z

flink-tests/pom.xml

@@ -189,6 +189,22 @@ under the License.
 			<scope>test</scope>
 		</dependency>

+		<dependency>


I notice you mentioned

Dependencies (does it add or upgrade a dependency): (yes / no)

in pull request description but it actually does upgrade a dependency. Could you please explain why this upgrade needed and correspondingly update the description?

Hi @tisonkun
Do we need to add the doc to describe the layout in util classes? I think they stay in ZooKeeperHaServices is more reasonable.

For the hdfs related dependencies, we have to add to use MiniDFSCluster in the tests. And the scope is only for test.

For the hdfs related dependencies, we have to add to use MiniDFSCluster in the tests. And the scope is only for test.

ok I see

I think they stay in ZooKeeperHaServices is more reasonable.

make sense to me

tisonkun · 2019-09-05T06:12:11Z

flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java

 		String rootPath = configuration.getValue(HighAvailabilityOptions.HA_STORAGE_PATH);

 		if (rootPath == null || StringUtils.isBlank(rootPath)) {
 			throw new IllegalConfigurationException("Missing high-availability storage path for metadata." +
 					" Specify via configuration key '" + HighAvailabilityOptions.HA_STORAGE_PATH + "'.");
 		} else {
-			return new FileSystemStateStorageHelper<T>(rootPath, prefix);
+			final String clusterId = configuration.getValue(HighAvailabilityOptions.HA_CLUSTER_ID);


What if HighAvailabilityOptions.HA_CLUSTER_ID configured wrongly? Shall we perform a runtime checker to guard clusterId is valid?

What do you mean configured wrongly? In standalone mode, it is configured by user. In Yarn/mesos mode, it will be set automatically. Also it have default value and will not be null.

Yes your right. It's my mistake.

I think it is done partially because Till check the clusterId is not empty or whitespace-only. I remember my concern here previous that if the path concatenate is not a valid path it might fail later.

tisonkun · 2019-09-12T03:28:31Z

Any updates?

Also ping @azagrebin could you take a look at this?

tillrohrmann

Thanks a lot for this improvement @wangyang0918. The change itself looks good to me.

The thing I would like to improve is the test case. I think we don't need to spin up an HDFS testing cluster just to test a path. Instead I would simply test what getClusterHighAvailabilityStoragePath returns. Of course this does not give you the same test coverage but I think it would be good enough. I'll push an commit which simplifies the test a bit.

…eStoragePath Let BlobUtils and ZooKeeperUtils call HighAvailabilityServiceUtils.getClusterHighAvailableStoragePath to obtain cluster wide high available storage path. This closes apache#9598.

tillrohrmann · 2019-09-16T17:50:23Z

I've pushed an update which simplifies the test and makes sure that BlobUtils also uses the cluster wide high available storage directory. Please take a look @wangyang0918 and @tisonkun.

tisonkun

Thanks for your update @tillrohrmann! The de-duplicate commits look good to me.

Checkstyle complains with unused import. Please fix it.

tisonkun · 2019-09-17T01:16:45Z

flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobUtils.java

@@ -23,9 +23,9 @@
 import org.apache.flink.configuration.Configuration;
 import org.apache.flink.configuration.ConfigurationUtils;
 import org.apache.flink.configuration.HighAvailabilityOptions;


Unused import that checkstyle complains with

Good catch. Will fix it.

…point to cluster-id subdirectory of high-availability storage

…eStoragePath Let BlobUtils and ZooKeeperUtils call HighAvailabilityServiceUtils.getClusterHighAvailableStoragePath to obtain cluster wide high available storage path. This closes apache#9598.

tillrohrmann · 2019-09-17T09:16:13Z

I've updated the PR to remove the unused import.

tillrohrmann · 2019-09-17T12:41:30Z

Travis passed. I will merge this PR now. Thanks a lot for the initial authoring @wangyang0918 and the review @tisonkun.

…eStoragePath Let BlobUtils and ZooKeeperUtils call HighAvailabilityServiceUtils.getClusterHighAvailableStoragePath to obtain cluster wide high available storage path. This closes apache#9598.

rmetzger added the review=description? label Sep 3, 2019

wangyang0918 changed the title ~~Flink 13633~~ [FLINK-13633][coordination] Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage Sep 3, 2019

rmetzger added the component=Runtime/Coordination label Sep 3, 2019

tisonkun requested changes Sep 5, 2019

View reviewed changes

tillrohrmann self-assigned this Sep 16, 2019

tillrohrmann requested changes Sep 16, 2019

View reviewed changes

tillrohrmann force-pushed the FLINK-13633 branch from 22408a3 to c13c748 Compare September 16, 2019 17:48

tisonkun reviewed Sep 17, 2019

View reviewed changes

wangyang0918 and others added 3 commits September 17, 2019 11:15

[FLINK-13633][coordination] Move submittedJobGraph and completedCheck…

a3bf3f9

…point to cluster-id subdirectory of high-availability storage

[FLINK-13633] Add HighAvailabilityServiceUtils.getClusterHighAvailabl…

0ae0e6e

…eStoragePath Let BlobUtils and ZooKeeperUtils call HighAvailabilityServiceUtils.getClusterHighAvailableStoragePath to obtain cluster wide high available storage path. This closes apache#9598.

[hotfix] Remove unused constructor from FileSystemStateStorageHelper

140ccb0

tillrohrmann force-pushed the FLINK-13633 branch from c44c9e7 to 140ccb0 Compare September 17, 2019 09:15

tillrohrmann approved these changes Sep 17, 2019

View reviewed changes

tillrohrmann closed this in 9656340 Sep 17, 2019

wangyang0918 deleted the FLINK-13633 branch October 22, 2019 05:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-13633][coordination] Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage #9598

[FLINK-13633][coordination] Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage #9598

wangyang0918 commented Sep 3, 2019

flinkbot commented Sep 3, 2019 •

edited

Loading

flinkbot commented Sep 3, 2019 •

edited

Loading

tisonkun left a comment

tisonkun Sep 5, 2019

wangyang0918 Sep 16, 2019 •

edited

Loading

tisonkun Sep 16, 2019

tisonkun Sep 5, 2019

wangyang0918 Sep 16, 2019

tisonkun Sep 16, 2019

tisonkun Sep 17, 2019

tisonkun commented Sep 12, 2019

tillrohrmann left a comment •

edited

Loading

tillrohrmann commented Sep 16, 2019

tisonkun left a comment

tisonkun Sep 17, 2019

tillrohrmann Sep 17, 2019

tillrohrmann commented Sep 17, 2019

tillrohrmann commented Sep 17, 2019

[FLINK-13633][coordination] Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage #9598

[FLINK-13633][coordination] Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage #9598

Conversation

wangyang0918 commented Sep 3, 2019

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Sep 3, 2019 • edited Loading

Automated Checks

Review Progress

flinkbot commented Sep 3, 2019 • edited Loading

CI report:

tisonkun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangyang0918 Sep 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun commented Sep 12, 2019

tillrohrmann left a comment • edited Loading

Choose a reason for hiding this comment

tillrohrmann commented Sep 16, 2019

tisonkun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann commented Sep 17, 2019

tillrohrmann commented Sep 17, 2019

flinkbot commented Sep 3, 2019 •

edited

Loading

flinkbot commented Sep 3, 2019 •

edited

Loading

wangyang0918 Sep 16, 2019 •

edited

Loading

tillrohrmann left a comment •

edited

Loading