[FLINK-14091][coordination] Allow updates to connection state when ZKCheckpointIDCounter reconnects to ZK #10754

tisonkun · 2020-01-03T03:39:19Z

What is the purpose of the change

ZKCheckpointIDCounter doesn't tolerate ZK suspended & reconnected while it could do. This causes that job can not trigger checkpoint forever after zookeeper change leader.

Brief change log

Allow updates to connection state when ZKCheckpointIDCounter reconnects to ZK.

Verifying this change

This change is a trivial fix that can be reasoned by code.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

tisonkun · 2020-01-03T03:39:35Z

also cc @lamber-ken

flinkbot · 2020-01-03T03:41:39Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 760a023 (Fri Jan 03 03:41:39 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-01-03T04:00:44Z

CI report:

760a023 Travis: SUCCESS Azure: SUCCESS
7bfd2b1 Travis: FAILURE Azure: SUCCESS
4d0d330 Travis: FAILURE Azure: SUCCESS
357601e Travis: SUCCESS Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

tillrohrmann

Thanks for fixing this problem @tisonkun. The changes look good to me. I was wondering whether we could add a test case for the ZooKeeperCheckpointIDCounter which ensures that we can increment the counter after reconnecting.

tisonkun · 2020-01-14T09:05:42Z

Thanks for your review @tillrohrmann ! I push a follow-up commit for adding dedicate test for it.

tillrohrmann

Thanks for adding the test case @tisonkun. The changes look good to me. I'll address my remaining comments while merging this PR.

tillrohrmann · 2020-01-14T14:14:31Z

...src/test/java/org/apache/flink/runtime/checkpoint/ZKCheckpointIDCounterMultiServersTest.java

+	private static final ZooKeeperTestEnvironment ZOOKEEPER = new ZooKeeperTestEnvironment(3);
+
+	@AfterClass
+	public static void tearDown() throws Exception {
+		ZOOKEEPER.shutdown();
+	}
+
+	@Before
+	public void cleanUp() throws Exception {
+		ZOOKEEPER.deleteAll();
+	}


Instead of the ZooKeeperTestEnvironment I would recommend using the ZooKeeperResource. Combining this with the @Rule will replace the AfterClass and Before methods. Maybe one needs to make the TestingServer accessible, though.

tillrohrmann · 2020-01-14T14:15:16Z

...src/test/java/org/apache/flink/runtime/checkpoint/ZKCheckpointIDCounterMultiServersTest.java

+		// encountered connected loss, this prevents us from getting false positive
+		while (true) {
+			try {
+				idCounter.get();
+			} catch (IllegalStateException ignore) {
+				log.debug("Encountered connection loss.");
+				break;
+			}
+		}


Are you sure that this always happens? This looks quite brittle to me. What if the restart is so fast that the client does not lose its connection?

tillrohrmann · 2020-01-14T14:16:33Z

...src/test/java/org/apache/flink/runtime/checkpoint/ZKCheckpointIDCounterMultiServersTest.java

+		while (true) {
+			try {
+				long id = idCounter.get();
+				assertThat(id, is(localCounter.get()));
+				break;
+			} catch (IllegalStateException ignore) {
+				log.debug("During ZooKeeper client reconnecting...");
+			}
+		}


Adding a timeout/deadline here might make sense.

tillrohrmann · 2020-01-14T14:17:24Z

...src/test/java/org/apache/flink/runtime/checkpoint/ZKCheckpointIDCounterMultiServersTest.java

+			}
+		}
+
+		assertThat(idCounter.getLastState(), is(ConnectionState.RECONNECTED));


I think it is not important for the test to ensure that some internal state is RECONNECTED. What we should try to test is that we can increment the ID counter under loss of connection but it is not important how exactly this works.

tillrohrmann · 2020-01-14T14:18:01Z

...-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCheckpointIDCounter.java

+	@VisibleForTesting
+	ConnectionState getLastState() {
+		return connStateListener.lastState;
+	}


I think this exposes internal details which are not relevant for the test. I would try to write the test without exposing these internals.

tillrohrmann · 2020-01-14T14:21:38Z

...-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCheckpointIDCounter.java

-
-		private volatile ConnectionState lastState;
+	private void checkConnectionState() {
+		final ConnectionState lastState = this.lastState;


I would not shadow the local variable. One could rename the variable currentLastState.

tillrohrmann · 2020-01-14T15:13:27Z

...-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCheckpointIDCounter.java

-				client.getConnectionStateListenable().addListener(connStateListener);
+
+				for (ConnectionStateListener listener : connectionStateListeners) {
+					client.getConnectionStateListenable().addListener(listener);


Since getConnectionStateListenable does not guarantee the order in which the listener are called, the added test case is unstable (if the testing listener is called before the one which sets lastState). I suggest to introduce a LastStateConnectionStateListener (similar to SharedCountConnectionStateListener) which we pass into the class.

Nice catch!

Another way is that we instead implement a chain of listeners so that the order is deterministic.

I also thought about this but I think the other approach is easier to understand.

…CheckpointIDCounter reconnects to ZK

…rd from connection loss

… testable codebase

In order to avoid race conditions between notifying different listeners, this commit introduces the LastStateConnectionStateListener which is passed into the ZooKeeperCheckpointIDCounter. This listener can be modified to fulfill the required testing purposes in ZKCheckpointIDCounterMultiServersTest#testRecoveredAfterConnectionLoss.

tisonkun · 2020-01-15T08:38:37Z

Thanks for reviewing and merging this patch!

Wangtao87 · 2020-02-20T14:23:37Z

Thanks for reviewing and merging this patch!

SO, does it need be fixed in FLINK 1.7 ??

tillrohrmann · 2020-02-25T16:58:39Z

Thanks for reviewing and merging this patch!

SO, does it need be fixed in FLINK 1.7 ??

@Wangtao87 the community no longer actively supports Flink 1.7. Hence you would need to backport the fix to this version yourself if needed.

tisonkun requested a review from tillrohrmann January 3, 2020 03:39

rmetzger added the review=description? label Jan 3, 2020

rmetzger added the component=Runtime/Checkpointing label Jan 3, 2020

tillrohrmann self-assigned this Jan 10, 2020

tillrohrmann approved these changes Jan 10, 2020

View reviewed changes

tisonkun force-pushed the FLINK-14091 branch from 760a023 to f703fcc Compare January 14, 2020 09:05

tisonkun force-pushed the FLINK-14091 branch 2 times, most recently from 5fd6901 to 7bfd2b1 Compare January 14, 2020 09:23

tillrohrmann approved these changes Jan 14, 2020

View reviewed changes

tillrohrmann reviewed Jan 14, 2020

View reviewed changes

tisonkun and others added 5 commits January 14, 2020 18:15

[FLINK-14091][coordination] Allow updates to connection state when ZK…

59f2c38

…CheckpointIDCounter reconnects to ZK

[FLINK-14091][tests] Tests ZooKeeperCheckpointIDCounter can be recove…

598e84f

…rd from connection loss

[FLINK-14091][tests] Refactor ZooKeeperCheckpointIDCounter for a more…

200fa10

… testable codebase

[hotfix] Fix checkstyle violations in ZooKeeperCheckpointIDCounter

7814b87

tillrohrmann force-pushed the FLINK-14091 branch from 4d0d330 to 357601e Compare January 14, 2020 17:16

tillrohrmann closed this in 7455a09 Jan 15, 2020

tisonkun deleted the FLINK-14091 branch January 15, 2020 08:38

[FLINK-14091][coordination] Allow updates to connection state when ZKCheckpointIDCounter reconnects to ZK #10754

[FLINK-14091][coordination] Allow updates to connection state when ZKCheckpointIDCounter reconnects to ZK #10754

Uh oh!

Conversation

tisonkun commented Jan 3, 2020

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

tisonkun commented Jan 3, 2020

Uh oh!

flinkbot commented Jan 3, 2020

Automated Checks

Review Progress

Uh oh!

flinkbot commented Jan 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Jan 14, 2020

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Jan 15, 2020

Uh oh!

Wangtao87 commented Feb 20, 2020

Uh oh!

tillrohrmann commented Feb 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

flinkbot commented Jan 3, 2020 •

edited

Loading