Skip to content

Conversation

@tisonkun
Copy link
Member

@tisonkun tisonkun commented Jan 3, 2020

What is the purpose of the change

ZKCheckpointIDCounter doesn't tolerate ZK suspended & reconnected while it could do. This causes that job can not trigger checkpoint forever after zookeeper change leader.

Brief change log

Allow updates to connection state when ZKCheckpointIDCounter reconnects to ZK.

Verifying this change

This change is a trivial fix that can be reasoned by code.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@tisonkun tisonkun requested a review from tillrohrmann January 3, 2020 03:39
@tisonkun
Copy link
Member Author

tisonkun commented Jan 3, 2020

also cc @lamber-ken

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 3, 2020

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 760a023 (Fri Jan 03 03:41:39 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 3, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this problem @tisonkun. The changes look good to me. I was wondering whether we could add a test case for the ZooKeeperCheckpointIDCounter which ensures that we can increment the counter after reconnecting.

@tisonkun
Copy link
Member Author

Thanks for your review @tillrohrmann ! I push a follow-up commit for adding dedicate test for it.

@tisonkun tisonkun force-pushed the FLINK-14091 branch 2 times, most recently from 5fd6901 to 7bfd2b1 Compare January 14, 2020 09:23
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test case @tisonkun. The changes look good to me. I'll address my remaining comments while merging this PR.

Comment on lines 42 to 52
private static final ZooKeeperTestEnvironment ZOOKEEPER = new ZooKeeperTestEnvironment(3);

@AfterClass
public static void tearDown() throws Exception {
ZOOKEEPER.shutdown();
}

@Before
public void cleanUp() throws Exception {
ZOOKEEPER.deleteAll();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the ZooKeeperTestEnvironment I would recommend using the ZooKeeperResource. Combining this with the @Rule will replace the AfterClass and Before methods. Maybe one needs to make the TestingServer accessible, though.

Comment on lines 80 to 91
// encountered connected loss, this prevents us from getting false positive
while (true) {
try {
idCounter.get();
} catch (IllegalStateException ignore) {
log.debug("Encountered connection loss.");
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that this always happens? This looks quite brittle to me. What if the restart is so fast that the client does not lose its connection?

Comment on lines 91 to 102
while (true) {
try {
long id = idCounter.get();
assertThat(id, is(localCounter.get()));
break;
} catch (IllegalStateException ignore) {
log.debug("During ZooKeeper client reconnecting...");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a timeout/deadline here might make sense.

}
}

assertThat(idCounter.getLastState(), is(ConnectionState.RECONNECTED));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not important for the test to ensure that some internal state is RECONNECTED. What we should try to test is that we can increment the ID counter under loss of connection but it is not important how exactly this works.

Comment on lines 153 to 156
@VisibleForTesting
ConnectionState getLastState() {
return connStateListener.lastState;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this exposes internal details which are not relevant for the test. I would try to write the test without exposing these internals.


private volatile ConnectionState lastState;
private void checkConnectionState() {
final ConnectionState lastState = this.lastState;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not shadow the local variable. One could rename the variable currentLastState.

client.getConnectionStateListenable().addListener(connStateListener);

for (ConnectionStateListener listener : connectionStateListeners) {
client.getConnectionStateListenable().addListener(listener);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since getConnectionStateListenable does not guarantee the order in which the listener are called, the added test case is unstable (if the testing listener is called before the one which sets lastState). I suggest to introduce a LastStateConnectionStateListener (similar to SharedCountConnectionStateListener) which we pass into the class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way is that we instead implement a chain of listeners so that the order is deterministic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about this but I think the other approach is easier to understand.

tisonkun and others added 5 commits January 14, 2020 18:15
In order to avoid race conditions between notifying different listeners,
this commit introduces the LastStateConnectionStateListener which is passed
into the ZooKeeperCheckpointIDCounter. This listener can be modified to
fulfill the required testing purposes in
ZKCheckpointIDCounterMultiServersTest#testRecoveredAfterConnectionLoss.
@tisonkun tisonkun deleted the FLINK-14091 branch January 15, 2020 08:38
@tisonkun
Copy link
Member Author

Thanks for reviewing and merging this patch!

@Wangtao87
Copy link

Thanks for reviewing and merging this patch!

SO, does it need be fixed in FLINK 1.7 ??

@tillrohrmann
Copy link
Contributor

Thanks for reviewing and merging this patch!

SO, does it need be fixed in FLINK 1.7 ??

@Wangtao87 the community no longer actively supports Flink 1.7. Hence you would need to backport the fix to this version yourself if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants