[FLINK-20928] Fix flaky test by retrying notifyCheckpointComplete until either commit success or timeout #17342

lindong28 · 2021-09-23T14:50:45Z

What is the purpose of the change

The test KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete is flaky according to the test failure history in FLINK-20928. This PR attempts to fix this flaky test.

Brief change log

Here are the problems with the existing code that could explain why the test is flaky:

The test calls KafkaSourceReader.notifyCheckpointComplete(...) once and expects the offset commit to be successful.
However, KafkaSourceReader.notifyCheckpointComplete(...) does not guarantee the offset commit to be successfully. This is because it calls KafkaConsumer.commitAsync(...) just once and won't retry even if the commit fails with an retriable exception.
During in the test, if the coordinator is temporarily unavailable due to e.g. coordinator movement or network disconnection, the test will fail due to TimeoutException.

This PR made the following changes to address the issues described above:

Updated KafkaSourceReader.notifyCheckpointComplete so that it can be called multiple times with the same checkpointId.
Updated CommonTestUtils.waitUtil(...) to support user-specified sleep time. Previously waitUtil(...) hardcodes the sleep time to be 1 ms.
Updated KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete to retry KafkaSourceReader.notifyCheckpointComplete once per second until either the offset commit has completed or the max wait time has been reached.

Verifying this change

The test KafkaSourceReaderTest#testOffsetCommitOnCheckpointComplete could consistently pass across 200 runs.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (no)

flinkbot · 2021-09-23T14:53:21Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 7f92b13 (Thu Sep 23 14:53:20 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-09-23T15:31:06Z

CI report:

9d74252 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

lindong28 · 2021-09-24T01:40:49Z

@flinkbot run azure

fapaul

@lindong28 very nice catch. I think your analysis of the cause is correct the KafkaSourceReader treats the offset committing not as mandatory which can lead to flaky tests. Your retry loop should harden the test.

fapaul · 2021-09-28T08:20:55Z

...n/java/org/apache/flink/connector/kafka/source/reader/fetcher/KafkaSourceFetcherManager.java

@@ -73,7 +73,7 @@ public void commitOffsets(
        if (offsetsToCommit.isEmpty()) {
            return;
        }
-        SplitFetcher<Tuple3<T, Long, Long>, KafkaPartitionSplit> splitFetcher = fetchers.get(0);
+        SplitFetcher<Tuple3<T, Long, Long>, KafkaPartitionSplit> splitFetcher = getRunningFetcher();


Nit: Does this change have any effect on the fix if not maybe make the change a separate commit?

Can you explain a bit more why this is a performance improvement?

Thanks for the review @fapaul. This change does not affect the fix. I have updated the PR to remove this change.

Regarding the reason why this could improve performance, let's assume the first fetcher created by this KafkaSourceFetcherManager has been closed and removed from fetchers. Prior to this change, every time the commitOffsets() is called, it will create a new SplitFetcher just to commit the offset. If commitOffsets() is called N times, then N SplitFetcher will be created, which seems to be really inefficient.

In order to fix this problem, we can commit the message using any running fetcher in the fetchers, which could be achieved by using getRunningFetcher() here.

I created https://issues.apache.org/jira/browse/FLINK-24398 to track this issue.

…il either commit success or timeout

AHeise · 2021-10-06T08:49:30Z

Thank you very much for the contribution. I merged it into master. Could you please create backport PRs?

lindong28 · 2021-10-08T16:19:10Z

@AHeise Thank you for helping review the PR. This PR just fixes a flaky test. Does this need to be backported?

I am happy to create backport PRs. I have not done this before. Could you let me know which branches need to have this backport PR?

fapaul · 2021-10-12T06:40:25Z

@lindong28 sorry for the late response. Can you cherry-pick your commit and create a pull request against the 1.14 branch?

lindong28 · 2021-10-12T09:37:32Z

Thanks @fapaul. I have created #17457 as suggested.

AHeise · 2021-10-14T06:07:36Z

I merged the backport into 1.14. According to the ticket it also affects 1.13. Can you verify that and do another backport? If not, please close the ticket.

lindong28 · 2021-10-15T01:52:01Z

Thanks @AHeise. Sure, I have created #17488 to backport this fix to the 1.13 branch.

rmetzger added the review=description? label Sep 23, 2021

lindong28 changed the title ~~[FLINK-20928] Fix flaky test by invoking notifyCheckpointComplete periodically~~ [FLINK-20928] Fix flaky test by retrying notifyCheckpointComplete until either commit success or timeout Sep 23, 2021

lindong28 force-pushed the FLINK-20928 branch 2 times, most recently from 46d4952 to a6fc284 Compare September 23, 2021 15:14

rmetzger added the component=Connectors/Kafka label Sep 23, 2021

lindong28 force-pushed the FLINK-20928 branch from a6fc284 to 32ad8a3 Compare September 24, 2021 01:40

fapaul approved these changes Sep 28, 2021

View reviewed changes

[FLINK-20928] Fix flaky test by retrying notifyCheckpointComplete unt…

9d74252

…il either commit success or timeout

lindong28 force-pushed the FLINK-20928 branch from 32ad8a3 to 9d74252 Compare September 28, 2021 12:30

AHeise merged commit cd50229 into apache:master Oct 6, 2021

lindong28 mentioned this pull request Oct 12, 2021

[FLINK-20928] Fix flaky test by retrying notifyCheckpointComplete until either commit success or timeout #17457

Merged

lindong28 mentioned this pull request Oct 15, 2021

[FLINK-20928] Fix flaky test by retrying notifyCheckpointComplete until either commit success or timeout #17488

Merged

PatrickRen mentioned this pull request Oct 22, 2021

[FLINK-23391][connector/kafka] Fix flaky Kafka source metric test by retrying notifyCheckpointComplete until success or timeout #17516

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-20928] Fix flaky test by retrying notifyCheckpointComplete until either commit success or timeout #17342

[FLINK-20928] Fix flaky test by retrying notifyCheckpointComplete until either commit success or timeout #17342

lindong28 commented Sep 23, 2021 •

edited

flinkbot commented Sep 23, 2021

flinkbot commented Sep 23, 2021 •

edited

lindong28 commented Sep 24, 2021

fapaul left a comment

fapaul Sep 28, 2021

fapaul Sep 28, 2021

lindong28 Sep 28, 2021

lindong28 Sep 28, 2021

AHeise commented Oct 6, 2021

lindong28 commented Oct 8, 2021 •

edited

fapaul commented Oct 12, 2021

lindong28 commented Oct 12, 2021

AHeise commented Oct 14, 2021

lindong28 commented Oct 15, 2021

[FLINK-20928] Fix flaky test by retrying notifyCheckpointComplete until either commit success or timeout #17342

[FLINK-20928] Fix flaky test by retrying notifyCheckpointComplete until either commit success or timeout #17342

Conversation

lindong28 commented Sep 23, 2021 • edited

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Sep 23, 2021

Automated Checks

Review Progress

flinkbot commented Sep 23, 2021 • edited

CI report:

lindong28 commented Sep 24, 2021

fapaul left a comment

Choose a reason for hiding this comment

fapaul Sep 28, 2021

Choose a reason for hiding this comment

fapaul Sep 28, 2021

Choose a reason for hiding this comment

lindong28 Sep 28, 2021

Choose a reason for hiding this comment

lindong28 Sep 28, 2021

Choose a reason for hiding this comment

AHeise commented Oct 6, 2021

lindong28 commented Oct 8, 2021 • edited

fapaul commented Oct 12, 2021

lindong28 commented Oct 12, 2021

AHeise commented Oct 14, 2021

lindong28 commented Oct 15, 2021

lindong28 commented Sep 23, 2021 •

edited

flinkbot commented Sep 23, 2021 •

edited

lindong28 commented Oct 8, 2021 •

edited