Skip to content

Conversation

@becketqin
Copy link
Contributor

What is the purpose of the change

Fix unstable Kafka IT case by checking the topic existence with KafkaConsumer.

Brief change log

Check the topic existence with KafkaConsumer.

Verifying this change

This change is already covered by existing tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): ( no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@becketqin
Copy link
Contributor Author

@pnowojski Will you have time to take a look? Thanks.

@flinkbot
Copy link
Collaborator

flinkbot commented May 19, 2020

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 175623d (Fri Oct 16 10:53:55 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, modulo one comment.

I'm wondering why you only fix it for Kafka 0.10 and 0.11? I know that the Kafka "Modern" version uses the AdminClient directly but it seems some of the failures on https://issues.apache.org/jira/browse/FLINK-12030 also happened on that "modern" Kafka connector.

do {
topicCreated = !consumer.partitionsFor(topic).isEmpty();
if (!topicCreated) {
Thread.sleep(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 1ms sleep is a bit to agressive. 10 or even 100 should work as well

@flinkbot
Copy link
Collaborator

flinkbot commented May 19, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@becketqin
Copy link
Contributor Author

@aljoscha Thanks for the review. I went through all the reported failures in the Jira ticket. It looks people are reporting two different issues.

  1. The "producer already closed" error caused by the race condition in StreamOperatorWrapper that we are trying to fix in FLINK-16383.
  2. The "does not host this topic-partition" error we are trying to fix here.

All the failures caused by issue 2 are either from 0.10 or 0.11 Kafka connectors, while the first issues are reported on all the versions. I think people are just adding comments to the same ticket once they see this test failed, regardless of the failure cause.

That being said, theoretically speaking, using KafkaAdminClient to create topic does not 100% guarantee that a producer will not see the "does not host this topic-partition" error. This is because when the AdminClient can only guarantee the topic metadata information has existed in the broker to which it sent the CreateTopicRequest. When a producer comes at a later point, it might send TopicMetdataRequest to a different broker and that broker may have not received the updated topic metadata yet. But this is much unlikely to happen given the broker usually receives the metadata update at the same time. Having retries configured on the producer side should be sufficient to handle such cases. We can also do that for 0.10 and 0.11 producers. But given that we have the producer properties scattered over the places (which is something we probably should avoid to begin with), it would be simpler to just make sure the topic has been created successfully before we start the tests.

@becketqin
Copy link
Contributor Author

Patch merged.
master: 51a0d42
release-1.11: 0f07223

@becketqin becketqin closed this May 20, 2020
@aljoscha
Copy link
Contributor

Thanks for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants