KAFKA-8876:KafkaBasedLog does not throw exception when some partition… #7300

huxihx · 2019-09-05T09:48:04Z

…s of the topic is offline

https://issues.apache.org/jira/browse/KAFKA-8876

When starting up, KafkaBasedLog should throw ConnectException if any of the subscribed partitions has no leader.

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…s of the topic is offline https://issues.apache.org/jira/browse/KAFKA-8876 When starting up, KafkaBasedLog should throw ConnectException if any of the subscribed partitions has no leader.

huxihx · 2019-09-06T01:23:39Z

@ewencp Could you take some time to review this patch? Thanks.

huxihx · 2019-09-11T01:41:58Z

retest this please.

huxihx · 2019-09-11T06:44:47Z

@ijuma Could you take some time to review this patch? Thanks.

mduggan · 2020-01-06T03:34:20Z

Thanks @huxihx for making a patch. I don't feel like I understand the problem well enough to review it, but I'd like to bump this thread with @ewencp and @ijuma if possible - we've seen it in production and it can cause data to be skipped in the binlog, which is not good.

ewencp

Generally this looks ok in that it is a more thorough check and just a couple minor nits.

We should think about how long we want to wait for something like this. This patch kind of conflates 2 potentially different types of issues that might warrant different behavior. Not getting any partition info (which tbh I'm not sure can even come back null anymore, it might always result in an exception) means we're having some fundamental issue getting metadata from the cluster. In contrast, leaderless partitions mean the cluster may generally be healthy, but just a couple of nodes/one partition is having an issue. These are different levels of brokenness. The former either means your cluster is completely hosed or you misconfigured something. The latter could just be a temporary outage, which might be covered by CREATE_TOPIC_TIMEOUT_MS, but might not be if it requires human intervention. If we had a leader outage like this during operation rather than during start(), would we want it to try to delay longer and recover, only logging errors to make the operator aware or would we want it to shut down? I think we might want the former, in which case we might want a somewhat different solution to this problem -- not having checked carefully, presumably something having to do with seekToBeginning and reading to the end of all partitions.

ewencp · 2020-02-04T04:14:12Z

connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java

-            public void run() {
-            }
-        };
+        this.initializer = initializer != null ? initializer : () -> { };


Since this doesn't change functionality, we probably don't want to change this just to update to modern syntax. The more changes we make like this, the harder it is to backport other fixes that might overlap with this diff, and ideally we backport fixes aggressively (and in fact, this could be an example where we might want to backport to a version that supports jdk7).

ewencp · 2020-02-04T04:15:20Z

connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java

@@ -238,7 +236,11 @@ public void send(K key, V value, org.apache.kafka.clients.producer.Callback call
        producer.send(new ProducerRecord<>(topic, key, value), callback);
    }

-
+    // package level visibility for testing only
+    void setTopicMetadataTimeoutMs(long timeoutMs) {


might want to do this as an alternative, package-private constructor so we can at a minimum make topicMetadataTimeoutMs final.

If we don't change this, convention in Kafka is to not use get/set prefixes on the method name.

huxihx · 2020-02-10T03:44:03Z

@ewencp Thanks for the response. The key problem to this patch is how to solve non-online partitions. The original code only covers the first issue you mentioned, namely the total failure of connecting the cluster. The patch indeed makes a stricter check: any unavailable partitions lead to a shutdown. This is deliberate since I think it's not easy to clearly clarify the brokenness between two issues you mentioned. Besides, seems we have little thing to do with such temporarily unavailable partitions after the timeout instead of throwing exceptions. What do you think?

KAFKA-8876:KafkaBasedLog does not throw exception when some partition…

c6f6801

…s of the topic is offline https://issues.apache.org/jira/browse/KAFKA-8876 When starting up, KafkaBasedLog should throw ConnectException if any of the subscribed partitions has no leader.

ewencp reviewed Feb 4, 2020

View reviewed changes

addressed ewencp's comments

e8723e6

kkonstantine added the connect label Mar 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-8876:KafkaBasedLog does not throw exception when some partition… #7300

KAFKA-8876:KafkaBasedLog does not throw exception when some partition… #7300

huxihx commented Sep 5, 2019

huxihx commented Sep 6, 2019

huxihx commented Sep 11, 2019

huxihx commented Sep 11, 2019

mduggan commented Jan 6, 2020

ewencp left a comment

ewencp Feb 4, 2020

ewencp Feb 4, 2020

huxihx commented Feb 10, 2020 •

edited

KAFKA-8876:KafkaBasedLog does not throw exception when some partition… #7300

Are you sure you want to change the base?

KAFKA-8876:KafkaBasedLog does not throw exception when some partition… #7300

Conversation

huxihx commented Sep 5, 2019

Committer Checklist (excluded from commit message)

huxihx commented Sep 6, 2019

huxihx commented Sep 11, 2019

huxihx commented Sep 11, 2019

mduggan commented Jan 6, 2020

ewencp left a comment

Choose a reason for hiding this comment

ewencp Feb 4, 2020

Choose a reason for hiding this comment

ewencp Feb 4, 2020

Choose a reason for hiding this comment

huxihx commented Feb 10, 2020 • edited

huxihx commented Feb 10, 2020 •

edited