New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-8876:KafkaBasedLog does not throw exception when some partition… #7300
base: trunk
Are you sure you want to change the base?
Conversation
…s of the topic is offline https://issues.apache.org/jira/browse/KAFKA-8876 When starting up, KafkaBasedLog should throw ConnectException if any of the subscribed partitions has no leader.
@ewencp Could you take some time to review this patch? Thanks. |
retest this please. |
@ijuma Could you take some time to review this patch? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this looks ok in that it is a more thorough check and just a couple minor nits.
We should think about how long we want to wait for something like this. This patch kind of conflates 2 potentially different types of issues that might warrant different behavior. Not getting any partition info (which tbh I'm not sure can even come back null anymore, it might always result in an exception) means we're having some fundamental issue getting metadata from the cluster. In contrast, leaderless partitions mean the cluster may generally be healthy, but just a couple of nodes/one partition is having an issue. These are different levels of brokenness. The former either means your cluster is completely hosed or you misconfigured something. The latter could just be a temporary outage, which might be covered by CREATE_TOPIC_TIMEOUT_MS
, but might not be if it requires human intervention. If we had a leader outage like this during operation rather than during start()
, would we want it to try to delay longer and recover, only logging errors to make the operator aware or would we want it to shut down? I think we might want the former, in which case we might want a somewhat different solution to this problem -- not having checked carefully, presumably something having to do with seekToBeginning and reading to the end of all partitions.
public void run() { | ||
} | ||
}; | ||
this.initializer = initializer != null ? initializer : () -> { }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this doesn't change functionality, we probably don't want to change this just to update to modern syntax. The more changes we make like this, the harder it is to backport other fixes that might overlap with this diff, and ideally we backport fixes aggressively (and in fact, this could be an example where we might want to backport to a version that supports jdk7).
@@ -238,7 +236,11 @@ public void send(K key, V value, org.apache.kafka.clients.producer.Callback call | |||
producer.send(new ProducerRecord<>(topic, key, value), callback); | |||
} | |||
|
|||
|
|||
// package level visibility for testing only | |||
void setTopicMetadataTimeoutMs(long timeoutMs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to do this as an alternative, package-private constructor so we can at a minimum make topicMetadataTimeoutMs
final
.
If we don't change this, convention in Kafka is to not use get
/set
prefixes on the method name.
@ewencp Thanks for the response. The key problem to this patch is how to solve non-online partitions. The original code only covers the first issue you mentioned, namely the total failure of connecting the cluster. The patch indeed makes a stricter check: any unavailable partitions lead to a shutdown. This is deliberate since I think it's not easy to clearly clarify the brokenness between two issues you mentioned. Besides, seems we have little thing to do with such temporarily unavailable partitions after the timeout instead of throwing exceptions. What do you think? |
…s of the topic is offline
https://issues.apache.org/jira/browse/KAFKA-8876
When starting up, KafkaBasedLog should throw ConnectException if any of the subscribed partitions has no leader.
More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.
Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.
Committer Checklist (excluded from commit message)