New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-3096; Leader is not set to -1 when it is shutdown if followers are down #765
Conversation
Suggested by Guozhang.
The bug was for the following case: `leader.isDefined && oldLeaderOpt.isEmpty && newLeaderOpt.isDefined && newLeaderOpt.get != leader.get` We would consider it a successful election even though we should not. I also changed the result type is we never return `None` (we throw an exception instead).
…r minor clean-ups
Review by @guozhangwang |
topicAndPartition.partition, currentLeaderAndIsr.copy(leader = -1), controllerContext.epoch, | ||
currentLeaderAndIsr.zkVersion) | ||
zkUpdateSucceeded = updateSucceeded | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where we update the leader before throwing NoReplicaOnlineException
.
var zkUpdateSucceeded = false | ||
while (!zkUpdateSucceeded) { | ||
val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkUtils, topicAndPartition.topic, | ||
topicAndPartition.partition, currentLeaderAndIsr.copy(leader = -1), controllerContext.epoch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use "LeaderAndIsr.NoLeader" instead of -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will do.
LGTM overall. Personally I feel this additional logic in "selectLeader" does make sense, since in general we may not be able to trigger the last "remove replica" event when the failed replica is the controller itself, and hence from the view of the new controller the old replica leader was never failed but just no longer in the aliveBroker list any more. So before we abort the leader election process due to either error or configured mechanism (e.g. unclean leader election not allowed), we may need to update the corresponding "source-of-truth" to reflect the abortion, that the new leader was not successfully elected, and hence the partition should be "offline". An alternative way to do this, is when new controller is elected (in |
As suggested by Guozhang.
Addressed your comment @guozhangwang and thanks for the review. It may be worth getting @junrao's opinion as well on which of the two approaches we should go for. |
Started the system test branch builder: https://jenkins.confluent.io/job/kafka_system_tests_branch_builder/337/ |
The failures in the system tests seem to be happening in the trunk build too. Will try again after the following fix to the system tests is merged: |
…ould-be-set-to--1 * apache/trunk: (28 commits) MINOR: Vagrant AWS overrideable EC2 instance name prefix KAFKA-2146: Fix adding partition did not find the correct startIndex. MINOR: Upgrade note on compacted topics behaviour on receiving message without key KAFKA-3080: Fix ConsoleConsumerTest by checking version when service is started KAFKA-2071: Replace Producer Request/Response with their org.apache.kafka.common.requests equivalents KAFKA-3121: Remove aggregatorSupplier and add Reduce functions KAFKA-3122; Fix memory leak in `Sender.completeBatch` on TOPIC_AUTHORIZATION_FAILED MINOR: Avoid unnecessary `ConcurrentHashMap.get` MINOR: Fix javadoc for `PartitionInfo.leader()` MINOR: MemoryRecords.sizeInBytes throws NPE when non-writable. KAFKA-3091: Broker persists generated ID even when the ID can't be used due to duplicates KAFKA-3012: Avoid reserved.broker.max.id collisions on upgrade KAFKA-3119: Adding -daemon option to zookeeper-server-start.sh USAGE, similar to kafka-server-start.sh MINOR: complete built-in stream aggregate functions KAFKA-3104: add windowed aggregation to KStream KAFKA-2999: Errors enum should be a 1 to 1 mapping of error codes and… KAFKA-3095: Add documentation on format of sasl.kerberos.principal.to.local.rules KAFKA-3098: "partition.assignment.strategy" appears twice in documentation KAFKA-2695: limited support for nullable byte arrays KAFKA-3105: Use `Utils.atomicMoveWithFallback` instead of `File.rename` ...
System tests build passed: https://jenkins.confluent.io/job/kafka_system_tests_branch_builder/342/ |
@junrao do you want to chime in to take a look at the proposed solution? |
Thanks for the patch. A few high level comments.
|
"%s is alive. Live brokers are: [%s],".format(topicAndPartition, controllerContext.liveBrokerIds)) + | ||
" ISR brokers are: [%s]".format(currentLeaderAndIsr.isr.mkString(","))) | ||
var zkUpdateSucceeded = false | ||
while (!zkUpdateSucceeded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure why we added the while loop here since there is already a while loop in PartitionStateMachine.electLeaderForPartition().
Hi, what's the user-visible impact of this and https://issues.apache.org/jira/browse/KAFKA-3143? From @junrao's previous comment, I can't tell whether this is a bug, or simply an optimization (reduce fetch requests to someone who isn't the leader, speed up leader failover). I'm seeing some weirdness with regards to unclean leader election, and am wondering if this is related. |
…ould-be-set-to--1 * apache/trunk: (58 commits) KAFKA-3207: Fix StateChangeLogger to use the right topic name MINOR: log connect reconfiguration error only if there was an error HOTFIX: fix partition ordering in assignment MINOR: Removed unnecessary Vagrantfile hack HOTFIX: fix broken WorkerSourceTask test HOTFIX: temp fix for ktable look up KAFKA-3198: Ticket Renewal Thread exits prematurely due to inverted c… MINOR: Fix restoring for source KTable KAFKA-3194: Validate security.inter.broker.protocol against the adver… MINOR: Some more Kafka Streams Javadocs KAFKA-3092: Replace SinkTask onPartitionsAssigned/onPartitionsRevoked with open/close KAFKA-3195; Transient test failure in OffsetCheckpointTest.testReadWrite MINOR: Increment ducktape dependency KAFKA-3121: Refactor KStream Aggregate to be Lambda-able. MINOR: some javadocs for kstream public api MINOR: fix the logic of RocksDBWindowStore using RocksDBStore Segments KAFKA-3170; Set default fetch_min_bytes in new consumer to 1 KAFKA-3068: Remove retry with nodesEverSeen MINOR: Reorder StreamThread shutdown sequence KAFKA-3060: Refactor MeteredStore and RockDBStore Impl ...
…ould-be-set-to--1 * apache/trunk: (71 commits) KAFKA-3280: KafkaConsumer Javadoc contains misleading description of heartbeat behavior and correct use MINOR: Validate inner message compression attribute KAFKA-3292; ClientQuotaManager.getOrCreateQuotaSensors() may return a null ClientSensors.throttleTimeSensor MINOR: Connect status tracking API followup KAFKA-2698: Add paused() method to o.a.k.c.c.Consumer MINOR: enhance streams system test MINOR: add retry to state dir locking KAFKA-3259 KAFKA-3253; KIP-31/KIP-32 Follow-up MINOR: add useful debug log messages to KConnect source task execution MINOR: Add note about which files need to be edited when updating the version number KAFKA-3277; Update trunk version to be 0.10.0.0-SNAPSHOT KAFKA-3093: Add Connect status tracking API KAFKA-3272: Add debugging options to kafka-run-class.sh so we can easily run remote debugging KAFKA-3007: implement max.poll.records (KIP-41) HOTFIX: Add missing file for KeyValue unit test MINOR: KTable.count() to only take a selector for key KAFKA-3046: Add ByteBuffer Serializer and Deserializer HOTFIX: fix consumer config for streams KAFKA-3245: config for changelog replication factor KAFKA-3242: minor rename / logging change to Controller ...
…ould-be-set-to--1 * apache/trunk: (110 commits) KAFKA-3412: multiple asynchronous commits causes send failures MINOR: update new version in additional places MINOR: fix documentation version MINOR: Add InterfaceStability.Unstable annotations to some Kafka Streams public APIs Changing version to 0.10.1.0-SNAPSHOT MINOR: Fix FetchRequest.getErrorResponse for version 1 KAFKA-3378; Follow-up to ensure we `finishConnect` for immediately connected keys MINOR: Add vagrant up wrapper for simple parallel bringup on aws KAFKA-3328: SimpleAclAuthorizer can lose ACLs with frequent add/remov… KAFKA-3424: Add CORS support to Connect REST API KAFKA-3427: broker can return incorrect version of fetch response when the broker hits an unknown exception KAFKA-3378; Client blocks forever if SocketChannel connects instantly KAFKA-3006: standardize KafkaConsumer API to use Collection KAFKA-3397: use -1(latest) as time default value for tools.GetOffsetShell KAFKA-3394; allow null offset metadata in commit API KAFKA-3422: Add overloading functions without serdes in Streams DSL HOTFIX: Renamed tests to match expected suffix KAFKA-2982; Mark the old Scala producer and related classes as deprecated KAFKA-3336: Unify Serializer and Deserializer into Serialization KAFKA-3202: System test that changes message version on the fly ...
…ould-be-set-to--1 * apache/trunk: (63 commits) KAFKA-3521: validate null keys in Streams DSL implementations KAFKA-3528: handle wakeups while rebalancing more gracefully MINOR: improve logging of consumer system tests KAFKA-725: Return OffsetOutOfRange error from ReplicaManager when non-follower attempts reading an offset that's above high watermark. KAFKA-3512: Added foreach operator KAFKA-3505: Fix punctuate generated record metadata MINOR: ensure original use of prop_file in verifiable producer KAFKA-3488; Avoid failing of unsent requests in consumer where possible KAFKA-3497: Streams ProcessorContext should support forward() based on child name KAFKA-3477: extended KStream/KTable API to specify custom partitioner for sinks KAFKA-3508: Fix transient SimpleACLAuthorizerTest failures KAFKA-3489; Update request metrics if a client closes a connection while the broker response is in flight KAFKA-3510; OffsetIndex thread safety KAFKA-3384: Conform to POSIX kill usage KAFKA-2998: log warnings when client is disconnected from bootstrap brokers KAFKA-3464: Add system tests for Connect with Kafka security enabled HOTFIX: set timestamp in SinkNode KAFKA-3483: Restructure ducktape tests to simplify running subsets of tests MINOR: Clean up of SourceTaskOffsetCommiter MINOR: Fix small typo in design section ...
@ijuma Is this still valid? If yes could you close it? |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Update leader to -1 before throwing
NoReplicaOnlineException
inOfflinePartitionLeaderSelector
as suggested by Guozhang. This fixes the test, which is great, but it seems a bit out of place. Would it be better to change the code somewhere else?Fix bug in
waitUntilLeaderIsElectedOrChanged
and simplify result type. The bug was for the following case:We would consider it a successful election even though we should not. I also changed the result type as we never return
None
(we throw an exception instead). Fixing this bug is what uncovered the leader issue being solved in this PR.Also included:
PartitionLeaderSelector.scala