KAFKA-15489: resign leadership when no fetch from majority voters #14428

showuon · 2023-09-23T09:58:04Z

In KIP-595, we expect to piggy-back on the quorum.fetch.timeout.ms config, and if the leader did not receive Fetch requests from a majority of the quorum for that amount of time, it would begin a new election, to resolve the network partition in the quorum. But we missed this implementation in current KRaft. Fixed it in this PR.

This PR did:

Added a fetchTimer with fetchTimeout in LeaderState, and check if expired each time when leader poll. If expired, resigning the leadership and start a new election.
Added fetchedVoters in LeaderState, and update the value each time received a fetchRequest, and clear it and reset fetchTimer if the majority of fetchRequest received.
Added tests.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

showuon · 2023-09-23T09:59:15Z

@hachikuji @cmccabe @jsancio , call for review. Thanks.

satishd · 2023-09-23T11:10:31Z

cc @mumrah

ahuang98 · 2023-09-26T18:56:46Z

raft/src/main/java/org/apache/kafka/raft/LeaderState.java

+        this.fetchTimer = time.timer(fetchTimeoutMs);
+    }
+
+    public boolean hasMajorityFollowerFetchTimeoutExpired(long currentTimeMs) {


nit: these are pretty lengthy and arguably unintuitive method names, could we add a short comment on what each method does?

Changed to hasMajorityFollowerFetchExpired. Let me know if you have any better suggestion. Thanks.

I was more suggesting that this method might benefit from a comment which describes behavior. I guess the info log explains it well enough

Fair enough. Added comments on the methods. Thanks.

jsancio

Thanks for the changes @showuon

jsancio · 2023-09-29T20:34:38Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientTest.java

@@ -485,6 +485,49 @@ public void testHandleBeginQuorumEpochAfterUserInitiatedResign() throws Exceptio
            context.listener.currentLeaderAndEpoch());
    }

+    @Test
+    public void testLeaderShouldResignLeadershipIfNotGetFetchRequestFromMajorityVoters() throws Exception {


Can we also tests the opposite. That the leader doesn't resign if the majority of the replicas (including the leader) have fetch in the last fetchTimeoutMs?

Can we also add test(s) under KafkaRaftClientSnapshotTest that show that the leader also considers FETCH_SNAPSHOT requests for determining network connectivity?

Will add tests later.

Added a test in KafkaRaftClientSnapshotTest.

Can we also tests the opposite. That the leader doesn't resign if the majority of the replicas (including the leader) have fetch in the last fetchTimeoutMs?

I didn't follow you. I've verified:

3 controller cluster - 1/2 fetch time leadership not get reassigned - fetch from one voter --- timer reset --- - 1/2 fetch time leadership not get reassigned - fetch from another voter --- timer reset --- - 1/2 fetch time leadership not get reassigned - fetch from the observer - 1/2 fetch time --- expired --- leadership should get reassigned

I think I've verified what you want. Let me know if I need to add other things. Thanks.

Got it. This test covers all of the cases I was thinking about.

raft/src/main/java/org/apache/kafka/raft/LeaderState.java

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

raft/src/main/java/org/apache/kafka/raft/LeaderState.java

jsancio · 2023-09-29T21:06:17Z

raft/src/main/java/org/apache/kafka/raft/LeaderState.java

+    public void maybeResetMajorityFollowerFetchTimeout(int id, long currentTimeMs) {
+        updateFetchedVoters(id);
+        if (fetchedVoters.size() >= majority) {
+            fetchedVoters.clear();
+            fetchTimer.update(currentTimeMs);
+            fetchTimer.reset(fetchTimeoutMs);
+        }
+    }
+
+    private void updateFetchedVoters(int id) {
+        if (isVoter(id)) {
+            fetchedVoters.add(id);
+        }


I see.

I think that the invariant that we need to hold is that for any time span of fetchTimeoutMs the majority of the replicas have performed a successful FETCH and FETCH_SNAPSHOT. Note that ReplicaState already contains the lastFetchTimestamp.

The part that is not clear to me is when or how to wake up the leader for a poll. We need to update KafkaRaftClient::pollLeader so that the replicas' last fetch time is taken into account when blocking on the messageQueue.poll.

What do you think?

Note that ReplicaState already contains the lastFetchTimestamp.

I'm trying to re-use the lastFetchTimestamp in ReplicaState today, but found it won't work as expected since the default value for it is -1, which means, when a note becomes a leader, all the lastFetchTimestamp of follower nodes are -1. Using current timer way is more readable IMO.

The part that is not clear to me is when or how to wake up the leader for a poll. We need to update KafkaRaftClient::pollLeader so that the replicas' last fetch time is taken into account when blocking on the messageQueue.poll

Good question. My thought is, we add some buffer to tolerate the operation time. Like when checking shrinkISR, we give a 1.5x of the timeout to make things easier, instead of calculating the accurate timestamp. So, I'm thinking we use fetchTimeout * 1.5. WDYT?

…AFKA-15489

showuon · 2023-10-03T12:07:36Z

@jsancio @ahuang98 , I've addressed the comments. Please take a look again. Thanks.

ahuang98

Thanks @showuon, I'm not sure if I follow the thread here so I don't feel comfortable approving just yet.
[Edit] I discussed with Jose offline and the comment makes sense to me now. I'm okay with the alternative of an added buffer time

I also had two other concerns/requests -
Perhaps the 1.5x of the timeout you suggested would also help address this concern - I'm wondering if we may start causing leaders to resign when followers are slow/backlogged and make the situation worse? E.g. if we have multiple followers that need to catch up via a large fetch snapshot, they are unable to fetch again prior to the timeout expiring, and cause the current leader to resign. I don't believe this would be very disruptive but wanted to check folks had considered this/similar situation.

I think we can also modify QUORUM_FETCH_TIMEOUT_MS_DOC to be slightly more explicit too (i.e. Maximum time a leader can go without receiving valid fetch or fetchsnapshot request from a majority of the quorum before resigning or something slightly different if we choose to use 1.5x)

raft/src/main/java/org/apache/kafka/raft/LeaderState.java

showuon · 2023-10-12T10:00:32Z

I'm wondering if we may start causing leaders to resign when followers are slow/backlogged and make the situation worse? E.g. if we have multiple followers that need to catch up via a large fetch snapshot, they are unable to fetch again prior to the timeout expiring, and cause the current leader to resign. I don't believe this would be very disruptive but wanted to check folks had considered this/similar situation.

Yes, with 1.5x of timeout, this issue should be resolved. Also, if one follower is slow due to whatever reason, and doesn't fetch again within fetch timeout, it'll also start a new election. That's already the current implementation.

I think we can also modify QUORUM_FETCH_TIMEOUT_MS_DOC to be slightly more explicit too (i.e. Maximum time a leader can go without receiving valid fetch or fetchsnapshot request from a majority of the quorum before resigning or something slightly different if we choose to use 1.5x)

Doc updated. I don't think we need to mention anything about 1.5x because that's the implementation detail.

showuon · 2023-11-07T10:32:51Z

@jsancio @ahuang98 , do you have any other comments? I'd like to include this fix into v3.5.2 if possible. Thanks.

jsancio · 2023-11-27T19:52:17Z

Excuse the delays @showuon . I'll review this today and this week!

jsancio

Hi @showuon ,

Thanks for the changes. They look good to me in general. One potential issue with this implementation is that the leader doesn't check that the fetching voters are making progress.

Just because the leader returned a successful response to FETCH and FETCH_SNAPSHOT doesn't mean that the followers were able to handle the response correctly.

For example, imagine the case where the log end offset (LEO) is at 1000 and all of the followers are continuously fetching at offset 0 without ever increasing their fetch offset. This can happen if the followers encounter an error when processing the FETCH or FETCH_SNAPSHOT response.

In this scenario the leader will never be able to increase the HWM. I think that this scenario is specific to KRaft and doesn't exists in Raft because KRaft is pull vs Raft which is push.

What do you think? Do you agree? If so should we address this issue in this PR or create an issue for this and fix it in a future PR?

raft/src/main/java/org/apache/kafka/raft/LeaderState.java

clients/src/main/java/org/apache/kafka/common/requests/FetchSnapshotRequest.java

raft/src/main/java/org/apache/kafka/raft/LeaderState.java

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientSnapshotTest.java

jsancio · 2023-11-27T22:16:52Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientTest.java

@@ -485,6 +485,49 @@ public void testHandleBeginQuorumEpochAfterUserInitiatedResign() throws Exceptio
            context.listener.currentLeaderAndEpoch());
    }

+    @Test
+    public void testLeaderShouldResignLeadershipIfNotGetFetchRequestFromMajorityVoters() throws Exception {


Got it. This test covers all of the cases I was thinking about.

raft/src/test/java/org/apache/kafka/raft/LeaderStateTest.java

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

showuon · 2023-11-28T07:41:48Z

Hi @showuon ,

Thanks for the changes. They look good to me in general. One potential issue with this implementation is that the leader doesn't check that the fetching voters are making progress.

Just because the leader returned a successful response to FETCH and FETCH_SNAPSHOT doesn't mean that the followers were able to handle the response correctly.

For example, imagine the case where the log end offset (LEO) is at 1000 and all of the followers are continuously fetching at offset 0 without ever increasing their fetch offset. This can happen if the followers encounter an error when processing the FETCH or FETCH_SNAPSHOT response.

In this scenario the leader will never be able to increase the HWM. I think that this scenario is specific to KRaft and doesn't exists in Raft because KRaft is pull vs Raft which is push.

What do you think? Do you agree? If so should we address this issue in this PR or create an issue for this and fix it in a future PR?

@jsancio , Good catch! Yes, that's indeed a potential problem. This PR has been pending for a long time, let's focus on the current issue in this PR first. I've filed: KAFKA-15911 for the potential issue.
I've addressed all your comments. Please take a look again. Thanks.

jsancio

Just one minor suggestion.

I took at look at the build and there seem to be a lot of failures. Can you confirm that they are not related to this change?

raft/src/main/java/org/apache/kafka/raft/LeaderState.java

showuon · 2023-11-29T01:36:10Z

I took at look at the build and there seem to be a lot of failures. Can you confirm that they are not related to this change?

No, they doesn't look related. Let's check the latest build results later.

jsancio

LGTM. Thanks for the improvement. There are a lot of test failures but they seem unrelated. Do you agree @showuon ?

showuon · 2023-12-01T07:04:54Z

Yes, I agree. Thanks for helping merge it.

…ajority voters (apache#14428) In KIP-595, we expect to piggy-back on the `quorum.fetch.timeout.ms` config, and if the leader did not receive Fetch requests from a majority of the quorum for that amount of time, it would begin a new election, to resolve the network partition in the quorum. But we missed this implementation in current KRaft. Fixed it in this PR. The commit include: 1. Added a timer with timeout configuration in `LeaderState`, and check if expired each time when leader is polled. If expired, resigning the leadership and start a new election. 2. Added `fetchedVoters` in `LeaderState`, and update the value each time received a FETCH or FETCH_SNAPSHOT request, and clear it and resets the timer if the majority - 1 of the remote voters sent such requests. Reviewers: José Armando García Sancio <jsancio@apache.org>

…ajority voters (apache#14428) (#315) In KIP-595, we expect to piggy-back on the `quorum.fetch.timeout.ms` config, and if the leader did not receive Fetch requests from a majority of the quorum for that amount of time, it would begin a new election, to resolve the network partition in the quorum. But we missed this implementation in current KRaft. Fixed it in this PR. The commit include: 1. Added a timer with timeout configuration in `LeaderState`, and check if expired each time when leader is polled. If expired, resigning the leadership and start a new election. 2. Added `fetchedVoters` in `LeaderState`, and update the value each time received a FETCH or FETCH_SNAPSHOT request, and clear it and resets the timer if the majority - 1 of the remote voters sent such requests. Reviewers: José Armando García Sancio <jsancio@apache.org> Co-authored-by: Luke Chen <showuon@gmail.com>

…ajority voters (apache#14428) In KIP-595, we expect to piggy-back on the `quorum.fetch.timeout.ms` config, and if the leader did not receive Fetch requests from a majority of the quorum for that amount of time, it would begin a new election, to resolve the network partition in the quorum. But we missed this implementation in current KRaft. Fixed it in this PR. The commit include: 1. Added a timer with timeout configuration in `LeaderState`, and check if expired each time when leader is polled. If expired, resigning the leadership and start a new election. 2. Added `fetchedVoters` in `LeaderState`, and update the value each time received a FETCH or FETCH_SNAPSHOT request, and clear it and resets the timer if the majority - 1 of the remote voters sent such requests. Reviewers: José Armando García Sancio <jsancio@apache.org>

showuon added 2 commits September 23, 2023 17:31

KAFKA-15489: resign leadership when no fetch from majority voters

805c578

KAFKA-15489: add LeaderStateTests

eff7741

jsancio added the kraft label Sep 25, 2023

jsancio self-assigned this Sep 25, 2023

ahuang98 reviewed Sep 26, 2023

View reviewed changes

jsancio reviewed Sep 29, 2023

View reviewed changes

showuon added 2 commits October 2, 2023 17:13

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-15489

2fd003d

KAFKA-15489: refactor

f3efeb8

showuon force-pushed the KAFKA-15489 branch from 9f9ae64 to f3efeb8 Compare October 2, 2023 13:06

showuon added 2 commits October 3, 2023 11:20

Merge branch 'KAFKA-15489' of https://github.com/showuon/kafka into K…

476b3c8

…AFKA-15489

KAFKA-15489: add tests for fetchSnapshot

41196a1

showuon force-pushed the KAFKA-15489 branch from f355668 to 41196a1 Compare October 3, 2023 04:27

ahuang98 reviewed Oct 9, 2023

View reviewed changes

raft/src/main/java/org/apache/kafka/raft/LeaderState.java Outdated Show resolved Hide resolved

KAFKA-15489: update QUORUM_FETCH_TIMEOUT_MS_DOC

5a04090

showuon force-pushed the KAFKA-15489 branch from d4911b8 to 5a04090 Compare October 12, 2023 09:57

showuon added 2 commits October 21, 2023 15:15

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-15489

ab2cadd

KAFKA-15489: use 1.5x fetch timeout to have some buffer

c76ee0a

jsancio reviewed Nov 27, 2023

View reviewed changes

jsancio reviewed Nov 28, 2023

View reviewed changes

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java Outdated Show resolved Hide resolved

showuon added 2 commits November 28, 2023 11:42

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-15489

09e892b

KAFKA-15489: address reviewer's comments

0689a69

showuon force-pushed the KAFKA-15489 branch from da54bf9 to 0689a69 Compare November 28, 2023 07:37

jsancio reviewed Nov 28, 2023

View reviewed changes

raft/src/main/java/org/apache/kafka/raft/LeaderState.java Outdated Show resolved Hide resolved

KAFKA-15489: fix indent

ac8cb99

jsancio approved these changes Nov 29, 2023

View reviewed changes

jsancio merged commit 37416e1 into apache:trunk Nov 30, 2023
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-15489: resign leadership when no fetch from majority voters #14428

KAFKA-15489: resign leadership when no fetch from majority voters #14428

showuon commented Sep 23, 2023

showuon commented Sep 23, 2023

satishd commented Sep 23, 2023

ahuang98 Sep 26, 2023

showuon Oct 2, 2023

ahuang98 Oct 2, 2023

showuon Oct 3, 2023

jsancio left a comment

jsancio Sep 29, 2023

showuon Oct 2, 2023

showuon Oct 3, 2023 •

edited

jsancio Nov 27, 2023

jsancio Sep 29, 2023

showuon Oct 2, 2023

showuon commented Oct 3, 2023

ahuang98 left a comment •

edited

showuon commented Oct 12, 2023

showuon commented Nov 7, 2023

jsancio commented Nov 27, 2023

jsancio left a comment •

edited

jsancio Nov 27, 2023

showuon commented Nov 28, 2023 •

edited

jsancio left a comment

showuon commented Nov 29, 2023

jsancio left a comment

showuon commented Dec 1, 2023

KAFKA-15489: resign leadership when no fetch from majority voters #14428

KAFKA-15489: resign leadership when no fetch from majority voters #14428

Conversation

showuon commented Sep 23, 2023

Committer Checklist (excluded from commit message)

showuon commented Sep 23, 2023

satishd commented Sep 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsancio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

showuon Oct 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

showuon commented Oct 3, 2023

ahuang98 left a comment • edited

Choose a reason for hiding this comment

showuon commented Oct 12, 2023

showuon commented Nov 7, 2023

jsancio commented Nov 27, 2023

jsancio left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

showuon commented Nov 28, 2023 • edited

jsancio left a comment

Choose a reason for hiding this comment

showuon commented Nov 29, 2023

jsancio left a comment

Choose a reason for hiding this comment

showuon commented Dec 1, 2023

showuon Oct 3, 2023 •

edited

ahuang98 left a comment •

edited

jsancio left a comment •

edited

showuon commented Nov 28, 2023 •

edited