KAFKA-12639: exit upon expired timer to prevent tight looping #13190

philipnee · 2023-02-02T20:17:52Z

https://issues.apache.org/jira/browse/KAFKA-12639

In AbstractCoordinator#joinGroupIfNeeded - joinGroup request will be retried without proper backoff, due to the expired timer. This is an uncommon scenario and possibly only appears during the testing, but I think it makes sense to enforce the client to drive the join group via poll.

philipnee · 2023-02-03T00:38:57Z

@guozhangwang - would you have time to review this 🥺 ?

guozhangwang

Thanks @philipnee . I left some comments.

Also I think we should add unit test with exactly the mocking clients to 1) return a non-retriable exception, and check that we throw immediately (if we already have it in the test, then we can skip), 2) return one of the four exceptions, and check that we never sleep backoffs, 3) return other retriable exception, and check that we would still the remaining timer; and in case 2/3), we check that if timer has elapsed, we would also return false immediately.

guozhangwang · 2023-02-03T18:16:43Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

                    continue;
                else if (!future.isRetriable())
                    throw exception;

+                if (timer.isExpired()) {


Could you add a couple comments here explaining why we check the timer again here in addition to in line 452 above? Maybe something like this:

We check the timer again after calling poll with the timer since it's possible that even after the timer has elapsed, the next client.poll(timer) would immediately return an error response which would cause us to not exiting the while loop.

guozhangwang · 2023-02-03T18:24:05Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

-                    exception instanceof MemberIdRequiredException)
+                        exception instanceof IllegalGenerationException ||
+                        exception instanceof RebalanceInProgressException ||
+                        exception instanceof MemberIdRequiredException)


Should we actually do the timer check before this? Since otherwise if the exception from the immediately returned responses is any of those four, we would still continue and skip the check below.

More concretely I think we can just move the remaining logic inside the if call:

if (!future.isRetriable()) { throw .. } else { if (timer.isExpired()) { return false; } else if (exception instance of..) { continue; } else { timer.sleep(..) } }

agreed. A comment here, retriableException check should happen after the instanceOf checks, because I think we actually want to retry upon these (according to the logic).

philipnee · 2023-02-13T04:05:31Z

Thanks @guozhangwang for the feedback - Added some tests there to cover the untesed cases. I still have a quick question around this block, is it intentional to continue w/o sleep on the backoff timer? (quoting the original code)

if (exception instanceof UnknownMemberIdException ||
                    exception instanceof IllegalGenerationException ||
                    exception instanceof RebalanceInProgressException ||
                    exception instanceof MemberIdRequiredException)
                    continue;
                else if (!future.isRetriable())
                    throw exception;

                timer.sleep(rebalanceConfig.retryBackoffMs);

guozhangwang · 2023-02-17T01:04:46Z

is it intentional to continue w/o sleep on the backoff timer?

Yes that's intentional. For those four exceptions, we'd like to send the follow-up request right away since the broker is waiting for those join-group request. But the question is, when the timer has already elapsed, should we honor that or should we ignore but always try to complete this mid-stage.

Since in the new protocol we would no longer have such mid-stages during a prepare_rebalance phase (cc @dajac to chime in if you feel different), I would suggest we respect the timer still for now to have a stronger poll(timer) timing guarantees.

philipnee · 2023-02-17T20:27:33Z

Thanks, @guozhangwang, that's my understanding as well.

philipnee · 2023-02-18T04:01:52Z

Moving the time check just broke a bunch of unit test 😅

abd

…anges.

abd

…anges.

…ipnee/kafka into kafka-12639_exit_upon_expired_timer

retrigger test

guozhangwang · 2023-02-21T18:11:01Z

@philipnee is this the final version of this PR? Seems we are still honoring the four exceptions indicating the mid-stage of a rebalance more than the elapsed timer here?

philipnee · 2023-02-21T18:13:57Z

Hey @guozhangwang it's WIP - I think moving the timer check before the exception handling block (that 4 exceptions), kind of breaks a bunch of tests, as most tests are expecting the complete within a single poll. I'm looking into these breakage actually. sorry about the confusion.

guozhangwang · 2023-02-21T18:15:13Z

Oh got it, thanks! All good :) Please let me know when it's ready for a final look.

philipnee · 2023-02-21T18:15:35Z

Although, are we ok with handling these 4 exceptions the same way as before? I know you previously mentioned that it might be better off to make the rules more consistent, and I kind of agree with it.

- First call will exit upon timeout - Second call should send a proper request before exiting.

re trigger-test

guozhangwang · 2023-02-22T17:39:56Z

Yeah I think it's okay to make the rule consistent, i.e. to honor the timeout even under those four exceptions: if the timer has elapsed, then we should well return from the loop in

client.poll(future, timer);
            if (!future.isDone()) {
                // we ran out of time
                return false;
            }

even if the response yet to be returned would contain any of these four exceptions. So I think we should still obey this rule, i.e. even if a response has been returned and we know it's going to be one of these four exceptions, if the timer has elapsed, we still exit the loop.

philipnee · 2023-02-22T17:58:41Z

Hmm, strangely, this branch seems to trigger a bunch of initializing error failures. And I can't seem to reproduce them locally...

philipnee · 2023-02-24T18:59:59Z

Just a bit a note here on this PR: Seems like we need to be more deliberate at handling the timeout, because the non-retriable errors are always expected to be thrown. (except for the 4 cases), which is why the change triggered 60-ish breaking tests. Updating the PR to retrigger the test.

philipnee · 2023-02-26T03:48:18Z

The failures seem irrelevant to the change here: i.e. they dont' show up in both rounds.

Build / JDK 11 and Scala 2.13 / testDynamicListenerConnectionCreationRateQuota() – kafka.network.DynamicConnectionQuotaTest
41s
Build / JDK 17 and Scala 2.13 / testTaskRequestWithOldStartMsGetsUpdated() – org.apache.kafka.trogdor.coordinator.CoordinatorTest
2m 0s
Build / JDK 8 and Scala 2.12 / [1] Type=ZK, Name=testRegisterZkBrokerInKraft, MetadataVersion=3.4-IV0, Security=PLAINTEXT – kafka.server.KafkaServerKRaftRegistrationTest
7s
Build / JDK 8 and Scala 2.12 / [1] Type=ZK, Name=testRegisterZkBrokerInKraft, MetadataVersion=3.4-IV0, Security=PLAINTEXT – kafka.server.KafkaServerKRaftRegistrationTest
12s

guozhangwang

@philipnee thanks for the added unit test. I made another pass and left some more comments.

guozhangwang · 2023-02-27T22:00:44Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

@@ -500,14 +500,22 @@ boolean joinGroupIfNeeded(final Timer timer) {
                    requestRejoin(shortReason, fullReason);
                }

+                // continue to retry as long as the timer hasn't expired


Could we simplify this multi-if logic as:

if (!future.isRetriable()) { throw } else { if (timer.isExpired() { return false } else if (exception instance of.. ) { continue} else {timer.sleep(..)} }

Also could we add a comment on top clarifying that the order of precedence are deliberated in this order and future changes should pay attention to not change it unnecessarily.

I think the instanceof ... exceptions are also non-retriable, and I think they need to be handled first.

so the if else blocks becomes a bit fragmented. or we could do:

if (!future.isRetriable()) { if ( ... instance of ... ) { continue; } throw ... } {rest of the logic there}

However, this is a bit more nested, which can be harder to read

guozhangwang · 2023-02-27T22:39:13Z

clients/src/test/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinatorTest.java

@@ -1484,6 +1484,8 @@ public void testRebalanceWithMetadataChange() {
                Utils.mkMap(Utils.mkEntry(topic1, 1), Utils.mkEntry(topic2, 1))));
        client.respond(joinGroupFollowerResponse(1, consumerId, "leader", Errors.NOT_COORDINATOR));
        client.prepareResponse(groupCoordinatorResponse(node, Errors.NONE));
+        coordinator.poll(time.timer(0)); // failing joinGroup request will require re-poll in order to retry


It's not very clear to me why here and line 3403 below we need additional polls since the test scenarios seems irrelevant to error cases?

The NOT_COORDINATOR error originally should trigger retries; however, in the new code, it would exit due to an expired timer. Another way to do it is using poll(time.timer(1))

Ack, that makes sense.

philipnee · 2023-02-27T23:40:12Z

clients/src/test/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinatorTest.java

@@ -3398,7 +3400,8 @@ public void testPrepareJoinAndRejoinAfterFailedRebalance() {
            client.respond(syncGroupResponse(partitions, Errors.NONE));

            // Join future should succeed but generation already cleared so result of join is false.
-            res = coordinator.joinGroupIfNeeded(time.timer(1));
+            coordinator.joinGroupIfNeeded(time.timer(0));


Similar here, the timer is expired upon IllegalGenerationException, the loop would continued in the original code, but now it would exit. I guess we could try to poll for a bit longer, like 3ms instead of 0ms.

philipnee · 2023-02-28T00:40:02Z

Here are a couple things I updated:

Added some documentation to clarify the intent, but I didn't rewrite it as nested if "can be" harder to read.
Added non zero timeouts for the tests as our timer now is stricter and will explicitly exit upon expiration.

guozhangwang · 2023-02-28T17:15:35Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

@@ -500,14 +500,24 @@ boolean joinGroupIfNeeded(final Timer timer) {
                    requestRejoin(shortReason, fullReason);
                }

+                // 4 special non-retriable exceptions that we want to retry, as long as the timer hasn't expired.


I think here the comment is not to just state what the code did, since readers can just understand that from the code :P instead what we want to emphasize is to remind future contributors that they should be careful to not change the precedence ordering of this logic unnecessarily.

guozhangwang · 2023-02-28T17:20:43Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

-                    exception instanceof IllegalGenerationException ||
-                    exception instanceof RebalanceInProgressException ||
-                    exception instanceof MemberIdRequiredException)
+                        exception instanceof IllegalGenerationException ||


Ah thanks for the clarifications!

Thinking about this a bit more (sorry for getting back and forth..), I now concerned a bit more that for some usage patterns where poll call would be triggered less frequently, we may not be coming back to handle these four exceptions while at the same time the broker is ticking and waiting for the join-group request to be re-sent. Hence I'm changing my mind to lean a bit more to honor the exception types for immediate handling than the timeouts --- again, sorry for going back and forth...

So I think we would define the ordering as the following:

For un-retriable exception, always try to handle immediately and not honor the timer.

Otherwise, honor the timer.

In that case, we could just go back to the first time you made the change, i.e. just add the

if (timer.isExpired()) return false;

After the if/else-if block. Still it's better to comment that above ordering is diligently designed as such.

Hey thanks for the comments again and absolutely no apology is needed there! I guess, as we all know, rebalancing is full of subtleties, so it makes sense to be careful about these non-retriable exception case. I think it's a good idea to keep the original behavior consistent, in case of unexpected breakage. Updating the PR.

philipnee · 2023-02-28T19:24:23Z

clients/src/test/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinatorTest.java

@@ -1484,7 +1484,8 @@ public void testRebalanceWithMetadataChange() {
                Utils.mkMap(Utils.mkEntry(topic1, 1), Utils.mkEntry(topic2, 1))));
        client.respond(joinGroupFollowerResponse(1, consumerId, "leader", Errors.NOT_COORDINATOR));
        client.prepareResponse(groupCoordinatorResponse(node, Errors.NONE));
-        coordinator.poll(time.timer(0));
+        assertFalse(client.hasInFlightRequests());
+        coordinator.poll(time.timer(1));


note: we need to add a timeout here to give the retry a second chance, because in the new code, the timer is checked and causes the method to exit.

philipnee · 2023-02-28T19:25:26Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

-                    exception instanceof MemberIdRequiredException)
+                        exception instanceof IllegalGenerationException ||
+                        exception instanceof RebalanceInProgressException ||
+                        exception instanceof MemberIdRequiredException)
                    continue;
                else if (!future.isRetriable())


the previous logic was reverted with some autocorrection to the indentation.

guozhangwang

LGTM, waiting for the jenkins job to complete.

philipnee · 2023-03-01T01:35:50Z

Hmm. I think these tests are flaky actually

Build / JDK 17 and Scala 2.13 / shouldPauseStandbyTaskAndNotTransitToUpdateStandbyAgain() – org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest
30s
Build / JDK 17 and Scala 2.13 / shouldPauseActiveTaskAndTransitToUpdateStandby() – org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest
30s
Build / JDK 17 and Scala 2.13 / testTaskRequestWithOldStartMsGetsUpdated() – org.apache.kafka.trogdor.coordinator.CoordinatorTest
2m 0s
Build / JDK 11 and Scala 2.13 / testListenerConnectionRateLimitWhenActualRateAboveLimit() – kafka.network.ConnectionQuotasTest
19s
Build / JDK 11 and Scala 2.13 / shouldRemovePausedAndUpdatingTasksOnShutdown() – org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest
30s
Build / JDK 11 and Scala 2.13 / shouldPauseStandbyTaskAndNotTransitToUpdateStandbyAgain() – org.apache.kafka.streams.processor.internals.DefaultStateUpdaterTest
31s

guozhangwang · 2023-03-01T01:36:57Z

The test failures are not relevant (but some of them are related to DefaultStateUpdaterTest.. sigh).

guozhangwang · 2023-03-01T01:37:10Z

Merged to trunk.

philipnee · 2023-03-01T01:45:15Z

yeah the DefaultStateUpdaterTest has been failing from time to time... not sure why 😭

exit upon expired timer to prevent tight looping

1638a5c

philipnee changed the title ~~KAFKA-12539: exit upon expired timer to prevent tight looping~~ KAFKA-12639: exit upon expired timer to prevent tight looping Feb 2, 2023

guozhangwang reviewed Feb 3, 2023

View reviewed changes

Added tests according to the comments

db3d772

philipnee requested a review from guozhangwang February 16, 2023 01:48

philipnee added 9 commits February 17, 2023 20:19

re-trigger test

2bbfa68

abd

Moving the timer below the specific exceptions to prevent breaking ch…

7987251

…anges.

exit upon expired timer to prevent tight looping

bb14d37

Added tests according to the comments

ff216d1

re-trigger test

6e9b75e

abd

Moving the timer below the specific exceptions to prevent breaking ch…

ae24f03

…anges.

Merge branch 'kafka-12639_exit_upon_expired_timer' of github.com:phil…

9302e79

…ipnee/kafka into kafka-12639_exit_upon_expired_timer

Merge branch 'apache:trunk' into kafka-12639_exit_upon_expired_timer

a2c8fd2

retrigger test

8b874ec

retrigger test

philipnee added 3 commits February 21, 2023 11:29

wip

488a872

Make sure joinGroupIfNeeded is called twice

1a301c7

- First call will exit upon timeout - Second call should send a proper request before exiting.

re-trigger test

27f99e1

re trigger-test

clean up

3cb89ff

philipnee added 4 commits February 24, 2023 11:00

Need to be more careful at handling timeout.

e0bfc1c

fix indentation

c97ed18

1

b567092

retrigger test

ad6a087

philipnee added 2 commits February 27, 2023 10:57

nit

78b5b9d

Merge branch 'apache:trunk' into kafka-12639_exit_upon_expired_timer

f1a5cd4

guozhangwang reviewed Feb 27, 2023

View reviewed changes

philipnee commented Feb 27, 2023

View reviewed changes

philipnee added 2 commits February 27, 2023 15:42

Use a timer in the test as the loop will exit upon expired timer

08dc79d

Documentation enhancement

b13fa0a

guozhangwang reviewed Feb 28, 2023

View reviewed changes

Revert back from previous changes

d77121f

philipnee commented Feb 28, 2023

View reviewed changes

documentation

3d9742b

guozhangwang approved these changes Feb 28, 2023

View reviewed changes

guozhangwang merged commit f7f376f into apache:trunk Mar 1, 2023

KAFKA-12639: exit upon expired timer to prevent tight looping #13190

KAFKA-12639: exit upon expired timer to prevent tight looping #13190

Conversation

philipnee commented Feb 2, 2023 • edited

philipnee commented Feb 3, 2023

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipnee commented Feb 13, 2023

guozhangwang commented Feb 17, 2023

philipnee commented Feb 17, 2023

philipnee commented Feb 18, 2023

guozhangwang commented Feb 21, 2023

philipnee commented Feb 21, 2023

guozhangwang commented Feb 21, 2023

philipnee commented Feb 21, 2023

guozhangwang commented Feb 22, 2023

philipnee commented Feb 22, 2023

philipnee commented Feb 24, 2023

philipnee commented Feb 26, 2023

guozhangwang left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipnee Feb 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipnee commented Feb 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang left a comment

Choose a reason for hiding this comment

philipnee commented Mar 1, 2023

guozhangwang commented Mar 1, 2023

guozhangwang commented Mar 1, 2023

philipnee commented Mar 1, 2023

philipnee commented Feb 2, 2023 •

edited

guozhangwang left a comment •

edited

philipnee Feb 27, 2023 •

edited