KAFKA-12983: reset needsJoinPrepare flag before rejoining the group #10986

ableegoldman · 2021-07-07T04:33:45Z

The #onJoinPrepare callback is not always invoked before a member (re)joins the group, but only once when it first enters the rebalance. This means that any updates or events that occur during the join phase can be lost in the internal state: for example, clearing the SubscriptionState (and thus the "ownedPartitions" that are used for cooperative rebalancing) after losing its memberId during a rebalance.

We should reset the needsJoinPrepare flag inside the resetStateAndRejoin() method. Should be cherrypicked back to 2.8 at least

dajac · 2021-07-07T06:38:32Z

@ableegoldman Thanks for the patch. The change makes sense to me. I wonder if we could add a unit test which would fail without it though. This would avoid regressing in the future. What do you think?

hachikuji · 2021-07-07T21:46:44Z

@ableegoldman Thanks for the patch. I think the original idea behind the implementation was to ensure that each rebalance triggered only one call to onPartitionsRevoked. It sounds like this needs some refinement for the cooperative rebalance logic. I guess the main difference is that we could now have a call to onPartitionsLost if the memberId is lost after the initial call to onJoinPrepare? It might be nice to ensure that we can keep the same behavior for eager rebalancing.

guozhangwang · 2021-07-12T20:57:18Z

@hachikuji I think the key idea behind this fix is that, if a rebalance failed with e.g. memberId lost, then conceptually we would just started a new rebalance in which we would call onJoinPrepare and in which we may call onRepartitionsRevoked again. This behavior would be the same for eager or cooperative.

Personally I think this fix is fine -- @ableegoldman if you could just add a unit test for the case of memberId lost during a first rebalance, and check that we would re-triggered onJoinPrepare again?

hachikuji · 2021-07-12T21:50:55Z

To clarify, from the perspective of the eager protocol, how would this case look? Would we get multiple calls to onPartitionsRevoked with the same set of partitions or something else?

ableegoldman · 2021-07-13T01:28:41Z

@hachikuji in the EAGER case, after the first onJoinPrepare / onPartitionsRevoked, the subscription would have been cleared. So any subsequent invocations of onPartitionsRevoked would be with an empty set of partitions

@everyone, I was having trouble getting a unit test that would actually verify this behavior but I wanted to kick off discussion on the fix ASAP (for obvious reasons) so I opened the PR without one. I do intended to add a test, I just haven't had time to pursue that yet. Suggestions welcome :P

ableegoldman · 2021-07-13T03:25:03Z

Ok I realize we actually do have a test that reproduces this already: ConsumerCoordinatorTest.testRebalanceWithMetadataChange. This test sets up a case where a change in topic metadata triggers a rebalance after a member had joined the group, after which the change is reverted so that the metadata is ultimately the same. Then a NOT_COORDINATOR response is sent to fail the initial JoinGroup, and the test just verifies that the member attempts to rejoin until successful. It also verifies things like the number of times each rebalance callback is invoked, and the set of partitions that the callbacks receive.
This test actually only failed in the COOPERATIVE case, which confirms that the behavior remains correct for the EAGER case. When following the COOPERATIVE protocol, the test was formerly assuming that the member would retain all partitions despite actually having its generation and memberId cleared when the initial JoinGroup is failed. So it was technically asserting the wrong behavior beforehand; just fixing this gives us a unit test for this patch after all.

ableegoldman · 2021-07-13T03:30:29Z

Now ready for review @dajac @hachikuji @guozhangwang

dajac

LGTM, thanks.

ableegoldman · 2021-07-13T19:12:15Z

Two test failures, both ConsumerBounceTest.testCloseDuringRebalance(). This test is already known to be flaky and failed with the same error that has been reported before (KAFKA-8529), so I think we can conclude that this was unrelated.

…10986) The #onJoinPrepare callback is not always invoked before a member (re)joins the group, but only once when it first enters the rebalance. This means that any updates or events that occur during the join phase can be lost in the internal state: for example, clearing the SubscriptionState (and thus the "ownedPartitions" that are used for cooperative rebalancing) after losing its memberId during a rebalance. We should reset the needsJoinPrepare flag inside the resetStateAndRejoin() method. Reviewers: Guozhang Wang <guozhang@apache.org>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>

ableegoldman · 2021-07-13T19:29:07Z

Merged to trunk and cherrypicked to 2.8 & 3.0 (cc @kkonstantine)

…10986) The #onJoinPrepare callback is not always invoked before a member (re)joins the group, but only once when it first enters the rebalance. This means that any updates or events that occur during the join phase can be lost in the internal state: for example, clearing the SubscriptionState (and thus the "ownedPartitions" that are used for cooperative rebalancing) after losing its memberId during a rebalance. We should reset the needsJoinPrepare flag inside the resetStateAndRejoin() method. Reviewers: Guozhang Wang <guozhang@apache.org>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>

…pache#10986) The #onJoinPrepare callback is not always invoked before a member (re)joins the group, but only once when it first enters the rebalance. This means that any updates or events that occur during the join phase can be lost in the internal state: for example, clearing the SubscriptionState (and thus the "ownedPartitions" that are used for cooperative rebalancing) after losing its memberId during a rebalance. We should reset the needsJoinPrepare flag inside the resetStateAndRejoin() method. Reviewers: Guozhang Wang <guozhang@apache.org>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>

ableegoldman requested review from dajac, guozhangwang and hachikuji July 7, 2021 04:33

reset needsJoinPrepare flag when resetting/rejoining

b3fa141

unit test

0483d07

ableegoldman force-pushed the 12983-always-invoke-onJoinPrepare branch from 5911911 to 0483d07 Compare July 13, 2021 03:30

guozhangwang approved these changes Jul 13, 2021

View reviewed changes

dajac approved these changes Jul 13, 2021

View reviewed changes

ableegoldman merged commit 1f64df9 into apache:trunk Jul 13, 2021

guozhangwang mentioned this pull request Aug 24, 2021

KAFKA-13214; Consumer should not reset state after retriable error in rebalance #11231

Merged

3 tasks

ableegoldman mentioned this pull request Sep 17, 2021

MINOR: re-add removed test coverage for 'KAFKA-12983: reset needsJoinPrepare flag' #11332

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-12983: reset needsJoinPrepare flag before rejoining the group #10986

KAFKA-12983: reset needsJoinPrepare flag before rejoining the group #10986

ableegoldman commented Jul 7, 2021 •

edited

dajac commented Jul 7, 2021

hachikuji commented Jul 7, 2021 •

edited

guozhangwang commented Jul 12, 2021

hachikuji commented Jul 12, 2021

ableegoldman commented Jul 13, 2021

ableegoldman commented Jul 13, 2021

ableegoldman commented Jul 13, 2021

dajac left a comment

ableegoldman commented Jul 13, 2021 •

edited

ableegoldman commented Jul 13, 2021

KAFKA-12983: reset needsJoinPrepare flag before rejoining the group #10986

KAFKA-12983: reset needsJoinPrepare flag before rejoining the group #10986

Conversation

ableegoldman commented Jul 7, 2021 • edited

dajac commented Jul 7, 2021

hachikuji commented Jul 7, 2021 • edited

guozhangwang commented Jul 12, 2021

hachikuji commented Jul 12, 2021

ableegoldman commented Jul 13, 2021

ableegoldman commented Jul 13, 2021

ableegoldman commented Jul 13, 2021

dajac left a comment

Choose a reason for hiding this comment

ableegoldman commented Jul 13, 2021 • edited

ableegoldman commented Jul 13, 2021

ableegoldman commented Jul 7, 2021 •

edited

hachikuji commented Jul 7, 2021 •

edited

ableegoldman commented Jul 13, 2021 •

edited