New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-16566: Fix consumer static membership system test with new protocol #15738
KAFKA-16566: Fix consumer static membership system test with new protocol #15738
Conversation
Hey @lucasbru, could you take a look at this test fix? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix for this system test, @lianetm! This stuff gets tricky quickly.
I'm +0.99 on this change, but just have the one question about timing with assert
s.
Thanks!
assert num_rebalances == consumer.num_rebalances(), "Static consumers attempt to join with instance id in use should not cause a rebalance" | ||
assert len(consumer.joined_nodes()) == len(consumer.nodes) | ||
assert len(conflict_consumer.joined_nodes()) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we anticipate any timing issues here? That is, will num_rebalances()
and joined_nodes()
be "guaranteed" to return the correct values immediately after the call to await_consumed_messages()
is finished? Or do we want to wrap those assertions as wait_until()
s to give them a few seconds to coalesce to the correct value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no timing issues as I see it. For the consumer.joined_nodes
there is a previous self.await_members
, that ensures that we wait for the time needed for all the nodes to join. As for the conflict_consumer.joined_nodes()
, its for nodes that never joined, we're just asserting that after the non-conflicting remained without rebalance, consuming (ensuring activity), the conflicting ones did not join. Makes sense?
FYI, I've been getting successful runs with this change: SESSION REPORT (ALL TESTS) But I still got less frequent failures after trying it several times, so I'm afraid there may be something else (maybe related to sessions remaining open between test executions happening too close to each other?). If that's the case, I'll probably be adding another change here. I'll dig into it and update here. |
@lianetm did you mean to closet the PR? |
Hey @lucasbru, yes, I had closed it just to investigate a bit more about some failures that I noticed, but ended up getting only to flaky behaviour not related to the changes in this PR, so reopening now for review. This PR is fixing a known change in the static membership behaviour between the protocols (it was failing consistently, as expected). I still see flakiness in the test, but not in the path that this PR is fixing, and happening with the legacy and new protocol, so I would say I track that separately, since this fix is already a good improvement to the current test situation. |
else: | ||
consumer.start() | ||
conflict_consumer.start() | ||
|
||
wait_until(lambda: len(consumer.joined_nodes()) + len(conflict_consumer.joined_nodes()) == len(consumer.nodes), | ||
timeout_sec=self.session_timeout_sec, | ||
err_msg="Timed out waiting for consumers to join, expected total %d joined, but only see %d joined from" | ||
timeout_sec=self.session_timeout_sec*2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to help a bit with the flaky behaviour, making it also consistent to how we wait for members in all other tests that rely on the await_members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, @lianetm!
LGTM, thanks! Merging this, however:
|
Hey @lucasbru , answering your questions : the new behaviour of the static membership regarding a member that joins with dup group instance Id is documented in this section of the KIP. We've also discussed it with @dajac and other client teams (librd), seeing the improvement the new protocol bring in this area (mainly in cases of conflicting members, that continuously kick each other out with the classic protocol) Your question does makes me notice that, even though the KIP describes how conflicting static members behave in the new protocol, it would probably be helpful extend that explanation to point out the fundamental difference it has with the legacy protocol and how it is an improved approach. (I can't find that explained in the KIP. If I'm not missing it, it would probably be a good update to the KIP static membership section @dajac ?) Thanks @lucasbru ! |
Updating consumer system test that was failing with the new protocol, related to static membership behaviour. The behaviour regarding static consumers that join with conflicting group instance id is slightly different between the classic and new consumer protocol, so the expectations in the tests needed to be updated.
If static members join with same instance id:
Classic protocol: all members join the group with the same group instance id, and then the first one will eventually fail (receives a HB error with FencedInstanceIdException)
Consumer protocol: new member with an instance Id already in use is not able to join, and first member remains active (new member with same instance Id receives an UnreleasedInstanceIdException in the response to the HB to join the group)
This PR is keeping the single parametrized test that existed before, given that what's being tested and part of the test itself apply to all protocols. This is just updating the expectations that are different, based on the protocol parameter.