KAFKA-16566: Fix consumer static membership system test with new protocol #15738

lianetm · 2024-04-16T20:02:37Z

Updating consumer system test that was failing with the new protocol, related to static membership behaviour. The behaviour regarding static consumers that join with conflicting group instance id is slightly different between the classic and new consumer protocol, so the expectations in the tests needed to be updated.

If static members join with same instance id:

Classic protocol: all members join the group with the same group instance id, and then the first one will eventually fail (receives a HB error with FencedInstanceIdException)
Consumer protocol: new member with an instance Id already in use is not able to join, and first member remains active (new member with same instance Id receives an UnreleasedInstanceIdException in the response to the HB to join the group)

This PR is keeping the single parametrized test that existed before, given that what's being tested and part of the test itself apply to all protocols. This is just updating the expectations that are different, based on the protocol parameter.

lianetm · 2024-04-16T20:40:54Z

Hey @lucasbru, could you take a look at this test fix? Thanks!

kirktrue

Thanks for the fix for this system test, @lianetm! This stuff gets tricky quickly.

I'm +0.99 on this change, but just have the one question about timing with asserts.

Thanks!

kirktrue · 2024-04-16T22:06:11Z

tests/kafkatest/tests/client/consumer_test.py

+                assert num_rebalances == consumer.num_rebalances(), "Static consumers attempt to join with instance id in use should not cause a rebalance"
+                assert len(consumer.joined_nodes()) == len(consumer.nodes)
+                assert len(conflict_consumer.joined_nodes()) == 0


Do we anticipate any timing issues here? That is, will num_rebalances() and joined_nodes() be "guaranteed" to return the correct values immediately after the call to await_consumed_messages() is finished? Or do we want to wrap those assertions as wait_until()s to give them a few seconds to coalesce to the correct value?

There should be no timing issues as I see it. For the consumer.joined_nodes there is a previous self.await_members, that ensures that we wait for the time needed for all the nodes to join. As for the conflict_consumer.joined_nodes(), its for nodes that never joined, we're just asserting that after the non-conflicting remained without rebalance, consuming (ensuring activity), the conflicting ones did not join. Makes sense?

lianetm · 2024-04-17T02:50:52Z

FYI, I've been getting successful runs with this change:
running:
TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_fencing_static_consumer" bash tests/docker/run_tests.sh
results:

SESSION REPORT (ALL TESTS)
ducktape version: 0.11.4
session_id: 2024-04-16--016
run time: 27 minutes 15.117 seconds
tests run: 16
passed: 16
flaky: 0
failed: 0
ignored: 0

But I still got less frequent failures after trying it several times, so I'm afraid there may be something else (maybe related to sessions remaining open between test executions happening too close to each other?). If that's the case, I'll probably be adding another change here.

I'll dig into it and update here.

lucasbru · 2024-04-17T10:31:00Z

@lianetm did you mean to closet the PR?

lianetm · 2024-04-18T13:33:48Z

Hey @lucasbru, yes, I had closed it just to investigate a bit more about some failures that I noticed, but ended up getting only to flaky behaviour not related to the changes in this PR, so reopening now for review.

This PR is fixing a known change in the static membership behaviour between the protocols (it was failing consistently, as expected). I still see flakiness in the test, but not in the path that this PR is fixing, and happening with the legacy and new protocol, so I would say I track that separately, since this fix is already a good improvement to the current test situation.

lianetm · 2024-04-18T13:37:21Z

tests/kafkatest/tests/client/consumer_test.py

        else:
            consumer.start()
            conflict_consumer.start()

            wait_until(lambda: len(consumer.joined_nodes()) + len(conflict_consumer.joined_nodes()) == len(consumer.nodes),
-                       timeout_sec=self.session_timeout_sec,
-                       err_msg="Timed out waiting for consumers to join, expected total %d joined, but only see %d joined from"
+                       timeout_sec=self.session_timeout_sec*2,


I added this to help a bit with the flaky behaviour, making it also consistent to how we wait for members in all other tests that rely on the await_members.

kirktrue

Thanks for the PR, @lianetm!

lucasbru · 2024-04-19T09:32:58Z

LGTM, thanks!

Merging this, however:

Have we discussed the behavioral difference with broker team / David?
Have we documented the behavioral difference anywhere?

lianetm · 2024-04-19T13:23:24Z

Hey @lucasbru , answering your questions : the new behaviour of the static membership regarding a member that joins with dup group instance Id is documented in this section of the KIP. We've also discussed it with @dajac and other client teams (librd), seeing the improvement the new protocol bring in this area (mainly in cases of conflicting members, that continuously kick each other out with the classic protocol)

Your question does makes me notice that, even though the KIP describes how conflicting static members behave in the new protocol, it would probably be helpful extend that explanation to point out the fundamental difference it has with the legacy protocol and how it is an improved approach. (I can't find that explained in the KIP. If I'm not missing it, it would probably be a good update to the KIP static membership section @dajac ?)

Thanks @lucasbru !

Support new protocol

863341f

kirktrue reviewed Apr 16, 2024

View reviewed changes

lianetm closed this Apr 17, 2024

lianetm added 2 commits April 18, 2024 09:18

Give consumers more time

29f08cc

Fix typo

d392a19

lianetm reopened this Apr 18, 2024

lianetm commented Apr 18, 2024

View reviewed changes

kirktrue approved these changes Apr 18, 2024

View reviewed changes

lucasbru merged commit 9ea0503 into apache:trunk Apr 19, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-16566: Fix consumer static membership system test with new protocol #15738

KAFKA-16566: Fix consumer static membership system test with new protocol #15738

lianetm commented Apr 16, 2024

lianetm commented Apr 16, 2024

kirktrue left a comment

kirktrue Apr 16, 2024

lianetm Apr 17, 2024

lianetm commented Apr 17, 2024 •

edited

lucasbru commented Apr 17, 2024

lianetm commented Apr 18, 2024

lianetm Apr 18, 2024

kirktrue left a comment

lucasbru commented Apr 19, 2024

lianetm commented Apr 19, 2024

KAFKA-16566: Fix consumer static membership system test with new protocol #15738

KAFKA-16566: Fix consumer static membership system test with new protocol #15738

Conversation

lianetm commented Apr 16, 2024

lianetm commented Apr 16, 2024

kirktrue left a comment

Choose a reason for hiding this comment

kirktrue Apr 16, 2024

Choose a reason for hiding this comment

lianetm Apr 17, 2024

Choose a reason for hiding this comment

lianetm commented Apr 17, 2024 • edited

lucasbru commented Apr 17, 2024

lianetm commented Apr 18, 2024

lianetm Apr 18, 2024

Choose a reason for hiding this comment

kirktrue left a comment

Choose a reason for hiding this comment

lucasbru commented Apr 19, 2024

lianetm commented Apr 19, 2024

lianetm commented Apr 17, 2024 •

edited