Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-16566: Fix consumer static membership system test with new protocol #15738

Merged
merged 3 commits into from Apr 19, 2024

Conversation

lianetm
Copy link
Contributor

@lianetm lianetm commented Apr 16, 2024

Updating consumer system test that was failing with the new protocol, related to static membership behaviour. The behaviour regarding static consumers that join with conflicting group instance id is slightly different between the classic and new consumer protocol, so the expectations in the tests needed to be updated.

If static members join with same instance id:

  • Classic protocol: all members join the group with the same group instance id, and then the first one will eventually fail (receives a HB error with FencedInstanceIdException)

  • Consumer protocol: new member with an instance Id already in use is not able to join, and first member remains active (new member with same instance Id receives an UnreleasedInstanceIdException in the response to the HB to join the group)

This PR is keeping the single parametrized test that existed before, given that what's being tested and part of the test itself apply to all protocols. This is just updating the expectations that are different, based on the protocol parameter.

@lianetm
Copy link
Contributor Author

lianetm commented Apr 16, 2024

Hey @lucasbru, could you take a look at this test fix? Thanks!

Copy link
Contributor

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix for this system test, @lianetm! This stuff gets tricky quickly.

I'm +0.99 on this change, but just have the one question about timing with asserts.

Thanks!

Comment on lines +364 to +366
assert num_rebalances == consumer.num_rebalances(), "Static consumers attempt to join with instance id in use should not cause a rebalance"
assert len(consumer.joined_nodes()) == len(consumer.nodes)
assert len(conflict_consumer.joined_nodes()) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we anticipate any timing issues here? That is, will num_rebalances() and joined_nodes() be "guaranteed" to return the correct values immediately after the call to await_consumed_messages() is finished? Or do we want to wrap those assertions as wait_until()s to give them a few seconds to coalesce to the correct value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no timing issues as I see it. For the consumer.joined_nodes there is a previous self.await_members, that ensures that we wait for the time needed for all the nodes to join. As for the conflict_consumer.joined_nodes(), its for nodes that never joined, we're just asserting that after the non-conflicting remained without rebalance, consuming (ensuring activity), the conflicting ones did not join. Makes sense?

@lianetm
Copy link
Contributor Author

lianetm commented Apr 17, 2024

FYI, I've been getting successful runs with this change:
running:
TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_fencing_static_consumer" bash tests/docker/run_tests.sh
results:

SESSION REPORT (ALL TESTS)
ducktape version: 0.11.4
session_id: 2024-04-16--016
run time: 27 minutes 15.117 seconds
tests run: 16
passed: 16
flaky: 0
failed: 0
ignored: 0

But I still got less frequent failures after trying it several times, so I'm afraid there may be something else (maybe related to sessions remaining open between test executions happening too close to each other?). If that's the case, I'll probably be adding another change here.

I'll dig into it and update here.

@lianetm lianetm closed this Apr 17, 2024
@lucasbru
Copy link
Member

@lianetm did you mean to closet the PR?

@lianetm lianetm reopened this Apr 18, 2024
@lianetm
Copy link
Contributor Author

lianetm commented Apr 18, 2024

Hey @lucasbru, yes, I had closed it just to investigate a bit more about some failures that I noticed, but ended up getting only to flaky behaviour not related to the changes in this PR, so reopening now for review.

This PR is fixing a known change in the static membership behaviour between the protocols (it was failing consistently, as expected). I still see flakiness in the test, but not in the path that this PR is fixing, and happening with the legacy and new protocol, so I would say I track that separately, since this fix is already a good improvement to the current test situation.

else:
consumer.start()
conflict_consumer.start()

wait_until(lambda: len(consumer.joined_nodes()) + len(conflict_consumer.joined_nodes()) == len(consumer.nodes),
timeout_sec=self.session_timeout_sec,
err_msg="Timed out waiting for consumers to join, expected total %d joined, but only see %d joined from"
timeout_sec=self.session_timeout_sec*2,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to help a bit with the flaky behaviour, making it also consistent to how we wait for members in all other tests that rely on the await_members.

Copy link
Contributor

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @lianetm!

@lucasbru
Copy link
Member

LGTM, thanks!

Merging this, however:

  • Have we discussed the behavioral difference with broker team / David?
  • Have we documented the behavioral difference anywhere?

@lucasbru lucasbru merged commit 9ea0503 into apache:trunk Apr 19, 2024
1 check failed
@lianetm
Copy link
Contributor Author

lianetm commented Apr 19, 2024

Hey @lucasbru , answering your questions : the new behaviour of the static membership regarding a member that joins with dup group instance Id is documented in this section of the KIP. We've also discussed it with @dajac and other client teams (librd), seeing the improvement the new protocol bring in this area (mainly in cases of conflicting members, that continuously kick each other out with the classic protocol)

Your question does makes me notice that, even though the KIP describes how conflicting static members behave in the new protocol, it would probably be helpful extend that explanation to point out the fundamental difference it has with the legacy protocol and how it is an improved approach. (I can't find that explained in the KIP. If I'm not missing it, it would probably be a good update to the KIP static membership section @dajac ?)

Thanks @lucasbru !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants