Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-16465: Fix consumer sys test revocation validation #15778

Merged
merged 3 commits into from Apr 29, 2024

Conversation

lianetm
Copy link
Contributor

@lianetm lianetm commented Apr 22, 2024

This fixes a consumer system test that was failing for the new protocol. The failure was because the test was expecting the eager behaviour of partitions being revoked on every rebalance, and it was wrongfully applying it to the runs with the new protocol too.
This same situation was previously identified and fixed in other parts of the sys test with #15661.

@lianetm
Copy link
Contributor Author

lianetm commented Apr 22, 2024

Hey @lucasbru , could you take a look when you have a chance? Similar issue to the one on #15661.

With this I get successful runs of the test (some flaky combinations, mostly around failures already filed/in-flight like failures removing unexpected assignment from list, timeout shutting down)

SESSION REPORT (ALL TESTS)
ducktape version: 0.11.4
session_id: 2024-04-22--006
run time: 66 minutes 30.269 seconds
tests run: 16
passed: 14
flaky: 2
failed: 0
ignored: 0

Thanks!

assert num_revokes_after_bounce == 0, \
"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce
elif consumer.is_eager():
assert num_revokes_after_bounce != 0, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this test work before for CooperativeStickyAssignor, or was this case not tested?

Also, is there any way now to detect that without static membership, a rebalance happens during the roll? It seems like in the case where static_membership == false, and is_eager == false, we are not asserting anything now, so I wonder how much value is still there in that combination.

Copy link
Contributor Author

@lianetm lianetm Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was not testing the cooperative case before because it was using the base setup_consumer that has a range assignor by default.

We do have a way to detect rebalances (number of calls to partitions assigned), but that does not allow any special check for dynamic only, since static members will get those same calls to partitions assigned (the static membership is only making that the group assignment is not re-computed, so partition is not re-assigned while the session lasts, expecting that the member might come back, but from the callbacks POV, the static member that leaves gets the same as the dynamic one).

I agree that the value of the dynamic + cooperative is questionable, but there is value in verifying the message delivery semantics further down, when checking the total consumed. The test_consumer_bounce does a very similar check for dynamic members btw, and again, focusing on the delivery semantics, but both tests are not exactly the same (mainly keeping 1 node alive or not), so I would lean towards not removing the combination from this static test, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, lets not remove a variation.

I'm just thinking about the CONSUMER protocol case. If a dynamic client can pass this test with num_revokes_after_bounce = 0, how meaningful is it really to check that static clients have num_revokes_after_bounce = 0. I could write a consumer that ignores the static membership configuration completely, and still pass this test, right?

The test description writes

In order to make
        sure the behavior of static members are different from dynamic ones, we take both static and dynamic
        membership into this test suite.

But in the CONSUMER protocol, it seems the behavior isn't all that difference, at least if we only look at num_revokes_after_bounce.

So maybe we should instead try to get the set of partitions and check that it didn't change?

Let me know if I'm understanding something wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we should instead try to get the set of partitions and check that it didn't change?

Doable, but that wouldn't truly ensure the static membership behaviour either I guess, because the test is intentionally leaving 1 member alive that could be the leader or not. So the assignment would remain the same regardless of the static membership under CONSUMER (or Cooperative) if the single partition belongs to the live member (single partition that I guess was intentionally decided to be able to easily check the delivery semantics after the bounces)

So as I see it this test is specifically crafted to validate the stickiness that static membership brings into the RangeAssignor ( nicely explained in this section of the RangeAssignor class doc btw). We're trying to apply it to the CONSUMER protocol but finding not much value given the purpose of the test and the shape it has (bounce n-1 nodes, check at least 1 revocation if dynamic, none if static, regardless of partition ownership). Listening to myself telling you this makes me reconsider if we should just not run this test for the new protocol, as it was truly never intended or run for CooperativeAssignor? (I would probably rename it to something like test_eager_stickiness_on_static_consumer_bounce , make the use of RangeAssignor explicit, and then it looks clearer that we don't want to run such test on CooperativeAssignor or CONSUMER protocol.

We do have other tests that ensure that static membership behaves as expected for the new protocol (ex. test_fencing_static_consumer), but agreed that the "owned partition not re-assigned for a static member that is bounced" is not covered in sys tests (not for CONSUMER, not for Cooperative either). We could think of a new test to cover that. The shape would be different I expect, because it would either have to rely on bouncing a member with assignment while having others with none, or ensuring that all members have at least 1 partition. It would also need a different, more complex delivery semantics validation, if any. I would just suggest a different Jira/PR for a new test, to be able to finalize migrating the current sys tests that apply to the new protocol. Makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a good plan. Can you update this PR to rename the test and create a separate JIRA for the new test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I also created the Jira assigned to me to add a test to cover all the non-eager paths that we know we don't have yet.

@lucasbru lucasbru merged commit 636e65a into apache:trunk Apr 29, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants