Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15680: Partition-Count is not getting updated Correctly in the Incremental Co-operative Rebalancing(ICR) Mode of Rebalancing. #14630

Merged
merged 9 commits into from Nov 6, 2023

Conversation

kumarpritam863
Copy link
Contributor

  • In ICR(Incremental Cooperative Rebalancing) mode, whenever a new worker, say Worker 3 joins, a new global assignment is computed by the leader, say Worker1, that results in the revocation of one task from each existing worker i.e Worker1 and Worker2.
  • Once the new member join is completed, ConsumerCoordinator.OnJoinComplete() method is called which primarily computes all the new partitions assigned and the partitions which are revoked and updates the subscription Object.
  • If it was the case of revocation which we check by checking the “partitonsRevoked” list, we call the method “invokePartitionRevoked()” which internally calls “updatePartitionCount()” which fetches partition from the assignment object which is yet not updated by the new assignment.
  • It is only just before calling the “invokePartitionsAssigned()” method that we update the assignment by invoking the following → subscriptions.assignFromSubscribed(assignedPartitions);

…partition count in both partition invocation or addition case.

Partition-Count is not getting updated Correctly in the Incremental Co-operative Rebalancing(ICR) Mode of Rebalancing.
@kumarpritam863
Copy link
Contributor Author

@C0urante Can you please review?

@C0urante
Copy link
Contributor

C0urante commented Oct 25, 2023

@kumarpritam863 is this the same as #12622 and https://issues.apache.org/jira/browse/KAFKA-14220? And if so, can you shed some light on why that PR and ticket were wiped and closed without comment after the initial round of review?

@kumarpritam863
Copy link
Contributor Author

@kumarpritam863 is this the same as #12622 and https://issues.apache.org/jira/browse/KAFKA-14220? And if so, can you shed some light on why that PR and ticket were wiped and closed without comment after the initial round of review?

Yes @C0urante This is same and I closed the PR and JIRA at that time due to some privacy policy about which I was not sure at that time. Really sorry for that as that was miss from my side. Can we please precede this with new PR and JIRA.

@kumarpritam863
Copy link
Contributor Author

kumarpritam863 commented Oct 26, 2023

@C0urante
It's a small but important modification, just a change in the order of calling the UpdatePartitionCount method and practically involve no code change.
I kindly request your review and approval for this PR. Your quick attention to this matter would be greatly appreciated.

If you have any questions or need further clarification about the changes, please feel free to reach out to me. Your insights and expertise are highly valued, and your approval would be really helpful.

Thank you in advance for your time and consideration.

Best regards,
Pritam

@C0urante
Copy link
Contributor

I can take a look next week. In the meantime, do you think you could add some unit tests for this change?

@kumarpritam863
Copy link
Contributor Author

kumarpritam863 commented Oct 27, 2023

@C0urante
As I mentioned this is a very small change just a change in the order of calling the same function which updates the partition count. This involves practically no code change and more than that I almost explored all the options to add UNIT or Integration test but it seems very complex adding test cases for this change. But we have tested this in our lower environments with multiple scenarios covering multiple dimensions and we have not encountered any issue. I have already spent a lot of time figuring out ways to add unit test cases.

These are the few reasons why adding unit test is difficult for this:

  1. Although the change is in workerSinkTask but the root cause of the issue is in ConsumerCoordinator
  2. So the current unit test cases for worker sink task mocks the consumer assignment itself.
  3. ConsumerCoordinator is not even a dependency for workerSinkTask and it is indirectly accessed via the kafka consumer Client.
  4. "onJoinComplete" method is protected and can be called from that module itself.

Can we please take this without the test case or with the existing test case. This is a really small change but of utmost important to us as we heavily rely on this metrics.

Thank you in advance for your time and consideration.

Best regards,
Pritam

@C0urante
Copy link
Contributor

@kumarpritam863 I understand that the code changes are small, and I'm glad to hear that pre-prod testing indicates that this commit has the intended effect. However, I still think that unit testing coverage is warranted--we want to make sure not just that this change works now, but that we don't accidentally regress in the future.

I think it should be possible to use the MockConsumer class (possibly with some small tweaks to is rebalance method to make the ordering of steps more closely match the ordering in the KafkaConsumer class when rebalances take place) to write a new unit test in the WorkerSinkTaskTest suite that verifies that the partition-count metric is updated appropriately.

You may also notice that existing unit tests are failing now (see the CI results, or feel free to test this PR locally with ./gradlew :connect:runtime:unitTest). Can you please fix the failing tests and make sure that the build works (at least locally) before requesting review again?

Finally, can you update the title of the PR to describe what changes are actually made? You can see examples of good PR titles in the list of recently-merged commits to trunk.

Thanks!

@kumarpritam863 kumarpritam863 changed the title Kafka 15680 Kafka-15680 :: Calling UpdatePartitionCount method from "onPartitionsAssigned()" to handle the partition addition and revocation simultaneoulsy. Oct 30, 2023
…n call. Call to closePartition method will not invoke consumer.assignment(). To accomodate That I have reduced the count in times call to 1.
@kumarpritam863
Copy link
Contributor Author

kumarpritam863 commented Oct 31, 2023

@C0urante
https://github.com/apache/kafka/blob/9dbee599f13997effd8f7e278fd7256b850c8813/clients/src/main/java/org/apache/kafka/clients/consumer/MockConsumer.java Implements clients/src/main/java/org/apache/kafka/clients/consumer/Consumer.java but worker connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java expects a clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java. The only way to use "MockConsumer" would be to create "MockWorkerSinkTask" which is not a small tweak and would be a big change to test this small code change.
Please let me know your thoughts on this and if I mis-understood something or if I am going down a wrong lane.
Thanks
Pritam

@kumarpritam863
Copy link
Contributor Author

@C0urante All the tests passed locally. But I will check for the failing tests again and will update here.

@kumarpritam863
Copy link
Contributor Author

I ran the following command:
./gradlew :connect:runtime:unitTest
And all the test passed locally and build is successful.

BUILD SUCCESSFUL in 3m 50s
73 actionable tasks: 6 executed, 67 up-to-date

@C0urante
Copy link
Contributor

That's fine, we can change the left-hand type for WorkerSinkTask.consumer from KafkaConsumer<byte[], byte[]> to Consumer<byte[], byte[]>.

@kumarpritam863
Copy link
Contributor Author

@C0urante Did not get the above part. Do you want me to change the type of consumer from KafkaConsumer<byte[], byte[]> to Consumer<byte[], byte[]> in the WorkerSinkTask class.

@C0urante
Copy link
Contributor

Yes, exactly 👍

Pritam Kumar added 3 commits November 1, 2023 12:35
… flow to update the partition count which updates wrong value before the actual update of Susbscription object 2. The correct flow which updates the partition after the subscription object is updated with the correct subscrition. As part of the test case update the signature for consumer from KafkaConsumer to consumer in worker sink task and worker sink task context.
@kumarpritam863
Copy link
Contributor Author

@C0urante I have added the test for both the scenarios:

  1. Test failing if the ConsumerRebalanceListener of WorkerSinkTask adopts to current flow where we update the partition count before updation of Subscription Object.
  2. Test Passing if the ConsumerRebalanceListener of WorkerSinkTask adopts to the flow added by this PR.

Can you please review and let me know if any other change or modification is required.

@kumarpritam863 kumarpritam863 changed the title Kafka-15680 :: Calling UpdatePartitionCount method from "onPartitionsAssigned()" to handle the partition addition and revocation simultaneoulsy. Kafka-15680 :: Relocate the "updatePartitionCall()" method to the "onPartitionsAssigned(Collection<TopicPartition> partitions)" method within the "WorkerSinkTask.class" to effectively manage both the addition and revocation of partitions. Nov 1, 2023
@kumarpritam863 kumarpritam863 changed the title Kafka-15680 :: Relocate the "updatePartitionCall()" method to the "onPartitionsAssigned(Collection<TopicPartition> partitions)" method within the "WorkerSinkTask.class" to effectively manage both the addition and revocation of partitions. KAFKA-15680: Partition-Count is not getting updated Correctly in the Incremental Co-operative Rebalancing(ICR) Mode of Rebalancing. Nov 2, 2023
Copy link
Contributor

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kumarpritam863, this is headed in the right direction. I think we can do a little better on the tests though; left my thoughts.

transformationChain, mockConsumer, pluginLoader, time,
RetryWithToleranceOperatorTest.NOOP_OPERATOR, null, statusBackingStore, Collections::emptyList);
// Subscribing to Topic = "test" with "MockHandleRebalance"
mockConsumer.subscribe(asList(TOPIC), new MockHandleRebalancePass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be subscribing directly here. Instead, the WorkerSinkTask instance should be issuing this call as part of its initializeAndStart method. This will also allow us to test the real consumer rebalance listener used by the WorkerSinkTask class, instead of a mocked listener.

I think the sequence of calls should be something like this:

  • MockConsumer::updateBeginningOffsets (necessary because the WorkerSinkTask instance's rebalance listener will try to get the current position of its assignment in onPartitionsAssigned)
  • WorkerSinkTask::initialize (you can use the TASK_CONFIG constant in the test class as the argument)
  • WorkerSinkTask::initializeAndStart
  • assertSinkMetricValue, MockConsumer::rebalance (as many times as you want)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @C0urante , Yes this would be nice and I have added this step and removed the Mock Handle Rebalances.

@@ -161,6 +165,7 @@ public class WorkerSinkTaskTest {
private StatusBackingStore statusBackingStore;
@Mock
private KafkaConsumer<byte[], byte[]> consumer;
private Consumer<byte[], byte[]> mockConsumer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it's a little strange to call this mockConsumer when there's also a consumer field that itself contains a mocked consumer. To avoid confusion, can we just make this a local variable for any of the tests that require it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have made this a local variable inside that test method.

@@ -1999,6 +2004,81 @@ public void testErrorReporterConfigurationExceptionPropagation() {
PowerMock.verifyAll();
}

@Test
public void testPartitionCountInCaseOfPartitionRevocation() {
mockConsumer = new MockConsumer<>(OffsetResetStrategy.EARLIEST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this a local variable, we can change the left-hand type to MockConsumer<byte[], byte[]> and then the casts later on at lines 2019 and 2022 won't be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I missed this, Thanks, I have added this change.

}

@Test
public void testPartitionCountInCaseOfPartitionRevocationFail() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing the MockConsumer behavior more than anything else. If we want to do that (which I don't think is necessary but wouldn't oppose), then we should move this kind of testing into the MockConsumerTest suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I added this to test the failure scenario but as I have made a change to use Handle Rebalance of Worker Sink Task this is not needed anymore.

Comment on lines 2046 to 2080
/*
Correct Order to Call the updatePartitionCount
*/
private class MockHandleRebalancePass implements ConsumerRebalanceListener {
@Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
workerTask.sinkTaskMetricsGroup().recordPartitionCount(mockConsumer.assignment().size());
if (partitions.isEmpty()) {
return;
}
// Not doing anything as the objective is to test only Partition Count
}
@Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// Not doing anything as we do not updatePartitionCount here.
}
}
/*
Existing flow which calls updatePartitionCount before updation of Subscription.
*/
private class MockHandleRebalanceFail implements ConsumerRebalanceListener {
@Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
if (partitions.isEmpty()) {
return;
}
// Not doing anything as the objective is to test only Partition Count
workerTask.sinkTaskMetricsGroup().recordPartitionCount(mockConsumer.assignment().size());
}
@Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// Not doing anything as we do not updatePartitionCount here.
workerTask.sinkTaskMetricsGroup().recordPartitionCount(mockConsumer.assignment().size());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any custom rebalance listeners for this PR, we should just be testing the one used by the WorkerSinkTask class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @C0urante , we do not need. Actually missed this. I have added this change.

@kumarpritam863
Copy link
Contributor Author

@C0urante I have addressed all the comments and made the changes accordingly. Can you please review once and let me know if any changes are required or I missed anything.

Copy link
Contributor

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @kumarpritam863!

@C0urante C0urante merged commit ca05da9 into apache:trunk Nov 6, 2023
1 check failed
C0urante pushed a commit that referenced this pull request Nov 6, 2023
…nsumer protocol is used (#14630)

Reviewers: Chris Egerton <chrise@aiven.io>
@kumarpritam863
Copy link
Contributor Author

kumarpritam863 commented Nov 6, 2023

Thanks @C0urante for all the help.

mjsax pushed a commit to confluentinc/kafka that referenced this pull request Nov 22, 2023
…nsumer protocol is used (apache#14630)

Reviewers: Chris Egerton <chrise@aiven.io>
yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024
…nsumer protocol is used (apache#14630)

Reviewers: Chris Egerton <chrise@aiven.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants