KAFKA-16686: Wait for given offset in TopicBasedRemoteLogMetadataManagerTest #15885

gaurav-narula · 2024-05-07T16:24:13Z

Some tests in TopicBasedRemoteLogMetadataManagerTest flake because waitUntilConsumerCatchesUp may break early before consumer manager has caught up with all the events.

This PR adds an expected offsets for leader/follower metadataOffset partitions and ensures we wait for the offset to be at least equal to the argument to avoid flakyness.

Refer Gradle Enterprise Report for more information on flakyness.

gaurav-narula · 2024-05-07T16:24:35Z

CC: @clolov @satishd

kamalcph

Thanks @gaurav-narula for the patch! Left minor comments to address.

We can rewrite this test to be concise. I'll file a separate ticket for this.

kamalcph · 2024-05-09T13:56:10Z

.../apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManagerTest.java

            if (leaderMetadataPartition == followerMetadataPartition) {
-                if (topicBasedRlmm().readOffsetForPartition(leaderMetadataPartition).orElse(-1L) >= 1) {
+                Assertions.assertEquals(targetLeaderMetadataPartitionOffset, targetFollowerMetadataPartitionOffset);
+                if (topicBasedRlmm().readOffsetForPartition(leaderMetadataPartition).orElse(-1L) >= targetLeaderMetadataPartitionOffset) {


previously, we were waiting for >=1, after this change, >=0. This will make the test more flaky.

when the leader and follower partitions are mapped to the same partition, then we have to wait for twice the amount of messages.

satishd

Thanks @gaurav-narula for the PR, left a meta comment.

satishd · 2024-05-09T17:11:40Z

.../apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManagerTest.java

        Assertions.assertTrue(topicBasedRlmm().listRemoteLogSegments(newLeaderTopicIdPartition).hasNext());
        Assertions.assertTrue(topicBasedRlmm().listRemoteLogSegments(newFollowerTopicIdPartition).hasNext());
    }

    private void waitUntilConsumerCatchesUp(TopicIdPartition newLeaderTopicIdPartition,
                                            TopicIdPartition newFollowerTopicIdPartition,
-                                            long timeoutMs) throws TimeoutException {
+                                            long timeoutMs,
+                                            long targetLeaderMetadataPartitionOffset,


These parameters will not help much here as this method was written for testNewPartitionUpdates but other tests in this class used the functionality with the gaps. It is better to relook at those usecases and refactor this method respectively.

gaurav-narula · 2024-05-12T01:27:54Z

Thanks for the feedback @kamalcph @satishd!

I've modified the tests so that we propagate a Consumer<RemoteLogMetadata> down to ConsumerTask and use it only for tests.

This allows us to replace waitUntilConsumerCatchesUp with TestUtils.waitForCondition to actually wait on the consumption of all expected RemoteLogMetadata objects we set up in the tests instead of relying on offsets which is ambiguous.

Please have a look and let me know your thoughts!

kamalcph · 2024-05-12T06:41:29Z

storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/ConsumerTask.java

@@ -153,6 +173,7 @@ public void run() {

    private void processConsumerRecord(ConsumerRecord<byte[], byte[]> record) {
        final RemoteLogMetadata remoteLogMetadata = serde.deserialize(record.value());
+        onConsume.accept(remoteLogMetadata);


This is not the correct way to capture the events. Assume that the testcase don't want to process an event (shouldProcess check returns false). We don't want that event to be captured.

Instead, we can have a setter method for RemotePartitionMetadataStore and pass a custom implementation similar to DummyEventHandler where we can capture the event and delegate it to the real implementation.

Thanks for the suggestion! This made me realise there's also a possible race where even after RemotePartitionMetadataStore::handleRemoteLogSegmentMetadata is invoked, the assertions on topicBasedRlmm().listRemoteLogSegments may fail because remoteLogMetadataCache.isInitialized() may return false.

Inspired by your suggestion to hook on RemotePartitionMetadataStore, I've modified TopicBasedRemoteLogMetadataManagerHarness to accept a spy object for it which is passed down to ConsumerTask. The tests are modified to ensure handleRemoteLogSegmentMetadata and markInitialized are invoked appropriate number of times.

Thanks for updating the test, the approach LGTM!

Nit: Why are we using the supplier pattern instead of adding a setter to TopicBasedRemoteLogMetadataManager and marking it as visibleForTesting?

Nit: Why are we using the supplier pattern instead of adding a setter to TopicBasedRemoteLogMetadataManager and marking it as visibleForTesting?

IIUC, you're alluding to something similar we do for remoteLogMetadataTopicPartitioner at

kafka/storage/src/test/java/org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManagerHarness.java

Line 108 in 8a9dd2b

if (remoteLogMetadataTopicPartitioner != null) {

.

That can work, but I feel it's very easy to introduce a race inadvertently since TopicBasedRemoteLogMetadataManager::configure spawns a thread (

kafka/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManager.java

Line 368 in 8a9dd2b

initializationThread = KafkaThread.nonDaemon("RLMMInitializationThread", this::initializeResources);

). In fact, remoteLogMetadataTopicPartitioner is prone to a race, where if the test thread yields before line 109 is executed, the ProducerManager and ConsumerManager instances can get instantiated with incorrect remoteLogMetdataTopicPartitioner instance (

kafka/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManager.java

Line 421 in 8a9dd2b

producerManager = new ProducerManager(rlmmConfig, rlmTopicPartitioner);

).

We can avoid it by invoking the setter before calling TopicBasedRemoteLogMetadataManager::configure but I feel it's easier to enforce it by using a Supplier instead. Either way, I feel this race should be fixed as well now :)

I've created https://issues.apache.org/jira/browse/KAFKA-16712

kamalcph

LGTM, thanks for fixing the flaky test!

chia7712

@gaurav-narula nice fix!

chia7712 · 2024-05-13T13:25:18Z

.../org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManager.java

@@ -89,14 +90,16 @@ public class TopicBasedRemoteLogMetadataManager implements RemoteLogMetadataMana
    private volatile RemoteLogMetadataTopicPartitioner rlmTopicPartitioner;
    private final Set<TopicIdPartition> pendingAssignPartitions = Collections.synchronizedSet(new HashSet<>());
    private volatile boolean initializationFailed;
+    private final Supplier<RemotePartitionMetadataStore> remoteLogMetadataManagerSupplier;

    public TopicBasedRemoteLogMetadataManager() {


Could you add comments to say this default constructor is required as we create RemoteLogMetadataManager dynamically?

Sure, addressed both in a8ba568

chia7712 · 2024-05-13T13:29:18Z

.../org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManager.java

    }

    // Visible for testing.
-    public TopicBasedRemoteLogMetadataManager(boolean startConsumerThread) {
+    public TopicBasedRemoteLogMetadataManager(boolean startConsumerThread, Supplier<RemotePartitionMetadataStore> remoteLogMetadataManagerSupplier) {


It seems package-private is enough in testing, right?

chia7712

LGTM

I have re-trigger QA. will merge it if no objection.

satishd

Thanks @gaurav-narula for addressing the review comments, this approach LGTM.

Some tests in TopicBasedRemoteLogMetadataManagerTest flake because `waitUntilConsumerCatchesUp` may break early before consumer manager has caught up with all the events. This change allows passing a spy object for `RemotePartitionMetadataStore` down to `ConsumerTask` which allows the test code to ensure the methods on it were invoked appropriate number of times before performing assertions. Refer [Gradle Enterprise Report](https://ge.apache.org/scans/tests?search.timeZoneId=Europe%2FLondon&tests.container=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManagerTest) for more information on flakyness.

gaurav-narula · 2024-05-14T17:05:17Z

@chia7712 looks like we still suffer from thread leaks in CI :( I've rebased from trunk to trigger CI again

chia7712 · 2024-05-14T18:09:18Z

looks like we still suffer from thread leaks in CI :( I've rebased from trunk to trigger CI again

I have noticed that too. so sad :(

…erTest (apache#15885) Some tests in TopicBasedRemoteLogMetadataManagerTest flake because waitUntilConsumerCatchesUp may break early before consumer manager has caught up with all the events. This PR adds an expected offsets for leader/follower metadataOffset partitions and ensures we wait for the offset to be at least equal to the argument to avoid flakyness. Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>

gaurav-narula mentioned this pull request May 8, 2024

KAFKA-16688: Use helper method to shutdown ExecutorService #15886

Merged

3 tasks

satishd requested a review from kamalcph May 9, 2024 06:51

kamalcph reviewed May 9, 2024

View reviewed changes

satishd reviewed May 9, 2024

View reviewed changes

gaurav-narula force-pushed the KAFKA-16686 branch 3 times, most recently from d2684d3 to 8586c03 Compare May 12, 2024 01:27

kamalcph reviewed May 12, 2024

View reviewed changes

gaurav-narula force-pushed the KAFKA-16686 branch from 8586c03 to eb6fd01 Compare May 12, 2024 16:34

kamalcph approved these changes May 13, 2024

View reviewed changes

chia7712 reviewed May 13, 2024

View reviewed changes

chia7712 approved these changes May 14, 2024

View reviewed changes

satishd approved these changes May 14, 2024

View reviewed changes

gaurav-narula added 2 commits May 14, 2024 18:04

Address review comments

58424dc

gaurav-narula force-pushed the KAFKA-16686 branch from a8ba568 to 58424dc Compare May 14, 2024 17:04

chia7712 merged commit eb5559a into apache:trunk May 15, 2024
1 check failed

gaurav-narula mentioned this pull request May 15, 2024

KAFKA-16712: Fix race in TopicBasedRemoteLogMetadataManagerMultipleSubscriptionsTest #15962

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-16686: Wait for given offset in TopicBasedRemoteLogMetadataManagerTest #15885

KAFKA-16686: Wait for given offset in TopicBasedRemoteLogMetadataManagerTest #15885

gaurav-narula commented May 7, 2024

gaurav-narula commented May 7, 2024

kamalcph left a comment

kamalcph May 9, 2024

satishd left a comment

satishd May 9, 2024

gaurav-narula commented May 12, 2024

kamalcph May 12, 2024

gaurav-narula May 12, 2024

kamalcph May 13, 2024

gaurav-narula May 13, 2024

gaurav-narula May 13, 2024

kamalcph left a comment

chia7712 left a comment

chia7712 May 13, 2024

gaurav-narula May 13, 2024

chia7712 May 13, 2024

chia7712 left a comment

satishd left a comment

gaurav-narula commented May 14, 2024

chia7712 commented May 14, 2024

KAFKA-16686: Wait for given offset in TopicBasedRemoteLogMetadataManagerTest #15885

KAFKA-16686: Wait for given offset in TopicBasedRemoteLogMetadataManagerTest #15885

Conversation

gaurav-narula commented May 7, 2024

gaurav-narula commented May 7, 2024

kamalcph left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satishd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaurav-narula commented May 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kamalcph left a comment

Choose a reason for hiding this comment

chia7712 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chia7712 left a comment

Choose a reason for hiding this comment

satishd left a comment

Choose a reason for hiding this comment

gaurav-narula commented May 14, 2024

chia7712 commented May 14, 2024