Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-13585; Fix flaky test ReplicaManagerTest.testReplicaAlterLogDirsWithAndWithoutIds #11665

Merged
merged 3 commits into from
Jan 26, 2022

Conversation

dajac
Copy link
Contributor

@dajac dajac commented Jan 11, 2022

ReplicaManagerTest.testReplicaAlterLogDirsWithAndWithoutIds fails regularly because assertFetcherHasTopicId can't validate the topic id in the fetch state. The issue is quite subtile. Under the hood, assertFetcherHasTopicId acquires the partitionMapLock in the fetcher thread. partitionMapLock is also acquired by the the processFetchRequest method. If processFetchRequest acquires it before assertFetcherHasTopicId can check the fetch state, assertFetcherHasTopicId has not chance to verify the state anymore because processFetchRequest will remove the fetch state before releasing the lock.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@jolshan
Copy link
Contributor

jolshan commented Jan 11, 2022

Is this test useful if we can't verify the topic ID? I think we have a similar one that doesn't check topic ID?

@dajac
Copy link
Contributor Author

dajac commented Jan 11, 2022

Is this test useful if we can't verify the topic ID? I think we have a similar one that doesn't check topic ID?

Do we? I haven't found it...

@jolshan
Copy link
Contributor

jolshan commented Jan 11, 2022

Strange -- I might be misremembering because I can't seem to find it either.

Copy link
Contributor

@jolshan jolshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there was a way to verify the topic ID, but better to not be flaky for now.

@dajac
Copy link
Contributor Author

dajac commented Jan 17, 2022

@jolshan I have found another way to fix the issue and that one keeps asserting the topic ID. Let me know what you think.

// requests issued by the ReplicaAlterLogDirsThread.
val replicaManager = setupReplicaManagerWithMockedPurgatories(
new MockTimer(time),
propsModifier = props => props.put(KafkaConfig.ReplicaFetchMaxBytesProp, "1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the issue. Was the problem that the copy was completing before the failing assertion? Is this intended to slow down the fetcher?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is quite subtile. The intent here is to ensures that the fetcher uses multiple fetch requests to replicate the data to the new log. The issue is that both assertFetcherHasTopicId and processFetchRequest compete for the partitionMapLock lock. If assertFetcherHasTopicId gets it first, the test pass. If processFetchRequest gets it first, the test fail because the fetch state is removed while holding the lock. By using multiple fetch requests, assertFetcherHasTopicId has a chance to verify the state before the copy is completed.

// Make sure the future log is created.
replicaManager.futureLocalLogOrException(topicPartition)
assertEquals(1, replicaManager.replicaAlterLogDirsManager.fetcherThreadMap.size)

// Verify that the fetcher has the correct topic id.
assertFetcherHasTopicId(replicaManager.replicaAlterLogDirsManager, partition.topicPartition, topicIdOpt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a slightly strange validation at this level. I guess we're verifying the propagation of the topicId to the log dir fetcher? It seems like the main thing we'd want to verify is that the new log gets created with the right topicId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. The goal is to verify that the topicid is correctly propagated to the fetcher state. Now that I think about it a bit more, we might be able to achieve the same by mocking the ReplicaAlterLogDirsManager. Let me give it a try.

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@dajac dajac merged commit 5b12d8a into apache:trunk Jan 26, 2022
@dajac dajac deleted the KAFKA-13585 branch January 26, 2022 07:14
dajac added a commit that referenced this pull request Jan 26, 2022
…rsWithAndWithoutIds` (#11665)

The issue was quite subtile. It was due to a race for the `partitionMapLock` lock. `assertFetcherHasTopicId` would only succeed if it can acquire it before `processFetchRequest`. This PR refactors the test in order to make it more stable.

Reviewers: Justine Olshan <jolshan@confluent.io>, Jason Gustafson <jason@confluent.io>,
jeffkbkim pushed a commit to confluentinc/kafka that referenced this pull request May 12, 2022
…rsWithAndWithoutIds` (apache#11665)

The issue was quite subtile. It was due to a race for the `partitionMapLock` lock. `assertFetcherHasTopicId` would only succeed if it can acquire it before `processFetchRequest`. This PR refactors the test in order to make it more stable.

Reviewers: Justine Olshan <jolshan@confluent.io>, Jason Gustafson <jason@confluent.io>,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants