Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-15481: Fix concurrency bug in RemoteIndexCache #14483
KAFKA-15481: Fix concurrency bug in RemoteIndexCache #14483
Changes from 1 commit
84fff3e
79d0e8a
0904561
6a3bc42
b338b45
48a82eb
8ea3985
9211e6b
cacf9f1
07ca66b
419f596
851593d
6b4fae6
c35c3b2
882864d
71fcad5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeel2420 Why we are invoking this again
invocation.callRealMethod()
It is already called in
712 line when(spyEntry).markForCleanup()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invocation.callRealMethod() is called to call the markForCleanup() after read is called and before we start asserting to make sure indexes get renamed before we assert the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeel2420
markForCleanUp
should be called only one time. We need to test the behaviour when there are concurrent read/remove happens on the cache for the same entry.In the test we just need to assert the way @showuon suggested
// So, maybe we verify with this:
if (Files.exists(entry.offsetIndex().file().toPath)) {
assertCacheSize(1)
} else {
assertCacheSize(0)
}
Calling '
markForCleanUp
' twice will always result in cacheSize 0 eventually.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iit2009060 Read
markForCleanUp()
is not getting called twice. Please see the mock, inside that during the first execution ofmarkForCleanUp()
I am calling the actual markForCleanup() function (i.e Index are getting renamed) but for subsequent calls, mock is doing nothing so we actual markForCleanup() function to rename the indexes is getting called once only and it is as expected.I have verified this behaviour as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeel2420 During read
MarkforCleanUp
should not be called not even once as per the functionality.Then why we need to call it explicitly again here
invocation.callRealMethod()
.I am seeing two invocation of
markForCleanUp
}).when(spyEntry).markForCleanup()
712 line noinvocation.callRealMethod()
708 line noThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, but they are "different"
markForCleanUp
.For (1), the
markForCleanUp
is an injected method for controlling the invoking order. So there are latches wait/countdown.For (2), it's the real
markForCleanUp
method to rename the cache files.The goal is to simulate the race condition happened in KAFKA-15481.
Yes, I think so. But in some cases, there could be 1 if getEntry goes after. The thread management are all decided by OS, we can't assure which one will go first, right?
I think the goal of this test is to make sure the issue in KAFKA-15481 will not happen again. That's why I added this comment.
I'm not following you here. What we're doing in this test is to
read and remove concurrently in the separate thread
. Aboutvalidate the cacheSize based on the order of execution
, since we can't assure which thread will be executed first, we can't do this, right? If we can decide the execution order, then it means they are not running concurrently, is that right?Maybe you can show us if it were you, what test you'll create. Some pseudo code are enough. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@showuon
For (1), the markForCleanUp is an injected method for controlling the invoking order. So there are latches wait/countdown.
Do you mean this is a mock method and no rename would happen in this case ?
Effectively the functionality/logic of
markCleanUp
is called one time only ?I was thinking something like this
In the test case mentioned in the jira KAFKA-15481
the execution order is remove,read and the overall result is cache size 0 which is wrong because of timegap between removal and renaming the files. Here we are validating the same with rsm call count. If they are atomic rsm execution should happen and files should be restored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
Yes, they are basically similar with what we have now. By injecting mock implementation for
markForCleanUp
is just to make the 2 thread execution more close. In the end, what we have now is to invoke "realMethod", which is what you did above. I'm fine if you think we should validate the rsm call count. But again, they are basically testing the same thing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@showuon Yes correct , it is testing the same thing. I am also fine. But from readability perspective the one I propose is simpler to understand and does not require any future change if markForCleanUp function changed. I left @jeel2420 to make a decision here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iit2009060 As current test case is able to reproduce the case mentioned in the jira KAFKA-15481 I think we should be fine.
The only reason to have
markForCleanUp
mock is to have control over the 2 thread execution.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeel2420 , sorry, I had another look and found we should also verify these 2 threads has no exception thrown. In the issue description, without this fix, there will be IOException thrown. So, we should verify there's no exception using the returned
future
fromexecutor.submit
. Thanks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@showuon Nice catch. Calling .get() on both task future objects so if there will be any error test will fail with that exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help me understand these lines. My understanding is:
In L598, we want to verify the
cache.getIndexEntry
will be run aftercache.remove
completion. So, that's where the cacheSize 1 comes frome.If so, then, In L599, why should we
getIndexEntry
again before asserting L600?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we want to prove that the entry is present in the cache so when we call getIndexEntry we should get the entry and index files should be present in the cacheDir.
This reproduce the actual issue by replicating the scenario where we can have inconsistency where key is present in the cache but reference files are already renamed by that time.