Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15169: Added TestCase in RemoteIndexCache #14482

Merged
merged 10 commits into from Oct 11, 2023

Conversation

iit2009060
Copy link
Contributor

@iit2009060 iit2009060 commented Oct 4, 2023

Test Cases Covered

  1. Index Files already exist on disk but not in Cache i.e. RemoteIndexCache should not call remoteStorageManager to fetch it instead cache it from the local index file present.
  2. RSM returns CorruptedIndex File i.e. RemoteIndexCache should throw CorruptedIndexException instead of successfull execution.
  3. Deleted Suffix Indexes file already present on disk i.e. If cleaner thread is slow , then there is a chance of deleted index files present on the disk while in parallel same index Entry is invalidated. To understand more refer https://issues.apache.org/jira/browse/KAFKA-15169

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@divijvaidya divijvaidya added the tiered-storage Pull requests associated with KIP-405 (Tiered Storage) label Oct 4, 2023
Comment on lines 545 to 546
def testOffsetIndexFileAlreadyExistOnDiskButNotInCache(): Unit ={
val remoteIndexCacheDir = new File(tpDir,RemoteIndexCache.DIR_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow style guide used in rest of the code. For example, space between = and {.
You can find more information here: https://kafka.apache.org/coding-guide.html

I remember your previous PR also had similar comments. I would recommend to set your IDE to use this style and automatically format newly added code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks , I have enabled on IDE to format it on save everytime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 545 to 546
def testOffsetIndexFileAlreadyExistOnDiskButNotInCache(): Unit ={
val remoteIndexCacheDir = new File(tpDir,RemoteIndexCache.DIR_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please create a getter in the RemoteIndexCache to get this dir, add javadoc for it // visible for testing and use that here. It will make this test independent of the remote cache location in case we change that in future,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Files.walk(remoteIndexCacheDir.toPath)
.filter(Files.isRegularFile(_))
.filter(path => path.getFileName.toString.endsWith(suffix))
.forEach(f => Files.move(f,f.resolveSibling(f.getFileName().toString().stripSuffix(tempSuffix))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of files.move, please use our custom utility function Utils.atomicMoveWithFallback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// rsm should not be called again
// instead files exist on disk
// should be used
verifyFetchIndexInvocation(count = 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be 0 if there is going to be no fetch invocation?

Copy link
Contributor Author

@iit2009060 iit2009060 Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divijvaidya To pre-create the valid index file in remote storage cache dir , I initially fetched it from the remote storage and then run the scenario. The 1 count is for pre-processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. So that will confuse readers. Please do either:

  1. reset the mock rsm
    or
  2. run verifyFetchIndexInvocation(count = 1) again before L672, so that we can make sure the count is not increasing even we call cache.getIndexEntry again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iit2009060 , have you addressed this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @showuon Thanks for identifying it , I missed it.
Changes Done.

// The size of the string written in the file is 12 bytes,
// but it should be multiple of Offset Index EntrySIZE which is equal to 8.
pw.close()
val offsetIdx = createOffsetIndexForSegmentMetadata(metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a bug in existing createOffsetIndexForSegmentMetadata and similar function. They create cache at the wrong place in the line:
new OffsetIndex(remoteOffsetIndexFile(tpDir, metadata)
Instead of tpDir, it should point to remote cache director which is a folder inside tpDir.

Please fix that and see if it impacts your test in any way.

Copy link
Contributor Author

@iit2009060 iit2009060 Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divijvaidya But when we call get index entry , it will create indexes file under remote cache directory only. It will read file content from tpDir and create indexes in "remoteIndexCache" dir.
IMO "createOffsetIndexForSegmentMetadata" will create indexes file and tpDir act as a placeholder for files fetched from remoteStorage. If we change it to remote cache dir , all test cases will always have the file exist behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divijvaidya We can keep this function i.e createOffsetIndexForSegmentMetadata to be used for indexes received from remote storage rather than remoteindexcache.

The existing test case are working because they directly put values into internalCache(Caeffine) rather than the getIndexEntry route.

In the above test case I want to corrupt the Index files received from remote storage rather than from remote index cache.
getIndexEntry does following steps

  1. Fetch indexes from remote Storage( i.e indexes stored in tpDir directory as per the mock)
  2. Manually Corrupt the indexes file stored in tpDir Directory after fetch from remote storage.
  3. It tries to copy indexes files in tpDir directory to remote index cache directory.
  4. Run sanity check on indexes stored in tpDir/remoteindexcache directory.
  5. It throws CorruptedIndexException.

// The size of the string written in the file is 12 bytes,
// but it should be multiple of Offset Index EntrySIZE which is equal to 8.
pw.close()
val offsetIdx = createOffsetIndexForSegmentMetadata(metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have 3 variants of this test where we are corruption all 3 indexes one by one. You can use @Parameterized to run those scenarios in same test.

Copy link
Contributor Author

@iit2009060 iit2009060 Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divijvaidya Is it possible to parametrize based on function as a parameter ? Otherwise I need to write if else condition in the same test ?
Corrupting Time and Offset Index is easy based on entry_size , any suggestion how to do it for TransactionIndex ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iit2009060 , if / else is fine with me, just make the test clear.

For txnIndex corruption, I usually check other tests if they did something similar as we want. Here's a good reference:

def testSanityCheck(): Unit = {
val abortedTxns = List(
new AbortedTxn(0L, 0, 10, 11),
new AbortedTxn(1L, 5, 15, 13),
new AbortedTxn(2L, 18, 35, 25),
new AbortedTxn(3L, 32, 50, 40))
abortedTxns.foreach(index.append)
index.close()
// open the index with a different starting offset to fake invalid data
val reopenedIndex = new TransactionIndex(100L, file)
assertThrows(classOf[CorruptIndexException], () => reopenedIndex.sanityCheck())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// wait until entry is marked for deletion
TestUtils.waitUntilTrue(() => entry.isMarkedForCleanup,
"Failed to mark cache entry for cleanup after invalidation")
TestUtils.waitUntilTrue(() => entry.isCleanStarted,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean started doesn't mean that index has been deleted. There could be a case where next line executes even though file has not been deleted yet. I would suggest to wrap the next assertions in waitUntilTrue(). It would make the test non-flaky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iit2009060 , thanks for the PR. Left some comments.


// restore index files
renameRemoteCacheIndexFileFromDisk(tempSuffix)
// validate cache entry for the above key should be null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: additional space before null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// The size of the string written in the file is 12 bytes,
// but it should be multiple of Offset Index EntrySIZE which is equal to 8.
pw.close()
val offsetIdx = createOffsetIndexForSegmentMetadata(metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iit2009060 , if / else is fine with me, just make the test clear.

For txnIndex corruption, I usually check other tests if they did something similar as we want. Here's a good reference:

def testSanityCheck(): Unit = {
val abortedTxns = List(
new AbortedTxn(0L, 0, 10, 11),
new AbortedTxn(1L, 5, 15, 13),
new AbortedTxn(2L, 18, 35, 25),
new AbortedTxn(3L, 32, 50, 40))
abortedTxns.foreach(index.append)
index.close()
// open the index with a different starting offset to fake invalid data
val reopenedIndex = new TransactionIndex(100L, file)
assertThrows(classOf[CorruptIndexException], () => reopenedIndex.sanityCheck())

Comment on lines 673 to 676
// Index Files already exist
// rsm should not be called again
// instead files exist on disk
// should be used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make these 4 lines into 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

})
cache.getIndexEntry(rlsMetadata)
// No exception should occur
// Offset rsm 0,TimeIndex 1 , TxnIndex 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is offset rsm? OffsetIndex?

Copy link
Contributor Author

@iit2009060 iit2009060 Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description. Yes it was OffsetIndex

Comment on lines 595 to 624
// However cache fetch failed
// but it has already created offset and time index file in remote cache dir.
// Current status
// ( cache is null)
// RemoteCacheDir contain
// 1. Offset Index File which is fine and not corrupted
// 2. Time Index File which is corrupted
// What should be the code flow in next execution
// 1. No rsm call for fetching OffSet Index File.
// 2. Time index file should be fetched from remote storage again as it is corrupted in the first execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove addition space between words.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 600 to 601
// 1. Offset Index File which is fine and not corrupted
// 2. Time Index File which is corrupted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word which can be removed:
// 1. Offset Index File is fine and not corrupted
// 2. Time Index File is corrupted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@divijvaidya
Copy link
Contributor

@showuon @iit2009060 please don't wait for my review on this one. I might not get to it until next week.

@iit2009060
Copy link
Contributor Author

iit2009060 commented Oct 6, 2023

@showuon @iit2009060 please don't wait for my review on this one. I might not get to it until next week.

@showuon Can you review @divijvaidya comments. I have taken care of all the reviews.

Copy link
Contributor

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iit2009060 , overall LGTM, left some refactor comments.

@ParameterizedTest
@EnumSource(value = classOf[IndexType], names = Array("OFFSET", "TIMESTAMP", "TRANSACTION"))
def testCorruptCacheIndexFileExistsButNotInCache(indexType: IndexType): Unit = {
// create Corrupt Index File in remote index cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// create Corrupted Index File in remote index cache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format is wrong:

    if (indexType == IndexType.OFFSET) {
      createCorruptOffsetIndexFile(cache.cacheDir())
    } else if (indexType == IndexType.TIMESTAMP) {
      createCorruptTimeIndexOffsetFile(cache.cacheDir())
    } else if (indexType == IndexType.TRANSACTION) {
      createCorruptTxnIndexForSegmentMetadata(cache.cacheDir(), rlsMetadata)
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// assert that parent directory for the index files is correct
assertEquals(RemoteIndexCache.DIR_NAME, offsetIndexFile.getParent.getFileName.toString,
s"offsetIndex=$offsetIndexFile is not overwrite under incorrect parent")
s"offsetIndex=entry.offsetIndex().file().toPath is created under incorrect parent")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be s"offsetIndex=$offsetIndexFile is created under incorrect parent") ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

val timeIdx = createTimeIndexForSegmentMetadata(metadata)
val txnIdx = createTxIndexForSegmentMetadata(metadata)
maybeAppendIndexEntries(offsetIdx, timeIdx)
// Create corrupt index file which would be returned by rsm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Create corrupt index file which would be returned by rsm
->
// Create corrupted index file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 595 to 596
// However cache fetch failed
// but it has already created offset and time index file in remote cache dir.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 lines can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 628 to 629
// No exception should occur
// rsm should not be called for offset Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// No exception should occur <- should be removed
// rsm should not be called for offset Index -> // rsm should not be called to fetch offset Index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 673 to 674
// Index Files already exist ,rsm should not be called again
// instead files exist on disk should be used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Index Files already exist, rsm should not fetch them again.

and remove L654

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// rsm should not be called again
// instead files exist on disk
// should be used
verifyFetchIndexInvocation(count = 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. So that will confuse readers. Please do either:

  1. reset the mock rsm
    or
  2. run verifyFetchIndexInvocation(count = 1) again before L672, so that we can make sure the count is not increasing even we call cache.getIndexEntry again.

Comment on lines 694 to 702
if (testIndexType == IndexType.OFFSET) {
createCorruptOffsetIndexFile(tpDir)
}
else if (testIndexType == IndexType.TIMESTAMP) {
createCorruptTimeIndexOffsetFile(tpDir)
}
else if (testIndexType == IndexType.TRANSACTION) {
txnIdx = createCorruptTxnIndexForSegmentMetadata(tpDir, rlsMetadata)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And since these logic repeat again, we can extract them into another helper method. Ex:

private void createCorruptedIndexFile(IndexType type) {
          if (testIndexType == IndexType.OFFSET) {
          createCorruptOffsetIndexFile(tpDir)
.....
}

Then, in the test, we can just call createCorruptedIndexFile(testIndexType) to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@iit2009060
Copy link
Contributor Author

@iit2009060 , overall LGTM, left some refactor comments.

@showuon done as mentioned.

Copy link
Contributor

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a minor formatting comment.

@EnumSource(value = classOf[IndexType], names = Array("OFFSET", "TIMESTAMP", "TRANSACTION"))
def testCorruptCacheIndexFileExistsButNotInCache(indexType: IndexType): Unit = {
// create Corrupted Index File in remote index cache
createCorruptedIndexFile(indexType,cache.cacheDir())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after comma (indexType, cache.cacheDir()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix. Let's wait for the CI build completes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@kamalcph kamalcph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are adding new tests, can we create a new RemoteIndexCacheTest under storage module and start writing the tests in java?

The source code is in Java but the test is in scala. We can move the existing tests in a separate PR.

@@ -166,6 +166,11 @@ public Cache<Uuid, Entry> internalCache() {
return internalCache;
}

// Visible for testing
public File cacheDir() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we file a ticket to move the RemoteIndexCacheTest to storage module under the same package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kamalcph I have just started understanding of the kafka ecosystem , Can you help me understand the rational behind it , Then i can create a ticket with details.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find more details about the intention to move the source code to Java in KAFKA-14524

@iit2009060
Copy link
Contributor Author

Since we are adding new tests, can we create a new RemoteIndexCacheTest under storage module and start writing the tests in java?

The source code is in Java but the test is in scala. We can move the existing tests in a separate PR.

@kamalcph Can i create a ticket for same and work on it after PR merge. This has been pending from last week and already approved.

Since we are adding new tests, can we create a new RemoteIndexCacheTest under storage module and start writing the tests in java?

The source code is in Java but the test is in scala. We can move the existing tests in a separate PR.

@iit2009060
Copy link
Contributor Author

@showuon @kamalcph @divijvaidya I have rebased changes and resolved conflict on the changes merged by
the PR #14381.
Can you please review and merge it.

Copy link
Collaborator

@kamalcph kamalcph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in the patch, LGTM. We can rewrite it to Java in a separate patch.

@EnumSource(value = classOf[IndexType], names = Array("OFFSET", "TIMESTAMP", "TRANSACTION"))
def testCorruptCacheIndexFileExistsButNotInCache(indexType: IndexType): Unit = {
// create Corrupted Index File in remote index cache
createCorruptedIndexFile(indexType, cache.cacheDir())
val entry = cache.getIndexEntry(rlsMetadata)
// Test would fail if it throws corrupt Exception
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this comment to?

Test would fail it it throws exception other than CorruptIndexException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// rsm should not be called to fetch offset Index
verifyFetchIndexInvocation(0, Seq(IndexType.OFFSET))
verifyFetchIndexInvocation(1, Seq(IndexType.TIMESTAMP))
verifyFetchIndexInvocation(1, Seq(IndexType.TRANSACTION))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Txn index file was not corrupted. Can we add a comment to explain why are we fetching it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@showuon
Copy link
Contributor

showuon commented Oct 9, 2023

@iit2009060
Copy link
Contributor Author

@iit2009060 , there are compilation failure, could you fix it? https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14482/10/pipeline/

on it.

@iit2009060
Copy link
Contributor Author

iit2009060 commented Oct 9, 2023

@hudeqi @showuon There is one test case failing related to the above change #14381, can you help me resolving it , in the local it is not failing ?
Build / JDK 8 and Scala 2.12 / testClearCacheAndIndexFilesWhenResizeCache() – kafka.log.remote.RemoteIndexCacheTest
java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /tmp/kafka-RemoteIndexCacheTest5502978390891743756/mwbAzcjOTtKRIDxlQrgTTw:foo-0/remote-log-index-cache/0_xg9JTaPnS4C47HfCgv2_gA.txnindex.deleted
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14482/12/tests/

@hudeqi
Copy link
Collaborator

hudeqi commented Oct 9, 2023

@hudeqi @showuon There is one test case failing related to the above change #14381, can you help me resolving it , in the local it is not failing ? Build / JDK 8 and Scala 2.12 / testClearCacheAndIndexFilesWhenResizeCache() – kafka.log.remote.RemoteIndexCacheTest java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /tmp/kafka-RemoteIndexCacheTest5502978390891743756/mwbAzcjOTtKRIDxlQrgTTw:foo-0/remote-log-index-cache/0_xg9JTaPnS4C47HfCgv2_gA.txnindex.deleted https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14482/12/tests/

hi, I think this may be the reason: when executing getIndexFileFromRemoteCacheDir, the .deleted file will be listed in the filter(Files.isRegularFile(_)) stage, but when executing path.getFileName.toString, the .deleted file has already been deleted because cleanStarted is already true to start cleaning.
If possible, can I fix this issue in your branch? @iit2009060

@iit2009060
Copy link
Contributor Author

iit2009060 commented Oct 9, 2023

@hudeqi @showuon There is one test case failing related to the above change #14381, can you help me resolving it , in the local it is not failing ? Build / JDK 8 and Scala 2.12 / testClearCacheAndIndexFilesWhenResizeCache() – kafka.log.remote.RemoteIndexCacheTest java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /tmp/kafka-RemoteIndexCacheTest5502978390891743756/mwbAzcjOTtKRIDxlQrgTTw:foo-0/remote-log-index-cache/0_xg9JTaPnS4C47HfCgv2_gA.txnindex.deleted https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14482/12/tests/

hi, I think this may be the reason: when executing getIndexFileFromRemoteCacheDir, the .deleted file will be listed in the filter(Files.isRegularFile(_)) stage, but when executing path.getFileName.toString, the .deleted file has already been deleted because cleanStarted is already true to start cleaning. If possible, can I fix this issue in your branch? @iit2009060

Yes sure @hudeqi . This seems to be a case of flaky behaviour i.e. race condition ?

@hudeqi
Copy link
Collaborator

hudeqi commented Oct 9, 2023

@hudeqi @showuon There is one test case failing related to the above change #14381, can you help me resolving it , in the local it is not failing ? Build / JDK 8 and Scala 2.12 / testClearCacheAndIndexFilesWhenResizeCache() – kafka.log.remote.RemoteIndexCacheTest java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /tmp/kafka-RemoteIndexCacheTest5502978390891743756/mwbAzcjOTtKRIDxlQrgTTw:foo-0/remote-log-index-cache/0_xg9JTaPnS4C47HfCgv2_gA.txnindex.deleted https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14482/12/tests/

hi, I think this may be the reason: when executing getIndexFileFromRemoteCacheDir, the .deleted file will be listed in the filter(Files.isRegularFile(_)) stage, but when executing path.getFileName.toString, the .deleted file has already been deleted because cleanStarted is already true to start cleaning. If possible, can I fix this issue in your branch? @iit2009060

Yes sure @hudeqi . This seems to be a case of flaky behaviour i.e. race condition ?

I think so, please cherry-pick this commit: hudeqi@854c773, thanks. @iit2009060

@iit2009060
Copy link
Contributor Author

iit2009060 commented Oct 9, 2023

@hudeqi @showuon There is one test case failing related to the above change #14381, can you help me resolving it , in the local it is not failing ? Build / JDK 8 and Scala 2.12 / testClearCacheAndIndexFilesWhenResizeCache() – kafka.log.remote.RemoteIndexCacheTest java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /tmp/kafka-RemoteIndexCacheTest5502978390891743756/mwbAzcjOTtKRIDxlQrgTTw:foo-0/remote-log-index-cache/0_xg9JTaPnS4C47HfCgv2_gA.txnindex.deleted https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14482/12/tests/

hi, I think this may be the reason: when executing getIndexFileFromRemoteCacheDir, the .deleted file will be listed in the filter(Files.isRegularFile(_)) stage, but when executing path.getFileName.toString, the .deleted file has already been deleted because cleanStarted is already true to start cleaning. If possible, can I fix this issue in your branch? @iit2009060

Yes sure @hudeqi . This seems to be a case of flaky behaviour i.e. race condition ?

I think so, please cherry-pick this commit: hudeqi@854c773, thanks. @iit2009060

@hudeqi This is given me bad object/revision . Can you have the command handy to do cherry pick from another fork ?
git cherry-pick 854c773
fatal: bad object 854c773
This seems to be existing in your fork repository
I figured out through this link. Will update soon the PR with changes.

@iit2009060
Copy link
Contributor Author

iit2009060 commented Oct 9, 2023

@hudeqi @showuon There is one test case failing related to the above change #14381, can you help me resolving it , in the local it is not failing ? Build / JDK 8 and Scala 2.12 / testClearCacheAndIndexFilesWhenResizeCache() – kafka.log.remote.RemoteIndexCacheTest java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /tmp/kafka-RemoteIndexCacheTest5502978390891743756/mwbAzcjOTtKRIDxlQrgTTw:foo-0/remote-log-index-cache/0_xg9JTaPnS4C47HfCgv2_gA.txnindex.deleted https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14482/12/tests/

hi, I think this may be the reason: when executing getIndexFileFromRemoteCacheDir, the .deleted file will be listed in the filter(Files.isRegularFile(_)) stage, but when executing path.getFileName.toString, the .deleted file has already been deleted because cleanStarted is already true to start cleaning. If possible, can I fix this issue in your branch? @iit2009060

Yes sure @hudeqi . This seems to be a case of flaky behaviour i.e. race condition ?

I think so, please cherry-pick this commit: hudeqi@854c773, thanks. @iit2009060

@hudeqi This is given me bad object/revision . Can you have the command handy to do cherry pick from another fork ? git cherry-pick 854c773 fatal: bad object 854c773 This seems to be existing in your fork repository I figured out through this link. Will update soon the PR with changes.

@hudeqi @showuon Please review it

@iit2009060
Copy link
Contributor Author

@iit2009060
Copy link
Contributor Author

@showuon Can we merge this changes ?

Comment on lines 548 to 552
TestUtils.waitUntilTrue(() => cacheEntry.isCleanFinished,
"Failed to finish cleanup cache entry after resizing cache.")

// verify no index files on remote cache dir
TestUtils.waitUntilTrue(() => !getIndexFileFromRemoteCacheDir(LogFileUtils.INDEX_FILE_SUFFIX).isPresent,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iit2009060 @hudeqi , adding an isCleanFinished flag just for test is not a good solution. In this case, could we catch the exception in the getIndexFileFromRemoteCacheDir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hudeqi Can you resolve the above comment ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@showuon I have removed the isCleanFinished flag changes and catch the Exception in the method.
cc @hudeqi
Please review it .

Copy link
Contributor

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the update.

@showuon
Copy link
Contributor

showuon commented Oct 11, 2023

Failed tests are unrelated and also failed in trunk build.

@showuon showuon merged commit 99ce2e0 into apache:trunk Oct 11, 2023
1 check failed
divijvaidya pushed a commit that referenced this pull request Nov 16, 2023
est Cases Covered

    1. Index Files already exist on disk but not in Cache i.e. RemoteIndexCache should not call remoteStorageManager to fetch it instead cache it from the local index file present.
    2. RSM returns CorruptedIndex File i.e. RemoteIndexCache should throw CorruptedIndexException instead of successfull execution.
    3. Deleted Suffix Indexes file already present on disk i.e. If cleaner thread is slow , then there is a chance of deleted index files present on the disk while in parallel same index Entry is invalidated. To understand more refer https://issues.apache.org/jira/browse/KAFKA-15169

Reviewers: Divij Vaidya <diviv@amazon.com>, Luke Chen <showuon@gmail.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
mjsax pushed a commit to confluentinc/kafka that referenced this pull request Nov 22, 2023
est Cases Covered

    1. Index Files already exist on disk but not in Cache i.e. RemoteIndexCache should not call remoteStorageManager to fetch it instead cache it from the local index file present.
    2. RSM returns CorruptedIndex File i.e. RemoteIndexCache should throw CorruptedIndexException instead of successfull execution.
    3. Deleted Suffix Indexes file already present on disk i.e. If cleaner thread is slow , then there is a chance of deleted index files present on the disk while in parallel same index Entry is invalidated. To understand more refer https://issues.apache.org/jira/browse/KAFKA-15169

Reviewers: Divij Vaidya <diviv@amazon.com>, Luke Chen <showuon@gmail.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tiered-storage Pull requests associated with KIP-405 (Tiered Storage)
Projects
None yet
5 participants