-
Notifications
You must be signed in to change notification settings - Fork 695
GEODE-9881: Oplog not compacted after recovery #7193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ec550cc to
e2cf136
Compare
| * Verifies that compaction works as expected after region is recovered | ||
| **/ | ||
| @Test | ||
| public void testThatCompactionWorksAfterRegionIsClosedAndThenRecovered() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test passes even with the code changes in this PR reverted. Would it be possible to add a test that fails with the previous behaviour but passes with the new behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! I executed mentioned test case multiple times with reverted changes and test case failed every time. It was always the case that first five Oplogs were not compacted as expected. It is possible that this test case is flaky in some way, but I cannot reproduce it or see anything suspicious. Could you please send me logs from test case execution where it pass after you revert the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have screwed up reverting the changes to Oplog.java when first testing this, because when I check now, the test fails as expected.
| /** | ||
| * Verifies that automatic compaction works after cache recovered from oplogs | ||
| */ | ||
| public class DiskRegionCompactorClearOplogAfterRecoveryJUnitTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the guidelines in the Geode Wiki, integration test classes should have names ending with "IntegrationTest" rather than "JUnitTest." Could this class be renamed to reflect that please?
kirklund
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second Donal's request for a test that fails without the change. It would be best to make it a lower level unit test but integration test is ok. Assuming that, I'll approve the PR now. Thanks!
| await() | ||
| .untilAsserted(() -> assertTrue(ma.getUsedMemory() > 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use AssertJ so it prints out a more detailed failure message:
await().untilAsserted(() -> {
assertThat(ma.getUsedMemory()).isGreaterThan(0);
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. I have applied all of your comments.
| config.setProperty(MCAST_PORT, "0"); | ||
| config.setProperty(LOCATORS, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the default values so specifying them is extraneous. You don't need them.
| cache.close(); | ||
| } finally { | ||
| DiskStoreImpl.SET_IGNORE_PREALLOCATE = false; | ||
| disconnectAllFromDS(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to call disconnectAllFromDS. It's a leftover from long ago when cache.close() did not disconnect from DS (it does now).
|
|
||
| createDiskStore(30, 10000); | ||
| Region<Object, Object> region = createRegion(); | ||
| DiskStoreImpl diskStore = ((LocalRegion) region).getDiskStore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use or cast to interfaces instead of concrete impls where possible. You can use InternalRegion instead of LocalRegion here. If you change createRegion to return InternalRegion (and perform the cast in that method) then you can avoid the cast here, making it easier to read.
| private void createDiskStoreWithSizeInBytes(String diskStoreName, | ||
| DiskStoreFactory diskStoreFactory, | ||
| long maxOplogSizeInBytes) { | ||
| ((DiskStoreFactoryImpl) diskStoreFactory).setMaxOplogSizeInBytes(maxOplogSizeInBytes); | ||
| ((DiskStoreFactoryImpl) diskStoreFactory).setDiskDirSizesUnit(DiskDirSizesUnit.BYTES); | ||
| diskStoreFactory.create(diskStoreName); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change this method to require DiskStoreFactoryImpl diskStoreFactory as a parameter just to be a bit cleaner.
Oplogs that are recovered after Region.close() will never be marked as eligible for compaction. This is because during recovery of the region the counter unrecoveredRegionCount is not cleared. When AbstractDiskRegionInfo object contained in Oplog is recovered then unrecoveredRegionCount should be decreased. This counter is decreased only if AbstractDiskRegionInfo.unrecovered is set to true for the object that is about to be recover (check checkForRecoverableRegion function). Problem is that these objects are deleted in Region.close(DiskRegion) and then recreated during recovery with flag unrecoverd set to false. Due to this unrecoveredRegionCount for the Oplog is not decreased after the recovery.This issue doesn't happen when cache/server is restarted because in that case this counter will be cleared. Solution: During close() mark all DiskRegion info objects as unrecovered in all Oplog's that are not yet compacted. After region is recovered then flag unrecovered is check and the counter checkForRecoverableRegion will be decreased.
e2cf136 to
0f1c4d6
Compare
Oplogs that are recovered after Region.close() will never be marked as
eligible for compaction. This is because unrecoveredRegionCount
is not cleared from Oplog during recovery of the region. When DiskRegion with
AbstractDiskRegionInfo.unrecovered flag is recovered, then unrecoveredRegionCount
should be decreased for the Oplog (check checkForRecoverableRegion function).
Problem here is that AbstractDiskRegionInfo objects are deleted in
Region.close(DiskRegion) and then recreated during recovery with flag
AbstractDiskRegionInfo.unrecovered set to false. Due to this unrecoveredRegionCount
for the Oplog is not decreased during the recovery. This issue doesn't happen when
cache/server is restarted because in that case unrecoveredRegionCount will be cleared.
Solution:
During close() create and mark AbstractDiskRegionInfo object as unrecovered in all Oplog's
that are not yet compacted. This way flag AbstractDiskRegionInfo.unrecovered will be read
during the recovery and unrecoveredRegionCount will be decreased.
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
Has your PR been rebased against the latest commit within the target branch (typically
develop)?Is your initial contribution a single, squashed commit?
Does
gradlew buildrun cleanly?Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?