HDDS-8314. [Snapshot] SnapDiff job and compaction DAG/SST file pruning synchronization#4553
Conversation
03bb2b4 to
338a274
Compare
| * to DAG), compaction DAG pruning job (to removes older snapshot's from DAG) | ||
| * or a snap diff job (reads compaction DAG). | ||
| */ | ||
| private final Object compactionDagLock = new Object(); |
There was a problem hiding this comment.
We should make RocksDBCheckpointDiffer a singleton class otherwise this could still be a problem with future changes.
There was a problem hiding this comment.
I don't think it is a good idea to make it singleton in this case because RocksDBCheckpointDiffer object is based on RocksDB's db directory.
There was a problem hiding this comment.
Discussed offline. Changing it to singleton will protect from multiple instances get created for RocksDBCheckpointDiffer. We want to have only one object of RocksDBCheckpointDiffer throughout OM process.
There was a problem hiding this comment.
After changing RocksDBCheckpointDiffer to simply singleton, most of OM unit and integration tests failed because each test is creating a new instance of OmMetadataManagerImpl which initializes RBDStore and RocksDBCheckpointDiffer. I tired to fix the tests by creating only one instance of OmMetadataManagerImpl per test class but that doesn’t work either. Assertions are failing in the case.
https://github.com/hemantk-12/ozone/actions/runs/4813598710/jobs/8570328203
One way to fix this, is to have one instance of per RocksDBCheckpointDiffer RocksDB dir and keep it in memory. Which solves the unit test failure and is close to what we want achieve. I made the changes accordingly.
...ksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java
Outdated
Show resolved
Hide resolved
prashantpogde
left a comment
There was a problem hiding this comment.
Other than couple minor comments, the overall changes look good.
…urn unique object per DB dir.
|
Thanks @hemantk-12 for the patch. |
What changes were proposed in this pull request?
Currently, it is possible that when snapDiff job is running we may loose some of compaction DAG if snapshots timing is overlapping with time by which snapshot becomes stale in compaction DAG.
Other thing could happen is that DAG returned some SST file/s as diff but those files are removed by RocksDBCheckpointDiffer#pruneOlderSnapshotsWithCompactionHistory while reading them to generate diff report.
For example, if one or both of the snapshots of snapDiff job are 30 days old and compaction DAG pruning service is removing 30 days older snapshots, we could be in that situation.
This change is to achieve the synchronization between compaction DAG update (appending and pruning) and SnapDiff job so that snapDiff report is complete and correct instead of partial and incorrect.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8314
How was this patch tested?
Existing tests as of now. More tests will be added as part of HDDS-8315 and HDDS-8389.