New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-3642. Stop/Pause Background services while replacing OM DB with checkpoint from Leader #1002
Conversation
@hanishakoneru This PR needs a rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
One minor comment, and it needs a rebase.
stopSecretManager(); | ||
metadataManager.stop(); | ||
|
||
// s3SecretManager should also be stopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, we don't have any stop() for S3SecretManager and PrefixManager.
And also for these, only read/write operations will happen. Write will go through double buffer and read's will not happen anyway as this is not leader. Do we still see issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If writes can happen then we should stop it, right?
But writes would have to come through Ratis server which will be paused. So it should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming.
Then we can remove the comments.
Looks like it is failing in compilation and also check style issues. |
Codecov Report
@@ Coverage Diff @@
## master #1002 +/- ##
============================================
- Coverage 69.46% 69.45% -0.02%
+ Complexity 9121 9118 -3
============================================
Files 961 961
Lines 48151 48158 +7
Branches 4679 4679
============================================
- Hits 33448 33447 -1
- Misses 12489 12495 +6
- Partials 2214 2216 +2 Continue to review full report at Codecov.
|
…checkpoint from Leader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM.
Thank You @hanishakoneru for the contribution
Kicked off CI run, test has not run due to unavailability of artificats. |
Thank You @hanishakoneru for the contribution. |
What changes were proposed in this pull request?
When a follower OM needs to replace its DB with a checkpoint from Leader (to catch up on the transactions), it should pause or stop services which read/ write to the DB.
During OM HA testing, found that OM could crash with JVM error on RocksDB. This happened because KeyDeletingService was trying to access a memory which is already freed up.
Please see Jira link below for error logs.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3642
How was this patch tested?
Tested on a local docker cluster.