Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-3642. Stop/Pause Background services while replacing OM DB with checkpoint from Leader #1002

Merged
merged 2 commits into from Jun 14, 2020

Conversation

hanishakoneru
Copy link
Contributor

What changes were proposed in this pull request?

When a follower OM needs to replace its DB with a checkpoint from Leader (to catch up on the transactions), it should pause or stop services which read/ write to the DB.

During OM HA testing, found that OM could crash with JVM error on RocksDB. This happened because KeyDeletingService was trying to access a memory which is already freed up.

Please see Jira link below for error logs.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3642

How was this patch tested?

Tested on a local docker cluster.

@bharatviswa504
Copy link
Contributor

@hanishakoneru This PR needs a rebase.

Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.
One minor comment, and it needs a rebase.

stopSecretManager();
metadataManager.stop();

// s3SecretManager should also be stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we don't have any stop() for S3SecretManager and PrefixManager.
And also for these, only read/write operations will happen. Write will go through double buffer and read's will not happen anyway as this is not leader. Do we still see issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If writes can happen then we should stop it, right?
But writes would have to come through Ratis server which will be paused. So it should be okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming.
Then we can remove the comments.

@bharatviswa504
Copy link
Contributor

Looks like it is failing in compilation and also check style issues.

@codecov-commenter
Copy link

codecov-commenter commented Jun 10, 2020

Codecov Report

Merging #1002 into master will decrease coverage by 0.01%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1002      +/-   ##
============================================
- Coverage     69.46%   69.45%   -0.02%     
+ Complexity     9121     9118       -3     
============================================
  Files           961      961              
  Lines         48151    48158       +7     
  Branches       4679     4679              
============================================
- Hits          33448    33447       -1     
- Misses        12489    12495       +6     
- Partials       2214     2216       +2     
Impacted Files Coverage Δ Complexity Δ
.../java/org/apache/hadoop/ozone/om/OzoneManager.java 64.27% <0.00%> (-0.32%) 186.00 <0.00> (ø)
...otocol/commands/RetriableDatanodeEventWatcher.java 55.55% <0.00%> (-44.45%) 3.00% <0.00%> (-1.00%)
...er/common/transport/server/GrpcXceiverService.java 70.00% <0.00%> (-10.00%) 3.00% <0.00%> (ø%)
...apache/hadoop/hdds/scm/block/BlockManagerImpl.java 66.66% <0.00%> (-5.41%) 18.00% <0.00%> (ø%)
...apache/hadoop/hdds/server/events/EventWatcher.java 77.77% <0.00%> (-4.17%) 14.00% <0.00%> (ø%)
...va/org/apache/hadoop/ozone/lease/LeaseManager.java 90.80% <0.00%> (-2.30%) 15.00% <0.00%> (-1.00%)
...hdds/scm/container/common/helpers/ExcludeList.java 95.91% <0.00%> (-2.05%) 23.00% <0.00%> (-1.00%)
...ine/commandhandler/DeleteBlocksCommandHandler.java 62.50% <0.00%> (-1.57%) 9.00% <0.00%> (-2.00%)
.../apache/hadoop/ozone/client/io/KeyInputStream.java 78.19% <0.00%> (-1.51%) 42.00% <0.00%> (-1.00%)
.../statemachine/background/BlockDeletingService.java 75.73% <0.00%> (-1.48%) 12.00% <0.00%> (-1.00%)
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cee76a8...37ff3de. Read the comment docs.

Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.
Thank You @hanishakoneru for the contribution

@bharatviswa504
Copy link
Contributor

Kicked off CI run, test has not run due to unavailability of artificats.

@bharatviswa504 bharatviswa504 merged commit aa04ac0 into apache:master Jun 14, 2020
@bharatviswa504
Copy link
Contributor

Thank You @hanishakoneru for the contribution.

@hanishakoneru hanishakoneru deleted the HDDS-3642 branch December 1, 2020 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants