HDDS-7271. Ozone Integration test shows memory leak (graceful shutdown cleanup)#3826
HDDS-7271. Ozone Integration test shows memory leak (graceful shutdown cleanup)#3826adoroszlai merged 9 commits intoapache:masterfrom
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @sumitagrawl for the new PR. Can you please check the following failure?
Error: org.apache.hadoop.ozone.client.rpc.TestCommitWatcher.testReleaseBuffersOnException Time elapsed: 38.019 s <<< FAILURE!
java.lang.AssertionError: Unexpected exception: class java.lang.NullPointerException
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.apache.hadoop.ozone.client.rpc.TestCommitWatcher.testReleaseBuffersOnException(TestCommitWatcher.java:324)
https://github.com/apache/ozone/actions/runs/3233538293/jobs/5295543311#step:5:3105
It may be intermittent, as it did not happen in your fork:
https://github.com/sumitagrawl/ozone/actions/runs/3233531880/jobs/5295505036#step:5:3105
|
Thanks @sumitagrawl for checking. I have never seen this problem before (we collect test results from |
Unable to reproduce locally, but from logs from CI, its happening because Metrics system try to unregister, but that information is not present in registry and there is no check for non-existence causing null pointer exception, that part of code is from hadoop. This may occur if shutdown of SCM clears cache, which is static registry, others can be impacted if also registered. I have changed unregister logic to remove only one as part of the SCM cache only registered, (not performing global cleanup now, as our test cases have multiple instance running in same memory of static cache). |
|
With previous commit repeated test run in CI shows 60% failure rate. https://github.com/adoroszlai/hadoop-ozone/actions/runs/3235026012/jobs/5298892266#step:6:12 I'll check the latest one, too. Thanks @sumitagrawl for updating the patch. |
adoroszlai
left a comment
There was a problem hiding this comment.
TestCommitWatcher passed 50/50 with f96b82e.
|
@ChenSammi Please merge |
What changes were proposed in this pull request?
Cleanup for RatisDropwizardExports registry on stop/shutdown
Avoiding continuous loop after interrupt for DeleteBlocksCommandHandler
Other services not stopped after stopping the cluster is also handled.
Few cases for "ForkJoinPool" which is based on CompletablFuture can not be handled as part, as this depends on service logic for handling same and marking it close.
These is observed in Ozone integration test
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7271
How was this patch tested?
This is verified running ozone integration test and verifying heap dump for same. This issue is not observed.