Skip to content

HDDS-7271. Ozone Integration test shows memory leak (graceful shutdown cleanup)#3826

Merged
adoroszlai merged 9 commits intoapache:masterfrom
sumitagrawl:HDDS-7271
Oct 12, 2022
Merged

HDDS-7271. Ozone Integration test shows memory leak (graceful shutdown cleanup)#3826
adoroszlai merged 9 commits intoapache:masterfrom
sumitagrawl:HDDS-7271

Conversation

@sumitagrawl
Copy link
Contributor

What changes were proposed in this pull request?
Cleanup for RatisDropwizardExports registry on stop/shutdown

Avoiding continuous loop after interrupt for DeleteBlocksCommandHandler

Other services not stopped after stopping the cluster is also handled.

Few cases for "ForkJoinPool" which is based on CompletablFuture can not be handled as part, as this depends on service logic for handling same and marking it close.

These is observed in Ozone integration test

What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7271

How was this patch tested?
This is verified running ozone integration test and verifying heap dump for same. This issue is not observed.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sumitagrawl for the new PR. Can you please check the following failure?

Error:  org.apache.hadoop.ozone.client.rpc.TestCommitWatcher.testReleaseBuffersOnException  Time elapsed: 38.019 s  <<< FAILURE!
java.lang.AssertionError: Unexpected exception: class java.lang.NullPointerException
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.apache.hadoop.ozone.client.rpc.TestCommitWatcher.testReleaseBuffersOnException(TestCommitWatcher.java:324)

https://github.com/apache/ozone/actions/runs/3233538293/jobs/5295543311#step:5:3105

It may be intermittent, as it did not happen in your fork:
https://github.com/sumitagrawl/ozone/actions/runs/3233531880/jobs/5295505036#step:5:3105

@adoroszlai
Copy link
Contributor

Thanks @sumitagrawl for checking. I have never seen this problem before (we collect test results from master branch), so until we gather more info I have to assume that it's related to this change. I have started repeated runs both with and without this change.

@sumitagrawl
Copy link
Contributor Author

Thanks @sumitagrawl for checking. I have never seen this problem before (we collect test results from master branch), so until we gather more info I have to assume that it's related to this change. I have started repeated runs both with and without this change.

Unable to reproduce locally, but from logs from CI, its happening because Metrics system try to unregister, but that information is not present in registry and there is no check for non-existence causing null pointer exception, that part of code is from hadoop.

This may occur if shutdown of SCM clears cache, which is static registry, others can be impacted if also registered.

I have changed unregister logic to remove only one as part of the SCM cache only registered, (not performing global cleanup now, as our test cases have multiple instance running in same memory of static cache).

@adoroszlai
Copy link
Contributor

With previous commit repeated test run in CI shows 60% failure rate.

https://github.com/adoroszlai/hadoop-ozone/actions/runs/3235026012/jobs/5298892266#step:6:12

I'll check the latest one, too. Thanks @sumitagrawl for updating the patch.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestCommitWatcher passed 50/50 with f96b82e.

@sumitagrawl
Copy link
Contributor Author

@ChenSammi Please merge

@adoroszlai adoroszlai merged commit 15217fe into apache:master Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants