New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread #1185
Conversation
I will check the failed ut. |
bca90af
to
2d15f95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly add new unit test to show case the new logic works as expected. And check the failed UT and accpetance test.
...main/java/org/apache/hadoop/ozone/container/common/states/datanode/RunningDatanodeState.java
Outdated
Show resolved
Hide resolved
23e1263
to
2409149
Compare
...ervice/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @runzhiwang for reporting issue and the fix. Patch LGTM, just have a minor question inline.
Codecov Report
@@ Coverage Diff @@
## master #1185 +/- ##
============================================
+ Coverage 70.56% 73.49% +2.93%
- Complexity 9427 10048 +621
============================================
Files 965 978 +13
Lines 49063 49918 +855
Branches 4803 4848 +45
============================================
+ Hits 34620 36689 +2069
+ Misses 12137 10925 -1212
+ Partials 2306 2304 -2
Continue to review full report at Codecov.
|
...ervice/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
Outdated
Show resolved
Hide resolved
LGTM + 1. Thanks @runzhiwang for the contribution and @xiaoyuyao for the review. |
What changes were proposed in this pull request?
What's problem ?
Datanode creates more than 20K Datanode State Machine Thread, then OOM happened.
What's the reason ?
20K Datanode State Machine Thread were created by newCachedThreadPool
Almost all of them were wait lock.
Only one Datanode State Machine Thread got the lock, and block when submitRequest. Because this thread was blocked and can not free the lock, newCachedThreadPool will create new thread infinitely.
How to fix ?
newCachedThreadPool
, because it will create new thread infinitely, if no thread available in pool.But timeout can not work, for example when init startTime = 0 and timeLeft = 100, if monotonicNow = 10, timeLeft = 100 - (10 - 0) = 90, and then if monotonicNow = 11, timeLeft = 90 - (11 - 0) = 79 which actually should be 89.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3933
How was this patch tested?
Existed UT.