Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread #1185

Merged
merged 7 commits into from Jul 22, 2020

Conversation

runzhiwang
Copy link
Contributor

@runzhiwang runzhiwang commented Jul 10, 2020

What changes were proposed in this pull request?

What's problem ?
Datanode creates more than 20K Datanode State Machine Thread, then OOM happened.
image

What's the reason ?
20K Datanode State Machine Thread were created by newCachedThreadPool
image

Almost all of them were wait lock.
image

Only one Datanode State Machine Thread got the lock, and block when submitRequest. Because this thread was blocked and can not free the lock, newCachedThreadPool will create new thread infinitely.
image

How to fix ?

  1. Avoid use newCachedThreadPool, because it will create new thread infinitely, if no thread available in pool.
  2. Cancel future when task time out.
  3. Besides there is bug in such following code which want to loop until timeout or returned >= count.
    But timeout can not work, for example when init startTime = 0 and timeLeft = 100, if monotonicNow = 10, timeLeft = 100 - (10 - 0) = 90, and then if monotonicNow = 11, timeLeft = 90 - (11 - 0) = 79 which actually should be 89.
 while (returned < count && timeLeft > 0) {
    timeLeft = timeLeft - (Time.monotonicNow() - startTime);
 }

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3933

How was this patch tested?

Existed UT.

@runzhiwang
Copy link
Contributor Author

I will check the failed ut.

Copy link
Contributor

@ChenSammi ChenSammi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly add new unit test to show case the new logic works as expected. And check the failed UT and accpetance test.

@runzhiwang runzhiwang force-pushed the thread-pool-size-2 branch 4 times, most recently from 23e1263 to 2409149 Compare July 16, 2020 11:11
Copy link
Contributor

@xiaoyuyao xiaoyuyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @runzhiwang for reporting issue and the fix. Patch LGTM, just have a minor question inline.

@codecov-commenter
Copy link

Codecov Report

Merging #1185 into master will increase coverage by 2.93%.
The diff coverage is 74.96%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1185      +/-   ##
============================================
+ Coverage     70.56%   73.49%   +2.93%     
- Complexity     9427    10048     +621     
============================================
  Files           965      978      +13     
  Lines         49063    49918     +855     
  Branches       4803     4848      +45     
============================================
+ Hits          34620    36689    +2069     
+ Misses        12137    10925    -1212     
+ Partials       2306     2304       -2     
Impacted Files Coverage Δ Complexity Δ
.../apache/hadoop/hdds/conf/HddsPrometheusConfig.java 50.00% <ø> (ø) 2.00 <0.00> (ø)
...op/hdds/utils/LegacyHadoopConfigurationSource.java 50.00% <ø> (ø) 5.00 <0.00> (ø)
.../java/org/apache/hadoop/ozone/OzoneConfigKeys.java 100.00% <ø> (ø) 1.00 <0.00> (ø)
...main/java/org/apache/hadoop/ozone/OzoneConsts.java 85.71% <ø> (+1.50%) 1.00 <0.00> (ø)
...g/apache/hadoop/hdds/conf/ConfigurationSource.java 22.72% <ø> (ø) 7.00 <0.00> (ø)
...rg/apache/hadoop/ozone/HddsDatanodeHttpServer.java 100.00% <ø> (+28.57%) 13.00 <0.00> (+4.00)
...mon/transport/server/ratis/XceiverServerRatis.java 88.10% <0.00%> (-4.50%) 62.00 <0.00> (+5.00) ⬇️
...iner/ozoneimpl/ContainerScrubberConfiguration.java 81.81% <ø> (-18.19%) 7.00 <0.00> (-1.00)
...org/apache/hadoop/hdds/server/http/HttpConfig.java 76.47% <0.00%> (ø) 1.00 <0.00> (ø)
.../hadoop/hdds/scm/container/ReplicationManager.java 87.35% <0.00%> (-0.70%) 102.00 <0.00> (ø)
... and 453 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25d3e52...2c381fa. Read the comment docs.

@ChenSammi
Copy link
Contributor

LGTM + 1. Thanks @runzhiwang for the contribution and @xiaoyuyao for the review.

@ChenSammi ChenSammi merged commit ff7b5a3 into apache:master Jul 22, 2020
ChenSammi pushed a commit that referenced this pull request Jul 22, 2020
rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants