Skip to content

[bug](Fe) fix potential deadlock in show proc statement#34988

Merged
xy720 merged 1 commit intoapache:masterfrom
xy720:prevent-dead-lock-in-show-proc
May 27, 2024
Merged

[bug](Fe) fix potential deadlock in show proc statement#34988
xy720 merged 1 commit intoapache:masterfrom
xy720:prevent-dead-lock-in-show-proc

Conversation

@xy720
Copy link
Member

@xy720 xy720 commented May 16, 2024

Proposed changes

Issue Number: close #xxx

version: 1.2|2.0|2.1

We have encountered Fe hang and the number connections reach limits.
Unfortunately, I restarted FE without Jstack the Fe process.

But I found a lot of logs like:

2024-05-11 14:09:32,533 WARN (thrift-server-pool-424042|4684871) [ReportHandler.putToQueue():190] the report queue size exceeds the limit: 100.
 current: 101
2024-05-11 14:09:32,545 WARN (thrift-server-pool-392529|4367180) [ReportHandler.putToQueue():190] the report queue size exceeds the limit: 100.
 current: 101
2024-05-11 14:09:32,663 WARN (thrift-server-pool-421895|4663462) [ReportHandler.putToQueue():190] the report queue size exceeds the limit: 100.
 current: 101
2024-05-11 14:09:32,816 WARN (thrift-server-pool-379243|4167917) [ReportHandler.putToQueue():190] the report queue size exceeds the limit: 100.
 current: 101
2024-05-11 14:09:33,531 WARN (thrift-server-pool-420797|4652941) [ReportHandler.putToQueue():190] the report queue size exceeds the limit: 100.
 current: 101

The tablet report task is hang:

2024-05-11 14:07:57,671 INFO (Thread-57|105) [ReportHandler.tabletReport():250] backend[10009] reports 28982 tablet(s). repor
t version: 17094493386232
2024-05-11 14:24:15,323 INFO (Thread-57|105) [TabletInvertedIndex.tabletReport():308] finished to do tablet diff with backend[10009].
sync: 0. metaDel: 7. foundInMeta: 28953. migration: 0. found invalid transactions 0. found republish transactions 0. tabletInMemorySync:
0. need recovery: 0. cost: 66 ms

And the dynamic partition scheduler seems also hang:

image

I found a similar issue, as issue #11319 saying, we should avoid using common global ForkJoinPool to execute parallelStream tasks and prevent Fe deadlocks from occurring.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@xy720
Copy link
Member Author

xy720 commented May 16, 2024

run buildall

@lide-reed
Copy link
Contributor

lide-reed commented May 17, 2024

It's said that java.util.concurrent.ForkJoinPool may degrade performance. Please confirm whether the changes here involve high-frequency operations.
https://stackoverflow.com/questions/20288379/analysis-performance-of-forkjoinpool

@xy720
Copy link
Member Author

xy720 commented May 17, 2024

It's said that java.util.concurrent.ForkJoinPool may degrade performance. Please confirm whether the changes here involve high-frequency operations.
https://stackoverflow.com/questions/20288379/analysis-performance-of-forkjoinpool

As we discussed, the pr will not affect performance. No longer use the global only thread pool here, but instead use its own thread pool separately.

Copy link
Contributor

@lide-reed lide-reed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 27, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@cambyzju cambyzju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xy720 xy720 merged commit 3664ca4 into apache:master May 27, 2024
lide-reed pushed a commit that referenced this pull request Dec 3, 2024
### What problem does this PR solve?

come from: #34988

Problem:
Use JDK11 and call `show proc "/cluster_health/tablet_health"`
frequently, make large number of ForkJoinPool thread leak
github-actions bot pushed a commit that referenced this pull request Dec 3, 2024
### What problem does this PR solve?

come from: #34988

Problem:
Use JDK11 and call `show proc "/cluster_health/tablet_health"`
frequently, make large number of ForkJoinPool thread leak
github-actions bot pushed a commit that referenced this pull request Dec 3, 2024
### What problem does this PR solve?

come from: #34988

Problem:
Use JDK11 and call `show proc "/cluster_health/tablet_health"`
frequently, make large number of ForkJoinPool thread leak
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.x dev/3.0.0-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants