Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around JVM Bug in LongGCDisruptionTests #50731

Merged
merged 1 commit into from
Jan 14, 2020

Conversation

original-brownbear
Copy link
Member

There is a JVM bug causing Thread#suspend calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
suspend tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the Thread#suspend calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes #50047

There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes elastic#50047
@original-brownbear original-brownbear added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test Issues or PRs that are addressing/adding tests v7.5.2 v7.6.0 v8.0.0 labels Jan 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

thread.suspend();
if (System.nanoTime() - startTime > TimeUnit.SECONDS.toNanos(3L)) {
sawSlowSuspendBug.set(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this conditional on the JVM version so that we are sure to keep tracking that JVM bug and to drop this fix when it's no longer needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we could, but right now we're seeing this failure on all versions. This is a little unexpected on 8 so I figured we'd apply the fix for 8 as well to identify whether it's the same issue (slow suspend) there or if it's blocked somewhere else.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit 375c140 into elastic:master Jan 14, 2020
@original-brownbear original-brownbear deleted the fix-50047 branch January 14, 2020 13:38
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 14, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes elastic#50047
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 14, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes elastic#50047
original-brownbear added a commit that referenced this pull request Jan 14, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes #50047
original-brownbear added a commit that referenced this pull request Jan 14, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes #50047
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
There is a JVM bug causing `Thread#suspend` calls to randomly take
multiple seconds breaking these tests that call the method numerous times
in a loop. Increasing the timeout would will not work since we may call
`suspend` tens if not hundreds of times and even a small number of them
experiencing the blocking will lead to multiple minutes of waiting.

This PR detects the specific issue by timing the `Thread#suspend` calls and
skips the remainder of the test if it timed out because of the JVM bug.

Closes elastic#50047
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test Issues or PRs that are addressing/adding tests v7.5.2 v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LongGCDisruptionTests.testNotBlockingUnsafeStackTraces failed
5 participants