Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dump Stacktrace on Slow IO-Thread Operations #42000

Conversation

Projects
None yet
4 participants
@original-brownbear
Copy link
Member

commented May 9, 2019

  • Follow up to #39729 extending the functionality to actually dump the
    stack when the thread is blocked not afterwards
    • Logging the stack-trace after the thread became unblocked is only of
      limited use because we don't know what happened in the slow callback
      from that (only whether we were blocked on a read,write,connect etc.)
  • Relates #41745
Dump Stacktrace on Slow IO-Thread Operations
* Follow up to #39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates #41745
@elasticmachine

This comment has been minimized.

Copy link

commented May 9, 2019

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

Jenkins run elasticsearch-ci/docbldesx

@cbuescher
Copy link
Member

left a comment

@original-brownbear thanks, I took a look since this would add great diagnostics in cases like the CI failure I am currently looking at. I don't know if the suggested 2s interval and warning threshold (seems 150ms) is long enough to not disturb normal test execution, my guess is that the mock transport should be fast but this is something somebody else like @tbrooks8 should comment on. I'd definitely like to have this kind of warning logging if it doesn't spam regular test execution error logs too much.

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

@cbuescher

I don't know if the suggested 2s interval and warning threshold (seems 150ms) is long enough to not disturb normal test execution, my guess is that the mock transport should be fast

The warning threshold was 150ms before as well and I think it was a fine choice then by @tbrooks8.
The 2s interval I figured was short enough to give a little more insight in case of a long running action on that transport_worker thread (from multiple changing stack traces being printed) but wouldn't completely destroy the logs for a 30s or 60s request timeout with a dead-locked callback. But admittedly the choice of 2s was pretty arbitrary. We could also put more effort into this (I'm not 100% convinced it's worth it but also not opposed) if others disagree and lower the 2s check interval and deduplicate log messages?

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented May 20, 2019

@tbrooks8 can you take a look here whenever you have a sec? I think you're the only one here comfortable reviewing this thing :)

@tbrooks8

This comment has been minimized.

Copy link
Contributor

commented May 21, 2019

Yes I’ll take a look tomorrow.

final Thread thread = entry.getKey();
logger.warn("Slow execution on network thread [{}] [{} milliseconds]: \n{}", thread.getName(),
TimeUnit.NANOSECONDS.toMillis(elapsedTime),
Arrays.stream(thread.getStackTrace()).map(Object::toString).collect(Collectors.joining("\n")));

This comment has been minimized.

Copy link
@tbrooks8

tbrooks8 May 22, 2019

Contributor

Why does this not cause issues with the security manager to call getStackTrace?

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear May 22, 2019

Author Member

I think it's because we grant

  // needed for top threads handling
  permission org.elasticsearch.secure_sm.ThreadPermission "modifyArbitraryThreadGroup";

to codeBase "${codebase.randomizedtesting-runner}" { right? (in test-framework.policy)

We actually do use the same code to get stack traces in other tests too so I'm sure it works fine with the SM (+ manually verified it).

Arrays.stream(thread.getStackTrace()).map(Object::toString).collect(Collectors.joining("\n")));
}
}
if (registered.get() > 0) {

This comment has been minimized.

Copy link
@tbrooks8

tbrooks8 May 22, 2019

Contributor

There is kind of a race here. It feels like we should just have the thread always run and always reschedule the task? It's not like a 2 second periodic task is coming to cause issues with performance for the integration tests.

The race is:

  1. Generic thread pool task checks and sees 0 registered.
  2. New thread calls register and then increments registered, does the other stuff, then fails on the running compare and set (because running is still true).
  3. Generic thread pool task finishes setting running too false.

Obviously the next registered call will fix that. So things should eventually work out. But I don't see the significant value of making the concurrency handling this complicated when a periodic task seems fine.

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear May 22, 2019

Author Member

Fair point, I was too paranoid here I guess :) Significantly simplified this now to always rescheduled and just added a flag to make it stop.

original-brownbear added some commits May 22, 2019

@original-brownbear original-brownbear requested a review from tbrooks8 May 22, 2019

@tbrooks8
Copy link
Contributor

left a comment

LGTM

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented May 22, 2019

thanks @tbrooks8 !

@original-brownbear original-brownbear merged commit 94848d8 into elastic:master May 22, 2019

8 checks passed

CLA All commits in pull request signed
Details
elasticsearch-ci/1 Build finished.
Details
elasticsearch-ci/2 Build finished.
Details
elasticsearch-ci/bwc Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

@original-brownbear original-brownbear deleted the original-brownbear:stronger-blocked-transport-monitoring branch May 22, 2019

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 27, 2019

Dump Stacktrace on Slow IO-Thread Operations (elastic#42000)
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to elastic#39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates elastic#41745

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019

Dump Stacktrace on Slow IO-Thread Operations (elastic#42000)
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to elastic#39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates elastic#41745

original-brownbear added a commit that referenced this pull request May 27, 2019

Dump Stacktrace on Slow IO-Thread Operations (#42000) (#42572)
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to #39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates #41745

cbuescher added a commit that referenced this pull request Jun 13, 2019

Dump Stacktrace on Slow IO-Thread Operations (#42000) (#42572)
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to #39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates #41745
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.