Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEODE-8521: detect if a p2p reader thread is hung #5763

Merged
merged 19 commits into from
Dec 4, 2020

Conversation

dschneider-pivotal
Copy link
Contributor

Each p2p reader thread now creates a single instance of AbstractExecutor
and registers it with the ThreadsMonitoring instance while it is "dispatching"
the message. It will not be monitored while it is waiting on the socket to read
a message. So this will only report stuck p2p readers that are stuck dispatch
which for inline processing includes processing the message and reading the
direct ack.

Thank you for submitting a contribution to Apache Geode.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?

  • Has your PR been rebased against the latest commit within the target branch (typically develop)?

  • Is your initial contribution a single, squashed commit?

  • Does gradlew build run cleanly?

  • Have you written or updated unit tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

Note:

Please ensure that once the PR is submitted, check Concourse for build issues and
submit an update to your PR as soon as possible. If you need help, please send an
email to dev@geode.apache.org.

@dschneider-pivotal dschneider-pivotal marked this pull request as ready for review November 20, 2020 01:22
Copy link
Contributor

@bschuchardt bschuchardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to detect stuck P2P Reader threads that are dispatching messages but I'm concerned about the need to take a millisecond clock reading at the beginning of every message-dispatch.
Have you thought about having the thread be monitored continuously and checking its connectionState to see if it's dispatching? You'd have to add a new DISPATCHING state and have custom code in the monitor for the new P2PReaderExecutorGroup to do this checking.

@dschneider-pivotal dschneider-pivotal marked this pull request as draft November 20, 2020 17:38
Darrel Schneider and others added 4 commits November 20, 2020 12:58
This does nothing on the base class but on P2PReaderExecutorGroup
it turns monitoring on and off.
Also the monitor thread will now detect a zero startTime
and in that case set the startTime allowing threads being
monitored to not keep setting it themselves (which happens more often).
will now always be the values configured on the DistributedSystem.
Previously the thread monitoring classes had their own instance of
DistributionConfigImpl which at least in some cases could be inconsistent
the the config on the DistributedSystem. Now these two config values
are passed in to the constructor.
2. If an instance of ResourceManagerStats can not be found then
it will now only prevent the monitor from updating the "numThreadsStuck"
stat. Before it prevented it from doing any detection and logging.
The lazy initialization of the ResourceManagerStats is now cleaner and does
not require the integration test to explicitly call run().
@dschneider-pivotal dschneider-pivotal marked this pull request as ready for review November 21, 2020 06:13
Copy link
Contributor

@bschuchardt bschuchardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing the clock probes Darrel. I also like the change to keep the reader threads in the monitor map. This will be great to have in diagnosing customer issues!

Copy link
Contributor

@kirklund kirklund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I requested a few small changes mostly involving naming.

@kirklund kirklund self-requested a review November 30, 2020 23:46
@dschneider-pivotal
Copy link
Contributor Author

the dunit failure is the known issue: GEODE-8573

@dschneider-pivotal dschneider-pivotal merged commit e26d759 into apache:develop Dec 4, 2020
@dschneider-pivotal dschneider-pivotal deleted the GEODE-8521 branch December 4, 2020 01:46
mkevo pushed a commit to Nordix/geode that referenced this pull request Apr 8, 2021
* AbstractExecutor can now be suspended and resumed.
This does nothing on the base class but on P2PReaderExecutorGroup
it turns monitoring on and off.
Also the monitor thread will now detect a zero startTime
and in that case set the startTime allowing threads being
monitored to not keep setting it themselves (which happens more often).

* the timeInterval and timeLimit that come from gemfire properties
will now always be the values configured on the DistributedSystem.
Previously the thread monitoring classes had their own instance of
DistributionConfigImpl which at least in some cases could be inconsistent
the the config on the DistributedSystem. Now these two config values
are passed in to the constructor.

* If an instance of ResourceManagerStats can not be found then
it will now only prevent the monitor from updating the "numThreadsStuck"
stat. Before it prevented it from doing any detection and logging.
The lazy initialization of the ResourceManagerStats is now cleaner and does
not require the integration test to explicitly call run().

Co-authored-by: Darrel Schneider <darrel@vmware.com>
mkevo pushed a commit to Nordix/geode that referenced this pull request Apr 19, 2021
* AbstractExecutor can now be suspended and resumed.
This does nothing on the base class but on P2PReaderExecutorGroup
it turns monitoring on and off.
Also the monitor thread will now detect a zero startTime
and in that case set the startTime allowing threads being
monitored to not keep setting it themselves (which happens more often).

* the timeInterval and timeLimit that come from gemfire properties
will now always be the values configured on the DistributedSystem.
Previously the thread monitoring classes had their own instance of
DistributionConfigImpl which at least in some cases could be inconsistent
the the config on the DistributedSystem. Now these two config values
are passed in to the constructor.

* If an instance of ResourceManagerStats can not be found then
it will now only prevent the monitor from updating the "numThreadsStuck"
stat. Before it prevented it from doing any detection and logging.
The lazy initialization of the ResourceManagerStats is now cleaner and does
not require the integration test to explicitly call run().

Co-authored-by: Darrel Schneider <darrel@vmware.com>
alb3rtobr pushed a commit to Nordix/geode that referenced this pull request Aug 2, 2021
* AbstractExecutor can now be suspended and resumed.
This does nothing on the base class but on P2PReaderExecutorGroup
it turns monitoring on and off.
Also the monitor thread will now detect a zero startTime
and in that case set the startTime allowing threads being
monitored to not keep setting it themselves (which happens more often).

* the timeInterval and timeLimit that come from gemfire properties
will now always be the values configured on the DistributedSystem.
Previously the thread monitoring classes had their own instance of
DistributionConfigImpl which at least in some cases could be inconsistent
the the config on the DistributedSystem. Now these two config values
are passed in to the constructor.

* If an instance of ResourceManagerStats can not be found then
it will now only prevent the monitor from updating the "numThreadsStuck"
stat. Before it prevented it from doing any detection and logging.
The lazy initialization of the ResourceManagerStats is now cleaner and does
not require the integration test to explicitly call run().

Co-authored-by: Darrel Schneider <darrel@vmware.com>
alb3rtobr pushed a commit to Nordix/geode that referenced this pull request Aug 4, 2021
* AbstractExecutor can now be suspended and resumed.
This does nothing on the base class but on P2PReaderExecutorGroup
it turns monitoring on and off.
Also the monitor thread will now detect a zero startTime
and in that case set the startTime allowing threads being
monitored to not keep setting it themselves (which happens more often).

* the timeInterval and timeLimit that come from gemfire properties
will now always be the values configured on the DistributedSystem.
Previously the thread monitoring classes had their own instance of
DistributionConfigImpl which at least in some cases could be inconsistent
the the config on the DistributedSystem. Now these two config values
are passed in to the constructor.

* If an instance of ResourceManagerStats can not be found then
it will now only prevent the monitor from updating the "numThreadsStuck"
stat. Before it prevented it from doing any detection and logging.
The lazy initialization of the ResourceManagerStats is now cleaner and does
not require the integration test to explicitly call run().

Co-authored-by: Darrel Schneider <darrel@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Quality Review
Awaiting triage
4 participants