-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pulsar-function] Stop calling the deprated method Thread.stop when stopping the function thread in ThreadRuntime. #11401
Conversation
@jerrypeng please review. |
Thanks for providing doc-related info! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zliang-min Thanks for working on this!
I think this is the good first step to improving the ThreadRuntime. I think the follow up to this should be having a couple configs for users to determine what the behavior should be for waiting for termination of an instance. We should provide the cluster admin an option to determine how long to wait for an instance to timeout and also what to do after the timeout has expired. Some options for what to do when timeout has expired:
- ignore and continue
- worker restarts itself
...unctions/runtime/src/main/java/org/apache/pulsar/functions/runtime/thread/ThreadRuntime.java
Outdated
Show resolved
Hide resolved
With this change, will there be any dangling thread in the heap dump that's not cleaned up? |
@nlu90 In java there is always is no guaranteed way to kill a thread. The thread has to cleanly exit. To make sure threads are not leaked, I suggested followup work in my previous comment |
/pulsarbot rerun-checks |
…on thread in ThreadRuntime.
8de70dd
to
2d0e666
Compare
@jerrypeng Your suggested changes are safer, may be we can add the changes in this PR? |
@nlu90 what I suggested is a much larger change and perhaps be released as part of several PRs. The current PR does not make things worse only better thus I think we merge it as it. It solves an issue we have seen. |
/pulsarbot rerun-checks |
2 similar comments
/pulsarbot rerun-checks |
/pulsarbot rerun-checks |
…topping the function thread in ThreadRuntime. (apache#11401) * Stop calling the deprated method Thread.stop when stopping the function thread in ThreadRuntime. * Updated warning message per review comment. (cherry picked from commit be93e14) Signed-off-by: Gimi Liang <zliang@splunk.com>
…topping the function thread in ThreadRuntime. (apache#11401) * Stop calling the deprated method Thread.stop when stopping the function thread in ThreadRuntime. * Updated warning message per review comment.
…topping the function thread in ThreadRuntime. (apache#11401) * Stop calling the deprated method Thread.stop when stopping the function thread in ThreadRuntime. * Updated warning message per review comment.
Motivation
Currently, when the
ThreadRuntime
tries to stop a function instance, it will call thestop
method on thefnThread
if the thread is still alive after 10 seconds since it interrupts the thread, see here.And Thread.stop is a deprecated method, and the issue is clearly documented in its doc, and I quote:
And this behavior exactly caused an issue with
BatchSourceExecutor
.BatchSourceExecutor
terminates thediscoveryThread
when it stops, and waits 10 seconds for the termination to complete, see here. So, if thediscoveryThread
took long enough to terminate, theawaitTermination
method could throw anIllegalMonitorStateException
because offnThread.stop
is called. Below is the backtrace stack I got for this case:The same issue has been discussed here as well.
And the consequence of this error is
BatchSourceExecutor
will lost the chance to clean up the resources like the consumer of the intermediate topic, so it will leak a consumer that still consume from the topic without processing the in-come messages.This PR is to solve this issue.
Modifications
The change is very simple, I just removed the
fnThread.stop()
method call because it's deprecated, and replace it with a warning log. So we just let the function instance take its time to clean things up.Verifying this change
I manually verified this change by:
BatchDataGeneratorSource
to add a sleep inside thediscover
method to make it run a long time, see here, and build the connector.http POST localhost:8080/admin/v3/sources/public/default/gimi-test/stop
.IllegalMonitorStateException
error. But after the fix, it won't happen anymore.http localhost:8080/admin/v2/persistent/public/default/gimi-test-intermediate/stats
. Before the fix, you would find that the consumer from the function instance is still there. And after the fix, you would see that the consumer got cleaned up.Does this pull request potentially affect one of the following parts:
If
yes
was chosen, please highlight the changesDocumentation
No doc change is needed, since it's an internal change.