New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(nodetool): increase graceful stop timeout #11567
fix(nodetool): increase graceful stop timeout #11567
Conversation
d05a15b
to
2a8d6bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have some standard scenario to try to figure out how long this timeout should be?
bin/nodetool
Outdated
@@ -8,6 +8,8 @@ | |||
%% ------------------------------------------------------------------- | |||
-mode(compile). | |||
|
|||
-define(SHUTDOWN_TIMEOUT_MS, 300_000). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 minutes is too long for user to wait.
Mostly users will ctrl + c
to stop waiting for so long time with panic.
Maybe we should spawn a child process to print the stop process is ongoing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhongwencool, added a periodic status messages while EMQX is stopping:
# bin/emqx stop
EMQX is shutting down, please wait...
EMQX is shutting down, please wait...
EMQX is shutting down, please wait...
EMQX is shutting down, please wait...
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more testing, I decreased the timeout from 5 to 2 min. I wasn't able to re-produce the case when 2min was not enough to shutdown gracefully.
2 min also coincides with the default wait period in EMQX script:
https://github.com/emqx/emqx/blob/master/bin/emqx#L1020
esockd_connection_sup spends the most time terminating its children, we can try to make it even more 'brutal' (kill all connections and don't wait for down messages) but I'm reluctant to change it solely for the purpose of speeding up graceful shutdown. It would look less graceful if we go for it...
2a8d6bd
to
ab3efa2
Compare
|
Converting back to draft, found another issue occurring when |
…drpc, timeout}` error
…less of `-p <pid>`
…int warning messages
1cfc58a
to
f790690
Compare
Pull Request Test Coverage Report for Build 6162521524
💛 - Coveralls |
Fixes EMQX-10835
1 core, 1 replicant cluster on a local machine test (i5-7200U 2 physical cores, 16GB RAM).
before this fix:
200K clients, ~1.2M topics/subscriptions:
Replicant config was tuned for high-latency network:
Increased pool sizes may contribute to the longer graceful shutdown time.
I haven't noticed any issues, it simply looks like 1 minute rpc timeout is too low to gracefully shutdown a highly loaded node.
As expected,
emqx
app shutdown time is the longest:Summary
🤖 Generated by Copilot at 2a8d6bd
This change improves the documentation of a pull request that enhances the graceful shutdown feature of EMQX. It adds a file
fix-11567.en.md
that explains the increased timeout and the error message for the shutdown process.PR Checklist
Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:
changes/(ce|ee)/(feat|perf|fix)-<PR-id>.en.md
filesChecklist for CI (.github/workflows) changes
changes/
dir for user-facing artifacts update