Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(nodetool): increase graceful stop timeout #11567

Conversation

SergeTupchiy
Copy link
Contributor

@SergeTupchiy SergeTupchiy commented Sep 5, 2023

Fixes EMQX-10835

1 core, 1 replicant cluster on a local machine test (i5-7200U 2 physical cores, 16GB RAM).
before this fix:
200K clients, ~1.2M topics/subscriptions:

root@serge-Latitude-5480:/home/serge/work/emqx/tmp/emqx1# date; bin/emqx stop; date
16:32:25 +0300
escript: exception error: no case clause matching {badrpc,timeout}
  in function  nodetool__escript__1693__920746__671869__2:do/1 (/home/serge/work/emqx/tmp/emqx1/bin/nodetool, line 88)
  in call from escript:run/2 (escript.erl, line 750)
  in call from escript:start/1 (escript.erl, line 277)
  in call from init:start_em/1 
  in call from init:do_boot/3 
ERROR: Graceful shutdown failed PID=[80468]
16:33:27 +0300

Replicant config was tuned for high-latency network:

node {
  name = "emqx1@127.0.0.1"
  cookie = "emqxsecretcookie"
  data_dir = "data"
  db_role = replicant
  broker_pool_size = 4000
  generic_pool_size = 2000
  channel_cleanup_batch_size = 10
  rpc_module = rpc
}

Increased pool sizes may contribute to the longer graceful shutdown time.
I haven't noticed any issues, it simply looks like 1 minute rpc timeout is too low to gracefully shutdown a highly loaded node.
As expected, emqx app shutdown time is the longest:

2023-09-05T17:20:48.763183+03:00 [debug] msg: stopping_app, mfa: emqx_machine_boot:stop_one_app/1, line: 89, app: emqx
2023-09-05T17:20:48.770902+03:00 [notice] ssl:default stopped on 0.0.0.0:8884
2023-09-05T17:20:48.779091+03:00 [notice] tcp:default stopped on 0.0.0.0:1884
2023-09-05T17:22:27.532862+03:00 [notice] Application: emqx. Exited: stopped. Type: permanent.

Summary

🤖 Generated by Copilot at 2a8d6bd

This change improves the documentation of a pull request that enhances the graceful shutdown feature of EMQX. It adds a file fix-11567.en.md that explains the increased timeout and the error message for the shutdown process.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Added property-based tests for code which performs user input validation
  • Changed lines covered in coverage report
  • Change log has been added to changes/(ce|ee)/(feat|perf|fix)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@SergeTupchiy SergeTupchiy force-pushed the EMQX-10835-increase-graceful-stop-timeout branch from d05a15b to 2a8d6bd Compare September 5, 2023 16:50
@SergeTupchiy SergeTupchiy marked this pull request as ready for review September 5, 2023 16:50
@SergeTupchiy SergeTupchiy requested a review from a team as a code owner September 5, 2023 16:50
paulozulato
paulozulato previously approved these changes Sep 5, 2023
Copy link
Contributor

@paulozulato paulozulato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have some standard scenario to try to figure out how long this timeout should be?

bin/nodetool Outdated
@@ -8,6 +8,8 @@
%% -------------------------------------------------------------------
-mode(compile).

-define(SHUTDOWN_TIMEOUT_MS, 300_000).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 minutes is too long for user to wait.
Mostly users will ctrl + c to stop waiting for so long time with panic.
Maybe we should spawn a child process to print the stop process is ongoing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhongwencool, added a periodic status messages while EMQX is stopping:

# bin/emqx stop
EMQX is shutting down, please wait...
EMQX is shutting down, please wait...
EMQX is shutting down, please wait...
EMQX is shutting down, please wait...
ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more testing, I decreased the timeout from 5 to 2 min. I wasn't able to re-produce the case when 2min was not enough to shutdown gracefully.
2 min also coincides with the default wait period in EMQX script:
https://github.com/emqx/emqx/blob/master/bin/emqx#L1020

esockd_connection_sup spends the most time terminating its children, we can try to make it even more 'brutal' (kill all connections and don't wait for down messages) but I'm reluctant to change it solely for the purpose of speeding up graceful shutdown. It would look less graceful if we go for it...

@SergeTupchiy
Copy link
Contributor Author

should we have some standard scenario to try to figure out how long this timeout should be?
@paulozulato, I think it would be hard to come up with some exact figures.
Shutdown time seems to depend (at least) on:

  • number of connections/subscriptions
  • hardware/network
  • pool sizes, configured supervisor shutdown timeout

paulozulato
paulozulato previously approved these changes Sep 6, 2023
@SergeTupchiy SergeTupchiy marked this pull request as draft September 6, 2023 14:51
@SergeTupchiy
Copy link
Contributor Author

Converting back to draft, found another issue occurring when emqx start can't succeed in time.
Additionally, it worth having a clearer picture of slow shutdown causes.

paulozulato
paulozulato previously approved these changes Sep 8, 2023
@SergeTupchiy SergeTupchiy force-pushed the EMQX-10835-increase-graceful-stop-timeout branch from 1cfc58a to f790690 Compare September 12, 2023 16:55
@SergeTupchiy SergeTupchiy marked this pull request as ready for review September 12, 2023 17:06
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 6162521524

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 36 unchanged lines in 17 files lost coverage.
  • Overall coverage decreased (-0.02%) to 82.03%

Files with Coverage Reduction New Missed Lines %
apps/emqx_bridge_gcp_pubsub/src/emqx_bridge_gcp_pubsub_client.erl 1 94.74%
apps/emqx_bridge_gcp_pubsub/src/emqx_bridge_gcp_pubsub_consumer_worker.erl 1 93.47%
apps/emqx_bridge_http/src/emqx_bridge_http_connector.erl 1 71.67%
apps/emqx_durable_storage/src/emqx_ds_message_storage_bitmask.erl 1 93.38%
apps/emqx_gateway_exproto/src/emqx_exproto_gcli.erl 1 64.15%
apps/emqx_gateway/src/emqx_gateway_utils.erl 1 86.39%
apps/emqx_resource/src/emqx_resource_buffer_worker.erl 1 93.81%
apps/emqx/src/emqx_quic_data_stream.erl 1 75.21%
apps/emqx/src/emqx_router_helper.erl 1 84.91%
apps/emqx/src/emqx_ws_connection.erl 1 76.31%
Totals Coverage Status
Change from base Build 6161819071: -0.02%
Covered Lines: 33023
Relevant Lines: 40257

💛 - Coveralls

@SergeTupchiy SergeTupchiy merged commit cf334d5 into emqx:master Sep 13, 2023
131 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants