fix(nodetool): increase graceful stop timeout #11567

SergeTupchiy · 2023-09-05T16:42:49Z

1 core, 1 replicant cluster on a local machine test (i5-7200U 2 physical cores, 16GB RAM).
before this fix:
200K clients, ~1.2M topics/subscriptions:

root@serge-Latitude-5480:/home/serge/work/emqx/tmp/emqx1# date; bin/emqx stop; date
16:32:25 +0300
escript: exception error: no case clause matching {badrpc,timeout}
  in function  nodetool__escript__1693__920746__671869__2:do/1 (/home/serge/work/emqx/tmp/emqx1/bin/nodetool, line 88)
  in call from escript:run/2 (escript.erl, line 750)
  in call from escript:start/1 (escript.erl, line 277)
  in call from init:start_em/1 
  in call from init:do_boot/3 
ERROR: Graceful shutdown failed PID=[80468]
16:33:27 +0300

Replicant config was tuned for high-latency network:

node {
  name = "emqx1@127.0.0.1"
  cookie = "emqxsecretcookie"
  data_dir = "data"
  db_role = replicant
  broker_pool_size = 4000
  generic_pool_size = 2000
  channel_cleanup_batch_size = 10
  rpc_module = rpc
}

Increased pool sizes may contribute to the longer graceful shutdown time.
I haven't noticed any issues, it simply looks like 1 minute rpc timeout is too low to gracefully shutdown a highly loaded node.
As expected, emqx app shutdown time is the longest:

2023-09-05T17:20:48.763183+03:00 [debug] msg: stopping_app, mfa: emqx_machine_boot:stop_one_app/1, line: 89, app: emqx
2023-09-05T17:20:48.770902+03:00 [notice] ssl:default stopped on 0.0.0.0:8884
2023-09-05T17:20:48.779091+03:00 [notice] tcp:default stopped on 0.0.0.0:1884
2023-09-05T17:22:27.532862+03:00 [notice] Application: emqx. Exited: stopped. Type: permanent.

Summary

`🤖 Generated by Copilot at 2a8d6bd`

This change improves the documentation of a pull request that enhances the graceful shutdown feature of EMQX. It adds a file fix-11567.en.md that explains the increased timeout and the error message for the shutdown process.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

Added tests for the changes
Added property-based tests for code which performs user input validation
Changed lines covered in coverage report
Change log has been added to changes/(ce|ee)/(feat|perf|fix)-<PR-id>.en.md files
For internal contributor: there is a jira ticket to track this change
Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

If changed package build workflow, pass this action (manual trigger)
Change log has been added to changes/ dir for user-facing artifacts update

paulozulato

should we have some standard scenario to try to figure out how long this timeout should be?

zhongwencool · 2023-09-06T01:37:38Z

bin/nodetool

@@ -8,6 +8,8 @@
 %% -------------------------------------------------------------------
 -mode(compile).

+-define(SHUTDOWN_TIMEOUT_MS, 300_000).


5 minutes is too long for user to wait.
Mostly users will ctrl + c to stop waiting for so long time with panic.
Maybe we should spawn a child process to print the stop process is ongoing.

Thanks @zhongwencool, added a periodic status messages while EMQX is stopping:

# bin/emqx stop EMQX is shutting down, please wait... EMQX is shutting down, please wait... EMQX is shutting down, please wait... EMQX is shutting down, please wait... ok

After some more testing, I decreased the timeout from 5 to 2 min. I wasn't able to re-produce the case when 2min was not enough to shutdown gracefully.
2 min also coincides with the default wait period in EMQX script:
https://github.com/emqx/emqx/blob/master/bin/emqx#L1020

esockd_connection_sup spends the most time terminating its children, we can try to make it even more 'brutal' (kill all connections and don't wait for down messages) but I'm reluctant to change it solely for the purpose of speeding up graceful shutdown. It would look less graceful if we go for it...

SergeTupchiy · 2023-09-06T09:47:43Z

should we have some standard scenario to try to figure out how long this timeout should be?
@paulozulato, I think it would be hard to come up with some exact figures.
Shutdown time seems to depend (at least) on:

number of connections/subscriptions
hardware/network
pool sizes, configured supervisor shutdown timeout

SergeTupchiy · 2023-09-06T14:55:32Z

Converting back to draft, found another issue occurring when emqx start can't succeed in time.
Additionally, it worth having a clearer picture of slow shutdown causes.

…drpc, timeout}` error

…less of `-p <pid>`

…int warning messages

coveralls · 2023-09-13T09:51:16Z

Pull Request Test Coverage Report for Build 6162521524

0 of 0 changed or added relevant lines in 0 files are covered.
36 unchanged lines in 17 files lost coverage.
Overall coverage decreased (-0.02%) to 82.03%

Files with Coverage Reduction	New Missed Lines	%
apps/emqx_bridge_gcp_pubsub/src/emqx_bridge_gcp_pubsub_client.erl	1	94.74%
apps/emqx_bridge_gcp_pubsub/src/emqx_bridge_gcp_pubsub_consumer_worker.erl	1	93.47%
apps/emqx_bridge_http/src/emqx_bridge_http_connector.erl	1	71.67%
apps/emqx_durable_storage/src/emqx_ds_message_storage_bitmask.erl	1	93.38%
apps/emqx_gateway_exproto/src/emqx_exproto_gcli.erl	1	64.15%
apps/emqx_gateway/src/emqx_gateway_utils.erl	1	86.39%
apps/emqx_resource/src/emqx_resource_buffer_worker.erl	1	93.81%
apps/emqx/src/emqx_quic_data_stream.erl	1	75.21%
apps/emqx/src/emqx_router_helper.erl	1	84.91%
apps/emqx/src/emqx_ws_connection.erl	1	76.31%

Totals
Change from base Build 6161819071:	-0.02%
Covered Lines:	33023
Relevant Lines:	40257

💛 - Coveralls

SergeTupchiy force-pushed the EMQX-10835-increase-graceful-stop-timeout branch from d05a15b to 2a8d6bd Compare September 5, 2023 16:50

SergeTupchiy marked this pull request as ready for review September 5, 2023 16:50

SergeTupchiy requested a review from a team as a code owner September 5, 2023 16:50

paulozulato previously approved these changes Sep 5, 2023

View reviewed changes

zhongwencool reviewed Sep 6, 2023

View reviewed changes

SergeTupchiy dismissed paulozulato’s stale review via ab3efa2 September 6, 2023 09:41

SergeTupchiy force-pushed the EMQX-10835-increase-graceful-stop-timeout branch from 2a8d6bd to ab3efa2 Compare September 6, 2023 09:41

paulozulato previously approved these changes Sep 6, 2023

View reviewed changes

SergeTupchiy marked this pull request as draft September 6, 2023 14:51

SergeTupchiy dismissed paulozulato’s stale review via 1cfc58a September 8, 2023 13:07

paulozulato previously approved these changes Sep 8, 2023

View reviewed changes

SergeTupchiy added 4 commits September 12, 2023 19:55

fix(nodetool): increase graceful stop timeout, handle and report `{ba…

9de9631

…drpc, timeout}` error

feat(nodetool): print shutdown status messages while EMQX is stopping

9b3f88a

fix(bin/emqx): remove ps -e opt, as it lists all processes regard…

6dd9e54

…less of `-p <pid>`

fix(bin/emqx): don't suppress wait_for command output, as it can pr…

f790690

…int warning messages

SergeTupchiy dismissed paulozulato’s stale review via f790690 September 12, 2023 16:55

SergeTupchiy force-pushed the EMQX-10835-increase-graceful-stop-timeout branch from 1cfc58a to f790690 Compare September 12, 2023 16:55

SergeTupchiy marked this pull request as ready for review September 12, 2023 17:06

zmstone approved these changes Sep 13, 2023

View reviewed changes

SergeTupchiy merged commit cf334d5 into emqx:master Sep 13, 2023
131 checks passed

Moisan mentioned this pull request Sep 30, 2023

emqx 5.3.0 Homebrew/homebrew-core#147463

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nodetool): increase graceful stop timeout #11567

fix(nodetool): increase graceful stop timeout #11567

SergeTupchiy commented Sep 5, 2023 •

edited by ghost

paulozulato left a comment

zhongwencool Sep 6, 2023

SergeTupchiy Sep 6, 2023

SergeTupchiy Sep 12, 2023

SergeTupchiy commented Sep 6, 2023

SergeTupchiy commented Sep 6, 2023

coveralls commented Sep 13, 2023

fix(nodetool): increase graceful stop timeout #11567

fix(nodetool): increase graceful stop timeout #11567

Conversation

SergeTupchiy commented Sep 5, 2023 • edited by ghost

Summary

🤖 Generated by Copilot at 2a8d6bd

PR Checklist

Checklist for CI (.github/workflows) changes

paulozulato left a comment

Choose a reason for hiding this comment

zhongwencool Sep 6, 2023

Choose a reason for hiding this comment

SergeTupchiy Sep 6, 2023

Choose a reason for hiding this comment

SergeTupchiy Sep 12, 2023

Choose a reason for hiding this comment

SergeTupchiy commented Sep 6, 2023

SergeTupchiy commented Sep 6, 2023

coveralls commented Sep 13, 2023

Pull Request Test Coverage Report for Build 6162521524

💛 - Coveralls

SergeTupchiy commented Sep 5, 2023 •

edited by ghost

`🤖 Generated by Copilot at 2a8d6bd`