Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown stopped working on master #10289

Closed
merlimat opened this issue Apr 20, 2021 · 6 comments
Closed

Graceful shutdown stopped working on master #10289

merlimat opened this issue Apr 20, 2021 · 6 comments
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug
Milestone

Comments

@merlimat
Copy link
Contributor

Trying to do a graceful stop of a broker with CTRL-C will fail due to an unexpected exception and then it will shutdown after 30 sec.

The logs show:

2021-04-20 12:23:28,621 [jdk.internal.loader.ClassLoaders$AppClassLoader@73d16e93] error Uncaught exception in thread shutdown-thread-52-1: org/apache/pulsar/broker/service/GracefulExecutorServicesShutdown

@lhotari I think this is probably related to #10199

@merlimat merlimat added the type/bug The PR fixed a bug or issue reported a bug label Apr 20, 2021
@merlimat merlimat added this to the 2.8.0 milestone Apr 20, 2021
@lhotari
Copy link
Member

lhotari commented Apr 20, 2021

@merlimat Thanks for the report. I'll take a look tomorrow.

@lhotari
Copy link
Member

lhotari commented Apr 21, 2021

@merlimat I wasn't able to reproduce this issue. I started pulsar directly in the repository directory with ./bin/pulsar standalone and then hit CTRL-C. This is the output I get:

2021-04-21 05:47:37,252 Thread-1 WARN Unable to register Log4j shutdown hook because JVM is shutting down. Using SimpleLogger
05:47:38.205 [Thread-1] INFO  org.apache.pulsar.broker.service.GracefulExecutorServicesTerminationHandler - Starting termination handler for 4 executors.
05:47:38.205 [Thread-1] INFO  org.apache.pulsar.broker.service.GracefulExecutorServicesTerminationHandler - Shutdown completed.
05:47:42.555 [globalEventExecutor-2-7] INFO  org.apache.pulsar.broker.service.GracefulExecutorServicesTerminationHandler - Starting termination handler for 8 executors.
05:47:42.556 [globalEventExecutor-2-7] INFO  org.apache.pulsar.broker.service.GracefulExecutorServicesTerminationHandler - Shutdown completed.

and the JVM stops as expected.

I wonder if you were able to capture the exception stacktrace?

@merlimat
Copy link
Contributor Author

No, I couldn’t see any stack trace. I think it was just the JVM default exception handler when the thread is terminated.

I also cannot reproduce always, though I’ve seen it few times. I think we should prob just add a catch Throwable in there and it should fix it.

One other minor thing that is reproducible is there seems to be always a 5 sec delay when the executors are being shut down. It seems like we’re always letting some operation to time out.

@lhotari
Copy link
Member

lhotari commented Apr 21, 2021

No, I couldn’t see any stack trace. I think it was just the JVM default exception handler when the thread is terminated.

I added printing of the stacktrace in #10304 .

I also cannot reproduce always, though I’ve seen it few times. I think we should prob just add a catch Throwable in there and it should fix it.

I didn't yet add any new catches. I expect that #10304 changes will help find the underlying issue.

One other minor thing that is reproducible is there seems to be always a 5 sec delay when the executors are being shut down. It seems like we’re always letting some operation to time out.

This is expected behavior. For Netty Event Loop Shutdown, there's a quiet time parameter. It's 2 seconds by default. I have increased it to use a fraction of the configured brokerShutdownTimeoutMs with the maximum value of 5 seconds.

CompletableFuture<Void> shutdownEventLoopGracefully(EventLoopGroup eventLoopGroup) {
long brokerShutdownTimeoutMs = pulsar.getConfiguration().getBrokerShutdownTimeoutMs();
long quietPeriod = Math.min((long) (
GRACEFUL_SHUTDOWN_QUIET_PERIOD_RATIO_OF_TOTAL_TIMEOUT * brokerShutdownTimeoutMs),
GRACEFUL_SHUTDOWN_QUIET_PERIOD_MAX_MS);
long timeout = (long) (GRACEFUL_SHUTDOWN_TIMEOUT_RATIO_OF_TOTAL_TIMEOUT * brokerShutdownTimeoutMs);
return NettyFutureUtil.toCompletableFutureVoid(
eventLoopGroup.shutdownGracefully(quietPeriod,
timeout, TimeUnit.MILLISECONDS));
}

Previously there was no delay since the shutdown didn't wait for the event loops to close at all. The default quiet time is 2 seconds and shutdown timeout 15 seconds in Netty. Those values were used, but the asynchronous return value was ignored.

@lhotari
Copy link
Member

lhotari commented Apr 26, 2021

It might not be directly related to this issue, but #10365 fixed an issue in tests where the shutdown didn't complete gracefully.

@codelipenghui codelipenghui modified the milestones: 2.8.0, 2.9.0 May 22, 2021
@lhotari
Copy link
Member

lhotari commented Jun 2, 2021

This should be fixed by #10304. Please re-open if the problem re-occurs.

@lhotari lhotari closed this as completed Jun 2, 2021
@lhotari lhotari modified the milestones: 2.9.0, 2.8.0 Jun 2, 2021
@lhotari lhotari self-assigned this Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

3 participants