Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs #10199

Merged
merged 17 commits into from
Apr 16, 2021

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Apr 12, 2021

Motivation

Broker shutdown isn't currently graceful although the intention for it is to be graceful.
Another problem in tests is that broker shutdown doesn't shutdown synchronously.
These 2 problems are addressed in this PR.

The changes will also help improve CI stability when the asynchronous tasks of broker shutdown can be controlled in tests. This prevents problems which are caused by too many brokers being active at the same time. This could currently happen when previous brokers are asynchronously shutting down.

Modifications

  • Goal of changes: shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs
    • wait for event loops getting closed before continuing to close other services in broker shutdown
    • configure event loop shutdown parameters since the default shutdown timeout is 15 seconds after a 2 second idle time.
    • close executor services forcefully if the broker shutdown times out
    • use shutdown timeout of 0 in tests. This triggers immediate forceful shutdown.

@lhotari
Copy link
Member Author

lhotari commented Apr 12, 2021

This PRs continues the work started in PR #9308 .

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am new to the broker's clean shutdown, but based on reading through your PR, these changes look good to me other than single variable that I think might benefit from the volatile keyword.

@lhotari lhotari force-pushed the lh-completely-shutdown-broker branch from dfc6747 to 32fa7d1 Compare April 13, 2021 06:49
@codelipenghui codelipenghui added this to the 2.8.0 milestone Apr 13, 2021
@lhotari lhotari force-pushed the lh-completely-shutdown-broker branch 2 times, most recently from 405c093 to e1febb1 Compare April 15, 2021 09:23
@lhotari
Copy link
Member Author

lhotari commented Apr 15, 2021

/pulsarbot run-failure-checks

1 similar comment
@lhotari
Copy link
Member Author

lhotari commented Apr 15, 2021

/pulsarbot run-failure-checks

@lhotari lhotari force-pushed the lh-completely-shutdown-broker branch from 79e800a to 4e7488a Compare April 15, 2021 15:05
@sijie
Copy link
Member

sijie commented Apr 15, 2021

@merlimat to review

@lhotari
Copy link
Member Author

lhotari commented Apr 15, 2021

/pulsarbot run-failure-checks

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 The approach LGTM. Just a couple of questions on the await termination

@lhotari
Copy link
Member Author

lhotari commented Apr 16, 2021

The approach LGTM. Just a couple of questions on the await termination

@merlimat Thank you for your review and good points about await termination.
Please ignore some of the replies I made since I didn't look at the full context of the code when replying.
I revisited the code after another check at your review comments.
I have rewritten the solution to use .awaitTermination and replaced the previous solution that used a ScheduledExecutorService. Please take another look.

@lhotari lhotari force-pushed the lh-completely-shutdown-broker branch from 532e51c to e4ad678 Compare April 16, 2021 12:15
@lhotari
Copy link
Member Author

lhotari commented Apr 16, 2021

/pulsarbot run-failure-checks

@merlimat merlimat merged commit 152d1e6 into apache:master Apr 16, 2021
jiazhai pushed a commit to streamnative/kop that referenced this pull request Jul 29, 2021
### Motivation

The KoP CI tests take much more time than CI tests of branch-2.7.2. 

The main reason is the `cleanup()` phase takes a long time, each time a test is cleaned up, it will take over 10 seconds to complete. This behavior was introduced from apache/pulsar#10199, which made broker shutdown gracefully by default but it would take longer to shutdown.

The other reason is caused by rebalance time. According to my observes, when a Kafka consumer subscribes a topic in KoP, it will take at least 3 seconds. Finally I found it's caused by the GroupInitialRebalanceDelayMs config, which has the same semantics with Kafka's [group.initial.rebalance.delay.ms](https://kafka.apache.org/documentation/#brokerconfigs_group.initial.rebalance.delay.ms). It makes Kafka server wait longer for `JOIN_GROUP` request for more consumers to join so that the rebalance count can reduce. However, it should be set zero  in tests.

After fixing these problems, sometimes the following error may happen and cause flakiness.

```
TwoPhaseCompactor$MockitoMock$1432102834 cannot be returned by getConfiguration()
getConfiguration() should return ServiceConfiguration
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - 
   - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method.
```

It's because `PulsarService#newCompactor` is not mocked well, see apache/pulsar#7102 for detail.

### Modifications

- Configure `GroupInitialRebalanceDelayMs` and `BrokerShutdownTimeoutMs` for each mocked `BrokerService`.
- Fix the flakiness caused by mocking compactor.

After the changes, the tests time has a significant improvement.

For example, `GroupCoordinatorTest` takes only 3 minutes now but it could take 9 minutes before. Because the `cleanup()` method is marked as `@AfterMethod` and would be called each time a single test finished.

Another example is that `BasicEndToEndKafkaTest` takes only 37 seconds now but it could take 56 seconds before. The `cleanup()` is marked as `@AfterClass` and only happens once, but many consumers will be created during the whole tests and each time a subscribe call can take 3 seconds.

* Speed up tests and fix flakiness

* Speed up tests that creat their own configs

* Ignore golang-sarama tests
BewareMyPower added a commit to streamnative/kop that referenced this pull request Aug 5, 2021
### Motivation

The KoP CI tests take much more time than CI tests of branch-2.7.2. 

The main reason is the `cleanup()` phase takes a long time, each time a test is cleaned up, it will take over 10 seconds to complete. This behavior was introduced from apache/pulsar#10199, which made broker shutdown gracefully by default but it would take longer to shutdown.

The other reason is caused by rebalance time. According to my observes, when a Kafka consumer subscribes a topic in KoP, it will take at least 3 seconds. Finally I found it's caused by the GroupInitialRebalanceDelayMs config, which has the same semantics with Kafka's [group.initial.rebalance.delay.ms](https://kafka.apache.org/documentation/#brokerconfigs_group.initial.rebalance.delay.ms). It makes Kafka server wait longer for `JOIN_GROUP` request for more consumers to join so that the rebalance count can reduce. However, it should be set zero  in tests.

After fixing these problems, sometimes the following error may happen and cause flakiness.

```
TwoPhaseCompactor$MockitoMock$1432102834 cannot be returned by getConfiguration()
getConfiguration() should return ServiceConfiguration
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - 
   - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method.
```

It's because `PulsarService#newCompactor` is not mocked well, see apache/pulsar#7102 for detail.

### Modifications

- Configure `GroupInitialRebalanceDelayMs` and `BrokerShutdownTimeoutMs` for each mocked `BrokerService`.
- Fix the flakiness caused by mocking compactor.

After the changes, the tests time has a significant improvement.

For example, `GroupCoordinatorTest` takes only 3 minutes now but it could take 9 minutes before. Because the `cleanup()` method is marked as `@AfterMethod` and would be called each time a single test finished.

Another example is that `BasicEndToEndKafkaTest` takes only 37 seconds now but it could take 56 seconds before. The `cleanup()` is marked as `@AfterClass` and only happens once, but many consumers will be created during the whole tests and each time a subscribe call can take 3 seconds.

* Speed up tests and fix flakiness

* Speed up tests that creat their own configs

* Ignore golang-sarama tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants