Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour #3418

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wenbingshen
Copy link
Member

Motivation

fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour by disable retry session expired.

issue #3206 was fixed by PR #3415, but it still introduced another flaky test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

According to the investigation, the reason is:
ZookeeperClient still creates a new zookeeper instance when the old zookeeper client session time out.

Due to the asynchronous execution of two threads executing bookie temporary node re-registration and zk instance re-creation, the test program sometimes succeeds and sometimes fails.

  1. When the temporary node re-registration is performed before the zk re-instantiation, the temporary node creation will use the old zk instance, which will cause a session timeout error, the bookie service will be shutdown, and the test will be successful;

  2. When the zk re-instantiation precedes the re-registration of the temporary node, the temporary node creation will use the new re-instantiated zk instance, then the temporary node will be successfully created, the bookie service is running normally, and the test fails.

        try {
            connectExecutor.submit(clientCreator);
        } catch (RejectedExecutionException ree) {
            if (!closed.get()) {
                logger.error("ZooKeeper reconnect task is rejected : ", ree);
            }
        } catch (Exception t) {
            logger.error("Failed to submit zookeeper reconnect task due to runtime exception : ", t);
        }

Changes

Add a retryExpired flag to indicate whether to run the zk instance and retry to create a new instance after the session times out.
Set this flag to false for ZKMetadataBookieDriver;
Other ZookeeperClient and normal ZookeeperClient applications will generate the default value true or set to true, which is consistent with the original behavior.

Test the behavior of this PR:

Before this PR:
Executed the test 10 times, all failed.
image

After this PR:
Executed the test 10 times, all successful.
image

Comment on lines +369 to +374

if (!retryExpired) {
logger.warn("ZooKeeper session expired retries have been disabled");
return;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recreating new zookeeper instances is prohibited here.

Comment on lines +182 to 185
// allow client.waitForConnection() timeout
Thread.sleep(10000);
assertFalse("Bookie should shutdown on losing zk session", server.isBookieRunning());
assertFalse("Bookie Server should shutdown on losing zk session", server.isRunning());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The increase from 3 seconds to 10 seconds is because creating a bookie temporary node needs to wait for the old zk client to establish a connection with the server. There is a timeout period of conf.getZkTimeout()=6 seconds. Once the timeout period expires, the bookie service will Closed due to session timeout, the extra 4 seconds is to consolidate the stability of the test and give the bookie service a little extra time to perform the shutdown.

@wenbingshen
Copy link
Member Author

ping @dlg99 @hangc0276 @shoothzj @eolivelli PTAL.

@HQebupt
Copy link
Contributor

HQebupt commented Jul 23, 2022

@StevenLuMT PTAL

Copy link
Contributor

@StevenLuMT StevenLuMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to fix it?

@@ -165,7 +165,8 @@ public String getScheme() {
protected void initialize(AbstractConfiguration<?> conf,
StatsLogger statsLogger,
RetryPolicy zkRetryPolicy,
Optional<Object> optionalCtx) throws MetadataException {
Optional<Object> optionalCtx,
boolean zkRetryExpired) throws MetadataException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check(zkRetryExpired) is difficult to understand, what scene is set to true and what scene is set to false

@dlg99
Copy link
Contributor

dlg99 commented Jul 26, 2022

can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times.
BookieStateManager's registration listener will shutdown the bookie if it cannot reconnect immediately so I guess there is still some timing issue between it and the test killing zk session / letting it recover.

@hangc0276 hangc0276 modified the milestones: 4.16.0, 4.17.0 Jul 29, 2022
@StevenLuMT
Copy link
Contributor

fix old workflow,please see #3455 for detail

@poorbarcode
Copy link
Contributor

@dlg99

can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times.
BookieStateManager's registration listener will shutdown the bookie if it cannot reconnect immediately so I guess there is still some timing issue between it and the test killing zk session / letting it recover.

When I run the test in my local, it is very flaky

@hangc0276
Copy link
Contributor

@dlg99

can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times.
BookieStateManager's registration listener will shutdown the bookie if it cannot reconnect immediately so I guess there is still some timing issue between it and the test killing zk session / letting it recover.

When I run the test in my local, it is very flaky

@poorbarcode There is a race condition describe in the motivation.

zymap pushed a commit that referenced this pull request Dec 7, 2023
…4144)

### Motivation
The `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` is a flaky test and the root cause is described in #3418. 

There is a race condition in the zookeeper reconnection and listener and it is an expected behavior in Bookie.

We created one issue to track this #4142. 

We can skip this flaky test to unblock the pending PRs.
zymap pushed a commit that referenced this pull request Dec 8, 2023
…4144)

### Motivation
The `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` is a flaky test and the root cause is described in #3418.

There is a race condition in the zookeeper reconnection and listener and it is an expected behavior in Bookie.

We created one issue to track this #4142.

We can skip this flaky test to unblock the pending PRs.

(cherry picked from commit 11ccebb)
yangl pushed a commit to yangl/bookkeeper that referenced this pull request Dec 11, 2023
…pache#4144)

### Motivation
The `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` is a flaky test and the root cause is described in apache#3418. 

There is a race condition in the zookeeper reconnection and listener and it is an expected behavior in Bookie.

We created one issue to track this apache#4142. 

We can skip this flaky test to unblock the pending PRs.
hangc0276 added a commit to hangc0276/bookkeeper that referenced this pull request Jan 18, 2024
…pache#4144)

### Motivation
The `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` is a flaky test and the root cause is described in apache#3418.

There is a race condition in the zookeeper reconnection and listener and it is an expected behavior in Bookie.

We created one issue to track this apache#4142.

We can skip this flaky test to unblock the pending PRs.

(cherry picked from commit 11ccebb)
@eolivelli eolivelli removed this from the 4.17.0 milestone May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants