Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-2577: Fix flaky testDuringLeaderSync test. #315

Closed
wants to merge 1 commit into from

Conversation

hanm
Copy link
Contributor

@hanm hanm commented Jul 24, 2017

testDuringLeaderSync uses the presence of intermediate zoo.cfg.dynamic file to decide if the reconfig operation was succeeded or not. This is not a problem and is logically correct, however in tests that provisions QuorumPeer directly through MainThread, the configFile will be null and the resulted intermediate dynamic reconfig file will has a name of "null.cfg.dynamic". This causes problem because multiple test cases use MainThread to provision QuorumPeer so these tests will share a single "null.cfg.dynamic" file, as opposed to the zoo.cfg.dynamic file in their specific test folder when configFile was not null. Since we support running concurrent ant unit tests in apache build infrastructure, it is highly likely that multiple tests that rely on this shared null.cfg.dynamic file will execute simultaneously which create all sorts of failure cases (this also explains why if one tries to reproduce the test failures in a standalone fashion the test will always pass, with the absence of the file sharing.).

This is problematic in multiple test cases and in particular cause this testDuringLeaderSync fail fairly frequently. There are multiple solutions to this problem:

  • Implement retry with timeout logic when detecting the presence of intermediate config files in testDuringLeaderSync. This will fix this specific test but the fix is non-deterministic and other tests might still fail because of sharing of null.cfg.dynamic file.

  • Always make sure to to feed a real config file when provision QuorumPeer. This however is an overkill as some cases a pure artificial QuorumPeer w/o real config file fit the use case well.

The approach taking here is to for this specific test, making sure we always have a configured confFileName, and it is pretty trivial to do so. For other tests, where the intermediate config file name is not important or irrelevant (e.g. FLE related tests), they will still have null confFileName because they directly provision QuorumPeer during test.

Testing done: running this patch on apache jenkins for the past week with 500 runs. Not only this test is fixed but overall stability of entirely unit tests are improved.

@@ -1451,7 +1452,12 @@ public synchronized void restartLeaderElection(QuorumVerifier qvOLD, QuorumVerif
}

public String getNextDynamicConfigFilename() {
return configFilename + QuorumPeerConfig.nextDynamicConfigFileSuffix;
if (configFilename != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  • It's be better if configFileName is never null. But that may be a bigger change that we don't want to do here
  • If its null, it may be better to return null here and avoid calling writeDynamicConfig.
  • The calling test should either initiate configFileName properly or not rely on its existance to check reconfig completion. You can do the latter by looking at the quorumVerified directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review, Alex. I'll look into making configFileName not null for this test, which is more deterministic than a random id generator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shralex Turns out pretty trivial to guarantee a real configFileName in this case - update with one line change. Tested on apache jenkins the issue is fixed.

Note we still have configFileName as null for some other tests (such as FLE tests) but in those tests the intermediate reconfig file is not relevant so I'll just leave as they are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update pull request based on offline chat with Alex. The idea is to detect the null cases for configFileName and for those cases don't write intermediate reconfig file to disk because if we do, the shared "null.dynamic.cfg" file will be overwritten and might likely break other things without notice.

asfgit pushed a commit that referenced this pull request Jul 27, 2017
testDuringLeaderSync uses the presence of intermediate zoo.cfg.dynamic file to decide if the reconfig operation was succeeded or not. This is not a problem and is logically correct, however in tests that provisions QuorumPeer directly through MainThread, the configFile will be null and the resulted intermediate dynamic reconfig file will has a name of "null.cfg.dynamic". This causes problem because multiple test cases use MainThread to provision QuorumPeer so these tests will share a single "null.cfg.dynamic" file, as opposed to the zoo.cfg.dynamic file in their specific test folder when configFile was not null. Since we support running concurrent ant unit tests in apache build infrastructure, it is highly likely that multiple tests that rely on this shared null.cfg.dynamic file will execute simultaneously which create all sorts of failure cases (this also explains why if one tries to reproduce the test failures in a standalone fashion the test will always pass, with the absence of the file sharing.).

This is problematic in multiple test cases and in particular cause this testDuringLeaderSync fail fairly frequently. There are multiple solutions to this problem:

* Implement retry with timeout logic when detecting the presence of intermediate config files in testDuringLeaderSync. This will fix this specific test but the fix is non-deterministic and other tests might still fail because of sharing of null.cfg.dynamic file.

* Always make sure to to feed a real config file when provision QuorumPeer. This however is an overkill as some cases a pure artificial QuorumPeer w/o real config file fit the use case well.

The approach taking here is to for this specific test, making sure we always have a configured confFileName, and it is pretty trivial to do so. For other tests, where the intermediate config file name is not important or irrelevant (e.g. FLE related tests), they will still have null confFileName because they directly provision QuorumPeer during test.

Testing done: running this patch on apache jenkins for the past week with 500 runs. Not only this test is fixed but overall stability of entirely unit tests are improved.

Author: Michael Han <hanm@apache.org>

Reviewers: Alexander Shraer <shralex@gmail.com>

Closes #315 from hanm/working
@asfgit asfgit closed this in 32794ca Jul 27, 2017
lvfangmin pushed a commit to lvfangmin/zookeeper that referenced this pull request Jun 17, 2018
testDuringLeaderSync uses the presence of intermediate zoo.cfg.dynamic file to decide if the reconfig operation was succeeded or not. This is not a problem and is logically correct, however in tests that provisions QuorumPeer directly through MainThread, the configFile will be null and the resulted intermediate dynamic reconfig file will has a name of "null.cfg.dynamic". This causes problem because multiple test cases use MainThread to provision QuorumPeer so these tests will share a single "null.cfg.dynamic" file, as opposed to the zoo.cfg.dynamic file in their specific test folder when configFile was not null. Since we support running concurrent ant unit tests in apache build infrastructure, it is highly likely that multiple tests that rely on this shared null.cfg.dynamic file will execute simultaneously which create all sorts of failure cases (this also explains why if one tries to reproduce the test failures in a standalone fashion the test will always pass, with the absence of the file sharing.).

This is problematic in multiple test cases and in particular cause this testDuringLeaderSync fail fairly frequently. There are multiple solutions to this problem:

* Implement retry with timeout logic when detecting the presence of intermediate config files in testDuringLeaderSync. This will fix this specific test but the fix is non-deterministic and other tests might still fail because of sharing of null.cfg.dynamic file.

* Always make sure to to feed a real config file when provision QuorumPeer. This however is an overkill as some cases a pure artificial QuorumPeer w/o real config file fit the use case well.

The approach taking here is to for this specific test, making sure we always have a configured confFileName, and it is pretty trivial to do so. For other tests, where the intermediate config file name is not important or irrelevant (e.g. FLE related tests), they will still have null confFileName because they directly provision QuorumPeer during test.

Testing done: running this patch on apache jenkins for the past week with 500 runs. Not only this test is fixed but overall stability of entirely unit tests are improved.

Author: Michael Han <hanm@apache.org>

Reviewers: Alexander Shraer <shralex@gmail.com>

Closes apache#315 from hanm/working
RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Sep 3, 2022
testDuringLeaderSync uses the presence of intermediate zoo.cfg.dynamic file to decide if the reconfig operation was succeeded or not. This is not a problem and is logically correct, however in tests that provisions QuorumPeer directly through MainThread, the configFile will be null and the resulted intermediate dynamic reconfig file will has a name of "null.cfg.dynamic". This causes problem because multiple test cases use MainThread to provision QuorumPeer so these tests will share a single "null.cfg.dynamic" file, as opposed to the zoo.cfg.dynamic file in their specific test folder when configFile was not null. Since we support running concurrent ant unit tests in apache build infrastructure, it is highly likely that multiple tests that rely on this shared null.cfg.dynamic file will execute simultaneously which create all sorts of failure cases (this also explains why if one tries to reproduce the test failures in a standalone fashion the test will always pass, with the absence of the file sharing.).

This is problematic in multiple test cases and in particular cause this testDuringLeaderSync fail fairly frequently. There are multiple solutions to this problem:

* Implement retry with timeout logic when detecting the presence of intermediate config files in testDuringLeaderSync. This will fix this specific test but the fix is non-deterministic and other tests might still fail because of sharing of null.cfg.dynamic file.

* Always make sure to to feed a real config file when provision QuorumPeer. This however is an overkill as some cases a pure artificial QuorumPeer w/o real config file fit the use case well.

The approach taking here is to for this specific test, making sure we always have a configured confFileName, and it is pretty trivial to do so. For other tests, where the intermediate config file name is not important or irrelevant (e.g. FLE related tests), they will still have null confFileName because they directly provision QuorumPeer during test.

Testing done: running this patch on apache jenkins for the past week with 500 runs. Not only this test is fixed but overall stability of entirely unit tests are improved.

Author: Michael Han <hanm@apache.org>

Reviewers: Alexander Shraer <shralex@gmail.com>

Closes apache#315 from hanm/working
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants