Skip to content

ZOOKEEPER-2722: fix flaky testSessionEstablishment test.#191

Closed
hanm wants to merge 2 commits intoapache:branch-3.5from
hanm:ZOOKEEPER-2722
Closed

ZOOKEEPER-2722: fix flaky testSessionEstablishment test.#191
hanm wants to merge 2 commits intoapache:branch-3.5from
hanm:ZOOKEEPER-2722

Conversation

@hanm
Copy link
Contributor

@hanm hanm commented Mar 14, 2017

Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum.

Copy link
Contributor

@skamille skamille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we saying that the watcher.waitForConnected(CONNECTION_TIMEOUT) is not working correctly? Because it seems like in most of the places you've added the check, we check that the watcher has gotten into connected state first.

@hanm
Copy link
Contributor Author

hanm commented Mar 15, 2017

So are we saying that the watcher.waitForConnected(CONNECTION_TIMEOUT) is not working correctly?

I believe this works as expected. I don't see any of the flaky / normal test results complain about this particular check.

Because it seems like in most of the places you've added the check

The check testConnection is not new, it simply improves the robustness of zk.create("/test", "test".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT) by wrapping it with retry, as from Jenkins test log this is the place where ConnectionLossException was thrown. Internal stress test indicates this is effective. I haven't figured out what exactly caused the ConnectionLossException in zk.create call for this particular test...

@hanm
Copy link
Contributor Author

hanm commented Mar 15, 2017

So what happened chronically on Jenkins for this test could be:

  • ZK client connect to server. Watcher received connected (read only) events.
  • For some reasons when invoke zk.create we get a ConnectionLossException.
  • We did not retry zk.create so the test was marked as failed.

@skamille
Copy link
Contributor

Right but my point is, if that creation is failing due to connection loss, shouldn't the places that check the watcher connection fail there instead of in your check? The first place you added the retries there is no watcher connection check, the rest of them there already is, and my understanding is that the watcher should not get past that check if the connection isn't established.

@skamille
Copy link
Contributor

Basically, I think it's worth understanding why we are getting a connection event in the watcher that should be waiting for connection, but still failing by not connecting, instead of fixing this with additional waiting.

@hanm
Copy link
Contributor Author

hanm commented Mar 16, 2017

if that creation is failing due to connection loss, shouldn't the places that check the watcher connection fail there instead of in your check?

ConnectionLossException can happen after a connection between ZooKeeper client and server has been established, right? So having the check only in watcher is not enough. A pass in watcher does not guarantee ConnectionLossException will not occur in a later point in time. Imagine an extreme case where the a network partition happened between client / server after a session establishment - the client will first get a connected event, and watcher happily reports everything is fine, then subsequent operation (e.g. create) will fail with ConnectionLossException until the network healed.

I think it's worth understanding why we are getting a connection event in the watcher that should be waiting for connection, but still failing by not connecting, instead of fixing this with additional waiting.

Yes I'd like to know what causes this though I had a hard time to reproduce this failure locally / in internal Jenkins. It is so far only reproducible in Apache Jenkins. I can add some loggings to capture more contexts when the failure happens on Apache Jenkins, but in that case the retry logic in create is still needed, unless we can prove it is not possible to get a ConnectionLossException after a session establishment.

@skamille
Copy link
Contributor

ConnectionLossException can happen after a connection between ZooKeeper client and server has been established, right? So having the check only in watcher is not enough. A pass in watcher does not guarantee ConnectionLossException will not occur in a later point in time. Imagine an extreme case where the a network partition happened between client / server after a session establishment - the client will first get a connected event, and watcher happily reports everything is fine, then subsequent operation (e.g. create) will fail with ConnectionLossException until the network healed.

Right but we're talking about a test case. If we have the issue that our tests can connect to ZK, then randomly drop connections while in the midst of the testing, that feels like a problem we should figure out. It should not happen, and we rely on this particular "watch till connection then proceed with test" functionality in tests throughout the code base. The fact that it is only failing here seems to me a stranger problem. I'm supportive of adding more logging to see if we can debug it.

@skamille
Copy link
Contributor

FWIW I can repro this on my desktop so let me see if I can figure out what's happening

@hanm
Copy link
Contributor Author

hanm commented Mar 16, 2017

@skamille Interesting, curious to see what you find out.

@skamille
Copy link
Contributor

Is it possible that what happens is the client connects again in read-only mode, and then is bumped when quorum happens to be achieved? It seems like that shouldn't happen but it seems like the one race condition that I can somewhat imagine.

@hanm
Copy link
Contributor Author

hanm commented Mar 16, 2017

That's a good one, it sounds possible. We can force the watcher only wait for SyncConnected event (rather than both SyncConnectedEvent and ConnectedReadyOnly event), so we will guarantee the client connected to a RW server before executing the write operation.

@hanm
Copy link
Contributor Author

hanm commented Mar 16, 2017

Let me also check the Apache Jenkins logs on the failed cases see what happened there.

@skamille
Copy link
Contributor

Are we running multiple tests in parallel on Jenkins? I think the logs support this thesis but they seem to have many different tests logging into the same lines as the testSessionEstablishment test and I'm terrible confused by how that is happening.

@hanm
Copy link
Contributor Author

hanm commented Mar 16, 2017

I think we support running concurrent JUNIT tests and the number of parallel tests is controlled by test.junit.threads - which has default value of 1 in build.xml. The Jenkins log indicates that this number is 8 on Apache Jenkins, though I haven't found where to set it in the job configuration yet.

@skamille
Copy link
Contributor

I think your idea to create a connector that waits for specific types of connection (RO or R/W) is a good idea to try to fix this issue.

@hanm
Copy link
Contributor Author

hanm commented Mar 16, 2017

@skamille
Implementing the idea of waiting for sync connected event to make sure we wait until a quorum is formed before prematurely trying a write operation. Made some comments in the test case that describes a possible case that client will drop connection between connected to r/o server and a quorum reforming.

@hanm
Copy link
Contributor Author

hanm commented Mar 17, 2017

I think this does not quite work still https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/434/testReport/junit/org.apache.zookeeper.test/ReadOnlyModeTest/testSessionEstablishment/ - need more dig see what exactly cause the connection drop.

@hanm hanm force-pushed the ZOOKEEPER-2722 branch 10 times, most recently from 31ace34 to 6b1253e Compare March 19, 2017 01:33
@hanm
Copy link
Contributor Author

hanm commented Mar 19, 2017

@skamille OK, for some reasons I can't even get a single repro with this patch now on Apache Jenkins. I've kicked 10+ builds on apache jenkins and all of them pass this test. Unfortunately the link I posted earlier use a version of patch that does have instrumented logs so I can't tell the state of ZK when weird thing happened.

I think we can commit this if this patch is OK with you, and let it hammered by daily builds / other pre-commit jobs so we can gather more log data on the contexts when it reproduces.

connected = true;
syncConnected = true;
} else if (state == KeeperState.ConnectedReadOnly) {
connected = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be value in setting syncConnected to false here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watcher.reset should be called between two process calls so syncConnected should be false here, no need reset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what if some event happens that causes the state to drop from syncconnected -> connectedreadonly during a test (before reset is ever called). If expecting a reset call was sufficient, we would not need the "else" clause in that if statement right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah fair enough , fixing ...

@hanm
Copy link
Contributor Author

hanm commented Mar 23, 2017

@skamille - gentle nudge.. any opinions my proposal regarding commit this to branch-3.5? I am eager to get some logging from daily build (if the patch itself does not fix the issue). I think we had all green builds in past week except one fail build caused by this https://builds.apache.org/job/ZooKeeper_branch35_jdk7/892/

@rakeshadr
Copy link
Contributor

+1 the patch looks good to me. Thanks @hanm

@hanm hanm changed the base branch from branch-3.5 to master April 6, 2017 16:11
@hanm hanm changed the base branch from master to branch-3.5 April 6, 2017 16:11
@hanm
Copy link
Contributor Author

hanm commented Apr 6, 2017

Thanks for the review @rakeshadr - I'll commit this after 3.5.3 released since this PR is targeting branch-3.5 (tried to change base branch from 3.5 to master makes git generates crazy diffs).

asfgit pushed a commit that referenced this pull request Apr 17, 2017
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum.

Author: Michael Han <hanm@apache.org>

Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org>

Closes #191 from hanm/ZOOKEEPER-2722 and squashes the following commits:

99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state.
9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment.
@asfgit asfgit closed this in 3946f0b Apr 17, 2017
asfgit pushed a commit that referenced this pull request Apr 17, 2017
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum.

Author: Michael Han <hanm@apache.org>

Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org>

Closes #191 from hanm/ZOOKEEPER-2722 and squashes the following commits:

99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state.
9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment.

(cherry picked from commit 2122011)
Signed-off-by: Michael Han <hanm@apache.org>
hanm added a commit to hanm/zookeeper that referenced this pull request Apr 21, 2017
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum.

Author: Michael Han <hanm@apache.org>

Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org>

Closes apache#191 from hanm/ZOOKEEPER-2722 and squashes the following commits:

99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state.
9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment.

(cherry picked from commit 2122011)
Signed-off-by: Michael Han <hanm@apache.org>
afine pushed a commit to afine/zookeeper that referenced this pull request May 18, 2017
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum.

Author: Michael Han <hanm@apache.org>

Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org>

Closes apache#191 from hanm/ZOOKEEPER-2722 and squashes the following commits:

99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state.
9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment.

(cherry picked from commit 2122011)
Signed-off-by: Michael Han <hanm@apache.org>
lvfangmin pushed a commit to lvfangmin/zookeeper that referenced this pull request Jun 17, 2018
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum.

Author: Michael Han <hanm@apache.org>

Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org>

Closes apache#191 from hanm/ZOOKEEPER-2722 and squashes the following commits:

99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state.
9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment.

(cherry picked from commit 2122011)
Signed-off-by: Michael Han <hanm@apache.org>
RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Sep 3, 2022
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum.

Author: Michael Han <hanm@apache.org>

Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org>

Closes apache#191 from hanm/ZOOKEEPER-2722 and squashes the following commits:

99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state.
9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment.

(cherry picked from commit 2122011)
Signed-off-by: Michael Han <hanm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants