ZOOKEEPER-2722: fix flaky testSessionEstablishment test.#191
ZOOKEEPER-2722: fix flaky testSessionEstablishment test.#191hanm wants to merge 2 commits intoapache:branch-3.5from
Conversation
skamille
left a comment
There was a problem hiding this comment.
So are we saying that the watcher.waitForConnected(CONNECTION_TIMEOUT) is not working correctly? Because it seems like in most of the places you've added the check, we check that the watcher has gotten into connected state first.
I believe this works as expected. I don't see any of the flaky / normal test results complain about this particular check.
The check |
|
So what happened chronically on Jenkins for this test could be:
|
|
Right but my point is, if that creation is failing due to connection loss, shouldn't the places that check the watcher connection fail there instead of in your check? The first place you added the retries there is no watcher connection check, the rest of them there already is, and my understanding is that the watcher should not get past that check if the connection isn't established. |
|
Basically, I think it's worth understanding why we are getting a connection event in the watcher that should be waiting for connection, but still failing by not connecting, instead of fixing this with additional waiting. |
ConnectionLossException can happen after a connection between ZooKeeper client and server has been established, right? So having the check only in watcher is not enough. A pass in watcher does not guarantee ConnectionLossException will not occur in a later point in time. Imagine an extreme case where the a network partition happened between client / server after a session establishment - the client will first get a connected event, and watcher happily reports everything is fine, then subsequent operation (e.g. create) will fail with ConnectionLossException until the network healed.
Yes I'd like to know what causes this though I had a hard time to reproduce this failure locally / in internal Jenkins. It is so far only reproducible in Apache Jenkins. I can add some loggings to capture more contexts when the failure happens on Apache Jenkins, but in that case the retry logic in create is still needed, unless we can prove it is not possible to get a ConnectionLossException after a session establishment. |
Right but we're talking about a test case. If we have the issue that our tests can connect to ZK, then randomly drop connections while in the midst of the testing, that feels like a problem we should figure out. It should not happen, and we rely on this particular "watch till connection then proceed with test" functionality in tests throughout the code base. The fact that it is only failing here seems to me a stranger problem. I'm supportive of adding more logging to see if we can debug it. |
|
FWIW I can repro this on my desktop so let me see if I can figure out what's happening |
|
@skamille Interesting, curious to see what you find out. |
|
Is it possible that what happens is the client connects again in read-only mode, and then is bumped when quorum happens to be achieved? It seems like that shouldn't happen but it seems like the one race condition that I can somewhat imagine. |
|
That's a good one, it sounds possible. We can force the watcher only wait for SyncConnected event (rather than both SyncConnectedEvent and ConnectedReadyOnly event), so we will guarantee the client connected to a RW server before executing the write operation. |
|
Let me also check the Apache Jenkins logs on the failed cases see what happened there. |
|
Are we running multiple tests in parallel on Jenkins? I think the logs support this thesis but they seem to have many different tests logging into the same lines as the testSessionEstablishment test and I'm terrible confused by how that is happening. |
|
I think we support running concurrent JUNIT tests and the number of parallel tests is controlled by test.junit.threads - which has default value of 1 in build.xml. The Jenkins log indicates that this number is 8 on Apache Jenkins, though I haven't found where to set it in the job configuration yet. |
|
I think your idea to create a connector that waits for specific types of connection (RO or R/W) is a good idea to try to fix this issue. |
|
@skamille |
|
I think this does not quite work still https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/434/testReport/junit/org.apache.zookeeper.test/ReadOnlyModeTest/testSessionEstablishment/ - need more dig see what exactly cause the connection drop. |
31ace34 to
6b1253e
Compare
|
@skamille OK, for some reasons I can't even get a single repro with this patch now on Apache Jenkins. I've kicked 10+ builds on apache jenkins and all of them pass this test. Unfortunately the link I posted earlier use a version of patch that does have instrumented logs so I can't tell the state of ZK when weird thing happened. I think we can commit this if this patch is OK with you, and let it hammered by daily builds / other pre-commit jobs so we can gather more log data on the contexts when it reproduces. |
| connected = true; | ||
| syncConnected = true; | ||
| } else if (state == KeeperState.ConnectedReadOnly) { | ||
| connected = true; |
There was a problem hiding this comment.
would there be value in setting syncConnected to false here?
There was a problem hiding this comment.
watcher.reset should be called between two process calls so syncConnected should be false here, no need reset.
There was a problem hiding this comment.
but what if some event happens that causes the state to drop from syncconnected -> connectedreadonly during a test (before reset is ever called). If expecting a reset call was sufficient, we would not need the "else" clause in that if statement right?
There was a problem hiding this comment.
yeah fair enough , fixing ...
|
@skamille - gentle nudge.. any opinions my proposal regarding commit this to branch-3.5? I am eager to get some logging from daily build (if the patch itself does not fix the issue). I think we had all green builds in past week except one fail build caused by this https://builds.apache.org/job/ZooKeeper_branch35_jdk7/892/ |
|
+1 the patch looks good to me. Thanks @hanm |
|
Thanks for the review @rakeshadr - I'll commit this after 3.5.3 released since this PR is targeting branch-3.5 (tried to change base branch from 3.5 to master makes git generates crazy diffs). |
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum. Author: Michael Han <hanm@apache.org> Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org> Closes #191 from hanm/ZOOKEEPER-2722 and squashes the following commits: 99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state. 9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment.
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum. Author: Michael Han <hanm@apache.org> Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org> Closes #191 from hanm/ZOOKEEPER-2722 and squashes the following commits: 99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state. 9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment. (cherry picked from commit 2122011) Signed-off-by: Michael Han <hanm@apache.org>
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum. Author: Michael Han <hanm@apache.org> Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org> Closes apache#191 from hanm/ZOOKEEPER-2722 and squashes the following commits: 99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state. 9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment. (cherry picked from commit 2122011) Signed-off-by: Michael Han <hanm@apache.org>
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum. Author: Michael Han <hanm@apache.org> Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org> Closes apache#191 from hanm/ZOOKEEPER-2722 and squashes the following commits: 99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state. 9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment. (cherry picked from commit 2122011) Signed-off-by: Michael Han <hanm@apache.org>
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum. Author: Michael Han <hanm@apache.org> Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org> Closes apache#191 from hanm/ZOOKEEPER-2722 and squashes the following commits: 99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state. 9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment. (cherry picked from commit 2122011) Signed-off-by: Michael Han <hanm@apache.org>
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum. Author: Michael Han <hanm@apache.org> Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>, Camille Fournier <camille@apache.org> Closes apache#191 from hanm/ZOOKEEPER-2722 and squashes the following commits: 99bf87f [Michael Han] Make sure syncConnected flag is set to false in read only connected state. 9969979 [Michael Han] ZOOKEEPER-2722: fix flaky test testSessionEstablishment. (cherry picked from commit 2122011) Signed-off-by: Michael Han <hanm@apache.org>
Make sure client is connected to a quorum before issuing a write operation to avoid possible race condition between connected to a RO server and forming a new quorum.