New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-2784: Add same `sid` config problem check #257

Open
wants to merge 6 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@asdf2014
Member

asdf2014 commented May 18, 2017

Add some limitations on code level for SID to avoid configuration problem

@asdf2014 asdf2014 force-pushed the asdf2014:quorum_sid branch 2 times, most recently from f60b2d5 to e467423 May 18, 2017

@asdf2014

This comment has been minimized.

Member

asdf2014 commented May 18, 2017

Finally, it works 😄

     [exec]     [junit] 2017-05-18 09:29:00,770 [myid:] - INFO  [main:ZKTestCase$1@58] - STARTING testSameSID
     [exec]     [junit] 2017-05-18 09:29:00,770 [myid:] - INFO  [main:PortAssignment@85] - Assigned port 24753 from range 24686 - 27378.
     [exec]     [junit] 2017-05-18 09:29:00,770 [myid:] - INFO  [main:PortAssignment@85] - Assigned port 24754 from range 24686 - 27378.
     [exec]     [junit] 2017-05-18 09:29:00,771 [myid:] - INFO  [main:PortAssignment@85] - Assigned port 24755 from range 24686 - 27378.
     [exec]     [junit] 2017-05-18 09:29:00,771 [myid:] - INFO  [main:PortAssignment@85] - Assigned port 24756 from range 24686 - 27378.
     [exec]     [junit] 2017-05-18 09:29:00,771 [myid:] - INFO  [main:PortAssignment@85] - Assigned port 24757 from range 24686 - 27378.
     [exec]     [junit] 2017-05-18 09:29:00,772 [myid:] - INFO  [main:PortAssignment@85] - Assigned port 24758 from range 24686 - 27378.
     [exec]     [junit] 2017-05-18 09:29:00,772 [myid:] - INFO  [main:PortAssignment@85] - Assigned port 24759 from range 24686 - 27378.
     [exec]     [junit] 2017-05-18 09:29:00,772 [myid:] - INFO  [main:PortAssignment@85] - Assigned port 24760 from range 24686 - 27378.
     [exec]     [junit] 2017-05-18 09:29:00,773 [myid:] - INFO  [main:PortAssignment@85] - Assigned port 24761 from range 24686 - 27378.
     [exec]     [junit] 2017-05-18 09:29:00,773 [myid:] - INFO  [main:JUnit4ZKTestRunner$LoggedInvokeMethod@77] - RUNNING TEST METHOD testSameSID
     [exec]     [junit] 2017-05-18 09:29:00,773 [myid:] - INFO  [main:ServerCnxnFactory@134] - Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory
     [exec]     [junit] 2017-05-18 09:29:00,773 [myid:] - INFO  [main:NIOServerCnxnFactory@673] - Configuring NIO connection handler with 10s sessionless connection timeout, 3 selector thread(s), 48 worker threads, and 64 kB direct buffers.
     [exec]     [junit] 2017-05-18 09:29:00,774 [myid:] - INFO  [main:NIOServerCnxnFactory@686] - binding to port /127.0.0.1:24760
     [exec]     [junit] 2017-05-18 09:29:00,775 [myid:] - INFO  [main:CnxManagerTest@392] - Creating socket connection, host: 127.0.0.1, port: 24761
     [exec]     [junit] 2017-05-18 09:29:00,775 [myid:] - INFO  [QuorumPeerListener:QuorumCnxManager$Listener@636] - My election bind port: /127.0.0.1:24761
     [exec]     [junit] 2017-05-18 09:29:00,775 [myid:] - INFO  [/127.0.0.1:24761:QuorumCnxManager$Listener@642] - Received connection request /127.0.0.1:54430
     [exec]     [junit] 2017-05-18 09:29:00,775 [myid:] - WARN  [main:QuorumCnxManager@342] - Exception reading or writing challenge: java.io.EOFException
     [exec]     [junit] 2017-05-18 09:29:00,776 [myid:] - ERROR [/127.0.0.1:24761:QuorumCnxManager$Listener@664] - Appearing duplicate SID: 2
@hanm

This comment has been minimized.

Contributor

hanm commented May 18, 2017

@asdf2014 : There are lots of formatting changes in this patch, which makes review harder. Can you please update the patch to avoid those formatting changes? See How To Contribute:

Please don't do in pull request:

reformat code unrelated to the bug being fixed: formatting changes should be separate patches/commits.

@asdf2014 asdf2014 force-pushed the asdf2014:quorum_sid branch 2 times, most recently from 8b4fb6d to 0762f2a May 19, 2017

@asdf2014 asdf2014 force-pushed the asdf2014:quorum_sid branch from 0762f2a to 0b8ecd1 May 19, 2017

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.*;

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

@asdf2014 FYI, ZK codebase gives preference to explicit imports instead of using wildcards.

This comment has been minimized.

@asdf2014

asdf2014 May 19, 2017

Member

I see. I will roll it back.

@@ -660,6 +665,9 @@ public void run() {
"Ignoring exception", ie);
}
closeSocket(client);
} catch (ConfigException ce) {
LOG.error(ce.getMessage());
throw new RuntimeException(ce);

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

Don't we have to call closeSocket(client) here before throwing the RuntimeException? Relying on OS/JVM to close a socket can be too much optimistic and leak resources. If I am right (please double check) then line 667 can be moved to a finally block just below this catch clause.

Also, as the program flow is interrupted if a ConfigException is thrown lines 679-687 wouldn't be called and then ServerSocket also is not properly closed, right?

PS: I am not sure, would have to look carefully later and don't have time now, so excuse me if I am misunderstood it.

This comment has been minimized.

@asdf2014

asdf2014 May 19, 2017

Member

You are right. Throwing 'RuntimeException' does skip the closing ServerSocket part, so I moved it to the end of the run method.

@eribeiro

My comments aside, I am +1 if the other reviewers are okay with the changes.

I was only a bit worried about throwing a RuntimeException here that bubbles up. As far as I can see it terminates the QuorumPeer and doesn't leave the server in a inconsistent state but this could not be the case in the future... I would prefer to save the exception in a package protected field, log the error and just leave the while loop smoothly. Then test to see if this variable was setup in the test case, if this is somewhat possible, of course.

}
closeSocket(client);
} catch (ConfigException e) {
LOG.error(e.getMessage());
ce = e;

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

👍 BUT you need to call closeSocket(client); here too. See line 668. ;) It is the ServerSocket and Socket we need to close in case of error. I know we close the socket previously before throwing the exception, but Socket#close() is idempotent so, I would advise to put it here too. In fact, we could remove line 668 and add a finally block with the closeSocket(client); after line 672.

Most important! The latest change doesn't break the while loop so it only leave after reaching the maximum number of retries or shutdown is called (I bet the first). We should include a break; at line 672. ;)

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

Great! I will fix it.

if((end - begin) > ((peer.getSyncLimit() * peer.getTickTime()) + 500)) Assert.fail("Waited more than necessary");
cnxManager.halt();
Assert.assertFalse(cnxManager.listener.isAlive());
try (Socket sock = new Socket()) {

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

It looks like the reason for changing those lines was to include a ARM block, right?

I understand the motivation and have to refrain myself from doing the same a couple of times, but this kind of refactoring would be goal of a separate issue as it is not related to the bug/feature of this issue (aka don't fix if it's not broken).

So, it revert it to the original one, please. :)

This comment has been minimized.

@afine

afine May 19, 2017

Contributor

+1, although please feel free to submit a jira and fix it there!

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

Okay! I will separate it into a different jira.

InetSocketAddress electionAddr = peers.get(peer.getId()).electionAddr;
LOG.info("Creating socket connection, host: {}, port: {}",
electionAddr.getHostString(), electionAddr.getPort());
sock.connect(electionAddr, 30000);

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

The timeout here is way to big (30000 ms) with relation to the other tests (5000 ms). It is really necessary to be this big?

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

I want to fix the Connection Refused. I thought the environment jenkins is not stable...

public void receiveConnection(Socket sock) {
Long sid = null, protocolVersion = null;
public void receiveConnection(Socket sock) throws ConfigException {
Long sid, protocolVersion;

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

As we removed the extraneous null here we could also use the primitive long type, right? I don't see any gain of using boxed types here. I see a lot of boxing/unboxing down in this method, so it could be simply int sid, protocolVersion; (this slightly falls into "the don't fix if..." adage, but we can ignore this for once ;) ).

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

Yeah, actually it could be a problem. Long == Long will call Long.valueof method, and compare the address of Object. And the Long.valueof method will create a new Long Object when the value of Long Object more than 127. So, it could return the false.

/*
127L == 127L: true
128L == 128L: false
128L equals 128L: true
127 == 127: true
128 == 128: true
 */
public static void main(String... args) {
    Long l1 = 127L, l2 = 127L, l3 = 128L, l4 = 128L;
    System.out.println("127L == 127L: " + (l1 == l2));
    System.out.println("128L == 128L: " + (l3 == l4));
    System.out.println("128L equals 128L: " + (l3.equals(l4)));
    long ll1 = 127, ll2 = 127, ll3 = 128, ll4 = 128;
    System.out.println("127 == 127: " + (ll1 == ll2));
    System.out.println("128 == 128: " + (ll3 == ll4));
}

This comment has been minimized.

@eribeiro

eribeiro May 20, 2017

Contributor

Yes, integers whose values are between -127 and 128 (inclusive) are cached inside Integer object.

Plus, boxing/unboxing in a tight loop (not the case here!) can be a huge performance pitfall. For example:

for (Integer it = 0; it < 10000; ){
       it += 1;
}

Used to have abysmal performance numbers due to many auto boxing/unboxing .

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

You are right. I will change it into long instead of Long.

@@ -371,6 +372,10 @@ public void receiveConnection(Socket sock) {
connectOne(sid);
}
}else if (sid == self.getId()) {

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

nit: space between { and else

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

Thanks! I will fix it.

@@ -545,6 +545,7 @@ public void testInconsistentPeerType() throws Exception {
defaultedToObserver = true;
}
if (warningPresent && defaultedToObserver) {
LOG.debug("Content from console is: {}", line);

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

Ugh?! What is the purpose of this for the issue at hand?

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

"QuorumPeerMainTest" has no reason to fail in jinkens, so add a logger to figure out why. I will remove it.

@@ -677,6 +686,7 @@ public void run() {
LOG.debug("Error closing server socket", ie);
}
}
if (ce !=null) throw new RuntimeException(ce);

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

Okay, I am still unsure if throwing a Runtime exception is the best approach here.

This comment has been minimized.

@afine

afine May 19, 2017

Contributor

Would it be possible to refactor this so the exception never leaves its catch block? Perhaps the logic "in between" can be captured with "finally". Then I think we will fix https://github.com/apache/zookeeper/pull/257/files#r117407300 as well.

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

Thx a lot! I will using finally deal with closing socket.

@afine

@asdf2014 good work. I'm wondering what our behavior "is" and what is "should be" when two servers present the same sid. What happens when this problem arises, not just on severs with the duplicate sid, but on the other servers as well?

Are there possibilities of split brain here (if some sufficiently "large" subset of servers have duplicate sids)? Is there something we can do to prevent that from happening?

@@ -18,6 +18,12 @@
package org.apache.zookeeper.server.quorum;
import org.apache.zookeeper.server.ZooKeeperThread;

This comment has been minimized.

@afine

afine May 19, 2017

Contributor

a few unnecessary reordering of imports, these should be kept to a minimum to keep the diff clean

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

Thank you! I will pay attention. :D

@@ -620,6 +625,7 @@ public void run() {
int numRetries = 0;
InetSocketAddress addr;
Socket client = null;
ConfigException ce = null;

This comment has been minimized.

@afine

afine May 19, 2017

Contributor

nit: can we use better variable naming?

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

Using configExceptioin?

This comment has been minimized.

@eribeiro

eribeiro May 20, 2017

Contributor

+1 about configException 👍

This comment has been minimized.

@asdf2014
@@ -657,9 +663,12 @@ public void run() {
LOG.error("Error closing server socket", ie);
} catch (InterruptedException ie) {
LOG.error("Interrupted while sleeping. " +
"Ignoring exception", ie);
"Ignoring exception", ie);

This comment has been minimized.

@afine

afine May 19, 2017

Contributor

nit: whitespace change

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

Thanks! I will roll it back.

@@ -677,6 +686,7 @@ public void run() {
LOG.debug("Error closing server socket", ie);
}
}
if (ce !=null) throw new RuntimeException(ce);

This comment has been minimized.

@afine

afine May 19, 2017

Contributor

Would it be possible to refactor this so the exception never leaves its catch block? Perhaps the logic "in between" can be captured with "finally". Then I think we will fix https://github.com/apache/zookeeper/pull/257/files#r117407300 as well.

if((end - begin) > ((peer.getSyncLimit() * peer.getTickTime()) + 500)) Assert.fail("Waited more than necessary");
cnxManager.halt();
Assert.assertFalse(cnxManager.listener.isAlive());
try (Socket sock = new Socket()) {

This comment has been minimized.

@afine

afine May 19, 2017

Contributor

+1, although please feel free to submit a jira and fix it there!

@@ -371,6 +372,10 @@ public void receiveConnection(Socket sock) {
connectOne(sid);
}
}else if (sid == self.getId()) {
closeSocket(sock);
throw new ConfigException(String.format("Appearing duplicate SID: %s", sid));

This comment has been minimized.

@eribeiro

eribeiro May 19, 2017

Contributor

I think this message could be more descriptive or reshaped. English is not my 1st language, but what about "Inconsistent ensemble setup: duplicate SID %s found"?

This comment has been minimized.

@asdf2014

asdf2014 May 20, 2017

Member

Great! I will change it.

@asdf2014 asdf2014 force-pushed the asdf2014:quorum_sid branch from 131c977 to fc5d9a5 May 20, 2017

@asdf2014

This comment has been minimized.

Member

asdf2014 commented May 20, 2017

I don't understand why jenkins reports UnknownHostException. Is it could due to DNS lookups failing?

java.net.UnknownHostException: 10.1.1.-113
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:579)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:448)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:491)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:426)
	at org.apache.zookeeper.test.CnxManagerTest.testCnxManagerTimeout(CnxManagerTest.java:212)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:53)

@asdf2014 asdf2014 force-pushed the asdf2014:quorum_sid branch 3 times, most recently from f5cab9a to a833653 May 20, 2017

asdf2014 added some commits May 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment