New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster not able to start #904
Comments
Can you tell which is the last service that was started? That can usually be a good indicator of where the hangup is. Each Atomix service should log a “Started” message, and we can usually deduce what’s hanging from those. That will at least narrow down the set of possible causes.
I don’t see anything wrong with the code. I suspect this is some subtle bug in the configuration, though.
… On Oct 30, 2018, at 3:57 AM, Jyotiswarup Raiturkar ***@***.***> wrote:
I am trying to start a 2 node cluster on the same machine. I'm running the same spring boot jar twice giving different values to myPort and otherPort ( code below)
AtomixBuilder builder = Atomix.builder()
.withMemberId(myName)
.withAddress("localhost:" + myPort)
.withMembershipProvider(BootstrapDiscoveryProvider.builder()
.withNodes(
Node.builder()
.withId(otherName).
withAddress("localhost:" + otherPort)
.build()
).build());
builder.addProfile(Profile.dataGrid());
this.ref = builder.build();
this.ref.start().join();
The two instances keep communicate with each other, but the code never returns. I enabled debug logs but i dont see any errors. What am i missing?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Looks like it's HashBasedPrimaryElectionService is the last service started..And later on there is . started msg for the Raft Partition group as well.. Will try to debug further.. 2018-10-31 09:45:21.863 INFO 33092 --- [nt-nio-server-0] i.a.c.m.i.NettyMessagingService : Started 2018-10-31 09:45:21.913 INFO 33092 --- [ atomix-0] .a.p.p.i.HashBasedPrimaryElectionService : 2018-10-31 09:45:26.358 DEBUG 33092 --- [tem-partition-1] i.a.p.r.p.i.RaftPartitionServer : Successfully started server for partition PartitionId{id=1, group=system} |
While running Atomix 3.1.0, I encountered the same issue. I am able to start three Atomix nodes with my IDE, all on different ports of localhost, and system works as expected. After moving to Docker Compose and running three containers, none of the nodes can join the cluster. Last message in the log:
From what I can tell, Configuration:
Thoughts? |
Usually this is a symptom of some sort of configuration issue. IMO we need to do a bette job of figuring out what the startup issue is and logging it. Currently that’s only done for missing partition groups.
I’m not familiar with Docker Compose, but I know the configuration and commands in the ONOS documentation work because I’ve tried them:
https://wiki.onosproject.org/plugins/servlet/mobile?contentId=28836788#content/view/28836788
So, I wonder what the difference could be in Docker Compose.
… On Jan 8, 2019, at 2:51 PM, Lukasz Antoniak ***@***.***> wrote:
While running Atomix 3.1.0, I encountered the same issue. I am able to start three Atomix nodes with my IDE, all on different ports of localhost, and system works as expected. After moving to Docker Compose and running three containers, none of the nodes can join the cluster.
Last message in the log:
[2019-01-08 22:33:58,358] INFO [io.atomix.protocols.raft.partition.RaftPartitionGroup] Started
From what I can tell, io.atomix.primitive.partition.impl.DefaultPartitionService does not complete its start-up procedure (that is the next message in the log while running in IDE). All members are able to see each other, as cluster membership events of type MEMBER_ADDED have been triggered.
Configuration:
cluster {
clusterId: "atomix"
node {
id: "member-0"
host: "node1"
port: 5001
}
discovery {
type: bootstrap
nodes.1 {
id: "member-0"
host: "node1"
port: 5001
}
nodes.2 {
id: "member-1"
host: "node2"
port: 5001
}
nodes.3 {
id: "member-2"
host: "node3"
port: 5001
}
}
}
managementGroup {
type: raft
name: system
partitions: 1
members: ["member-0", "member-1", "member-2"]
storage {
directory: "/volume/data/atomix/mgmt"
level: disk
}
}
partitionGroups.raft {
type: raft
partitions: 3
members: ["member-0", "member-1", "member-2"]
storage {
directory: "/volume/data/atomix/pg"
level: disk
}
}
Thoughts?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Thank you for quick reply. I have been looking at those few configuration lines for hours, so decided to follow with some debugging. It looks to me that DefaultPartitionService hangs on the line: https://github.com/atomix/atomix/blob/master/primitive/src/main/java/io/atomix/primitive/partition/impl/DefaultPartitionService.java#L167. I have enabled debug level logs and observe:
Do you have any idea where to look next? What could cause the issue? Docker Compose is nothing more than running several Docker containers in a convenient way. Port 5001 TCP/UDP is open and I assume that my configuration does not need multi-casting. |
Finally fixed it. In my case, two Guava JARs were present on classpath (transitive dependency). Atomix did not work with Guava 20.0. |
I am trying to start a 2 node cluster on the same machine. I'm running the same spring boot jar twice giving different values to myPort and otherPort ( code below)
AtomixBuilder builder = Atomix.builder()
.withMemberId(myName)
.withAddress("localhost:" + myPort)
.withMembershipProvider(BootstrapDiscoveryProvider.builder()
.withNodes(
Node.builder()
.withId(otherName).
withAddress("localhost:" + otherPort)
.build()
).build());
The two instances keep communicate with each other, but the code never returns. I enabled debug logs but i dont see any errors. What am i missing?
The text was updated successfully, but these errors were encountered: