cluster not able to start #904

cookingkode · 2018-10-30T10:57:51Z

I am trying to start a 2 node cluster on the same machine. I'm running the same spring boot jar twice giving different values to myPort and otherPort ( code below)

AtomixBuilder builder = Atomix.builder()
.withMemberId(myName)
.withAddress("localhost:" + myPort)
.withMembershipProvider(BootstrapDiscoveryProvider.builder()
.withNodes(
Node.builder()
.withId(otherName).
withAddress("localhost:" + otherPort)
.build()
).build());

   builder.addProfile(Profile.dataGrid());
   this.ref = builder.build();
   this.ref.start().join();

The two instances keep communicate with each other, but the code never returns. I enabled debug logs but i dont see any errors. What am i missing?

The text was updated successfully, but these errors were encountered:

kuujo · 2018-10-30T18:39:37Z

Can you tell which is the last service that was started? That can usually be a good indicator of where the hangup is. Each Atomix service should log a “Started” message, and we can usually deduce what’s hanging from those. That will at least narrow down the set of possible causes. I don’t see anything wrong with the code. I suspect this is some subtle bug in the configuration, though.

…

On Oct 30, 2018, at 3:57 AM, Jyotiswarup Raiturkar ***@***.***> wrote: I am trying to start a 2 node cluster on the same machine. I'm running the same spring boot jar twice giving different values to myPort and otherPort ( code below) AtomixBuilder builder = Atomix.builder() .withMemberId(myName) .withAddress("localhost:" + myPort) .withMembershipProvider(BootstrapDiscoveryProvider.builder() .withNodes( Node.builder() .withId(otherName). withAddress("localhost:" + otherPort) .build() ).build()); builder.addProfile(Profile.dataGrid()); this.ref = builder.build(); this.ref.start().join(); The two instances keep communicate with each other, but the code never returns. I enabled debug logs but i dont see any errors. What am i missing? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

cookingkode · 2018-10-31T04:21:52Z

Looks like it's HashBasedPrimaryElectionService is the last service started..And later on there is . started msg for the Raft Partition group as well.. Will try to debug further..

2018-10-31 09:45:21.863 INFO 33092 --- [nt-nio-server-0] i.a.c.m.i.NettyMessagingService : Started
2018-10-31 09:45:21.875 INFO 33092 --- [tomix-cluster-0] i.a.c.d.BootstrapDiscoveryProvider : Joined
2018-10-31 09:45:21.875 INFO 33092 --- [tomix-cluster-0] i.a.c.i.DefaultClusterMembershipService : member2 - Member activated: Member{id=member2, address=localhost:2222, properties={}}
2018-10-31 09:45:21.877 INFO 33092 --- [tomix-cluster-0] i.a.c.i.DefaultClusterMembershipService : Started
2018-10-31 09:45:21.877 INFO 33092 --- [tomix-cluster-0] c.m.i.DefaultClusterCommunicationService : Started
2018-10-31 09:45:21.878 INFO 33092 --- [tomix-cluster-0] i.a.c.m.i.DefaultClusterEventService : Started
2018-10-31 09:45:21.884 INFO 33092 --- [ atomix-0] i.DefaultPartitionGroupMembershipService : Started

2018-10-31 09:45:21.913 INFO 33092 --- [ atomix-0] .a.p.p.i.HashBasedPrimaryElectionService :
i.a.p.r.i.DefaultRaftServer : RaftServer{system-partition-1} - Server started successfully!

2018-10-31 09:45:26.358 DEBUG 33092 --- [tem-partition-1] i.a.p.r.p.i.RaftPartitionServer : Successfully started server for partition PartitionId{id=1, group=system}
2018-10-31 09:45:26.372 DEBUG 33092 --- [tem-partition-1] i.a.p.r.p.i.RaftPartitionClient : Successfully started client for partition PartitionId{id=1, group=system}
2018-10-31 09:45:26.372 INFO 33092 --- [tem-partition-1] i.a.p.r.p.RaftPartitionGroup : Started

lukasz-antoniak · 2019-01-08T22:51:33Z

While running Atomix 3.1.0, I encountered the same issue. I am able to start three Atomix nodes with my IDE, all on different ports of localhost, and system works as expected. After moving to Docker Compose and running three containers, none of the nodes can join the cluster.

Last message in the log:

[2019-01-08 22:33:58,358] INFO [io.atomix.protocols.raft.partition.RaftPartitionGroup] Started

From what I can tell, io.atomix.primitive.partition.impl.DefaultPartitionService does not complete its start-up procedure (that is the next message in the log while running in IDE). All members are able to see each other, as cluster membership events of type MEMBER_ADDED have been triggered.

Configuration:

cluster {
  clusterId: "atomix"
  node {
    id: "member-0"
    host: "node1"
    port: 5001
  }
  discovery {
    type: bootstrap
    nodes.1 {
      id: "member-0"
      host: "node1"
      port: 5001
    }
    nodes.2 {
      id: "member-1"
      host: "node2"
      port: 5001
    }
    nodes.3 {
      id: "member-2"
      host: "node3"
      port: 5001
    }
  }
}

managementGroup {
  type: raft
  name: system
  partitions: 1
  members: ["member-0", "member-1", "member-2"]
  storage {
    directory: "/volume/data/atomix/mgmt"
    level: disk
  }
}

partitionGroups.raft {
  type: raft
  partitions: 3
  members: ["member-0", "member-1", "member-2"]
  storage {
    directory: "/volume/data/atomix/pg"
    level: disk
  }
}

Thoughts?

kuujo · 2019-01-08T23:51:43Z

Usually this is a symptom of some sort of configuration issue. IMO we need to do a bette job of figuring out what the startup issue is and logging it. Currently that’s only done for missing partition groups. I’m not familiar with Docker Compose, but I know the configuration and commands in the ONOS documentation work because I’ve tried them: https://wiki.onosproject.org/plugins/servlet/mobile?contentId=28836788#content/view/28836788 So, I wonder what the difference could be in Docker Compose.

…

On Jan 8, 2019, at 2:51 PM, Lukasz Antoniak ***@***.***> wrote: While running Atomix 3.1.0, I encountered the same issue. I am able to start three Atomix nodes with my IDE, all on different ports of localhost, and system works as expected. After moving to Docker Compose and running three containers, none of the nodes can join the cluster. Last message in the log: [2019-01-08 22:33:58,358] INFO [io.atomix.protocols.raft.partition.RaftPartitionGroup] Started From what I can tell, io.atomix.primitive.partition.impl.DefaultPartitionService does not complete its start-up procedure (that is the next message in the log while running in IDE). All members are able to see each other, as cluster membership events of type MEMBER_ADDED have been triggered. Configuration: cluster { clusterId: "atomix" node { id: "member-0" host: "node1" port: 5001 } discovery { type: bootstrap nodes.1 { id: "member-0" host: "node1" port: 5001 } nodes.2 { id: "member-1" host: "node2" port: 5001 } nodes.3 { id: "member-2" host: "node3" port: 5001 } } } managementGroup { type: raft name: system partitions: 1 members: ["member-0", "member-1", "member-2"] storage { directory: "/volume/data/atomix/mgmt" level: disk } } partitionGroups.raft { type: raft partitions: 3 members: ["member-0", "member-1", "member-2"] storage { directory: "/volume/data/atomix/pg" level: disk } } Thoughts? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

lukasz-antoniak · 2019-01-09T18:42:32Z

Thank you for quick reply. I have been looking at those few configuration lines for hours, so decided to follow with some debugging. It looks to me that DefaultPartitionService hangs on the line: https://github.com/atomix/atomix/blob/master/primitive/src/main/java/io/atomix/primitive/partition/impl/DefaultPartitionService.java#L167. I have enabled debug level logs and observe:

DEBUG PrimitiveService{2}{type=PrimaryElectorType{name=PRIMARY_ELECTOR}, name=atomix-primary-elector} - Opening session 23 (io.atomix.protocols.raft.service.RaftServiceContext)
DEBUG Session{23}{type=PrimaryElectorType{name=PRIMARY_ELECTOR}, name=atomix-primary-elector} - State changed: OPEN (io.atomix.protocols.raft.session.RaftSession)
DEBUG PrimitiveService{2}{type=PrimaryElectorType{name=PRIMARY_ELECTOR}, name=atomix-primary-elector} - Session expired in 56860 milliseconds: RaftSession{RaftServiceContext{server=system-partition-1, type=PrimaryElectorType{name=PRIMARY_ELECTOR}, name=atomix-primary-elector, id=2}, session=12, timestamp=2019-01-09 06:15:34,329} (io.atomix.protocols.raft.service.RaftServiceContext)

Do you have any idea where to look next? What could cause the issue? Docker Compose is nothing more than running several Docker containers in a convenient way. Port 5001 TCP/UDP is open and I assume that my configuration does not need multi-casting.

lukasz-antoniak · 2019-01-12T19:31:03Z

Finally fixed it. In my case, two Guava JARs were present on classpath (transitive dependency). Atomix did not work with Guava 20.0.

johnou · 2020-05-01T13:02:36Z

#1009

johnou closed this as completed May 1, 2020

kuujo added archived Archived issues from the legacy Java implementation of Atomix legacy Issues from the legacy Java implementation of Atomix labels Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster not able to start #904

cluster not able to start #904

cookingkode commented Oct 30, 2018

kuujo commented Oct 30, 2018 via email

cookingkode commented Oct 31, 2018

lukasz-antoniak commented Jan 8, 2019

kuujo commented Jan 8, 2019 via email

lukasz-antoniak commented Jan 9, 2019

lukasz-antoniak commented Jan 12, 2019

johnou commented May 1, 2020

cluster not able to start #904

cluster not able to start #904

Comments

cookingkode commented Oct 30, 2018

kuujo commented Oct 30, 2018 via email

cookingkode commented Oct 31, 2018

lukasz-antoniak commented Jan 8, 2019

kuujo commented Jan 8, 2019 via email

lukasz-antoniak commented Jan 9, 2019

lukasz-antoniak commented Jan 12, 2019

johnou commented May 1, 2020