evs exception: failed to form singleton view (leave processing) #37

temeo · 2014-05-21T07:11:05Z

Starting 9 nodes concurrently, one of the nodes timed out waiting for prim at the moment prim was being rebootstrapped:

140518  4:28:07 [Note] WSREP: re-bootstrapping prim from partitioned components
140518  4:28:08 [Note] WSREP: evs::proto(bdefe1c5, OPERATIONAL, view_id(REG,ba547f34,7)):  state change: OPERATIONAL -> LEAVING
140518  4:28:08 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():141
140518  4:28:08 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
140518  4:28:08 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'my_wsrep_cluster' at 'gcomm://192.168.17.11:10071?gmcast.listen_addr=tcp://0.0.0.0:10081': -110 (Connection timed out)
140518  4:28:08 [ERROR] WSREP: gcs connect failed: Connection timed out
140518  4:28:08 [ERROR] WSREP: wsrep::connect() failed: 7
140518  4:28:08 [ERROR] Aborting

All the other aborted due to failure of reaching consensus:

140518  4:28:45 [Note] WSREP: going to give up, state dump for diagnosis:
evs::proto(evs::proto(be2b058f, GATHER, view_id(REG,ba547f34,7)), GATHER) {
current_view=view(view_id(REG,ba547f34,7) memb {
    ba547f34,
    bded5f55,
    bdee555e,
    bdefe1c5,
    be0d179f,
    be12d899,
    be183c60,
    be25f051,
    be2b058f,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=14,safe_seq=13,node_index=
node: {idx=0,range=[15,14],safe_seq=14}
node: {idx=1,range=[15,14],safe_seq=14}
node: {idx=2,range=[15,14],safe_seq=14}
node: {idx=3,range=[15,14],safe_seq=13}
node: {idx=4,range=[15,14],safe_seq=14}
node: {idx=5,range=[15,14],safe_seq=14}
node: {idx=6,range=[15,14],safe_seq=14}
node: {idx=7,range=[15,14],safe_seq=14}
node: {idx=8,range=[15,14],safe_seq=14} ,
msg_index=  (0,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=ba547f34,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=201,nl=(
)
}
    (1,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bded5f55,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=179,nl=(
)
}
    (2,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bdee555e,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=183,nl=(
)
}
    (3,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bdefe1c5,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=157,nl=(
)
}
    (4,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be0d179f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=199,nl=(
)
}
    (5,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be12d899,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=187,nl=(
)
}
    (6,14),{v=0,t=1,ut=5,o=4,s=14,sr=0,as=13,f=4,src=be183c60,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=162,nl=(
)
}
    (7,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be25f051,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=170,nl=(
)
}
    (8,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=0,src=be2b058f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=161,nl=(
)
}
,recovery_index=},
fifo_seq=224,
last_sent=14,
known:
ba547f34 at tcp://192.168.17.11:10011
{o=0,s=1,i=0,fs=231,}
bded5f55 at tcp://192.168.17.11:10071
{o=0,s=1,i=0,fs=209,}
bdee555e at tcp://192.168.17.11:10041
{o=0,s=1,i=0,fs=216,}
bdefe1c5 at tcp://192.168.17.12:10081
{o=0,s=1,i=0,fs=158,lm=
{v=0,t=6,ut=255,o=1,s=14,sr=-1,as=13,f=6,src=bdefe1c5,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=158,nl=(
)
},
}
be0d179f at tcp://192.168.17.13:10061
{o=0,s=1,i=0,fs=229,}
be12d899 at tcp://192.168.17.13:10031
{o=0,s=1,i=0,fs=216,}
be183c60 at tcp://192.168.17.12:10051
{o=0,s=1,i=0,fs=193,}
be25f051 at tcp://192.168.17.12:10021
{o=0,s=1,i=0,fs=198,}
be2b058f at 
{o=1,s=0,i=0,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=13,sr=-1,as=14,f=0,src=be2b058f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=224,nl=(
    ba547f34, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    bded5f55, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    bdee555e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    bdefe1c5, {o=0,s=1,f=0,ls=14,vid=view_id(REG,ba547f34,7),ss=13,ir=[15,14],}
    be0d179f, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be12d899, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be183c60, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be25f051, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be2b058f, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
)
},
}
 }
140518  4:28:45 [ERROR] WSREP: exception from gcomm, backend must be restarted:evs::proto(be2b058f, GATHER, view_id(REG,ba547f34,7)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
     at gcomm/src/evs_proto.cpp:handle_install_timer():612

It looks like leaving node failed to acknowledge all messages it had received due to exception and remaining nodes failed to reach consensus because of that.

To fix this, remaining nodes must decide at some point that leaving node won't be sending any more messages and ignore its safe seq in consensus computation. Raised suspected flag could be one such an indication.

The text was updated successfully, but these errors were encountered:

dirtysalt · 2014-06-24T04:56:18Z

@temeo your way could do the trick. here is my understanding, correct me if I'm wrong.

each node could decide leaving if node won't send messages any more when the first time handle_install_timer is called, mark leaving node suspected, and resend join message. we can safely ignore leaving node's safe_seq only until all other nodes mark the leaving node suspected.
In consensus computation, we could easily get right safe_seq by ignoring the suspected leaving node's safe_seq in function is_consistent_highest_reachable_safe_seq . However we also need to update semantics(or write another function) of input_map_.safe_seq() which is needed in function is_consistent_input_map by ignoring the suspected leaving node's safe_seq.

I devise another way without deciding leaving node will send messages or not. Maybe easier. Every time handle_install_timer is triggered

we find is there any leaving node, and get leave_message(lm)
if there is leaving node, we could send delegated message safely (O_DROP, seq = [lm.seq, high_seen_seq] ) to acknowledge messages.

temeo · 2014-06-24T11:22:23Z

@dirtysalt The latter way should be avoided since generating messages on behalf of another node might produce surprising results if there are still messages from leaving node lingering in the network and it might not be backwards compatible.

For the first analysis:

Marking leaving node suspected in join message is done automatically once the leaving node has been absent longer than evs.suspect_timeout, so it will happen eventually on all remaining nodes and will be communicated with others via join messages
It should be enough to modify is_consistent_highest_reachable_safe_seq to filter out nodes that are suspected in all join messages from the leaving nodes view from leaving list. No need to modify is_consistent_input_map since input map state is updated correctly via retransmitted and recovered messages (input map message content) and join messages (input map safe seqs).

…r operational nodes

…essages. clean up commented code

Refs: GAL-479

temeo added the bug label May 21, 2014

temeo added this to the 3.6 milestone May 21, 2014

temeo self-assigned this May 21, 2014

temeo mentioned this issue May 23, 2014

exception from gcomm: mn.operational() == false #40

Closed

temeo mentioned this issue Jun 25, 2014

Improve gcomm operation in presence of high packet loss #71

Closed

dirtysalt added a commit that referenced this issue Jun 25, 2014

Refs #37: add unit test to reveal case

58a0ee4

dirtysalt added a commit that referenced this issue Jun 30, 2014

Refs #37: ignore leaving node safe_seq if it is suspected by all othe…

a2fa437

…r operational nodes

dirtysalt added a commit that referenced this issue Jul 1, 2014

Refs #37: is_all_suspected requires all operational nodes have join m…

885fba7

…essages. clean up commented code

dirtysalt assigned dirtysalt and unassigned temeo Jul 1, 2014

dirtysalt closed this as completed Jul 1, 2014

temeo unassigned dirtysalt Dec 10, 2014

philip-galera added a commit that referenced this issue Jan 26, 2017

Merge pull request #37 from codership/GAL-479

c5b2703

Refs: GAL-479

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evs exception: failed to form singleton view (leave processing) #37

evs exception: failed to form singleton view (leave processing) #37

temeo commented May 21, 2014

dirtysalt commented Jun 24, 2014

temeo commented Jun 24, 2014

evs exception: failed to form singleton view (leave processing) #37

evs exception: failed to form singleton view (leave processing) #37

Comments

temeo commented May 21, 2014

dirtysalt commented Jun 24, 2014

temeo commented Jun 24, 2014