Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evs exception: failed to form singleton view (leave processing) #37

Closed
temeo opened this issue May 21, 2014 · 2 comments
Closed

evs exception: failed to form singleton view (leave processing) #37

temeo opened this issue May 21, 2014 · 2 comments
Labels
Milestone

Comments

@temeo
Copy link
Contributor

temeo commented May 21, 2014

Starting 9 nodes concurrently, one of the nodes timed out waiting for prim at the moment prim was being rebootstrapped:

140518  4:28:07 [Note] WSREP: re-bootstrapping prim from partitioned components
140518  4:28:08 [Note] WSREP: evs::proto(bdefe1c5, OPERATIONAL, view_id(REG,ba547f34,7)):  state change: OPERATIONAL -> LEAVING
140518  4:28:08 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():141
140518  4:28:08 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
140518  4:28:08 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'my_wsrep_cluster' at 'gcomm://192.168.17.11:10071?gmcast.listen_addr=tcp://0.0.0.0:10081': -110 (Connection timed out)
140518  4:28:08 [ERROR] WSREP: gcs connect failed: Connection timed out
140518  4:28:08 [ERROR] WSREP: wsrep::connect() failed: 7
140518  4:28:08 [ERROR] Aborting

All the other aborted due to failure of reaching consensus:

140518  4:28:45 [Note] WSREP: going to give up, state dump for diagnosis:
evs::proto(evs::proto(be2b058f, GATHER, view_id(REG,ba547f34,7)), GATHER) {
current_view=view(view_id(REG,ba547f34,7) memb {
    ba547f34,
    bded5f55,
    bdee555e,
    bdefe1c5,
    be0d179f,
    be12d899,
    be183c60,
    be25f051,
    be2b058f,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=14,safe_seq=13,node_index=
node: {idx=0,range=[15,14],safe_seq=14}
node: {idx=1,range=[15,14],safe_seq=14}
node: {idx=2,range=[15,14],safe_seq=14}
node: {idx=3,range=[15,14],safe_seq=13}
node: {idx=4,range=[15,14],safe_seq=14}
node: {idx=5,range=[15,14],safe_seq=14}
node: {idx=6,range=[15,14],safe_seq=14}
node: {idx=7,range=[15,14],safe_seq=14}
node: {idx=8,range=[15,14],safe_seq=14} ,
msg_index=  (0,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=ba547f34,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=201,nl=(
)
}
    (1,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bded5f55,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=179,nl=(
)
}
    (2,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bdee555e,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=183,nl=(
)
}
    (3,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bdefe1c5,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=157,nl=(
)
}
    (4,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be0d179f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=199,nl=(
)
}
    (5,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be12d899,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=187,nl=(
)
}
    (6,14),{v=0,t=1,ut=5,o=4,s=14,sr=0,as=13,f=4,src=be183c60,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=162,nl=(
)
}
    (7,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be25f051,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=170,nl=(
)
}
    (8,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=0,src=be2b058f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=161,nl=(
)
}
,recovery_index=},
fifo_seq=224,
last_sent=14,
known:
ba547f34 at tcp://192.168.17.11:10011
{o=0,s=1,i=0,fs=231,}
bded5f55 at tcp://192.168.17.11:10071
{o=0,s=1,i=0,fs=209,}
bdee555e at tcp://192.168.17.11:10041
{o=0,s=1,i=0,fs=216,}
bdefe1c5 at tcp://192.168.17.12:10081
{o=0,s=1,i=0,fs=158,lm=
{v=0,t=6,ut=255,o=1,s=14,sr=-1,as=13,f=6,src=bdefe1c5,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=158,nl=(
)
},
}
be0d179f at tcp://192.168.17.13:10061
{o=0,s=1,i=0,fs=229,}
be12d899 at tcp://192.168.17.13:10031
{o=0,s=1,i=0,fs=216,}
be183c60 at tcp://192.168.17.12:10051
{o=0,s=1,i=0,fs=193,}
be25f051 at tcp://192.168.17.12:10021
{o=0,s=1,i=0,fs=198,}
be2b058f at 
{o=1,s=0,i=0,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=13,sr=-1,as=14,f=0,src=be2b058f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=224,nl=(
    ba547f34, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    bded5f55, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    bdee555e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    bdefe1c5, {o=0,s=1,f=0,ls=14,vid=view_id(REG,ba547f34,7),ss=13,ir=[15,14],}
    be0d179f, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be12d899, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be183c60, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be25f051, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be2b058f, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
)
},
}
 }
140518  4:28:45 [ERROR] WSREP: exception from gcomm, backend must be restarted:evs::proto(be2b058f, GATHER, view_id(REG,ba547f34,7)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
     at gcomm/src/evs_proto.cpp:handle_install_timer():612

It looks like leaving node failed to acknowledge all messages it had received due to exception and remaining nodes failed to reach consensus because of that.

To fix this, remaining nodes must decide at some point that leaving node won't be sending any more messages and ignore its safe seq in consensus computation. Raised suspected flag could be one such an indication.

@temeo temeo added the bug label May 21, 2014
@temeo temeo added this to the 3.6 milestone May 21, 2014
@temeo temeo self-assigned this May 21, 2014
@dirtysalt
Copy link
Contributor

@temeo your way could do the trick. here is my understanding, correct me if I'm wrong.

  1. each node could decide leaving if node won't send messages any more when the first time handle_install_timer is called, mark leaving node suspected, and resend join message. we can safely ignore leaving node's safe_seq only until all other nodes mark the leaving node suspected.
  2. In consensus computation, we could easily get right safe_seq by ignoring the suspected leaving node's safe_seq in function is_consistent_highest_reachable_safe_seq . However we also need to update semantics(or write another function) of input_map_.safe_seq() which is needed in function is_consistent_input_map by ignoring the suspected leaving node's safe_seq.

I devise another way without deciding leaving node will send messages or not. Maybe easier. Every time handle_install_timer is triggered

  1. we find is there any leaving node, and get leave_message(lm)
  2. if there is leaving node, we could send delegated message safely (O_DROP, seq = [lm.seq, high_seen_seq] ) to acknowledge messages.

@temeo
Copy link
Contributor Author

temeo commented Jun 24, 2014

@dirtysalt The latter way should be avoided since generating messages on behalf of another node might produce surprising results if there are still messages from leaving node lingering in the network and it might not be backwards compatible.

For the first analysis:

  1. Marking leaving node suspected in join message is done automatically once the leaving node has been absent longer than evs.suspect_timeout, so it will happen eventually on all remaining nodes and will be communicated with others via join messages
  2. It should be enough to modify is_consistent_highest_reachable_safe_seq to filter out nodes that are suspected in all join messages from the leaving nodes view from leaving list. No need to modify is_consistent_input_map since input map state is updated correctly via retransmitted and recovered messages (input map message content) and join messages (input map safe seqs).

dirtysalt added a commit that referenced this issue Jun 30, 2014
dirtysalt added a commit that referenced this issue Jul 1, 2014
@dirtysalt dirtysalt assigned dirtysalt and unassigned temeo Jul 1, 2014
philip-galera added a commit that referenced this issue Jan 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants