[v1.4.8] crash in downlist_master_choose #167

NicolasMa · 2016-11-15T11:55:58Z

Hi all,
I recently got into the assertion at the bottom of downlist_master_choose who said that :
assert (best != NULL);. It happens after corosync was on an heavy load.

Here is the stacktrace:

(gdb) bt
#0 0x00000008012f156c in thr_kill () at thr_kill.S:3
#1 0x0000000801395c6b in abort () at /usr/src/lib/libc/stdlib/abort.c:65
#2 0x0000000801379305 in __assert (func=0x187b1 <Address 0x187b1 out of bounds>, file=0x6 <Address 0x6 out of bounds>, line=0, failedexpr=0x0)
at /usr/src/lib/libc/gen/assert.c:54
#3 0x000000080300ad79 in downlist_master_choose_and_send () at cpg.c:798
#4 0x0000000000408377 in deliver_fn (nodeid=805526530, msg=0x8029ad009, msg_len=, endian_conversion_required=0) at main.c:899
#5 0x00000008008475fa in totempg_deliver_fn (nodeid=805526530, msg=0x808303862, msg_len=, endian_conversion_required=0) at totempg.c:529
#6 0x0000000800840bb3 in messages_deliver_to_app (instance=0x8018c1000, skip=0, end_point=) at totemsrp.c:3820
#7 0x0000000800844b85 in message_handler_orf_token (instance=0x8018c1000, msg=, msg_len=,
endian_conversion_needed=) at totemsrp.c:3690
#8 0x000000080083bce2 in passive_token_recv (rrp_instance=0x801821780, iface_no=0, context=0x8018c1000, msg=0x8024a4690, msg_len=70,
token_seq=) at totemrrp.c:1063
#9 0x000000080083cac7 in rrp_deliver_fn (context=0x80181b1a0, msg=0x8024a4690, msg_len=70) at totemrrp.c:1736
#10 0x00000008008385dd in net_deliver_fn (handle=, fd=, revents=, data=)
at totemudp.c:1260
#11 0x000000080083433c in poll_run (handle=1197105576937521152) at coropoll.c:513
#12 0x0000000000406c92 in main (argc=, argv=, envp=) at main.c:1866

Anyway after having a look at the code, this assertion seems correct. The things is that the downlist_messages_head doesn't seems to be locked. So I was wondering could this crash comes from a race condition? Or do you think it is something else?

Thanks for your reply.

The text was updated successfully, but these errors were encountered:

jfriesse · 2016-11-15T13:48:16Z

I'm pretty sure ifdown on one of network interfaces happened. Corosync behavior is to rebind to 127.0.0.1 and weird things then happen. This is one of that things. Without using RRP behavior is usually more or less correct, with RRP localhost "member" starts spreading across all nodes.

All calls into services are properly locked by giant lock so it's highly improbable to be the race condition.

Please make sure not to do ifdown (directly or indirectly via NetwokrManager).

Ifdown problem is planned to be solved in 3.x.

jfriesse closed this as completed Nov 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.4.8] crash in downlist_master_choose #167

[v1.4.8] crash in downlist_master_choose #167

NicolasMa commented Nov 15, 2016

jfriesse commented Nov 15, 2016

[v1.4.8] crash in downlist_master_choose #167

[v1.4.8] crash in downlist_master_choose #167

Comments

NicolasMa commented Nov 15, 2016

jfriesse commented Nov 15, 2016