Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.4.8] crash in downlist_master_choose #167

Closed
NicolasMa opened this issue Nov 15, 2016 · 1 comment
Closed

[v1.4.8] crash in downlist_master_choose #167

NicolasMa opened this issue Nov 15, 2016 · 1 comment

Comments

@NicolasMa
Copy link
Contributor

Hi all,
I recently got into the assertion at the bottom of downlist_master_choose who said that :
assert (best != NULL);. It happens after corosync was on an heavy load.

Here is the stacktrace:

(gdb) bt
#0 0x00000008012f156c in thr_kill () at thr_kill.S:3
#1 0x0000000801395c6b in abort () at /usr/src/lib/libc/stdlib/abort.c:65
#2 0x0000000801379305 in __assert (func=0x187b1 <Address 0x187b1 out of bounds>, file=0x6 <Address 0x6 out of bounds>, line=0, failedexpr=0x0)
at /usr/src/lib/libc/gen/assert.c:54
#3 0x000000080300ad79 in downlist_master_choose_and_send () at cpg.c:798
#4 0x0000000000408377 in deliver_fn (nodeid=805526530, msg=0x8029ad009, msg_len=, endian_conversion_required=0) at main.c:899
#5 0x00000008008475fa in totempg_deliver_fn (nodeid=805526530, msg=0x808303862, msg_len=, endian_conversion_required=0) at totempg.c:529
#6 0x0000000800840bb3 in messages_deliver_to_app (instance=0x8018c1000, skip=0, end_point=) at totemsrp.c:3820
#7 0x0000000800844b85 in message_handler_orf_token (instance=0x8018c1000, msg=, msg_len=,
endian_conversion_needed=) at totemsrp.c:3690
#8 0x000000080083bce2 in passive_token_recv (rrp_instance=0x801821780, iface_no=0, context=0x8018c1000, msg=0x8024a4690, msg_len=70,
token_seq=) at totemrrp.c:1063
#9 0x000000080083cac7 in rrp_deliver_fn (context=0x80181b1a0, msg=0x8024a4690, msg_len=70) at totemrrp.c:1736
#10 0x00000008008385dd in net_deliver_fn (handle=, fd=, revents=, data=)
at totemudp.c:1260
#11 0x000000080083433c in poll_run (handle=1197105576937521152) at coropoll.c:513
#12 0x0000000000406c92 in main (argc=, argv=, envp=) at main.c:1866

Anyway after having a look at the code, this assertion seems correct. The things is that the downlist_messages_head doesn't seems to be locked. So I was wondering could this crash comes from a race condition? Or do you think it is something else?

Thanks for your reply.

@jfriesse
Copy link
Member

I'm pretty sure ifdown on one of network interfaces happened. Corosync behavior is to rebind to 127.0.0.1 and weird things then happen. This is one of that things. Without using RRP behavior is usually more or less correct, with RRP localhost "member" starts spreading across all nodes.

All calls into services are properly locked by giant lock so it's highly improbable to be the race condition.

Please make sure not to do ifdown (directly or indirectly via NetwokrManager).

Ifdown problem is planned to be solved in 3.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants