corosync crashes #35

michael-dev · 2014-08-21T09:26:03Z

corosync 2.3.3 with libqb-0.17.0 crashes periodically in exec/totemsrp.c:3016, that is
assert (instance->commit_token->memb_index <= instance->commit_token->addr_entries);

backtrace:

(gdb) info threads 
  Id   Target Id         Frame 
  2    Thread 0x7fcde859c700 (LWP 6977) 0x00007fcdeb394420 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
* 1    Thread 0x7fcdebe3c700 (LWP 6976) 0x00007fcdeb02e545 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fcdeb02e545 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fcdeb0317c0 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fcdeb0276f1 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fcdeba19a31 in memb_state_commit_token_update (instance=0x7fcde7d5f010) at totemsrp.c:3016
#4  memb_state_commit_enter (instance=instance@entry=0x7fcde7d5f010) at totemsrp.c:2118
#5  0x00007fcdeba1b9aa in message_handler_memb_commit_token (instance=<optimized out>, msg=<optimized out>, msg_len=228, endian_conversion_needed=<optimized out>)
    at totemsrp.c:4548
#6  0x00007fcdeba183cc in rrp_deliver_fn (context=0x7fcdecdd9390, msg=0x7fcdecddc618, msg_len=228) at totemrrp.c:1794
#7  0x00007fcdeba1340e in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7fcdecddc5b0) at totemudp.c:521
#8  0x00007fcdeb5afeef in ?? () from /usr/lib/libqb.so.0
#9  0x00007fcdeb5afad7 in qb_loop_run () from /usr/lib/libqb.so.0
#10 0x00007fcdebe61910 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1314

The text was updated successfully, but these errors were encountered:

jfriesse · 2014-08-22T09:07:09Z

@michael-dev Do you have reproducer for this bug? Can you please share your config and take a look to log file for error messages?

michael-dev · 2014-08-23T15:49:56Z

Yes this bug happens quite frequently on one virtual machine.

This is three corosync nodes (all the same version), two of them run pacemaker. The one failing is the one without pacemaker. All three nodes are connected in two vlans.

totem {
        version: 2
        secauth: on
        threads: 0
        rrp_mode: active
        interface {
                ringnumber: 0
                bindnetaddr: 10.42.1.13
                mcastaddr: 226.94.42.7
                mcastport: 5411
        }
        interface {
                ringnumber: 1
                bindnetaddr: 141.24.41.236
                mcastaddr: 226.94.42.11
                mcastport: 5419
        }
        token: 10000
        token_retransmits_before_loss_const: 40
        rrp_problem_count_timeout: 20000
        nodeid: 3
}
quorum {
    provider: corosync_votequorum
    expected_votes: 3
}
logging {
        fileline: off
        to_stderr: yes
        to_logfile: no
        to_syslog: yes
        logfile: /tmp/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

amf {
        mode: disabled
}

I don't have any errors or warnings regarding corosync in syslog.

jfriesse · 2014-08-25T10:11:37Z

@michael-dev Can you please give a try to RRP passive mode? Active rrp is not very well tested. Also passive is better because it makes progress during failure.

michael-dev · 2014-08-27T09:02:56Z

I've changed the cluster configuration and am waiting to see if the bug went away.

michael-dev · 2014-09-01T10:57:49Z

Ok, i changed rpp_mode to passive and otherwise left the config file unchanged.
Now the log reports:

Aug 30 09:23:37 admindb2-db-slave-01 corosync[25568]:   [TOTEM ] Marking ringid 1 interface 141.24.41.236 FAULTY
Aug 30 09:23:38 admindb2-db-slave-01 corosync[25568]:   [TOTEM ] Automatically recovered ring 1
Aug 30 17:23:32 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:30:16 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:30:32 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:30:39 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:30:51 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:03 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:09 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:19 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:28 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:38 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:47 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:57 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:06 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:16 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:25 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:35 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:44 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Marking ringid 0 interface 10.42.1.13 FAULTY
Aug 30 23:32:44 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:52 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:00 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:08 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:16 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:17 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Retransmit List: 18f 
Aug 30 23:33:17 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:24 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:31 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:38 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:45 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:52 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:59 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:06 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:12 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:19 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:26 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:33 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:40 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:47 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:54 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:55 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Marking ringid 0 interface 10.42.1.13 FAULTY
Aug 30 23:34:56 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:02 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:08 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:15 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:21 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:26 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Retransmit List: 191 
Aug 30 23:35:26 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Marking ringid 0 interface 10.42.1.13 FAULTY
Aug 30 23:35:27 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:30 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] FAILED TO RECEIVE

gdb says:

(gdb) bt
#0  0x00007fd6337e31a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd6337e6420 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd6337dc351 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fd6341cea31 in memb_state_commit_token_update (instance=0x7fd630514010) at totemsrp.c:3016
#4  memb_state_commit_enter (instance=instance@entry=0x7fd630514010) at totemsrp.c:2118
#5  0x00007fd6341d09aa in message_handler_memb_commit_token (instance=<optimized out>, msg=<optimized out>, msg_len=228, endian_conversion_needed=<optimized out>)
    at totemsrp.c:4548
#6  0x00007fd6341ccb7c in passive_mcast_recv (rrp_instance=0x7fd634ab4520, iface_no=1, context=<optimized out>, msg=<optimized out>, msg_len=<optimized out>)
    at totemrrp.c:1017
#7  0x00007fd6341cd3cc in rrp_deliver_fn (context=0x7fd634ad4290, msg=0x7fd634ad74a8, msg_len=228) at totemrrp.c:1794
#8  0x00007fd6341c840e in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7fd634ad7440) at totemudp.c:521
#9  0x00007fd633d64eef in _poll_dispatch_and_take_back_ (item=0x7fd634a6d418, p=<optimized out>) at loop_poll.c:108
#10 0x00007fd633d64ad7 in qb_loop_run_level (level=0x7fd634a6cce8) at loop.c:43
#11 qb_loop_run (lp=<optimized out>) at loop.c:210
#12 0x00007fd634616910 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1314

So the same assert is still hit.

michael-dev · 2014-12-09T09:49:09Z

Any reason for closing this? Has this bug been fixed?

jfriesse · 2014-12-11T15:15:36Z

@michael-dev: Whups. Sorry, I was cleaning old issues where reporter just didn't responded and closed this by mistake.

Basically, it's very weird that you are getting so much "Automatically recovered ring 0" so often. Also assert should really never happen.

So my theory is, that ether other corosync (probably flatiron) is running on same subnet or packet is corrupted. Can you please try to change mcast port to some different value or you can change mcast address.

jfriesse · 2015-03-05T14:38:20Z

@michael-dev: Were you able to solve this issue or still happening?

michael-dev · 2015-09-29T09:14:27Z

I've not seen this for a while.

jfriesse closed this as completed Dec 8, 2014

jfriesse reopened this Dec 11, 2014

michael-dev closed this as completed Sep 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corosync crashes #35

corosync crashes #35

michael-dev commented Aug 21, 2014

jfriesse commented Aug 22, 2014

michael-dev commented Aug 23, 2014

jfriesse commented Aug 25, 2014

michael-dev commented Aug 27, 2014

michael-dev commented Sep 1, 2014

michael-dev commented Dec 9, 2014

jfriesse commented Dec 11, 2014

jfriesse commented Mar 5, 2015

michael-dev commented Sep 29, 2015

corosync crashes #35

corosync crashes #35

Comments

michael-dev commented Aug 21, 2014

jfriesse commented Aug 22, 2014

michael-dev commented Aug 23, 2014

jfriesse commented Aug 25, 2014

michael-dev commented Aug 27, 2014

michael-dev commented Sep 1, 2014

michael-dev commented Dec 9, 2014

jfriesse commented Dec 11, 2014

jfriesse commented Mar 5, 2015

michael-dev commented Sep 29, 2015