Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corosync crashes #35

Closed
michael-dev opened this issue Aug 21, 2014 · 9 comments
Closed

corosync crashes #35

michael-dev opened this issue Aug 21, 2014 · 9 comments

Comments

@michael-dev
Copy link

corosync 2.3.3 with libqb-0.17.0 crashes periodically in exec/totemsrp.c:3016, that is
assert (instance->commit_token->memb_index <= instance->commit_token->addr_entries);

backtrace:

(gdb) info threads 
  Id   Target Id         Frame 
  2    Thread 0x7fcde859c700 (LWP 6977) 0x00007fcdeb394420 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
* 1    Thread 0x7fcdebe3c700 (LWP 6976) 0x00007fcdeb02e545 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fcdeb02e545 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fcdeb0317c0 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fcdeb0276f1 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fcdeba19a31 in memb_state_commit_token_update (instance=0x7fcde7d5f010) at totemsrp.c:3016
#4  memb_state_commit_enter (instance=instance@entry=0x7fcde7d5f010) at totemsrp.c:2118
#5  0x00007fcdeba1b9aa in message_handler_memb_commit_token (instance=<optimized out>, msg=<optimized out>, msg_len=228, endian_conversion_needed=<optimized out>)
    at totemsrp.c:4548
#6  0x00007fcdeba183cc in rrp_deliver_fn (context=0x7fcdecdd9390, msg=0x7fcdecddc618, msg_len=228) at totemrrp.c:1794
#7  0x00007fcdeba1340e in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7fcdecddc5b0) at totemudp.c:521
#8  0x00007fcdeb5afeef in ?? () from /usr/lib/libqb.so.0
#9  0x00007fcdeb5afad7 in qb_loop_run () from /usr/lib/libqb.so.0
#10 0x00007fcdebe61910 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1314
@jfriesse
Copy link
Member

@michael-dev Do you have reproducer for this bug? Can you please share your config and take a look to log file for error messages?

@michael-dev
Copy link
Author

Yes this bug happens quite frequently on one virtual machine.

This is three corosync nodes (all the same version), two of them run pacemaker. The one failing is the one without pacemaker. All three nodes are connected in two vlans.

totem {
        version: 2
        secauth: on
        threads: 0
        rrp_mode: active
        interface {
                ringnumber: 0
                bindnetaddr: 10.42.1.13
                mcastaddr: 226.94.42.7
                mcastport: 5411
        }
        interface {
                ringnumber: 1
                bindnetaddr: 141.24.41.236
                mcastaddr: 226.94.42.11
                mcastport: 5419
        }
        token: 10000
        token_retransmits_before_loss_const: 40
        rrp_problem_count_timeout: 20000
        nodeid: 3
}
quorum {
    provider: corosync_votequorum
    expected_votes: 3
}
logging {
        fileline: off
        to_stderr: yes
        to_logfile: no
        to_syslog: yes
        logfile: /tmp/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

amf {
        mode: disabled
}

I don't have any errors or warnings regarding corosync in syslog.

@jfriesse
Copy link
Member

@michael-dev Can you please give a try to RRP passive mode? Active rrp is not very well tested. Also passive is better because it makes progress during failure.

@michael-dev
Copy link
Author

I've changed the cluster configuration and am waiting to see if the bug went away.

@michael-dev
Copy link
Author

Ok, i changed rpp_mode to passive and otherwise left the config file unchanged.
Now the log reports:

Aug 30 09:23:37 admindb2-db-slave-01 corosync[25568]:   [TOTEM ] Marking ringid 1 interface 141.24.41.236 FAULTY
Aug 30 09:23:38 admindb2-db-slave-01 corosync[25568]:   [TOTEM ] Automatically recovered ring 1
Aug 30 17:23:32 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:30:16 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:30:32 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:30:39 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:30:51 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:03 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:09 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:19 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:28 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:38 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:47 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:31:57 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:06 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:16 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:25 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:35 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:44 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Marking ringid 0 interface 10.42.1.13 FAULTY
Aug 30 23:32:44 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:32:52 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:00 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:08 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:16 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:17 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Retransmit List: 18f 
Aug 30 23:33:17 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:24 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:31 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:38 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:45 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:52 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:33:59 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:06 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:12 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:19 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:26 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:33 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:40 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:47 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:54 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:34:55 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Marking ringid 0 interface 10.42.1.13 FAULTY
Aug 30 23:34:56 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:02 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:08 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:15 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:21 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:26 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Retransmit List: 191 
Aug 30 23:35:26 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Marking ringid 0 interface 10.42.1.13 FAULTY
Aug 30 23:35:27 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] Automatically recovered ring 0
Aug 30 23:35:30 admindb2-db-slave-01 corosync[31405]:   [TOTEM ] FAILED TO RECEIVE

gdb says:

(gdb) bt
#0  0x00007fd6337e31a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd6337e6420 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd6337dc351 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fd6341cea31 in memb_state_commit_token_update (instance=0x7fd630514010) at totemsrp.c:3016
#4  memb_state_commit_enter (instance=instance@entry=0x7fd630514010) at totemsrp.c:2118
#5  0x00007fd6341d09aa in message_handler_memb_commit_token (instance=<optimized out>, msg=<optimized out>, msg_len=228, endian_conversion_needed=<optimized out>)
    at totemsrp.c:4548
#6  0x00007fd6341ccb7c in passive_mcast_recv (rrp_instance=0x7fd634ab4520, iface_no=1, context=<optimized out>, msg=<optimized out>, msg_len=<optimized out>)
    at totemrrp.c:1017
#7  0x00007fd6341cd3cc in rrp_deliver_fn (context=0x7fd634ad4290, msg=0x7fd634ad74a8, msg_len=228) at totemrrp.c:1794
#8  0x00007fd6341c840e in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7fd634ad7440) at totemudp.c:521
#9  0x00007fd633d64eef in _poll_dispatch_and_take_back_ (item=0x7fd634a6d418, p=<optimized out>) at loop_poll.c:108
#10 0x00007fd633d64ad7 in qb_loop_run_level (level=0x7fd634a6cce8) at loop.c:43
#11 qb_loop_run (lp=<optimized out>) at loop.c:210
#12 0x00007fd634616910 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1314

So the same assert is still hit.

@jfriesse jfriesse closed this as completed Dec 8, 2014
@michael-dev
Copy link
Author

Any reason for closing this? Has this bug been fixed?

@jfriesse
Copy link
Member

@michael-dev: Whups. Sorry, I was cleaning old issues where reporter just didn't responded and closed this by mistake.

Basically, it's very weird that you are getting so much "Automatically recovered ring 0" so often. Also assert should really never happen.

So my theory is, that ether other corosync (probably flatiron) is running on same subnet or packet is corrupted. Can you please try to change mcast port to some different value or you can change mcast address.

@jfriesse jfriesse reopened this Dec 11, 2014
@jfriesse
Copy link
Member

jfriesse commented Mar 5, 2015

@michael-dev: Were you able to solve this issue or still happening?

@michael-dev
Copy link
Author

I've not seen this for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants