New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whole-cluster fence upon membership change #660
Comments
|
we got the first reports after we started rolling out the two versions mentioned above - before that we were on 3.1.2 / 1.21. |
|
some preliminary testing with libknet1 reverted to 1.21 seems to indicate that 1.22 either contains the bug or exposes it in corosync: 1.22: regular retry counts > 50, up to the full crash in the initial post everything else about the test setup being equal. I'll keep testing both versions to see whether this is stable when running over a longer period of time. |
|
the rate-limiting is done with the following command on the hypervisor side, limiting the tap interface that's passed to the guest: as stated above, this is just what makes it triggerable in the test setup, the real-world root cause is likely something else like network congestion/problems or system load. |
|
only a single interface (first link) on a single node (the third one were corosync is restarted) is rate-limited like that. |
|
reverting kronosnet/kronosnet@0fc50a1 does not change the symptoms (even though it seemed like a likely candidate). I'll kick off a bisect tomorrow, let's hope we get results on which knet commit triggers it. |
|
the bisect does point at kronosnet/kronosnet@0fc50a1 as the cause (or potentially, change that causes this to go from "triggering basically never" to "triggering rarely"). but reverting just that single commit doesn't seem to be enough for 1.22 to run stable - or possible the reproducer is just not giving a clear enough picture of the issue :-/ |
|
@jfriesse I was able to reproduce part of the issue described by @Fabian-Gruenbichler. In my setup, hypervisor is Fedora 34 (this might be relevant for tc commands used to inject network failures). 3 VMs running rhel8.4 latest, 2 links. libqb, knet and corosync master. Of the 2 links, one is for ssh access, one is private. I injected the faults on the private network to retain clean ssh access and backup link. Determine the iface where to inject faults by using: virsh domiflist rhel8-node1Interface Type Source Model MACvnet8 bridge br0 virtio 54:52:00:00:07:01 (for me that would be vnet9, eth1 inside node1 VM). iface=vnet9 ~0.1MB/s in bit/srate=838856 1mb/sburst=1048576 NOTE that the first command here is different from the one provided above, replace vs add (I get an error using add)tc qdisc replace dev $iface root handle 1: htb default 1 corosync.conf matches the one provided above modulo ip addresses. On node1 and node2, simply starts corosync -f, and start cpgverify After some interactions of the test, cpgverify will fail to join the cpg group. This is visible on all nodes. corosync-quorumtool still reports all nodes online. I have observed in the logs:
it appears that corosync gets stuck somehow and cpg is blocked, even tho membership seems fine. I can´t exclude 100% that knet is not delivering junk to corosync yet. I am still working on it. |
|
@jfriesse easier reproducer here: VM setup as above, 3 VM.. etc. It is possible to reproduce also with one link. To keep things aligned: node1 is the one with the slow link. knet stable1, corosync latest camelback (will need to retest with master as well) Use this patch on top of cpgverify: https://paste.centos.org/view/ef7624d5 start cpgverify on all nodes. on node2 and 3, start corosync plain and simple. on node1 (the slow one): while [ ! -f stop ]; do corosync; sleep 20; echo "killing corosync"; kill $(pidof corosync); sleep 20; echo "sleeping"; done after a few iteractions, node2 and node3 will hit an "incorrect hash". |
|
I can reproduce the issue with one or two links, with knet stable and knet master. On top of knet master, I applied the following patch for debugging: https://paste.centos.org/view/c545d451 We never hit checksum errors from knet TX/RX of data, and cpgverify still shows incorrect hash on node2 and node3. IMPORTANT NOTE: I found another bug in knet while using this debug patch. crypto_nss does not like the extra header size, please just add crypto_model: openssl or drop crypto completely during testing (makes no difference to trigger the problem). |
|
One more observation (using patched cpgverify): conf_chg it appears that the packet with incorrect hash always happens after a conf_chg (or shortly after). |
|
More debug info: using udp(u) we cannot trigger the incorrect hash problem. An interesting observation is that cpgverify will, from time to time, get a -6/EAGAIN from cpg in combination with udp(u), but never seen in combination with knet. As if cpg is accepting requests too fast when corosync is configured with knet. |
|
More changes applied: https://paste.centos.org/view/raw/71b114db |
|
That totemknet patch in full totemknet-crc.txt |
|
with the following diff on top of cpghum.c from 3.1.5 (thanks chrissie for parts of the diff!) and the following test setup:
after a few restarts I see CRC errors on nodes 1 and 2 (logs attached, generated by piping the above commands through cpghum_node1.txt removing the rate limit makes the problem go away, so it seems to be caused by backlog/backpressure somewhere. some things worth noting: cpghum has a single array for the data where the message gets prepared, since we now fill it with static data the only thing that changes is the counter and timestamp encoded within the cpghum header, the actual payload that gets corrupted is always the same. the corruption looks like (de)fragmenting gone wrong to me, but the offsets are pretty random |
|
knet checksum debug diff, on top of knet master. |
|
One more datapoint. Today I tested with 4 VMs, adding a "good" node4. During the test, node2 received an incorrect hash. From that point on, all of cpgverify processes appears to be stuck for a long period of time. Restarting cpgverify on node2 will not produce any output for over a minute. After which node3 and node4 reported the incorrect hash and everything unblocked. |
|
again with cpghum with a slight change from the previous iteration:
the corruption looks like we get parts of later messages, but not always aligned at the int-boundary.. cpghum_seq_payload_1.log.txt |
|
@jfriesse please checkout knet urgent-fixes branch (based on master) and build knet with --enable-onwire-v1-extra-debug This options does break onwire compatibility with stable1! so all nodes need to have this version. It enables: code abort() for checksum failures. Knet did not, unfortunately, abort while reproducing this test case. |
|
@jfriesse so I added more debug to exclude knet as source of the problem. Please grab the extra-coro-debug branch from knet, based on master and build with: (replace the include path to match your corosync include ;)). This branch adds totem crc verification (based on @chrissie-c patch above) on both ends of knet data delivery sockets. Once again, none of those abort() are triggering within knet. So at this point I am fairly confident we are triggering a problem in corosync or delivery from corosync to cpgverify. |
Better description to be made Signed-off-by: Jan Friesse <jfriesse@redhat.com>
|
For whoever interested in this issue, could you please give a try to #662? I haven't tested it in 3 nodes condition (only 2 nodes) and still not sure if it doesn't make loose of some message (or deliver messages which shouldn't be delivered), but that's best I can provide for now. |
|
All, whoever is going to test this fix, please make sure to use the extra-coro-debug branch from knet. I was able to trigger once, after almost 1.5hours of testing, one of the abort() in the RX threads during packet crc32 verification, after reassembling the defrag buffers. The packet was originated from the slow node. |
|
I found the issue in knet, it doesn´t affect stable1 and I provided a temporary workaround in the extra-coro-debug branch so that @jfriesse can continue his testing. |
|
for the record, the knet issue has been fixed here: kronosnet/kronosnet#366 and it only affects master branch. |
|
so with the other two found issues (totemsrp buffer confusion fixed in corosync #662, knet master defrag buf shrink fixed in kronosnet#366) seemingly solved, circling back to the original issue:
at 12:46:42/43 you can see the totem membership changing on all three nodes after corosync on the third node has started, with node 1&2 seeing 3 as joining, and 3 seeing 1&2 as joining at 12:46:44 we see the first mesages on nodes 1&2 about failing to use cpg_mcast_joined, which returns -6. each retry is spaced with a small delay: after 100 retries we abort attempting to send the message and bubble up the result, but subsequent error handling might involve sending further messages at 12:46:47 we have the first log message of our CPG consumer after hitting neither of the three nodes recover from this state, with nodes 1 & 2 fencing about a minute after node3 joined because their fence-preventing watchdog expired for lack of write access. node 3 didn't have the watchdog active, else it would have shared that fate ;) possibly relevant: the points in time when KNET realizes links are up/down (with corosync restarted at 12:46:42 on node 3), interspersed with TOTEM membership changes, possibly there is a confusion about which nodes to send traffic to? also note the delay with which link 0 (which is rate-limited for node 3) goes up compared to link 1 the last down events one node 3 are when 1/2 are fenced. |
|
@Fabian-Gruenbichler can you please also backport this patch #652 ? I have seen something very similar while testing something else and #652 did fix my problem. |
Commit 92e0f9c added switching of totempg buffers in sync phase. But because buffers got switch too early there was a problem when delivering recovered messages (messages got corrupted and/or lost). Solution is to switch buffers after recovered messages got delivered. I think it is worth to describe complete history with reproducers so it doesn't get lost. It all started with 4026389 (more info about original problem is described in https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch solves problem which is way to be reproduced with following reproducer: - 2 nodes - Both nodes running corosync and testcpg - Pause node 1 (SIGSTOP of corosync) - On node 1, send some messages by testcpg (it's not answering but this doesn't matter). Simply hit ENTER key few times is enough) - Wait till node 2 detects that node 1 left - Unpause node 1 (SIGCONT of corosync) and on node 1 newly mcasted cpg messages got sent before sync barrier, so node 2 logs "Unknown node -> we will not deliver message". Solution was to add switch of totemsrp new messages buffer. This patch was not enough so new one (92e0f9c) was created. Reproducer of problem was similar, just cpgverify was used instead of testcpg. Occasionally when node 1 was unpaused it hang in sync phase because there was a partial message in totempg buffers. New sync message had different frag cont so it was thrown away and never delivered. After many years problem was found which is solved by this patch (original issue describe in corosync#660). Reproducer is more complex: - 2 nodes - Node 1 is rate-limited (used script on the hypervisor side): ``` iface=tapXXXX # ~0.1MB/s in bit/s rate=838856 # 1mb/s burst=1048576 tc qdisc add dev $iface root handle 1: htb default 1 tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \ burst ${burst}b tc qdisc add dev $iface handle ffff: ingress tc filter add dev $iface parent ffff: prio 50 basic police rate \ ${rate}bps burst ${burst}b mtu 64kb "drop" ``` - Node 2 is running corosync and cpgverify - Node 1 keeps restarting of corosync and running cpgverify in cycle, so - Console 1: while true; do corosync; sleep 20; kill $(pidof corosync); \ sleep 20; done - Console 2: while true; do ./cpgverify;done And from time to time (reproduced usually in less than 5 minutes) cpgverify reports corrupted message. Signed-off-by: Jan Friesse <jfriesse@redhat.com>
|
so here we go again, this time with
here's the output from the commands running on each node, indicating when stuff became inaccessible node1 - restarting until 9:18:07, 30s after that last restart pmxcfs on this node was not writable anymore node2 - writable until 9:18:10, when corosync on node1 was up again - access to pmxcfs then blocks because of lack of CPG progress, followed by unblocking at 9:26:12: node3 similar to node2: here are the trace logs for each node, showing the now familiar pattern of a constant retransmit for a single seq number on one node.. |
|
the previous logs are with knet defrag buffers capped at 32 (so min=max=32) to mimc knet 1.22 behaviour. I'll re-run with min=32 max=1024 (vanilla knet master)! |
Commit 92e0f9c added switching of totempg buffers in sync phase. But because buffers got switch too early there was a problem when delivering recovered messages (messages got corrupted and/or lost). Solution is to switch buffers after recovered messages got delivered. I think it is worth to describe complete history with reproducers so it doesn't get lost. It all started with 4026389 (more info about original problem is described in https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch solves problem which is way to be reproduced with following reproducer: - 2 nodes - Both nodes running corosync and testcpg - Pause node 1 (SIGSTOP of corosync) - On node 1, send some messages by testcpg (it's not answering but this doesn't matter). Simply hit ENTER key few times is enough) - Wait till node 2 detects that node 1 left - Unpause node 1 (SIGCONT of corosync) and on node 1 newly mcasted cpg messages got sent before sync barrier, so node 2 logs "Unknown node -> we will not deliver message". Solution was to add switch of totemsrp new messages buffer. This patch was not enough so new one (92e0f9c) was created. Reproducer of problem was similar, just cpgverify was used instead of testcpg. Occasionally when node 1 was unpaused it hang in sync phase because there was a partial message in totempg buffers. New sync message had different frag cont so it was thrown away and never delivered. After many years problem was found which is solved by this patch (original issue describe in #660). Reproducer is more complex: - 2 nodes - Node 1 is rate-limited (used script on the hypervisor side): ``` iface=tapXXXX # ~0.1MB/s in bit/s rate=838856 # 1mb/s burst=1048576 tc qdisc add dev $iface root handle 1: htb default 1 tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \ burst ${burst}b tc qdisc add dev $iface handle ffff: ingress tc filter add dev $iface parent ffff: prio 50 basic police rate \ ${rate}bps burst ${burst}b mtu 64kb "drop" ``` - Node 2 is running corosync and cpgverify - Node 1 keeps restarting of corosync and running cpgverify in cycle - Console 1: while true; do corosync; sleep 20; \ kill $(pidof corosync); sleep 20; done - Console 2: while true; do ./cpgverify;done And from time to time (reproduced usually in less than 5 minutes) cpgverify reports corrupted message. Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
|
and now another log, this time with knet master with 1024 max defrag bufs and only two nodes (the second node has two votes and reads+writes from pmxcfs, the first node is rate limited and restarting and does a single write every 60s to check for accessibility) |
|
So small end of the day update. I'm able to reproduce bug with following commands:
Bug is reproduced when command in console 1 stops.For now, it looks like node 1 doesn't get message (my test always happened with seq 5) sent by node 2. Node 2 seems to send seq 5 successfully to knet and knet probably receives it because it is sent back via loopback (so node 2 receives seq 5 message), but for some reason it is not delivered to node 1. Tomorrow is about find out if packet is thrown away on node 1 corosync side, scratched by knet or not delivered on wire at all. |
|
@Fabian-Gruenbichler @jfriesse: this is a workaround that is holding fine for me, using @jfriesse environment, and the pmxfs test has been running fine for some 3 hours. I was able to debug to the point where, when reproducing the problem, *dst_seq_num is always 0. This is causing havoc with both defrag buffers and the other issue of "Packet already delivered". I am not sure yet why this value is not updated during the reproducer, so in various tests, I did try to set it to the seq_num, but I am sure it's just hiding some other bugs. @jfriesse I will continue working over the weekend to find the root cause, but I might need access to your test nodes a bit longer, since I am unable to hit the issue locally. Cheers |
|
Test has been running overnight without problems. A small correction, the workaround does not fix the "Packet has already been delivered" problem. This can be triggered with the same test env by reducing the sleep time from 30 to 15. Fabio |
|
The "Packet has already been delivered" is triggered when corosync restarts too fast. I understand the issue well, I need to think of a fix. In general we don't see this problem because the first ping from node1 would reset the counters, but in this degraded network environment, that packet might not make it to node2. |
|
@jfriesse in the defrag-debug knet branch that I have deployed on your nodes, there is a partial fix for the "Packet has already been delivered" and it doesn't appear to trigger anymore. Partial only because we need to cleanup the patch and handle all potential ICMP error code. The patch itself marks a link down when the UDP socket is returning an error when trying to send data to the other node (ECONNREFUSED in our case). This speeds up a lot link down detection. |
see corosync/corosync#660 as well. these are already queued for 1.23 and taken straight from stable1-proposed. Acked-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
|
just to provide some closure here as well - this issue is fixed by and which are currently also queued for knet 1.x we (Proxmox) started rolling out fixed packages for PVE yesterday - will keep this issue updated in case we encounter more trouble ;) |
|
This issue was (hopefully) fixed by multiple patches in multiple projects:
so closing (to not forget it) for now. Please reopen issue if it appears again. |
Commit 92e0f9c added switching of totempg buffers in sync phase. But because buffers got switch too early there was a problem when delivering recovered messages (messages got corrupted and/or lost). Solution is to switch buffers after recovered messages got delivered. I think it is worth to describe complete history with reproducers so it doesn't get lost. It all started with 4026389 (more info about original problem is described in https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch solves problem which is way to be reproduced with following reproducer: - 2 nodes - Both nodes running corosync and testcpg - Pause node 1 (SIGSTOP of corosync) - On node 1, send some messages by testcpg (it's not answering but this doesn't matter). Simply hit ENTER key few times is enough) - Wait till node 2 detects that node 1 left - Unpause node 1 (SIGCONT of corosync) and on node 1 newly mcasted cpg messages got sent before sync barrier, so node 2 logs "Unknown node -> we will not deliver message". Solution was to add switch of totemsrp new messages buffer. This patch was not enough so new one (92e0f9c) was created. Reproducer of problem was similar, just cpgverify was used instead of testcpg. Occasionally when node 1 was unpaused it hang in sync phase because there was a partial message in totempg buffers. New sync message had different frag cont so it was thrown away and never delivered. After many years problem was found which is solved by this patch (original issue describe in #660). Reproducer is more complex: - 2 nodes - Node 1 is rate-limited (used script on the hypervisor side): ``` iface=tapXXXX # ~0.1MB/s in bit/s rate=838856 # 1mb/s burst=1048576 tc qdisc add dev $iface root handle 1: htb default 1 tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \ burst ${burst}b tc qdisc add dev $iface handle ffff: ingress tc filter add dev $iface parent ffff: prio 50 basic police rate \ ${rate}bps burst ${burst}b mtu 64kb "drop" ``` - Node 2 is running corosync and cpgverify - Node 1 keeps restarting of corosync and running cpgverify in cycle - Console 1: while true; do corosync; sleep 20; \ kill $(pidof corosync); sleep 20; done - Console 2: while true; do ./cpgverify;done And from time to time (reproduced usually in less than 5 minutes) cpgverify reports corrupted message. Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com> (cherry picked from commit e7a8237)
see corosync/corosync#660 as well. these are already queued for 1.23 and taken straight from stable1-proposed. Acked-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> (cherry picked from commit b3e4724)
Fabian-Gruenbichler commentedOct 14, 2021
Hi!
we recently upgraded to corosync 3.1.5 with libknet1 1.22, and had some reports with the following symptoms:
e.g., where a regular join of a node would be visible in logs like this:
we now see this:
with no progress on the corosync end until the fence event
besides the symptom of not finalizing synchronization on the corosync level, this has the additional fallout that CPG isn't operational on all nodes (
cpg_join/cpg_mcast_joinedreturns EAGAIN), which in turn means that our downstream synchronization mechanism doesn't work, which in turn means watchdogs expiring, which in turn means fencing of all nodes almost simultaneously.the trigger probability seems to be vastly increased with increased network load (we had some reports where other parts of the network not used for corosync were experiencing issues, or the increased network load upon booting might be the culprit, or we see knet only detecting the proper MTU after the log messages above). it's pretty hard to trigger (we had 4 separate reports so far, and I only now managed to triggered it with rather artificial environmental constraints).
I did manage to trigger the same symptoms with the following test setup (once, and a few more times where it started to show the symptoms for 10-15s but then successfully recovered):
I attached two log files (log2_filtered.log is one of the "other" nodes, log3_filtered is the one where corosync gets restarted) with corosync debugging enabled and some of our pmxcfs messages interspersed - the part with lots of cpg_join / cpg_send_message retries is where everything goes wrong (after node 3 starts corosync again at 09:17:38). node 3 is also the one with the severe rate limiting.
IMHO a single node misbehaving should only cause that node to drop out of the cluster, and not be able to subtly kill the whole cluster.
I can provide the full log with all debug messages of our file system, but those are probably not very interesting to you ;)
log2_filtered.log
log3_filtered.log
corosync.conf is also attached
corosync.conf.txt
I'd be glad for any pointers on where to start looking next
The text was updated successfully, but these errors were encountered: