Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whole-cluster fence upon membership change #660

Closed
Fabian-Gruenbichler opened this issue Oct 14, 2021 · 38 comments
Closed

whole-cluster fence upon membership change #660

Fabian-Gruenbichler opened this issue Oct 14, 2021 · 38 comments

Comments

@Fabian-Gruenbichler
Copy link
Contributor

Fabian-Gruenbichler commented Oct 14, 2021

Hi!

we recently upgraded to corosync 3.1.5 with libknet1 1.22, and had some reports with the following symptoms:

  • quorate cluster
  • corosync on one node is restarted (either by restarting the corosync service, or by rebooting a node)
  • restarted node gets stuck between noticing the membership change (single node -> all nodes) and finalizing synchronization
  • other nodes get stuck between noticing the membership change and finalizing synchronization

e.g., where a regular join of a node would be visible in logs like this:

Oct  5 12:44:43 node corosync[2135]:   [QUORUM] Sync members[3]: 1 2 3
Oct  5 12:44:43 node corosync[2135]:   [QUORUM] Sync joined[2]: 1 3
Oct  5 12:44:43 node corosync[2135]:   [TOTEM ] A new membership (1.7b1) was formed. Members joined: 1 3
Oct  5 12:44:43 node corosync[2135]:   [QUORUM] Members[3]: 1 2 3
Oct  5 12:44:43 node corosync[2135]:   [MAIN  ] Completed service synchronization, ready to provide service.

we now see this:

Oct  8 08:27:41 node corosync[2134]:   [QUORUM] Sync members[3]: 1 2 3
Oct  8 08:27:41 node corosync[2134]:   [QUORUM] Sync joined[2]: 1 3
Oct  8 08:27:41 node corosync[2134]:   [TOTEM ] A new membership (1.841) was formed. Members joined: 1 3

with no progress on the corosync end until the fence event

besides the symptom of not finalizing synchronization on the corosync level, this has the additional fallout that CPG isn't operational on all nodes (cpg_join / cpg_mcast_joined returns EAGAIN), which in turn means that our downstream synchronization mechanism doesn't work, which in turn means watchdogs expiring, which in turn means fencing of all nodes almost simultaneously.

the trigger probability seems to be vastly increased with increased network load (we had some reports where other parts of the network not used for corosync were experiencing issues, or the increased network load upon booting might be the culprit, or we see knet only detecting the proper MTU after the log messages above). it's pretty hard to trigger (we had 4 separate reports so far, and I only now managed to triggered it with rather artificial environmental constraints).

I did manage to trigger the same symptoms with the following test setup (once, and a few more times where it started to show the symptoms for 10-15s but then successfully recovered):

  • virtual 3 node PVE cluster
  • dual-link setup with MTU of 1280 (in the hopes that this causes more fragmentation)
  • rate-limit of 0.1MB/s on NIC of first link on one node
  • write once per second to our cpg-synced fuse file system to cause some artificial load/traffic
  • restart corosync every 10s until the fuse file system indicates we triggered the issue

I attached two log files (log2_filtered.log is one of the "other" nodes, log3_filtered is the one where corosync gets restarted) with corosync debugging enabled and some of our pmxcfs messages interspersed - the part with lots of cpg_join / cpg_send_message retries is where everything goes wrong (after node 3 starts corosync again at 09:17:38). node 3 is also the one with the severe rate limiting.

IMHO a single node misbehaving should only cause that node to drop out of the cluster, and not be able to subtly kill the whole cluster.

I can provide the full log with all debug messages of our file system, but those are probably not very interesting to you ;)

log2_filtered.log
log3_filtered.log

corosync.conf is also attached

corosync.conf.txt

I'd be glad for any pointers on where to start looking next

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Oct 14, 2021

we got the first reports after we started rolling out the two versions mentioned above - before that we were on 3.1.2 / 1.21.

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Oct 14, 2021

some preliminary testing with libknet1 reverted to 1.21 seems to indicate that 1.22 either contains the bug or exposes it in corosync:

1.22: regular retry counts > 50, up to the full crash in the initial post
1.21: rare retry counts < 10

everything else about the test setup being equal. I'll keep testing both versions to see whether this is stable when running over a longer period of time.

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Oct 14, 2021

the rate-limiting is done with the following command on the hypervisor side, limiting the tap interface that's passed to the guest:

iface=tapXXXX
# ~0.1MB/s in bit/s
rate=838856
# 1mb/s
burst=1048576
tc qdisc add dev $iface root handle 1: htb default 1
tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps burst ${burst}b
tc qdisc add dev $iface handle ffff: ingress
tc filter add dev $iface parent ffff: prio 50 basic police rate ${rate}bps burst ${burst}b mtu 64kb "drop"

as stated above, this is just what makes it triggerable in the test setup, the real-world root cause is likely something else like network congestion/problems or system load.

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Oct 14, 2021

only a single interface (first link) on a single node (the third one were corosync is restarted) is rate-limited like that.

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Oct 14, 2021

reverting kronosnet/kronosnet@0fc50a1 does not change the symptoms (even though it seemed like a likely candidate). I'll kick off a bisect tomorrow, let's hope we get results on which knet commit triggers it.

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Oct 15, 2021

the bisect does point at kronosnet/kronosnet@0fc50a1 as the cause (or potentially, change that causes this to go from "triggering basically never" to "triggering rarely"). but reverting just that single commit doesn't seem to be enough for 1.22 to run stable - or possible the reproducer is just not giving a clear enough picture of the issue :-/

@fabbione
Copy link
Member

fabbione commented Oct 19, 2021

@jfriesse I was able to reproduce part of the issue described by @Fabian-Gruenbichler.

In my setup, hypervisor is Fedora 34 (this might be relevant for tc commands used to inject network failures).

3 VMs running rhel8.4 latest, 2 links. libqb, knet and corosync master.

Of the 2 links, one is for ssh access, one is private. I injected the faults on the private network to retain clean ssh access and backup link.

Determine the iface where to inject faults by using:

virsh domiflist rhel8-node1

Interface Type Source Model MAC

vnet8 bridge br0 virtio 54:52:00:00:07:01
vnet9 bridge br1 virtio 54:52:00:01:07:01

(for me that would be vnet9, eth1 inside node1 VM).

iface=vnet9

~0.1MB/s in bit/s

rate=838856

1mb/s

burst=1048576

NOTE that the first command here is different from the one provided above, replace vs add (I get an error using add)

tc qdisc replace dev $iface root handle 1: htb default 1
tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps burst ${burst}b
tc qdisc add dev $iface handle ffff: ingress
tc filter add dev $iface parent ffff: prio 50 basic police rate ${rate}bps burst ${burst}b mtu 64kb "drop"

corosync.conf matches the one provided above modulo ip addresses.

On node1 and node2, simply starts corosync -f, and start cpgverify
on node3: tail -f /var/logs...... and from another shell:
while [ ! -f stop ]; do corosync; sleep 2; /root/coro/test/cpgverify -i1000; echo "killing corosync"; kill $(pidof corosync); sleep 2; echo "sleeping"; done

After some interactions of the test, cpgverify will fail to join the cpg group. This is visible on all nodes. corosync-quorumtool still reports all nodes online.

I have observed in the logs:

  • pong packets being rejected because of excessive latency (as reported above)
  • links going down/up and eventually recovering
  • dynamic buffers increasing and decreasing as expected
  • no quorum loss between node1 and node2. node3 is bouncing as expected
  • retransmit queues can grow quite a lot, but they will flush eventually after cpgverify is stopped
  • I have noticed, and this might be important, "[rx] Packet has already been delivered", that might point to more issues in managing the packet seq numbers.
  • at no time cpgverify reported any error, it gives me confidence that knet is not corrupting packets (based on the error just above).

it appears that corosync gets stuck somehow and cpg is blocked, even tho membership seems fine.

I can´t exclude 100% that knet is not delivering junk to corosync yet. I am still working on it.

@fabbione
Copy link
Member

fabbione commented Oct 20, 2021

@jfriesse easier reproducer here:

VM setup as above, 3 VM.. etc. It is possible to reproduce also with one link.

To keep things aligned: node1 is the one with the slow link.

knet stable1, corosync latest camelback (will need to retest with master as well)

Use this patch on top of cpgverify: https://paste.centos.org/view/ef7624d5

start cpgverify on all nodes.

on node2 and 3, start corosync plain and simple.

on node1 (the slow one): while [ ! -f stop ]; do corosync; sleep 20; echo "killing corosync"; kill $(pidof corosync); sleep 20; echo "sleeping"; done

after a few iteractions, node2 and node3 will hit an "incorrect hash".

@fabbione
Copy link
Member

fabbione commented Oct 20, 2021

I can reproduce the issue with one or two links, with knet stable and knet master.

On top of knet master, I applied the following patch for debugging:

https://paste.centos.org/view/c545d451

We never hit checksum errors from knet TX/RX of data, and cpgverify still shows incorrect hash on node2 and node3.

IMPORTANT NOTE: I found another bug in knet while using this debug patch. crypto_nss does not like the extra header size, please just add crypto_model: openssl or drop crypto completely during testing (makes no difference to trigger the problem).

@fabbione
Copy link
Member

fabbione commented Oct 20, 2021

One more observation (using patched cpgverify):

conf_chg
conf_chg
cpg_mcast_joined 5800 msgs
incorrect hash

it appears that the packet with incorrect hash always happens after a conf_chg (or shortly after).

@fabbione
Copy link
Member

fabbione commented Oct 20, 2021

More debug info: using udp(u) we cannot trigger the incorrect hash problem.

An interesting observation is that cpgverify will, from time to time, get a -6/EAGAIN from cpg in combination with udp(u), but never seen in combination with knet. As if cpg is accepting requests too fast when corosync is configured with knet.

@fabbione
Copy link
Member

fabbione commented Oct 20, 2021

More changes applied: https://paste.centos.org/view/raw/71b114db
This patch from Chrissie adds chksum verification on both ends of knet. Even when we see cpgverify (and cpghum) errors, there is no detected corruption at knet level.
@Fabian-Gruenbichler has developed more testing patches that show the corruption.

@chrissie-c
Copy link
Contributor

chrissie-c commented Oct 20, 2021

That totemknet patch in full totemknet-crc.txt

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Oct 20, 2021

with the following diff on top of cpghum.c from 3.1.5 (thanks chrissie for parts of the diff!)

diff --git a/test/cpghum.c b/test/cpghum.c
index c31244c8..4212c07d 100644
--- a/test/cpghum.c
+++ b/test/cpghum.c
@@ -318,6 +318,10 @@ static void cpg_bm_deliver_fn (
 	if (crc != recv_crc) {
 		crc_errors++;
 		cpgh_log_printf(CPGH_LOG_ERR, "%s: CRCs don't match. got %lx, expected %lx from nodeid " CS_PRI_NODE_ID "\n", group_name->value, recv_crc, crc, nodeid);
+		for (int i=0; i<(datalen/4); i++) {
+			cpgh_log_printf(CPGH_LOG_ERR, "%d %x ", i, dataint[i]);
+		}
+			cpgh_log_printf(CPGH_LOG_ERR, "\n");
 
 		if (abort_on_error) {
 			exit(2);
@@ -355,7 +359,7 @@ static void set_packet(int write_size, int counter)
 
 	header->counter = counter;
 	for (i=0; i<(datalen/4); i++) {
-		dataint[i] = rand();
+		dataint[i] = i ;
 	}
 	crc = crc32(0, NULL, 0);
 	header->crc = crc32(crc, (Bytef*)&dataint[0], datalen);
@@ -431,7 +435,7 @@ static void cpg_flood (
 	}
 }
 
-static void cpg_test (
+static int cpg_test (
 	cpg_handle_t handle_in,
 	int write_size,
 	int delay_time,
@@ -458,6 +462,10 @@ static void cpg_test (
 			send_retries++;
 			goto resend;
 		}
+		if (res == CS_ERR_LIBRARY) {
+			send_counter--;
+			return -1;
+		}
 		if (res != CS_OK) {
 			cpgh_log_printf(CPGH_LOG_ERR, "send failed: %d\n", res);
 			send_fails++;
@@ -478,7 +486,7 @@ static void cpg_test (
 			cpgh_log_printf(CPGH_LOG_RTT, "RTT min/avg/max: %ld/%ld/%ld\n", min_rtt, avg_rtt, max_rtt);
 		}
 	}
-
+	return 0;
 }
 
 static void sigalrm_handler (int num)
@@ -719,9 +727,17 @@ int main (int argc, char *argv[]) {
 		break;
 	}
 
+	struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 };
+
+reinitialize:
 	if (res != CS_OK) {
 		cpgh_log_printf(CPGH_LOG_ERR, "cpg_initialize failed with result %d\n", res);
-		exit (1);
+		if (res != CS_ERR_TRY_AGAIN) {
+			exit (1);
+		} else {
+			nanosleep(&tvreq, NULL);
+			goto reinitialize;
+		}
 	}
 	res = cpg_local_get(handle, &g_our_nodeid);
 	if (res != CS_OK) {
@@ -731,10 +747,16 @@ int main (int argc, char *argv[]) {
 
 	pthread_create (&thread, NULL, dispatch_thread, NULL);
 
+rejoin:
 	res = cpg_join (handle, &group_name);
 	if (res != CS_OK) {
 		cpgh_log_printf(CPGH_LOG_ERR, "cpg_join failed with result %d\n", res);
-		exit (1);
+		if (res != CS_ERR_TRY_AGAIN) {
+			exit (1);
+		} else {
+			nanosleep(&tvreq, NULL);
+			goto rejoin;
+		}
 	}
 
 	if (listen_only) {
@@ -786,7 +808,25 @@ int main (int argc, char *argv[]) {
 		else {
 			send_counter = -1; /* So we start from zero to allow listeners to sync */
 			for (i = 0; i < repetitions && !stopped; i++) {
-				cpg_test (handle, write_size, delay_time, print_time);
+				if (cpg_test (handle, write_size, delay_time, print_time) == -1) {
+					res = -1;
+					// TODO: Use 'proper' connect
+					while (res != CS_OK) {
+						printf("Reconnecting...\n");
+						if (handle) {
+							cpg_finalize(handle);
+						}
+						handle = 0;
+						res = cpg_initialize (&handle, &callbacks);
+						if (res == CS_OK) {
+							res = cpg_join (handle, &group_name);
+							pthread_cancel(thread);
+							pthread_create (&thread, NULL, dispatch_thread, NULL);
+						} else {
+							sleep(1);
+						}
+					}
+				}
 				signal (SIGALRM, sigalrm_handler);
 			}
 		}

and the following test setup:

  • all nodes with knet 1.22 and corosync patch from @fabbione 's last comment
  • node 1 and 2: no rate limit, ./cpghum -d0 -r10 -n node12
  • node 3: rate limit as described in OP, ./cpghum -d0 -r10 -n node3 and in a second shell, while :; do date; systemctl restart corosync; sleep 20

after a few restarts I see CRC errors on nodes 1 and 2 (logs attached, generated by piping the above commands through ts, please note in the log messages "expected" CRC refers to the one calculated by the receiving node based on the -corrupt- data, so the actually expected one is the one contained in the payload, which is referred to as 'got' CRC).

cpghum_node1.txt
cpghum_node2.txt
cpghum_node3.txt

removing the rate limit makes the problem go away, so it seems to be caused by backlog/backpressure somewhere.

some things worth noting:

cpghum has a single array for the data where the message gets prepared, since we now fill it with static data the only thing that changes is the counter and timestamp encoded within the cpghum header, the actual payload that gets corrupted is always the same.

the corruption looks like (de)fragmenting gone wrong to me, but the offsets are pretty random

@fabbione
Copy link
Member

fabbione commented Oct 20, 2021

knet_chksum.diff.txt

knet checksum debug diff, on top of knet master.

@fabbione
Copy link
Member

fabbione commented Oct 21, 2021

One more datapoint. Today I tested with 4 VMs, adding a "good" node4.

During the test, node2 received an incorrect hash. From that point on, all of cpgverify processes appears to be stuck for a long period of time. Restarting cpgverify on node2 will not produce any output for over a minute. After which node3 and node4 reported the incorrect hash and everything unblocked.

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Oct 21, 2021

again with cpghum with a slight change from the previous iteration:

  • instead of setting dataint[i] = i, set it to the message's sequence number
  • adapt the log messages in case of CRC mismatch to print lines like XXXX: AAAAAAAA ..., where XXXX is the offset in bytes (of the payload, so this is AFTER the cpghum header), followed by 16 ints of data that should contain the sequence number and nothing else

the corruption looks like we get parts of later messages, but not always aligned at the int-boundary..

cpghum_seq_payload_1.log.txt
cpghum_seq_payload_2.log.txt
cpghum_seq_payload_3.log.txt

@fabbione
Copy link
Member

fabbione commented Oct 24, 2021

@jfriesse please checkout knet urgent-fixes branch (based on master) and build knet with --enable-onwire-v1-extra-debug

This options does break onwire compatibility with stable1! so all nodes need to have this version.

It enables:
1 crc32 of incoming data received from corosync just after recvmsg/readv socket
2 crc32 for each network packet
3 verification of crc32 for each network packet
4 verification of crc32 of the data from 1, just before delivering to corosync

code abort() for checksum failures.

Knet did not, unfortunately, abort while reproducing this test case.

@fabbione
Copy link
Member

fabbione commented Oct 25, 2021

@jfriesse so I added more debug to exclude knet as source of the problem.

Please grab the extra-coro-debug branch from knet, based on master and build with:
configure --enable-onwire-v1-extra-debug CFLAGS="-I/home/fabbione/work/coro/corosync/include/"

(replace the include path to match your corosync include ;)).

This branch adds totem crc verification (based on @chrissie-c patch above) on both ends of knet data delivery sockets.

Once again, none of those abort() are triggering within knet. So at this point I am fairly confident we are triggering a problem in corosync or delivery from corosync to cpgverify.

jfriesse added a commit to jfriesse/corosync that referenced this issue Oct 26, 2021
Better description to be made

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
@jfriesse
Copy link
Member

jfriesse commented Oct 26, 2021

For whoever interested in this issue, could you please give a try to #662? I haven't tested it in 3 nodes condition (only 2 nodes) and still not sure if it doesn't make loose of some message (or deliver messages which shouldn't be delivered), but that's best I can provide for now.

@fabbione
Copy link
Member

fabbione commented Oct 27, 2021

@jfriesse been running #662 for over one hour now, on a 4 nodes, without problems. I don´t see any incorrect hashes from cpgverify. Not sure how to verify the "if it doesn't make loose of some message (or deliver messages which shouldn't be delivered)" tho...

@fabbione
Copy link
Member

fabbione commented Oct 27, 2021

All, whoever is going to test this fix, please make sure to use the extra-coro-debug branch from knet. I was able to trigger once, after almost 1.5hours of testing, one of the abort() in the RX threads during packet crc32 verification, after reassembling the defrag buffers. The packet was originated from the slow node.
I am currently re-running the test, tho it is very possible we have another bug in knet at this point. The crc32 verification is all internal to knet.

@fabbione
Copy link
Member

fabbione commented Oct 27, 2021

I confirm, there is also a bug in knet and it just takes forever to reproduce. The interesting bit is that it reproduces only and exclusively on node3 (good) with a packet from node1 (slow).
@jfriesse no "incorrect hash" from corosync tho with the #662 patch.

@fabbione
Copy link
Member

fabbione commented Oct 27, 2021

I found the issue in knet, it doesn´t affect stable1 and I provided a temporary workaround in the extra-coro-debug branch so that @jfriesse can continue his testing.

@fabbione
Copy link
Member

fabbione commented Oct 27, 2021

for the record, the knet issue has been fixed here: kronosnet/kronosnet#366 and it only affects master branch.

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Oct 29, 2021

so with the other two found issues (totemsrp buffer confusion fixed in corosync #662, knet master defrag buf shrink fixed in kronosnet#366) seemingly solved, circling back to the original issue:

  • same setup (3-node virtual cluster with two links, first link of third node rate-limited, third node is restarting corosync while some CPG activity is happening on all three nodes)
  • kronosnet 1.22
  • corosync 3.1.2 with Hopefully solve #660 #662 applied (3.1.5 was also affected)
  • debug set to trace this time

at 12:46:42/43 you can see the totem membership changing on all three nodes after corosync on the third node has started, with node 1&2 seeing 3 as joining, and 3 seeing 1&2 as joining

at 12:46:44 we see the first mesages on nodes 1&2 about failing to use cpg_mcast_joined, which returns -6. each retry is spaced with a small delay:

 204         struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 };
 205         cs_error_t result;
 206         int retries = 0;
 207 loop:
 208         g_mutex_lock (&dfsm->cpg_mutex);
 209         result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);
 210         g_mutex_unlock (&dfsm->cpg_mutex);
 211         if (retry && result == CS_ERR_TRY_AGAIN) {
 212                 nanosleep(&tvreq, NULL);
 213                 ++retries;
 214                 if ((retries % 10) == 0)
 215                         cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries);
 216                 if (retries < 100)
 217                         goto loop;
 218         }
 219 
 220         if (retries)
 221                 cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries);
 222 
 223         if (result != CS_OK &&
 224             (!retry || result != CS_ERR_TRY_AGAIN))
 225                 cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result);

after 100 retries we abort attempting to send the message and bubble up the result, but subsequent error handling might involve sending further messages

at 12:46:47 we have the first log message of our CPG consumer after hitting -6 on cpg_join 10 times (with a short delay inbetween attempts, the retry logic is identical to the send code above). the cpg_join comes a bit later because on node 3 the CPG consumer tries to start the CPG-based state machine services, but if that fails (because corosync is not yet ready, causing cpg_initialize to fail) the repeated attempts are delayed by at least a second.

neither of the three nodes recover from this state, with nodes 1 & 2 fencing about a minute after node3 joined because their fence-preventing watchdog expired for lack of write access. node 3 didn't have the watchdog active, else it would have shared that fate ;)

possibly relevant: the points in time when KNET realizes links are up/down (with corosync restarted at 12:46:42 on node 3), interspersed with TOTEM membership changes, possibly there is a confusion about which nodes to send traffic to? also note the delay with which link 0 (which is rate-limited for node 3) goes up compared to link 1

trace_1.txt:Oct 29 12:46:42 pve6ceph01 corosync[2313]:   [KNET  ] link: host: 3 link: 0 is down
trace_1.txt:Oct 29 12:46:43 pve6ceph01 corosync[2313]:   [TOTEM ] A new membership (1.30faf) was formed. Members joined: 3
trace_1.txt:Oct 29 12:46:44 pve6ceph01 corosync[2313]:   [KNET  ] link: host: 3 link: 1 is down
trace_1.txt:Oct 29 12:46:45 pve6ceph01 corosync[2313]:   [KNET  ] rx: host: 3 link: 1 is up
trace_1.txt:Oct 29 12:46:52 pve6ceph01 corosync[2313]:   [KNET  ] rx: host: 3 link: 0 is up


trace_2.txt:Oct 29 12:46:42 pve6ceph02 corosync[1553]:   [KNET  ] link: host: 3 link: 0 is down
trace_2.txt:Oct 29 12:46:43 pve6ceph02 corosync[1553]:   [TOTEM ] A new membership (1.30faf) was formed. Members joined: 3
trace_2.txt:Oct 29 12:46:52 pve6ceph02 corosync[1553]:   [KNET  ] rx: host: 3 link: 0 is up


trace_3.txt:Oct 29 12:46:40 pve6ceph03 corosync[27706]:   [TOTEM ] A new membership (3.30fab) was formed. Members joined: 3
trace_3.txt:Oct 29 12:46:42 pve6ceph03 corosync[27706]:   [KNET  ] rx: host: 2 link: 1 is up
trace_3.txt:Oct 29 12:46:42 pve6ceph03 corosync[27706]:   [KNET  ] rx: host: 1 link: 1 is up
trace_3.txt:Oct 29 12:46:42 pve6ceph03 corosync[27706]:   [TOTEM ] A new membership (1.30faf) was formed. Members joined: 1 2
trace_3.txt:Oct 29 12:46:51 pve6ceph03 corosync[27706]:   [KNET  ] rx: host: 1 link: 0 is up
trace_3.txt:Oct 29 12:46:51 pve6ceph03 corosync[27706]:   [KNET  ] rx: host: 2 link: 0 is up
trace_3.txt:Oct 29 12:47:47 pve6ceph03 corosync[27706]:   [KNET  ] link: host: 2 link: 1 is down
trace_3.txt:Oct 29 12:47:47 pve6ceph03 corosync[27706]:   [KNET  ] link: host: 1 link: 1 is down
trace_3.txt:Oct 29 12:47:49 pve6ceph03 corosync[27706]:   [KNET  ] link: host: 1 link: 0 is down
trace_3.txt:Oct 29 12:47:49 pve6ceph03 corosync[27706]:   [KNET  ] link: host: 2 link: 0 is down
trace_3.txt:Oct 29 12:47:54 pve6ceph03 corosync[27706]:   [TOTEM ] A new membership (3.30fb7) was formed. Members left: 1 2

the last down events one node 3 are when 1/2 are fenced.

trace_node1.txt
trace_node2.txt
trace_node3.txt

@fabbione
Copy link
Member

fabbione commented Oct 30, 2021

@Fabian-Gruenbichler can you please also backport this patch #652 ? I have seen something very similar while testing something else and #652 did fix my problem.

jfriesse added a commit to jfriesse/corosync that referenced this issue Nov 2, 2021
Commit 92e0f9c added switching of
totempg buffers in sync phase. But because buffers got switch too early
there was a problem when delivering recovered messages (messages got
corrupted and/or lost). Solution is to switch buffers after recovered
messages got delivered.

I think it is worth to describe complete history with reproducers so it
doesn't get lost.

It all started with 4026389 (more info
about original problem is described in
https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch
solves problem which is way to be reproduced with following reproducer:
- 2 nodes
- Both nodes running corosync and testcpg
- Pause node 1 (SIGSTOP of corosync)
- On node 1, send some messages by testcpg
  (it's not answering but this doesn't matter). Simply hit ENTER key
  few times is enough)
- Wait till node 2 detects that node 1 left
- Unpause node 1 (SIGCONT of corosync)

and on node 1 newly mcasted cpg messages got sent before sync barrier,
so node 2 logs "Unknown node -> we will not deliver message".

Solution was to add switch of totemsrp new messages buffer.

This patch was not enough so new one
(92e0f9c) was created. Reproducer of
problem was similar, just cpgverify was used instead of testcpg.
Occasionally when node 1 was unpaused it hang in sync phase because
there was a partial message in totempg buffers. New sync message had
different frag cont so it was thrown away and never delivered.

After many years problem was found which is solved by this patch
(original issue describe in
corosync#660).
Reproducer is more complex:
- 2 nodes
- Node 1 is rate-limited (used script on the hypervisor side):
  ```
  iface=tapXXXX
  # ~0.1MB/s in bit/s
  rate=838856
  # 1mb/s
  burst=1048576
  tc qdisc add dev $iface root handle 1: htb default 1
  tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \
    burst ${burst}b
  tc qdisc add dev $iface handle ffff: ingress
  tc filter add dev $iface parent ffff: prio 50 basic police rate \
    ${rate}bps burst ${burst}b mtu 64kb "drop"
  ```
- Node 2 is running corosync and cpgverify
- Node 1 keeps restarting of corosync and running cpgverify in cycle, so
  - Console 1: while true; do corosync; sleep 20; kill $(pidof corosync); \
      sleep 20; done
  - Console 2: while true; do ./cpgverify;done

And from time to time (reproduced usually in less than 5 minutes) cpgverify
reports corrupted message.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
@jfriesse
Copy link
Member

jfriesse commented Nov 2, 2021

Closed #662 in favor of #663 (same patch, just longer description).

As described, original problem still persists and we keep working on it (local reproducer needed).

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Nov 3, 2021

so here we go again, this time with

  • knet master with CRC debug verification
  • corosync 3.1.5 with totem: Add cancel_hold_on_retransmit config option #652, Hopefully solve #660 #662 and CRC
  • three nodes, just running corosync and pmxcfs
  • node1 is rate limited and restarts corosync every 30s if pmxcfs is still accessible while :; do date; systemctl restart corosync; sleep 30; touch /etc/pve/testfile || break; done
  • node2 writes to pmxcfs roughly every 0.2s (to cause CPG activity) while :; do date; date > /etc/pve/testfile2 || break; sleep 0.2; done
  • node3 reads from and writes pmxcfs roughly every 0.2 (to cause CPG activity) while :; do date; cat /etc/pve/testfile2 > /etc/pve/testfile3 || break; sleep 0.2; done

here's the output from the commands running on each node, indicating when stuff became inaccessible

node1 - restarting until 9:18:07, 30s after that last restart pmxcfs on this node was not writable anymore

Wed 03 Nov 2021 09:12:48 AM CET
Wed 03 Nov 2021 09:13:26 AM CET
Wed 03 Nov 2021 09:13:57 AM CET
Wed 03 Nov 2021 09:14:28 AM CET
Wed 03 Nov 2021 09:15:00 AM CET
Wed 03 Nov 2021 09:15:31 AM CET
Wed 03 Nov 2021 09:16:02 AM CET
Wed 03 Nov 2021 09:16:33 AM CET
Wed 03 Nov 2021 09:17:05 AM CET
Wed 03 Nov 2021 09:17:36 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
touch: setting times of '/etc/pve/testfile': Permission denied

node2 - writable until 9:18:10, when corosync on node1 was up again - access to pmxcfs then blocks because of lack of CPG progress, followed by unblocking at 9:26:12:

Wed 03 Nov 2021 09:12:48 AM CET
Wed 03 Nov 2021 09:12:49 AM CET
Wed 03 Nov 2021 09:12:49 AM CET
Wed 03 Nov 2021 09:12:49 AM CET
Wed 03 Nov 2021 09:12:49 AM CET
Wed 03 Nov 2021 09:12:50 AM CET
Wed 03 Nov 2021 09:12:50 AM CET
Wed 03 Nov 2021 09:12:50 AM CET
Wed 03 Nov 2021 09:12:50 AM CET
Wed 03 Nov 2021 09:12:50 AM CET
Wed 03 Nov 2021 09:12:51 AM CET
Wed 03 Nov 2021 09:12:51 AM CET
Wed 03 Nov 2021 09:12:51 AM CET
Wed 03 Nov 2021 09:12:51 AM CET
Wed 03 Nov 2021 09:12:51 AM CET
Wed 03 Nov 2021 09:12:52 AM CET
Wed 03 Nov 2021 09:12:52 AM CET
Wed 03 Nov 2021 09:12:52 AM CET
Wed 03 Nov 2021 09:12:52 AM CET
Wed 03 Nov 2021 09:12:52 AM CET
[...]
Wed 03 Nov 2021 09:18:01 AM CET
Wed 03 Nov 2021 09:18:01 AM CET
Wed 03 Nov 2021 09:18:01 AM CET
Wed 03 Nov 2021 09:18:01 AM CET
Wed 03 Nov 2021 09:18:02 AM CET
Wed 03 Nov 2021 09:18:02 AM CET
Wed 03 Nov 2021 09:18:02 AM CET
Wed 03 Nov 2021 09:18:02 AM CET
Wed 03 Nov 2021 09:18:03 AM CET
Wed 03 Nov 2021 09:18:03 AM CET
Wed 03 Nov 2021 09:18:03 AM CET
Wed 03 Nov 2021 09:18:03 AM CET
Wed 03 Nov 2021 09:18:04 AM CET
Wed 03 Nov 2021 09:18:04 AM CET
Wed 03 Nov 2021 09:18:04 AM CET
Wed 03 Nov 2021 09:18:04 AM CET
Wed 03 Nov 2021 09:18:04 AM CET
Wed 03 Nov 2021 09:18:05 AM CET
Wed 03 Nov 2021 09:18:05 AM CET
Wed 03 Nov 2021 09:18:05 AM CET
Wed 03 Nov 2021 09:18:05 AM CET
Wed 03 Nov 2021 09:18:05 AM CET
Wed 03 Nov 2021 09:18:06 AM CET
Wed 03 Nov 2021 09:18:06 AM CET
Wed 03 Nov 2021 09:18:06 AM CET
Wed 03 Nov 2021 09:18:06 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:10 AM CET
Wed 03 Nov 2021 09:18:10 AM CET
Wed 03 Nov 2021 09:18:10 AM CET
Wed 03 Nov 2021 09:18:10 AM CET
Wed 03 Nov 2021 09:18:10 AM CET
Wed 03 Nov 2021 09:26:12 AM CET
Wed 03 Nov 2021 09:26:15 AM CET
Wed 03 Nov 2021 09:26:16 AM CET
Wed 03 Nov 2021 09:26:16 AM CET
Wed 03 Nov 2021 09:26:16 AM CET
Wed 03 Nov 2021 09:26:17 AM CET
Wed 03 Nov 2021 09:26:17 AM CET
Wed 03 Nov 2021 09:26:17 AM CET
Wed 03 Nov 2021 09:26:17 AM CET
Wed 03 Nov 2021 09:26:18 AM CET
Wed 03 Nov 2021 09:26:18 AM CET
[...]
Wed 03 Nov 2021 09:28:02 AM CET
Wed 03 Nov 2021 09:28:02 AM CET
Wed 03 Nov 2021 09:28:02 AM CET
Wed 03 Nov 2021 09:28:02 AM CET
Wed 03 Nov 2021 09:28:03 AM CET
Wed 03 Nov 2021 09:28:03 AM CET
^C

node3 similar to node2:

Wed 03 Nov 2021 09:12:48 AM CET
Wed 03 Nov 2021 09:12:48 AM CET
Wed 03 Nov 2021 09:12:48 AM CET
Wed 03 Nov 2021 09:12:48 AM CET
Wed 03 Nov 2021 09:12:49 AM CET
Wed 03 Nov 2021 09:12:49 AM CET
Wed 03 Nov 2021 09:12:49 AM CET
Wed 03 Nov 2021 09:12:49 AM CET
[...]
Wed 03 Nov 2021 09:18:06 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:07 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:08 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:09 AM CET
Wed 03 Nov 2021 09:18:10 AM CET
Wed 03 Nov 2021 09:18:10 AM CET
Wed 03 Nov 2021 09:18:10 AM CET
Wed 03 Nov 2021 09:18:10 AM CET
Wed 03 Nov 2021 09:18:11 AM CET
Wed 03 Nov 2021 09:26:12 AM CET
Wed 03 Nov 2021 09:26:15 AM CET
Wed 03 Nov 2021 09:26:16 AM CET
Wed 03 Nov 2021 09:26:16 AM CET
Wed 03 Nov 2021 09:26:16 AM CET
Wed 03 Nov 2021 09:26:16 AM CET
Wed 03 Nov 2021 09:26:17 AM CET
Wed 03 Nov 2021 09:26:17 AM CET
[...]
Wed 03 Nov 2021 09:28:01 AM CET
Wed 03 Nov 2021 09:28:01 AM CET
Wed 03 Nov 2021 09:28:01 AM CET
Wed 03 Nov 2021 09:28:01 AM CET
Wed 03 Nov 2021 09:28:02 AM CET
Wed 03 Nov 2021 09:28:02 AM CET
Wed 03 Nov 2021 09:28:02 AM CET
Wed 03 Nov 2021 09:28:02 AM CET
Wed 03 Nov 2021 09:28:02 AM CET
Wed 03 Nov 2021 09:28:03 AM CET
Wed 03 Nov 2021 09:28:03 AM CET
^C

here are the trace logs for each node, showing the now familiar pattern of a constant retransmit for a single seq number on one node..

log_node1.txt
log_node2.txt
log_node3.txt

@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Nov 3, 2021

the previous logs are with knet defrag buffers capped at 32 (so min=max=32) to mimc knet 1.22 behaviour. I'll re-run with min=32 max=1024 (vanilla knet master)!

jfriesse added a commit that referenced this issue Nov 3, 2021
Commit 92e0f9c added switching of
totempg buffers in sync phase. But because buffers got switch too early
there was a problem when delivering recovered messages (messages got
corrupted and/or lost). Solution is to switch buffers after recovered
messages got delivered.

I think it is worth to describe complete history with reproducers so it
doesn't get lost.

It all started with 4026389 (more info
about original problem is described in
https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch
solves problem which is way to be reproduced with following reproducer:
- 2 nodes
- Both nodes running corosync and testcpg
- Pause node 1 (SIGSTOP of corosync)
- On node 1, send some messages by testcpg
  (it's not answering but this doesn't matter). Simply hit ENTER key
  few times is enough)
- Wait till node 2 detects that node 1 left
- Unpause node 1 (SIGCONT of corosync)

and on node 1 newly mcasted cpg messages got sent before sync barrier,
so node 2 logs "Unknown node -> we will not deliver message".

Solution was to add switch of totemsrp new messages buffer.

This patch was not enough so new one
(92e0f9c) was created. Reproducer of
problem was similar, just cpgverify was used instead of testcpg.
Occasionally when node 1 was unpaused it hang in sync phase because
there was a partial message in totempg buffers. New sync message had
different frag cont so it was thrown away and never delivered.

After many years problem was found which is solved by this patch
(original issue describe in
#660).
Reproducer is more complex:
- 2 nodes
- Node 1 is rate-limited (used script on the hypervisor side):
  ```
  iface=tapXXXX
  # ~0.1MB/s in bit/s
  rate=838856
  # 1mb/s
  burst=1048576
  tc qdisc add dev $iface root handle 1: htb default 1
  tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \
    burst ${burst}b
  tc qdisc add dev $iface handle ffff: ingress
  tc filter add dev $iface parent ffff: prio 50 basic police rate \
    ${rate}bps burst ${burst}b mtu 64kb "drop"
  ```
- Node 2 is running corosync and cpgverify
- Node 1 keeps restarting of corosync and running cpgverify in cycle
  - Console 1: while true; do corosync; sleep 20; \
      kill $(pidof corosync); sleep 20; done
  - Console 2: while true; do ./cpgverify;done

And from time to time (reproduced usually in less than 5 minutes)
cpgverify reports corrupted message.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Nov 3, 2021

and now another log, this time with knet master with 1024 max defrag bufs and only two nodes (the second node has two votes and reads+writes from pmxcfs, the first node is rate limited and restarting and does a single write every 60s to check for accessibility)

twonode_node1.txt
twonode_node2.txt

@jfriesse
Copy link
Member

jfriesse commented Nov 3, 2021

So small end of the day update. I'm able to reproduce bug with following commands:

  • console 1 (node 1) - while :; do date; systemctl restart corosync; sleep 30; touch /etc/pve/testfile || break; done
  • console 2 (node 2) - corosync -f
  • console 3 (node 1) - pmxcfs -f
  • console 4 (node 2) - pmxcfs -f
  • console 5 (node 2) - while :; do date; cat /etc/pve/testfile2 > /dev/null || break; date > /etc/pve/testfile2 || break; sleep 0.2; done

Bug is reproduced when command in console 1 stops.For now, it looks like node 1 doesn't get message (my test always happened with seq 5) sent by node 2. Node 2 seems to send seq 5 successfully to knet and knet probably receives it because it is sent back via loopback (so node 2 receives seq 5 message), but for some reason it is not delivered to node 1. Tomorrow is about find out if packet is thrown away on node 1 corosync side, scratched by knet or not delivered on wire at all.

@fabbione
Copy link
Member

fabbione commented Nov 5, 2021

@Fabian-Gruenbichler @jfriesse:

kronosnet/kronosnet@8935475

this is a workaround that is holding fine for me, using @jfriesse environment, and the pmxfs test has been running fine for some 3 hours.

I was able to debug to the point where, when reproducing the problem, *dst_seq_num is always 0. This is causing havoc with both defrag buffers and the other issue of "Packet already delivered".

I am not sure yet why this value is not updated during the reproducer, so in various tests, I did try to set it to the seq_num, but I am sure it's just hiding some other bugs.

@jfriesse I will continue working over the weekend to find the root cause, but I might need access to your test nodes a bit longer, since I am unable to hit the issue locally.

Cheers
Fabio

@fabbione
Copy link
Member

fabbione commented Nov 6, 2021

Test has been running overnight without problems.

A small correction, the workaround does not fix the "Packet has already been delivered" problem. This can be triggered with the same test env by reducing the sleep time from 30 to 15.

Fabio

@fabbione
Copy link
Member

fabbione commented Nov 6, 2021

The "Packet has already been delivered" is triggered when corosync restarts too fast.
What is happening is that node2 has no time to realize that node1 has gone away, it doesn't detect the link down event and resets the seq_num properly. node1 will rejoin with seq = 1 (fresh start) and node2 gets confused because it just delivered those packets few ms before.

I understand the issue well, I need to think of a fix. In general we don't see this problem because the first ping from node1 would reset the counters, but in this degraded network environment, that packet might not make it to node2.

@fabbione
Copy link
Member

fabbione commented Nov 7, 2021

@jfriesse in the defrag-debug knet branch that I have deployed on your nodes, there is a partial fix for the "Packet has already been delivered" and it doesn't appear to trigger anymore. Partial only because we need to cleanup the patch and handle all potential ICMP error code. The patch itself marks a link down when the UDP socket is returning an error when trying to send data to the other node (ECONNREFUSED in our case). This speeds up a lot link down detection.

ProxBot pushed a commit to proxmox/kronosnet that referenced this issue Nov 9, 2021
see corosync/corosync#660 as well. these are
already queued for 1.23 and taken straight from stable1-proposed.

Acked-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
@Fabian-Gruenbichler
Copy link
Contributor Author

Fabian-Gruenbichler commented Nov 10, 2021

just to provide some closure here as well - this issue is fixed by

kronosnet/kronosnet@28ddb87

and

kronosnet/kronosnet@62271c5

which are currently also queued for knet 1.x

we (Proxmox) started rolling out fixed packages for PVE yesterday - will keep this issue updated in case we encounter more trouble ;)

@jfriesse
Copy link
Member

jfriesse commented Nov 11, 2021

This issue was (hopefully) fixed by multiple patches in multiple projects:

so closing (to not forget it) for now. Please reopen issue if it appears again.

jfriesse added a commit that referenced this issue Nov 15, 2021
Commit 92e0f9c added switching of
totempg buffers in sync phase. But because buffers got switch too early
there was a problem when delivering recovered messages (messages got
corrupted and/or lost). Solution is to switch buffers after recovered
messages got delivered.

I think it is worth to describe complete history with reproducers so it
doesn't get lost.

It all started with 4026389 (more info
about original problem is described in
https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch
solves problem which is way to be reproduced with following reproducer:
- 2 nodes
- Both nodes running corosync and testcpg
- Pause node 1 (SIGSTOP of corosync)
- On node 1, send some messages by testcpg
  (it's not answering but this doesn't matter). Simply hit ENTER key
  few times is enough)
- Wait till node 2 detects that node 1 left
- Unpause node 1 (SIGCONT of corosync)

and on node 1 newly mcasted cpg messages got sent before sync barrier,
so node 2 logs "Unknown node -> we will not deliver message".

Solution was to add switch of totemsrp new messages buffer.

This patch was not enough so new one
(92e0f9c) was created. Reproducer of
problem was similar, just cpgverify was used instead of testcpg.
Occasionally when node 1 was unpaused it hang in sync phase because
there was a partial message in totempg buffers. New sync message had
different frag cont so it was thrown away and never delivered.

After many years problem was found which is solved by this patch
(original issue describe in
#660).
Reproducer is more complex:
- 2 nodes
- Node 1 is rate-limited (used script on the hypervisor side):
  ```
  iface=tapXXXX
  # ~0.1MB/s in bit/s
  rate=838856
  # 1mb/s
  burst=1048576
  tc qdisc add dev $iface root handle 1: htb default 1
  tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \
    burst ${burst}b
  tc qdisc add dev $iface handle ffff: ingress
  tc filter add dev $iface parent ffff: prio 50 basic police rate \
    ${rate}bps burst ${burst}b mtu 64kb "drop"
  ```
- Node 2 is running corosync and cpgverify
- Node 1 keeps restarting of corosync and running cpgverify in cycle
  - Console 1: while true; do corosync; sleep 20; \
      kill $(pidof corosync); sleep 20; done
  - Console 2: while true; do ./cpgverify;done

And from time to time (reproduced usually in less than 5 minutes)
cpgverify reports corrupted message.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
(cherry picked from commit e7a8237)
ProxBot pushed a commit to proxmox/kronosnet that referenced this issue Dec 1, 2021
see corosync/corosync#660 as well. these are
already queued for 1.23 and taken straight from stable1-proposed.

Acked-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
(cherry picked from commit b3e4724)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants