Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about behavior of corosync when older node comes online #701

Open
rvojcik opened this issue Jun 20, 2022 · 26 comments
Open

Question about behavior of corosync when older node comes online #701

rvojcik opened this issue Jun 20, 2022 · 26 comments

Comments

@rvojcik
Copy link

rvojcik commented Jun 20, 2022

Hi guys.

I'm debbuging recent behavior of Proxmox cluster and my question is very related to corosync itself.

Can someone explain it to me or give me little bit deeper insight into it ?

Situation

We had cluster with 35 nodes in corosync. Node number 27 has problems with HW , so we decided to turn it off for later debbuging and HW inspection.

In the mean time (several weeks) we changed number of nodes.
We removed 22 nodes from cluster.

After it we boot up old node 27 which comes online and try to communicate with rest of the cluster:

Jun 17 12:17:26 virt27 corosync[7965]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Jun 17 12:17:26 virt27 corosync[7965]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jun 17 12:17:26 virt27 corosync[7965]:   [TOTEM ] Initializing transport (Kronosnet).
Jun 17 12:17:26 virt27 corosync[7965]:   [TOTEM ] totemknet initialized
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jun 17 12:17:26 virt27 corosync[7965]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun 17 12:17:26 virt27 corosync[7965]:   [QB    ] server name: cmap
Jun 17 12:17:26 virt27 corosync[7965]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jun 17 12:17:26 virt27 corosync[7965]:   [QB    ] server name: cfg
Jun 17 12:17:26 virt27 corosync[7965]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 17 12:17:26 virt27 corosync[7965]:   [QB    ] server name: cpg
Jun 17 12:17:26 virt27 corosync[7965]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun 17 12:17:26 virt27 corosync[7965]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun 17 12:17:26 virt27 corosync[7965]:   [WD    ] Watchdog not enabled by configuration
Jun 17 12:17:26 virt27 corosync[7965]:   [WD    ] resource load_15min missing a recovery key.
Jun 17 12:17:26 virt27 corosync[7965]:   [WD    ] resource memory_used missing a recovery key.
Jun 17 12:17:26 virt27 corosync[7965]:   [WD    ] no resources configured.
Jun 17 12:17:26 virt27 corosync[7965]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun 17 12:17:26 virt27 corosync[7965]:   [QUORUM] Using quorum provider corosync_votequorum
Jun 17 12:17:26 virt27 corosync[7965]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 17 12:17:26 virt27 corosync[7965]:   [QB    ] server name: votequorum
Jun 17 12:17:26 virt27 corosync[7965]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 17 12:17:26 virt27 corosync[7965]:   [QB    ] server name: quorum
Jun 17 12:17:26 virt27 corosync[7965]:   [TOTEM ] Configuring link 0
Jun 17 12:17:26 virt27 corosync[7965]:   [TOTEM ] Configured link number 0: local addr: 192.168.251.183, port=5405
Jun 17 12:17:26 virt27 corosync[7965]:   [TOTEM ] Configuring link 1
Jun 17 12:17:26 virt27 corosync[7965]:   [TOTEM ] Configured link number 1: local addr: 192.168.232.77, port=5406
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 1 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 1 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 1 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 0)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 10 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 10 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 10 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [QUORUM] Sync members[1]: 27
Jun 17 12:17:26 virt27 corosync[7965]:   [QUORUM] Sync joined[1]: 27
Jun 17 12:17:26 virt27 corosync[7965]:   [TOTEM ] A new membership (1b.c1c) was formed. Members joined: 27
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 19 (passive) best link: 0 (pri: 0)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 19 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 19 (passive) best link: 0 (pri: 1)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 19 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 19 (passive) best link: 0 (pri: 1)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 19 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 2 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 2 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 2 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [QUORUM] Members[1]: 27
Jun 17 12:17:26 virt27 corosync[7965]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 20 (passive) best link: 0 (pri: 0)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 20 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 20 (passive) best link: 0 (pri: 1)
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 20 has no active links
Jun 17 12:17:26 virt27 corosync[7965]:   [KNET  ] host: host: 20 (passive) best link: 0 (pri: 1)

--- LOT OF MESSAGES ABOUT NODES THAT NOT EXISTS ANYMORE ---

Then we get TOTEM and QUORUM info

Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 35 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 34 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 33 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 31 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 30 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 29 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 25 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 26 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 28 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 32 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 24 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 23 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] rx: host: 2 link: 0 is up
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 35 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 34 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 33 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 31 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 30 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 29 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 25 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 26 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 28 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 32 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 24 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 23 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 12:17:39 virt27 corosync[7965]:   [QUORUM] Sync members[14]: 2 23 24 25 26 27 28 29 30 31 32 33 34 35
Jun 17 12:17:39 virt27 corosync[7965]:   [QUORUM] Sync joined[13]: 2 23 24 25 26 28 29 30 31 32 33 34 35
Jun 17 12:17:39 virt27 corosync[7965]:   [TOTEM ] A new membership (2.c84) was formed. Members joined: 2 23 24 25 26 28 29 30 31 32 33 34 35
Jun 17 12:17:39 virt27 corosync[7965]:   [CMAP  ] Received config version (64) is different than my config version (43)! Exiting

At this point node 27 is exiting corosync and continues messages about shutdown corosync process

Jun 17 12:17:39 virt27 corosync[7965]:   [SERV  ] Unloading all Corosync service engines.
Jun 17 12:17:39 virt27 corosync[7965]:   [QB    ] withdrawing server sockets
Jun 17 12:17:39 virt27 corosync[7965]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Jun 17 12:17:39 virt27 corosync[7965]:   [QB    ] withdrawing server sockets
Jun 17 12:17:39 virt27 corosync[7965]:   [SERV  ] Service engine unloaded: corosync configuration map access
Jun 17 12:17:39 virt27 corosync[7965]:   [QB    ] withdrawing server sockets
Jun 17 12:17:39 virt27 corosync[7965]:   [SERV  ] Service engine unloaded: corosync configuration service
Jun 17 12:17:39 virt27 corosync[7965]:   [QB    ] withdrawing server sockets
Jun 17 12:17:39 virt27 corosync[7965]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Jun 17 12:17:39 virt27 corosync[7965]:   [QB    ] withdrawing server sockets
Jun 17 12:17:39 virt27 corosync[7965]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jun 17 12:17:39 virt27 corosync[7965]:   [SERV  ] Service engine unloaded: corosync profile loading service
Jun 17 12:17:39 virt27 corosync[7965]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Jun 17 12:17:39 virt27 corosync[7965]:   [SERV  ] Service engine unloaded: corosync watchdog service
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 35 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 34 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 33 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 32 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 31 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 30 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 29 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 28 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 26 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 25 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 24 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 23 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] host: host: 18 (passive) best link: 0 (pri: 0)
Jun 17 12:17:40 virt27 corosync[7965]:   [KNET  ] host: host: 18 has no active links
Jun 17 12:17:40 virt27 corosync[7965]:   [MAIN  ] Corosync Cluster Engine exiting normally
Jun 17 12:17:40 virt27 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Jun 17 12:17:40 virt27 systemd[1]: corosync.service: Failed with result 'exit-code'.

After that, all nodes in the cluster fall into Fancing state and reboot.

I'm trying to understand corosync behavior when one older node can create this kind of situation. We need to improve some our processes and scripts to avoid this kind of state.

Thank you for any insight

@Fabian-Gruenbichler
Copy link
Contributor

could you post the old and new config contents (for version 64 and 43)? logs covering the relevant timespan from nodes other than virt27 would also be interesting.. thanks!

@rvojcik
Copy link
Author

rvojcik commented Jun 20, 2022

@Fabian-Gruenbichler that was fast :) Thanks.
I will post everything in few hours. (traveling right now)

@rvojcik
Copy link
Author

rvojcik commented Jun 20, 2022

Adding configs and logs from 3 others nodes from the same time. I checked also other nodes but looks the same as this three.
corosync-v43.txt
corosync-v64.txt
virt2-corosync.log.txt
virt23-corosync.log.txt
virt35-corosync.log.txt

@Fabian-Gruenbichler
Copy link
Contributor

thanks! I assume you filtered the journal to just show the corosync unit? could you also include pve-cluster.service (or if you want/can, the full unfiltered output starting at 12:17:26 ;))

in the meantime, I'll try to reproduce the issue!

@jfriesse
Copy link
Member

I would like to ask if you found corosync crashdump on fenced nodes?

@rvojcik
Copy link
Author

rvojcik commented Jun 21, 2022

@jfriesse could you help me where I can find crash dumps ? I tried some generic places but didn't see anything.

@rvojcik
Copy link
Author

rvojcik commented Jun 21, 2022

@Fabian-Gruenbichler full logs from 27, 2, 23 and 35 node in that timerange until reboot
virt2-full.log.txt
virt23-full.log.txt
virt35-full.log.txt

and origin of our problems node 27
virt27-full.log.txt

@Fabian-Gruenbichler
Copy link
Contributor

okay, so that looks okay, other than the fact that node 27 starts shutting down at some point without any clear indication what's triggering it.. is there some sort of monitoring/.. in place that could be the cause? was there a manual shutdown triggered? there's no indication that pmxcfs couldn't communicate over corosync, or that anything else is hanging on the PVE side.

I didn't manage to reproduce the issue using a small test cluster, I'll try with a bigger one next.

@rvojcik
Copy link
Author

rvojcik commented Jun 21, 2022

@Fabian-Gruenbichler Yes there was in fact 2 nodes in maintenance, node 28 which cause no problems and 27 which after boot cause the state we are here talking about.
My colleague in datacenter turned off both nodes 27 and 28 when he realized there is serious condition in monitoring (many services down). Originaly he thinks network was broken somehow from these 2 nodes so he decided to turn them off.

@jfriesse
Copy link
Member

@rvojcik It really depends on distribution (and config) so you can try:

  • check /var/lib/corosync
  • check coredumpctl command

@rvojcik
Copy link
Author

rvojcik commented Jun 23, 2022

@jfriesse thx. I did some research and sadly our systems has coredumps disabled. It's because there is 512 GB RAM so coredumps should be bigger then actual disks on the servers. :(

@jfriesse
Copy link
Member

@rvojcik Ok, that make sense ;) The reason I was asking is mainly to find out if problem was some suboptimal decision by pmcxfs to fence nodes which shouldn't be fenced or it was fenced because of corosync crash. Have you found any evidence of corosync crashing - it should be in the journal and there may be file named fdata-$DATE created in /var/lib/corosync?

@rvojcik
Copy link
Author

rvojcik commented Jun 23, 2022

@jfriesse I looked through our logs and only one corosync instance which was going "down" was on node 27 and it didn't crash but ends with message that version is different

Jun 17 12:17:39 virt27 corosync[7965]:   [CMAP  ] Received config version (64) is different than my config version (43)! Exiting

I didnt find any signs that corosync crashes or go down on other nodes. We should see systemd message that unit crashed like on node 27 but others was ok.

Another confirmation that corosync was functional was that last message before fencing is from corosync about missing host 27 and links down (this is on all nodes)

Jun 17 12:17:39 virt23 corosync[33549]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 17 12:17:40 virt23 corosync[33549]:   [KNET  ] link: host: 27 link: 0 is down
Jun 17 12:17:40 virt23 corosync[33549]:   [KNET  ] host: host: 27 (passive) best link: 1 (pri: 1)
Jun 17 12:17:41 virt23 corosync[33549]:   [KNET  ] link: host: 27 link: 1 is down
Jun 17 12:17:41 virt23 corosync[33549]:   [KNET  ] host: host: 27 (passive) best link: 1 (pri: 1)
Jun 17 12:17:41 virt23 corosync[33549]:   [KNET  ] host: host: 27 has no active links

--- REBOOT ---

@Fabian-Gruenbichler
Copy link
Contributor

the strange thing is - even if pmxcfs became non-functional, there should still be log messages pertaining to the fencing - unless those were not persisted to disk and thus lost?

@rvojcik
Copy link
Author

rvojcik commented Jun 23, 2022

@Fabian-Gruenbichler interesting observation. During outage we also had problem that some servers from another vendor detected soft watchdog reset and need to press continue during boot :( so outage was longer then it should because it takes half hour to boot least number of nodes for quorum.

I'm providing you full log from problem start to healthy state of node 27, I didn't see anything but maybe you will
virt27-long.log.txt

@rvojcik
Copy link
Author

rvojcik commented Jun 30, 2022

@Fabian-Gruenbichler any succes with replicating our situation ?

@Fabian-Gruenbichler
Copy link
Contributor

unfortunately not so far - I'll give you an update end of next week (got a few PTO days coming up before that..)

@rvojcik
Copy link
Author

rvojcik commented Jul 7, 2022

Hi guys, one more question.
Can you explain me this message from corosync log ?

corosync[48695]:   [TOTEM ] Token has not been received in 9489 ms

I'm doing little testing and trying to shutdown 2 nodes and i saw this message in logs and I'm not sure what it means from fencing point of view.
Thanks!

Full log

Jul 07 06:30:28 virt2 corosync[48695]:   [TOTEM ] Retransmit List: 7c9a61
Jul 07 07:32:07 virt2 corosync[48695]:   [TOTEM ] Retransmit List: 7e6d0b
Jul 07 10:01:19 virt2 corosync[48695]:   [CFG   ] Node 24 was shut down by sysadmin
Jul 07 10:01:19 virt2 corosync[48695]:   [QUORUM] Sync members[13]: 2 23 25 26 27 28 29 30 31 32 33 34 35
Jul 07 10:01:19 virt2 corosync[48695]:   [QUORUM] Sync left[1]: 24
Jul 07 10:01:19 virt2 corosync[48695]:   [TOTEM ] A new membership (2.d43) was formed. Members left: 24
Jul 07 10:01:19 virt2 corosync[48695]:   [QUORUM] Members[13]: 2 23 25 26 27 28 29 30 31 32 33 34 35
Jul 07 10:01:19 virt2 corosync[48695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 07 10:01:20 virt2 corosync[48695]:   [KNET  ] link: host: 24 link: 0 is down
Jul 07 10:01:20 virt2 corosync[48695]:   [KNET  ] host: host: 24 (passive) best link: 1 (pri: 1)
Jul 07 10:01:21 virt2 corosync[48695]:   [KNET  ] link: host: 24 link: 1 is down
Jul 07 10:01:21 virt2 corosync[48695]:   [KNET  ] host: host: 24 (passive) best link: 1 (pri: 1)
Jul 07 10:01:21 virt2 corosync[48695]:   [KNET  ] host: host: 24 has no active links
Jul 07 10:01:31 virt2 corosync[48695]:   [CFG   ] Node 34 was shut down by sysadmin
Jul 07 10:01:31 virt2 corosync[48695]:   [QUORUM] Sync members[12]: 2 23 25 26 27 28 29 30 31 32 33 35
Jul 07 10:01:31 virt2 corosync[48695]:   [QUORUM] Sync left[1]: 34
Jul 07 10:01:31 virt2 corosync[48695]:   [TOTEM ] A new membership (2.d47) was formed. Members left: 34
Jul 07 10:01:31 virt2 corosync[48695]:   [QUORUM] Members[12]: 2 23 25 26 27 28 29 30 31 32 33 35
Jul 07 10:01:31 virt2 corosync[48695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 07 10:01:32 virt2 corosync[48695]:   [KNET  ] link: host: 34 link: 0 is down
Jul 07 10:01:32 virt2 corosync[48695]:   [KNET  ] host: host: 34 (passive) best link: 1 (pri: 1)
Jul 07 10:01:33 virt2 corosync[48695]:   [KNET  ] link: host: 34 link: 1 is down
Jul 07 10:01:33 virt2 corosync[48695]:   [KNET  ] host: host: 34 (passive) best link: 1 (pri: 1)
Jul 07 10:01:33 virt2 corosync[48695]:   [KNET  ] host: host: 34 has no active links
Jul 07 10:09:24 virt2 corosync[48695]:   [KNET  ] rx: host: 24 link: 0 is up
Jul 07 10:09:24 virt2 corosync[48695]:   [KNET  ] rx: host: 24 link: 1 is up
Jul 07 10:09:24 virt2 corosync[48695]:   [KNET  ] host: host: 24 (passive) best link: 0 (pri: 1)
Jul 07 10:09:24 virt2 corosync[48695]:   [KNET  ] host: host: 24 (passive) best link: 0 (pri: 1)
Jul 07 10:09:27 virt2 corosync[48695]:   [KNET  ] rx: host: 34 link: 0 is up
Jul 07 10:09:27 virt2 corosync[48695]:   [KNET  ] rx: host: 34 link: 1 is up
Jul 07 10:09:27 virt2 corosync[48695]:   [KNET  ] host: host: 34 (passive) best link: 0 (pri: 1)
Jul 07 10:09:27 virt2 corosync[48695]:   [KNET  ] host: host: 34 (passive) best link: 0 (pri: 1)
Jul 07 10:09:33 virt2 corosync[48695]:   [TOTEM ] Token has not been received in 9489 ms
Jul 07 10:09:35 virt2 corosync[48695]:   [QUORUM] Sync members[14]: 2 23 24 25 26 27 28 29 30 31 32 33 34 35
Jul 07 10:09:35 virt2 corosync[48695]:   [QUORUM] Sync joined[2]: 24 34
Jul 07 10:09:35 virt2 corosync[48695]:   [TOTEM ] A new membership (2.d4f) was formed. Members joined: 24 34
Jul 07 10:09:35 virt2 corosync[48695]:   [QUORUM] Members[14]: 2 23 24 25 26 27 28 29 30 31 32 33 34 35
Jul 07 10:09:35 virt2 corosync[48695]:   [MAIN  ] Completed service synchronization, ready to provide service.

@Fabian-Gruenbichler
Copy link
Contributor

I also cannot reproduce this with a setup mimicking yours more closely (8 nodes with current config, 1 node with old config that references a lot more nodes, both with more than 16 nodes total and less). the 8 nodes continue to chug along, the "outdated" node refuses the mismatched config and is left out (with corosync exiting, just like in your case).

a token timeout occurs if the token doesn't pass through the quorate part of the cluster within the expected time (the exact value is determined using a formula - see man corosync.conf). usually it would indicate some sort of network problem or config inconsistency, but it can also be caused by nodes being overloaded and not scheduling the corosync process (should be logged as such) or similar issues. in your specific cause it was probably triggered by the membership change a few seconds earlier, but @jfriesse might be able to shed more light ;)

@jfriesse
Copy link
Member

jfriesse commented Jul 7, 2022

@Fabian-Gruenbichler Sadly I'm unable to shed any more light - I have pretty bad gut feeling about this "Token has not been ..." message, because I've already got similar bugzilla - https://bugzilla.redhat.com/show_bug.cgi?id=2084327 - and I have simply zero clue what is happening.

It's probably not scheduling, because scheduling should result in "Corosync main process was not scheduled ..." message. On the other hand, knet is running different thread so with a bit "luck" main process is scheduled and knet processes are not? On yet another hand, the "Token has not been received ..." message was added as a warning and it is 75% of token timeout so maybe process is really not scheduled for long enough time to trigger this message and nod "Corosync main process ..." one.

No matter what, we are pretty clueless where to start without some reproducer - and I was also unable to reproduce the problem.

@rvojcik Could you please share some info about configuration? I'm expecting these 35 nodes are real (physical) machines right? I can also expect they have more than 1 core (so let's say >= 4) right? How much committed machines are?

@Fabian-Gruenbichler
Copy link
Contributor

@rvojcik what kind of network setup is backing the corosync links? anything special (like bonds, vlans, jumbo frames/non-standard MTU)?

@rvojcik
Copy link
Author

rvojcik commented Jul 8, 2022

@jfriesse @Fabian-Gruenbichler
Our setup is that every node is physical rack server. Every node has:

  • 2x 25Gbit for uplink (all vlans, internet and intranet networks), in Active-Backup bonding configuration
  • 2x 25Gbit for storage (CEPH storage network), in LACP bonding configuration (separated storage network on its own switches)
  • 512 GB RAM per node
  • 1 AMD EPYC 7F72 24-Core Processor, 24C, 48T
  • System use Proxmox with OpenVSwitch

There isn't any non default network configuration/tuning, MTU is default 1500, no jumboframes configured.

Corosync use both networks for checking other nodes (ring0_addr and ring1_addr)

@rvojcik
Copy link
Author

rvojcik commented Jul 8, 2022

I'm trying to simulate our case in docker. I'll let you know any updates and findings

@Fabian-Gruenbichler
Copy link
Contributor

so the two corosync links correspond to the two bonds? or are they two vlans on the "uplink" bond?

@rvojcik
Copy link
Author

rvojcik commented Jul 8, 2022

@Fabian-Gruenbichler yes, every 2x25gbit bond corespond to one corosync link.
storage bond: there is "no vlan", means native vlan tagged on switches
uplink bond: several vlans, corosync use management IPs from management vlan which is native vlan for that network

@rvojcik
Copy link
Author

rvojcik commented Jul 8, 2022

I'am trying to simulate our case. Do you know how can i check if the corosync is in health state ? I mean in proxmox there is pve-ha-lrm services which somehow checks corosync state and reset softdog counter.

Do you guys know how pve-ha-lrm checks the current state of the cluster ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants