node left cluster, caused full cluster fence, nodes could not rejoin after reboot #705

mvernimmen · 2022-09-16T10:52:59Z

Hi,

We've been having some issues with one of our proxmox cluster recently. After investigating the issue we are experiencing seems to be related to corosync behavior. This specific cluster has 28 nodes and can run stable for months. Yesterday we powered off 1 of the servers in the cluster. Immediately after all nodes got fenced and rebooted.
After reboot, all nodes were in a state where they did not rejoin or did not rejoin properly.
Restarting corosync on several machines did not change the situation, not even if all but a few machines were powered down. The only way to get out of this situation was to power down all machines and power them up again one by one.

I'm not a corosync expert by any stretch of the imagination. I provide here (https://forum.proxmox.com/attachments/archive-zip.41145/) some logs that I hope will enable someone to point to the reason for all nodes getting fenced, and hopefully also to figure out why the nodes would not rejoin.
This has happened several times recently and I expect it to happen a few more times. If any commands can be run to gain additional insight I'd be happy to run them during such an event.

I know this is not a very clear bug report and I'm very sorry for that. I'm hoping for some input to further this investigation and drill down to a clear root cause.

best regards,

Max

jfriesse · 2022-09-19T07:30:48Z

Hi,
I'm wondering if this could be similar to #701 - anyway, @Fabian-Gruenbichler (and other proxmox guys) will be probably able to identify problem much better than I do, so I will let them to share their view.

jfriesse · 2022-09-20T10:15:56Z

Proxmox thread https://forum.proxmox.com/threads/proxmox-7-2-issue-cluster-not-joining-back-together.115059/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node left cluster, caused full cluster fence, nodes could not rejoin after reboot #705

node left cluster, caused full cluster fence, nodes could not rejoin after reboot #705

mvernimmen commented Sep 16, 2022

jfriesse commented Sep 19, 2022

jfriesse commented Sep 20, 2022

node left cluster, caused full cluster fence, nodes could not rejoin after reboot #705

node left cluster, caused full cluster fence, nodes could not rejoin after reboot #705

Comments

mvernimmen commented Sep 16, 2022

jfriesse commented Sep 19, 2022

jfriesse commented Sep 20, 2022