You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've been having some issues with one of our proxmox cluster recently. After investigating the issue we are experiencing seems to be related to corosync behavior. This specific cluster has 28 nodes and can run stable for months. Yesterday we powered off 1 of the servers in the cluster. Immediately after all nodes got fenced and rebooted.
After reboot, all nodes were in a state where they did not rejoin or did not rejoin properly.
Restarting corosync on several machines did not change the situation, not even if all but a few machines were powered down. The only way to get out of this situation was to power down all machines and power them up again one by one.
I'm not a corosync expert by any stretch of the imagination. I provide here (https://forum.proxmox.com/attachments/archive-zip.41145/) some logs that I hope will enable someone to point to the reason for all nodes getting fenced, and hopefully also to figure out why the nodes would not rejoin.
This has happened several times recently and I expect it to happen a few more times. If any commands can be run to gain additional insight I'd be happy to run them during such an event.
I know this is not a very clear bug report and I'm very sorry for that. I'm hoping for some input to further this investigation and drill down to a clear root cause.
best regards,
Max
The text was updated successfully, but these errors were encountered:
Hi,
I'm wondering if this could be similar to #701 - anyway, @Fabian-Gruenbichler (and other proxmox guys) will be probably able to identify problem much better than I do, so I will let them to share their view.
Hi,
We've been having some issues with one of our proxmox cluster recently. After investigating the issue we are experiencing seems to be related to corosync behavior. This specific cluster has 28 nodes and can run stable for months. Yesterday we powered off 1 of the servers in the cluster. Immediately after all nodes got fenced and rebooted.
After reboot, all nodes were in a state where they did not rejoin or did not rejoin properly.
Restarting corosync on several machines did not change the situation, not even if all but a few machines were powered down. The only way to get out of this situation was to power down all machines and power them up again one by one.
I'm not a corosync expert by any stretch of the imagination. I provide here (https://forum.proxmox.com/attachments/archive-zip.41145/) some logs that I hope will enable someone to point to the reason for all nodes getting fenced, and hopefully also to figure out why the nodes would not rejoin.
This has happened several times recently and I expect it to happen a few more times. If any commands can be run to gain additional insight I'd be happy to run them during such an event.
I know this is not a very clear bug report and I'm very sorry for that. I'm hoping for some input to further this investigation and drill down to a clear root cause.
best regards,
Max
The text was updated successfully, but these errors were encountered: