New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about behavior of corosync when older node comes online #701
Comments
could you post the old and new config contents (for version 64 and 43)? logs covering the relevant timespan from nodes other than virt27 would also be interesting.. thanks! |
@Fabian-Gruenbichler that was fast :) Thanks. |
Adding configs and logs from 3 others nodes from the same time. I checked also other nodes but looks the same as this three. |
thanks! I assume you filtered the journal to just show the corosync unit? could you also include in the meantime, I'll try to reproduce the issue! |
I would like to ask if you found corosync crashdump on fenced nodes? |
@jfriesse could you help me where I can find crash dumps ? I tried some generic places but didn't see anything. |
@Fabian-Gruenbichler full logs from 27, 2, 23 and 35 node in that timerange until reboot and origin of our problems node 27 |
okay, so that looks okay, other than the fact that node 27 starts shutting down at some point without any clear indication what's triggering it.. is there some sort of monitoring/.. in place that could be the cause? was there a manual shutdown triggered? there's no indication that pmxcfs couldn't communicate over corosync, or that anything else is hanging on the PVE side. I didn't manage to reproduce the issue using a small test cluster, I'll try with a bigger one next. |
@Fabian-Gruenbichler Yes there was in fact 2 nodes in maintenance, node 28 which cause no problems and 27 which after boot cause the state we are here talking about. |
@rvojcik It really depends on distribution (and config) so you can try:
|
@jfriesse thx. I did some research and sadly our systems has coredumps disabled. It's because there is 512 GB RAM so coredumps should be bigger then actual disks on the servers. :( |
@rvojcik Ok, that make sense ;) The reason I was asking is mainly to find out if problem was some suboptimal decision by pmcxfs to fence nodes which shouldn't be fenced or it was fenced because of corosync crash. Have you found any evidence of corosync crashing - it should be in the journal and there may be file named |
@jfriesse I looked through our logs and only one corosync instance which was going "down" was on node 27 and it didn't crash but ends with message that version is different
I didnt find any signs that corosync crashes or go down on other nodes. We should see systemd message that unit crashed like on node 27 but others was ok. Another confirmation that corosync was functional was that last message before fencing is from corosync about missing host 27 and links down (this is on all nodes)
|
the strange thing is - even if pmxcfs became non-functional, there should still be log messages pertaining to the fencing - unless those were not persisted to disk and thus lost? |
@Fabian-Gruenbichler interesting observation. During outage we also had problem that some servers from another vendor detected soft watchdog reset and need to press continue during boot :( so outage was longer then it should because it takes half hour to boot least number of nodes for quorum. I'm providing you full log from problem start to healthy state of node 27, I didn't see anything but maybe you will |
@Fabian-Gruenbichler any succes with replicating our situation ? |
unfortunately not so far - I'll give you an update end of next week (got a few PTO days coming up before that..) |
Hi guys, one more question.
I'm doing little testing and trying to shutdown 2 nodes and i saw this message in logs and I'm not sure what it means from fencing point of view. Full log
|
I also cannot reproduce this with a setup mimicking yours more closely (8 nodes with current config, 1 node with old config that references a lot more nodes, both with more than 16 nodes total and less). the 8 nodes continue to chug along, the "outdated" node refuses the mismatched config and is left out (with corosync exiting, just like in your case). a token timeout occurs if the token doesn't pass through the quorate part of the cluster within the expected time (the exact value is determined using a formula - see |
@Fabian-Gruenbichler Sadly I'm unable to shed any more light - I have pretty bad gut feeling about this "Token has not been ..." message, because I've already got similar bugzilla - https://bugzilla.redhat.com/show_bug.cgi?id=2084327 - and I have simply zero clue what is happening. It's probably not scheduling, because scheduling should result in "Corosync main process was not scheduled ..." message. On the other hand, knet is running different thread so with a bit "luck" main process is scheduled and knet processes are not? On yet another hand, the "Token has not been received ..." message was added as a warning and it is 75% of token timeout so maybe process is really not scheduled for long enough time to trigger this message and nod "Corosync main process ..." one. No matter what, we are pretty clueless where to start without some reproducer - and I was also unable to reproduce the problem. @rvojcik Could you please share some info about configuration? I'm expecting these 35 nodes are real (physical) machines right? I can also expect they have more than 1 core (so let's say >= 4) right? How much committed machines are? |
@rvojcik what kind of network setup is backing the corosync links? anything special (like bonds, vlans, jumbo frames/non-standard MTU)? |
@jfriesse @Fabian-Gruenbichler
There isn't any non default network configuration/tuning, MTU is default 1500, no jumboframes configured. Corosync use both networks for checking other nodes (ring0_addr and ring1_addr) |
I'm trying to simulate our case in docker. I'll let you know any updates and findings |
so the two corosync links correspond to the two bonds? or are they two vlans on the "uplink" bond? |
@Fabian-Gruenbichler yes, every 2x25gbit bond corespond to one corosync link. |
I'am trying to simulate our case. Do you know how can i check if the corosync is in health state ? I mean in proxmox there is Do you guys know how pve-ha-lrm checks the current state of the cluster ? |
Hi guys.
I'm debbuging recent behavior of Proxmox cluster and my question is very related to corosync itself.
Can someone explain it to me or give me little bit deeper insight into it ?
Situation
We had cluster with 35 nodes in corosync. Node number 27 has problems with HW , so we decided to turn it off for later debbuging and HW inspection.
In the mean time (several weeks) we changed number of nodes.
We removed 22 nodes from cluster.
After it we boot up old node 27 which comes online and try to communicate with rest of the cluster:
Then we get TOTEM and QUORUM info
At this point node 27 is exiting corosync and continues messages about shutdown corosync process
After that, all nodes in the cluster fall into Fancing state and reboot.
I'm trying to understand corosync behavior when one older node can create this kind of situation. We need to improve some our processes and scripts to avoid this kind of state.
Thank you for any insight
The text was updated successfully, but these errors were encountered: