Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
CLOUDSTACK-10246 Fix Host HA and VM HA issues #2474
The HA logic just does not work. VM's with HA enabled would never restart after a host failure. Had to re-do most of that logic. There are comments inline with the code, but down below is the general updated logic. Sorry for the long notes...
PS. We enabled virtlockd (https://libvirt.org/locking-lockd.html) and highly recommend it, otherwise you can have VM HA start a VM on multiple hosts.
We are running KVM FYI.
If the investigators find the host is UP, but just the agent is unreachable
If the investigators find the host is DOWN
Next, in this example, when the host agent is back online again, it sends a power report to the mgmt servers, and the management servers think the router was powered-on OOB. However, the GUI will not show a control IP for the vRouter, and the DB will have the NEW control IP it tried to allocated during the failed VM HA event. Thus, leaving us unable to communicate with the vRouter.
This PR does a simple check that we can still communicate with the vRouter after any OOB power-on occurs. If we can, then we have the correct control IP in the DB and we're good - so we do nothing. If we can't communicate with the vRouter after the OOB power-on, we do a reboot of the vRouter to fix it.
overall looks good but need extensive testing and some minor remarks added
Trillian test result (tid-2328)
@Slair1 I have no easy way to put this. You have four (4) PRs out and all fail all travis runs. It must be that travis doesn't like your github handle, or so. Can you think of something you have/do that might cause this. In the same period other PRs have passed travis runs...?