New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLOUDSTACK-10246 Fix Host HA and VM HA issues #2474

Open
wants to merge 1 commit into
base: 4.9
from

Conversation

Projects
None yet
4 participants
@Slair1
Contributor

Slair1 commented Mar 1, 2018

The HA logic just does not work. VM's with HA enabled would never restart after a host failure. Had to re-do most of that logic. There are comments inline with the code, but down below is the general updated logic. Sorry for the long notes...

PS. We enabled virtlockd (https://libvirt.org/locking-lockd.html) and highly recommend it, otherwise you can have VM HA start a VM on multiple hosts.

We are running KVM FYI.

  • If host-agent is unreachable, handleDisconnectWithInvestigation() is called as always.
  • The investigators are called to see what happened, which is one of the following two scenarios. (If it isn't one of the two below, then the host just came back UP, or another status was returned and that is also logged. But the two scenarios below are what needed updated the most)

If the investigators find the host is UP, but just the agent is unreachable
The host is put into DISCONNECTED status. It will stay in this status and the PingTimeouts will continue to call handleDisconnectWithoutInvestigation() periodically. It will stay in DISCONNECTED status until the AlertWait config option expires. If the AlertWait time eventually is hit, and the investigators are still just reporting that the host is DISCONNECTED and not DOWN. Then we'll put the host into ALERT state and we'll stay there until the investigators say the host is UP or the investigators say the host is DOWN. If the host goes DOWN, then VM HA will be initiated.

If the investigators find the host is DOWN
Then VM HA is initiated...

VirtualNetworkApplianceManagerImpl.java
The file VirtualNetworkApplianceManagerImpl.java is edited for a related VM HA problem. When a Host is determined to be DOWN, CloudStack attempts to VM HA any affected routers. The problem is, when the host is determined to be down, by code referenced above, the host may not actually be DOWN. On KVM for example, the host is considered DOWN if the agent is stopped on the KVM host for too long. In that case, the VMs could still be running just fine... However when we think the host is DOWN, VM HA runs on the router and as part of that it unallocates/cleans-up the router and it's 169.x.x.x control IP is unallocated. Then after it cleans it up, it tries to power on the router on another host, and as part of that it allocates a NEW 169.x.x.x control IP and writes that to the DB. However, since the router isn't actually down (we just think the host is down) the VM HA fails as the vRouter is currently still running on the problem host.

Next, in this example, when the host agent is back online again, it sends a power report to the mgmt servers, and the management servers think the router was powered-on OOB. However, the GUI will not show a control IP for the vRouter, and the DB will have the NEW control IP it tried to allocated during the failed VM HA event. Thus, leaving us unable to communicate with the vRouter.

This PR does a simple check that we can still communicate with the vRouter after any OOB power-on occurs. If we can, then we have the correct control IP in the DB and we're good - so we do nothing. If we can't communicate with the vRouter after the OOB power-on, we do a reboot of the vRouter to fix it.

@DaanHoogland

overall looks good but need extensive testing and some minor remarks added

/* Our next Agent transition state is Alert
* Let's see if the host down or why we had this event
*/
s_logger.info("Investigating why host " + hostShortDesc + " has disconnected with event " + event);

This comment has been minimized.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

👍 good improvement, but though it is only (a comment and) a log statement, this entails an interface of the system. the ecosystem may query logs for the text and no longer find the hostId thus not being able to take mitigating actions any more. I'd rather see a less destructive change like 'hostId + " (" + hostShortDesc + ") "'

We may get away with it but it does require extensive testing by the whole community :/.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

👍 good improvement, but though it is only (a comment and) a log statement, this entails an interface of the system. the ecosystem may query logs for the text and no longer find the hostId thus not being able to take mitigating actions any more. I'd rather see a less destructive change like 'hostId + " (" + hostShortDesc + ") "'

We may get away with it but it does require extensive testing by the whole community :/.

This comment has been minimized.

@Slair1

Slair1 Mar 2, 2018

Contributor

@DaanHoogland good thought, I didn’t think about that. However, the hostShortDesc does include the hostId as part of it. So maybe it’s ok?

final String hostShortDesc = "Host " + host.getName() + " (id:" + host.getId() + ")";

@Slair1

Slair1 Mar 2, 2018

Contributor

@DaanHoogland good thought, I didn’t think about that. However, the hostShortDesc does include the hostId as part of it. So maybe it’s ok?

final String hostShortDesc = "Host " + host.getName() + " (id:" + host.getId() + ")";

This comment has been minimized.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

yeah, maybe. intesive testing still required...

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

yeah, maybe. intesive testing still required...

Status determinedState = investigate(attache);
// if state cannot be determined do nothing and bail out
if (determinedState == null) {
if ((System.currentTimeMillis() >> 10) - host.getLastPinged() > AlertWait.value()) {
s_logger.warn("Agent " + hostId + " state cannot be determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, will go to Alert state");
s_logger.warn("State for " + hostShortDesc + " could not be determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, will go to Alert state");

This comment has been minimized.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

for warn message the above is even more true

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

for warn message the above is even more true

} else if (determinedState == Status.Disconnected) {
s_logger.warn("Agent is disconnected but the host is still up: " + host.getId() + "-" + host.getName());

This comment has been minimized.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

why must this statement be removed?

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

why must this statement be removed?

This comment has been minimized.

@Slair1

Slair1 Mar 2, 2018

Contributor

I removed it because it was extraneous. A similar but more detailed log entry is performed for every case that the host is in Disconnected. It’s just a little hard to see that while reading through the changes.

@Slair1

Slair1 Mar 2, 2018

Contributor

I removed it because it was extraneous. A similar but more detailed log entry is performed for every case that the host is in Disconnected. It’s just a little hard to see that while reading through the changes.

if (currentStatus == Status.Disconnected) {
// Last status was disconnected, only switch status if AlertWait has passed

This comment has been minimized.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

Can you extract these bits of code in methods, please?

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

Can you extract these bits of code in methods, please?

host = _hostDao.findById(hostId); // Maybe the host magically reappeared?
handleDisconnectWithoutInvestigation(attache, event, true, removeAgent);
host = _hostDao.findById(hostId); // We may have transitioned the status - refresh

This comment has been minimized.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

protected boolean handleDisconnectWithInvestigation(final AgentAttache attache, Status.Event event) just grew from 85 to 118 lines. Can you please make it a bit more modular?

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

protected boolean handleDisconnectWithInvestigation(final AgentAttache attache, Status.Event event) just grew from 85 to 118 lines. Can you please make it a bit more modular?

This comment has been minimized.

@Slair1

Slair1 Mar 2, 2018

Contributor

Yea, that sounds good

@Slair1

Slair1 Mar 2, 2018

Contributor

Yea, that sounds good

@@ -340,6 +340,7 @@
private ScheduledExecutorService _executor;
private ScheduledExecutorService _checkExecutor;
private ScheduledExecutorService _networkStatsUpdateExecutor;
private ExecutorService _routerOobStartExecutor;

This comment has been minimized.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

we are trying to get rid of these _'s . no need to adhere to this old convention. In fact you may want to rename the others as well (in a separate commit for review ease) (this is not a request for change, just a suggestion)

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

we are trying to get rid of these _'s . no need to adhere to this old convention. In fact you may want to rename the others as well (in a separate commit for review ease) (this is not a request for change, just a suggestion)

* This is needed for example when a host agent goes down and comes back up,
* we would have done a failed HA event on the router and end up having our controlIP out-of-sync
*/
s_logger.info("Router " + vo.getInstanceName() + " (ID:" + vo.getId() + ") is powered-on out-of-band, checking if can send CheckRouterCommand to router");

This comment has been minimized.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

👍 completely agree with this change, the ecosystem shouldn't rely on debug logging anyway. friendliness to the community dictates diligent testing anyway.

@DaanHoogland

DaanHoogland Mar 2, 2018

Contributor

👍 completely agree with this change, the ecosystem shouldn't rely on debug logging anyway. friendliness to the community dictates diligent testing anyway.

@borisstoyanov

This comment has been minimized.

Show comment
Hide comment
@borisstoyanov

borisstoyanov Mar 6, 2018

Contributor

@Slair1 what issues are fixed, do we have marvin tests for them? If not I think it'll be good to add them.

Contributor

borisstoyanov commented Mar 6, 2018

@Slair1 what issues are fixed, do we have marvin tests for them? If not I think it'll be good to add them.

@borisstoyanov

This comment has been minimized.

Show comment
Hide comment
@borisstoyanov
Contributor

borisstoyanov commented Mar 6, 2018

@blueorangutan package

@blueorangutan

This comment has been minimized.

Show comment
Hide comment
@blueorangutan

blueorangutan Mar 6, 2018

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan commented Mar 6, 2018

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

Show comment
Hide comment
@blueorangutan

blueorangutan Mar 6, 2018

Packaging result: ✔centos6 ✖centos7 ✖debian. JID-1759

blueorangutan commented Mar 6, 2018

Packaging result: ✔centos6 ✖centos7 ✖debian. JID-1759

@borisstoyanov

This comment has been minimized.

Show comment
Hide comment
@borisstoyanov
Contributor

borisstoyanov commented Mar 6, 2018

@blueorangutan package

@blueorangutan

This comment has been minimized.

Show comment
Hide comment
@blueorangutan

blueorangutan Mar 6, 2018

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan commented Mar 6, 2018

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

Show comment
Hide comment
@blueorangutan

blueorangutan Mar 6, 2018

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1762

blueorangutan commented Mar 6, 2018

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1762

@borisstoyanov

This comment has been minimized.

Show comment
Hide comment
@borisstoyanov
Contributor

borisstoyanov commented Mar 6, 2018

@blueorangutan

This comment has been minimized.

Show comment
Hide comment
@blueorangutan

blueorangutan Mar 6, 2018

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan commented Mar 6, 2018

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan

This comment has been minimized.

Show comment
Hide comment
@blueorangutan

blueorangutan Mar 7, 2018

Trillian test result (tid-2328)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 21963 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2474-t2328-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_iso.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 52 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_04_rvpc_privategw_static_routes Failure 329.18 test_privategw_acl.py
test_02_edit_iso Failure 0.04 test_iso.py
test_05_iso_permissions Failure 0.05 test_iso.py

blueorangutan commented Mar 7, 2018

Trillian test result (tid-2328)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 21963 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2474-t2328-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_iso.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 52 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_04_rvpc_privategw_static_routes Failure 329.18 test_privategw_acl.py
test_02_edit_iso Failure 0.04 test_iso.py
test_05_iso_permissions Failure 0.05 test_iso.py
@DaanHoogland

This comment has been minimized.

Show comment
Hide comment
@DaanHoogland

DaanHoogland Mar 7, 2018

Contributor

@Slair1 are you going to modularise the handleDisconnectWithInvestigation method?

Contributor

DaanHoogland commented Mar 7, 2018

@Slair1 are you going to modularise the handleDisconnectWithInvestigation method?

@borisstoyanov

@Slair1 is there a test coverage of these fixes already, if yes can you show us results, if no can you cover?

@Slair1

This comment has been minimized.

Show comment
Hide comment
@Slair1

Slair1 Mar 13, 2018

Contributor

Yea, i can work to modularize this, i unfortunately don't have the time at the moment, but can later.

On the tests, do you mean Unit Tests? I've never wrote a unit test before, but agree it would be good to have

Contributor

Slair1 commented Mar 13, 2018

Yea, i can work to modularize this, i unfortunately don't have the time at the moment, but can later.

On the tests, do you mean Unit Tests? I've never wrote a unit test before, but agree it would be good to have

@borisstoyanov

This comment has been minimized.

Show comment
Hide comment
@borisstoyanov

borisstoyanov Mar 14, 2018

Contributor

@Slair1 can we also have marvin tests to cover these fixes ?

Contributor

borisstoyanov commented Mar 14, 2018

@Slair1 can we also have marvin tests to cover these fixes ?

@DaanHoogland

This comment has been minimized.

Show comment
Hide comment
@DaanHoogland

DaanHoogland Mar 14, 2018

Contributor

@Slair1 I have no easy way to put this. You have four (4) PRs out and all fail all travis runs. It must be that travis doesn't like your github handle, or so. Can you think of something you have/do that might cause this. In the same period other PRs have passed travis runs...?

Contributor

DaanHoogland commented Mar 14, 2018

@Slair1 I have no easy way to put this. You have four (4) PRs out and all fail all travis runs. It must be that travis doesn't like your github handle, or so. Can you think of something you have/do that might cause this. In the same period other PRs have passed travis runs...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment