Skip to content

Conversation

@harikrishna-patnala
Copy link
Contributor

@harikrishna-patnala harikrishna-patnala commented Mar 20, 2023

Description

This PR tries to fix #7320 where it adds improvement to handle VMs HA when VMware host goes into alert state.
Usually, VMware hosts go to alert state when ping times out. So it is good idea to start HA process on the VMs residing on the host.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

@harikrishna-patnala
Copy link
Contributor Author

@blueorangutan package

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Alert state mean the vmware host is Down ?

It is very risky to start 1 vm on 2 hosts, which might cause data corruption.

@harikrishna-patnala
Copy link
Contributor Author

I get your point @weizhouapache and that seems to be correct.
Alert could be because of network issues also (not just the host is completely down).

And I now really see why HA is not implemented for VMware, we need to check the actual status of VM. I think we can do it by implementing CheckOnHostCommand

protected Answer execute(CheckOnHostCommand cmd) {
return new CheckOnHostAnswer(cmd, null, "Not Implmeneted");
}

This does check the VM state using the neighbour hosts in the cluster (thats how it is impletemented in KVM)

For now, I'm marking this PR as draft.

@weizhouapache please let me know if CheckOnHostCommand makes some sense !

@harikrishna-patnala harikrishna-patnala marked this pull request as draft March 20, 2023 09:24
@weizhouapache
Copy link
Member

I get your point @weizhouapache and that seems to be correct. Alert could be because of network issues also (not just the host is completely down).

And I now really see why HA is not implemented for VMware, we need to check the actual status of VM. I think we can do it by implementing CheckOnHostCommand

protected Answer execute(CheckOnHostCommand cmd) {
return new CheckOnHostAnswer(cmd, null, "Not Implmeneted");
}

This does check the VM state using the neighbour hosts in the cluster (thats how it is impletemented in KVM)

For now, I'm marking this PR as draft.

@weizhouapache please let me know if CheckOnHostCommand makes some sense !

@harikrishna-patnala
yes, but it relies on NFS storage and heartbeat. I do not know how to implement it in vmware.

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some options in my opinion @harikrishna-patnala @weizhouapache @rohityadavcloud

  • We can have global config for a timeout after which all VMs on an Alert state host are migrated away
  • We allow putting such host in maintenance so the operator can manually put it in maintenance so the VMs on the host get migrated away

@rohityadavcloud
Copy link
Member

ping @harikrishna-patnala any update on this? Thanks.

@rohityadavcloud
Copy link
Member

ping @harikrishna-patnala @shwstppr any update on this? Thanks.

@harikrishna-patnala
Copy link
Contributor Author

I like the idea of @shwstppr to add an global setting to attempt migrate the VMs when host goes to alert state. I'll add that and update the PR.

@DaanHoogland
Copy link
Contributor

So if I understand @shwstppr correctly we will implement a setting to time putting hosts in Alert state in Maintenance. Migrating of VMs should then automatically happen and we should not have to care about that. Is that correct @shwstppr ?

I wonder if this takes @weizhouapache 's worry away. The host in alert state may still be up, and as it can not be reached, the VMs on it cannot be stopped or migrated. The disks may consequently still be accessed and modified by any process running on the VM (started by cron or as a deamon (or windows equivilents))

Can we leverage vSphere HA for this (https://www.techtarget.com/searchvmware/definition/VMware-HA)?

@DaanHoogland
Copy link
Contributor

I think we should not merge this. I created a doc PR at apache/cloudstack-documentation#324 . If we can improve the functional description of this we may be able to implement some added HA functionality but relying on a timeout is error prone. Even if the operator is sure the VMs are no longer running, CloudStack cannot.

@harikrishna-patnala
Copy link
Contributor Author

Agree with you @DaanHoogland, closing this PR for that reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VM, VR HA doesn't work on VMware host disconnect

5 participants