Skip to content

Conversation

vishesh92
Copy link
Member

Description

In case of a failure while deploying VM, we reset the host_id for the failed VM to null but not the pod_id. This results in failure when there is enough capacity in another pod, but not in the existing pod.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

This needs an environment with 2 pods to reproduce the issue and test the fix.

  1. On management server, set a debugger here:
  2. Deploy a VM. When the debugger reaches the line above, do the following:
    1. Run SELECT id, state, pod_id, host_id, last_host_id FROM vm_instance ORDER BY id DESC LIMIT 1; on the cloud database.
    2. Get the pod_id from the above and run this query for that pod_id UPDATE host_pod_ref SET allocation_state = 'Disabled' WHERE id = <pod id>.
    3. Set hostHasCpuCapability = false in the debugger to throw an error in the first run.
    4. VM is retried again once more after this failure. Before the fix, it won't stop at the debugger since it no longer has any available resources to deploy on. After the fix, it will stop again at the debugger. At this point, you can check that pod_id is different.

@codecov
Copy link

codecov bot commented Oct 12, 2023

Codecov Report

Merging #8085 (a425b07) into 4.18 (29c7b31) will increase coverage by 0.07%.
Report is 92 commits behind head on 4.18.
The diff coverage is 19.90%.

@@             Coverage Diff              @@
##               4.18    #8085      +/-   ##
============================================
+ Coverage     13.02%   13.10%   +0.07%     
- Complexity     9032     9123      +91     
============================================
  Files          2720     2720              
  Lines        257080   257598     +518     
  Branches      40088    40158      +70     
============================================
+ Hits          33476    33748     +272     
- Misses       219400   219587     +187     
- Partials       4204     4263      +59     
Files Coverage Δ
...hestration/service/VolumeOrchestrationService.java 100.00% <ø> (ø)
.../main/java/com/cloud/network/IpAddressManager.java 100.00% <100.00%> (ø)
...ava/com/cloud/network/as/AutoScaleVmProfileVO.java 80.20% <100.00%> (+11.66%) ⬆️
...main/java/com/cloud/storage/dao/VolumeDaoImpl.java 23.11% <ø> (-0.13%) ⬇️
...java/com/cloud/upgrade/DatabaseUpgradeChecker.java 40.89% <100.00%> (+0.64%) ⬆️
...va/com/cloud/upgrade/DatabaseVersionHierarchy.java 85.10% <100.00%> (+1.01%) ⬆️
.../storage/vmsnapshot/StorageVMSnapshotStrategy.java 25.39% <ø> (+0.95%) ⬆️
.../api/command/admin/ratelimit/ResetApiLimitCmd.java 0.00% <ø> (ø)
...oud/hypervisor/kvm/resource/LibvirtConnection.java 0.00% <ø> (ø)
.../hypervisor/kvm/storage/ScaleIOStorageAdaptor.java 10.44% <100.00%> (ø)
... and 63 more

... and 15 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7350

@rohityadavcloud rohityadavcloud added this to the 4.19.0.0 milestone Oct 13, 2023
@rohityadavcloud
Copy link
Member

@blueorangutan test matrix

@blueorangutan
Copy link

@rohityadavcloud a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-7955)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41619 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8085-t7955-kvm-centos7.zip
Smoke tests completed. 112 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_deploy_vm_wrong_checksum Error 40.62 test_templates.py
test_09_list_templates_download_details Failure 0.04 test_templates.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-7953)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43151 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8085-t7953-xenserver-71.zip
Smoke tests completed. 113 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7412

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8070)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42125 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8085-t8070-kvm-centos7.zip
Smoke tests completed. 112 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_deploy_vm_wrong_checksum Error 39.55 test_templates.py
test_09_list_templates_download_details Failure 0.07 test_templates.py

@DaanHoogland DaanHoogland force-pushed the fix-pod-selection-after-failure branch from b0057f7 to cad6412 Compare October 25, 2023 09:41
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@vishesh92 vishesh92 force-pushed the fix-pod-selection-after-failure branch from cad6412 to db22bdf Compare October 25, 2023 09:56
@vishesh92 vishesh92 changed the base branch from main to 4.18 October 25, 2023 09:57
@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7506

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7508

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7520

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, I'll be testing it manually to simulate the right conditions.

@vishesh92 vishesh92 force-pushed the fix-pod-selection-after-failure branch from 8af2720 to 07339be Compare November 1, 2023 09:22
@vishesh92 vishesh92 force-pushed the fix-pod-selection-after-failure branch from 07339be to a425b07 Compare November 1, 2023 09:25
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7611

@DaanHoogland
Copy link
Contributor

@blueorangutan test alma9 kvm-alma9

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (alma9 mgmt + kvm-alma9) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8220)
Environment: kvm-alma9 (x2), Advanced Networking with Mgmt server a9
Total time taken: 45623 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8085-t8220-kvm-alma9.zip
Smoke tests completed. 108 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 90.35 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 52.72 test_vm_life_cycle.py
test_08_migrate_vm Error 46.04 test_vm_life_cycle.py

@DaanHoogland
Copy link
Contributor

DaanHoogland commented Nov 3, 2023

not sure if the errors are related;

@DaanHoogland
Copy link
Contributor

@blueorangutan test alma9 kvm-alma9

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (alma9 mgmt + kvm-alma9) has been kicked to run smoke tests

@rohityadavcloud
Copy link
Member

JFYI @DaanHoogland @vishesh92 I had some issues using Alma Linux (due to repo/mirror issue) but OL8/OL9 seems to work fine with backend CI/CD.

@DaanHoogland
Copy link
Contributor

@blueorangutan test alma9 kvm-alma9

@DaanHoogland
Copy link
Contributor

@blueorangutan test alma9 kvm-alma9

this didn´t work 🤯 , so started it manually

@blueorangutan
Copy link

@DaanHoogland [SL] unsupported parameters provided. Supported mgmt server os are: centos7, centos6, suse15, alma8, ubuntu18, ubuntu22, ubuntu20, rocky8, alma9. Supported hypervisors are: kvm-centos6, kvm-centos7, kvm-rocky8, kvm-alma8, kvm-alma9, kvm-ubuntu18, kvm-ubuntu20, kvm-ubuntu22, kvm-suse15, vmware-55u3, vmware-60u2, vmware-65u2, vmware-67u3, vmware-70u1, vmware-70u2, vmware-70u3, vmware-80, vmware-80u1, xenserver-65sp1, xenserver-71, xenserver-74, xcpng74, xcpng76, xcpng80, xcpng81, xcpng82

@DaanHoogland
Copy link
Contributor

@blueorangutan test alma9 kvm-alma9

this didn´t work 🤯 , so started it manually

results:
Smoke tests completed. 108 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 89.26 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 51.66 test_vm_life_cycle.py
test_08_migrate_vm Error 0.07 test_vm_life_cycle.py

These error are all over the place at the moment, not specific to this issue.

@DaanHoogland
Copy link
Contributor

tested according to spec in the description.

@DaanHoogland DaanHoogland merged commit e65c9ff into apache:4.18 Nov 7, 2023
@DaanHoogland DaanHoogland deleted the fix-pod-selection-after-failure branch November 7, 2023 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants