Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CleanUp Async Jobs after mgmt server maintenance #8394

Merged
merged 5 commits into from Jan 19, 2024

Conversation

kishankavala
Copy link
Contributor

Description

This PR fixes moves resources stuck in transition state during async job cleanup

Problem:
During maintenance of the management server, other servers in the cluster or the same server after a restart initiate async job cleanup. However, this process leaves resources in a transitional state. The only recovery option currently available is to make direct database changes.

Solution:
This PR introduces a resolution by changing Volume, Virtual Machine, and Network resources from their transitional states. This adjustment enables the reattempt of failed operations without the need for manual database modifications.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Tested manually and with unit tests

Copy link

codecov bot commented Dec 21, 2023

Codecov Report

Attention: 45 lines in your changes are missing coverage. Please review.

Comparison is base (1411da1) 30.88% compared to head (14b5967) 30.75%.
Report is 14 commits behind head on main.

Files Patch % Lines
...stack/framework/jobs/impl/AsyncJobManagerImpl.java 32.30% 35 Missing and 9 partials ⚠️
...apache/cloudstack/storage/volume/VolumeObject.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #8394      +/-   ##
============================================
- Coverage     30.88%   30.75%   -0.13%     
+ Complexity    34079    33941     -138     
============================================
  Files          5341     5341              
  Lines        374861   374922      +61     
  Branches      54518    54529      +11     
============================================
- Hits         115769   115323     -446     
- Misses       243825   244347     +522     
+ Partials      15267    15252      -15     
Flag Coverage Δ
simulator-marvin-tests 24.64% <28.08%> (-0.18%) ⬇️
uitests 4.39% <ø> (+<0.01%) ⬆️
unit-tests 16.47% <46.06%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@harikrishna-patnala
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Contributor

@harikrishna-patnala harikrishna-patnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8111

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@shwstppr
Copy link
Contributor

Moving this to 4.19.1 milestone for now cc @rohityadavcloud
If we are not able to cut RC this week and tests look good we can move it back and merge

@shwstppr shwstppr modified the milestones: 4.19.0.0, 4.19.1.0 Dec 21, 2023
@blueorangutan
Copy link

[SF] Trillian test result (tid-8653)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 47435 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8394-t8653-kvm-centos7.zip
Smoke tests completed. 126 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_05_list_volumes_isrecursive Failure 0.03 test_list_volumes.py
test_07_list_volumes_listall Failure 0.02 test_list_volumes.py
test_02_upgrade_kubernetes_cluster Failure 436.85 test_kubernetes_clusters.py

Copy link
Contributor

@JoaoJandre JoaoJandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM, didn't test it

@sureshanaparti
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8205

@sureshanaparti sureshanaparti self-assigned this Jan 8, 2024
@sureshanaparti
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8228

@sureshanaparti
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8754)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 54871 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8394-t8754-kvm-centos7.zip
Smoke tests completed. 121 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Contributor

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested the changes manually.

Stopped the management server during the following scenario's

  1. while a attachvolume is in progress
  2. while create volume from snapshot is in progress
  3. While vm deployment along with a new network is in progress

After the management server is started

  1. Volume is marked Ready and removed column date is kept as NULL

  2. Volume is marked Allocated and removed column date is populated, the entries in volume_details table is removed for the snapshot

  3. Vm state is stopped, volume is set in allocated and the network state is implementing

@rohityadavcloud
Copy link
Member

@shwstppr can we get this in 4.19.0.0 ? Thanks.

@shwstppr shwstppr modified the milestones: 4.19.1.0, 4.19.0.0 Jan 18, 2024
@shwstppr
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@shwstppr
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8370

@shwstppr
Copy link
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8875)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41654 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8394-t8875-kvm-centos7.zip
Smoke tests completed. 121 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-8873)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 48423 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8394-t8873-xenserver-71.zip
Smoke tests completed. 120 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestSharedNetwork>:setup Error 64.62 test_network.py

@sureshanaparti
Copy link
Contributor

[SF] Trillian test result (tid-8873) Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7 Total time taken: 48423 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8394-t8873-xenserver-71.zip Smoke tests completed. 120 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestSharedNetwork>:setup Error 64.62 test_network.py

Error here while creating network, is not related to this PR changes. Changes here would reset/cleanup any Volume, VM & Network(in implementing state) resources for the pending jobs on MS start & is good to go. cc @shwstppr

Execute cmd: createnetwork failed, due to: errorCode: 431, errorText:The VLAN tag to use for new guest network, 2625 is already being used for dynamic vlan allocation for the guest network in zone pr8394-t8873-xenserver-71

@shwstppr shwstppr merged commit 80bbb29 into apache:main Jan 19, 2024
25 of 26 checks passed
@shwstppr shwstppr deleted the async_cleanup branch January 19, 2024 07:56
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jan 26, 2024
This PR fixes moves resources stuck in transition state during async job cleanup

Problem:
During maintenance of the management server, other servers in the cluster or the same server after a restart initiate async job cleanup. However, this process leaves resources in a transitional state. The only recovery option currently available is to make direct database changes.

Solution:
This PR introduces a resolution by changing Volume, Virtual Machine, and Network resources from their transitional states. This adjustment enables the reattempt of failed operations without the need for manual database modifications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants