Skip to content

Conversation

@davidjumani
Copy link
Contributor

Description

Adds the feature to safely shutdown CloudStack
It does the following :

  • Prevents new Async jobs from being added
  • Waits for existing jobs to finish before shutting down

Contains 4 new apis :

  • TriggerShutdown - Prevents new jobs and shuts down once all pending jobs are completed
  • ReadyForShutdown - Returns whether a shutdown has been triggered and the no of pending jobs
  • PrepareForShutdown - Prevents new jobs from being added but does not shutdown when there are zero pending jobs : Useful when admin changes are required in ACS which can impact operations
  • CancelShutdown - Cancel a shutdown if possible

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale

  • Major
  • Minor

Screenshots (if appropriate):

Screenshot from 2022-09-21 16-44-55

How Has This Been Tested?

TODO

@acs-robot
Copy link

Found UI changes, kicking a new UI QA build
@blueorangutan ui

@acs-robot
Copy link

Found UI changes, kicking a new UI QA build
@blueorangutan ui

@blueorangutan
Copy link

@acs-robot a Jenkins job has been kicked to build UI QA env. I'll keep you posted as I make progress.

@blueorangutan
Copy link

UI build: ✔️
Live QA URL: http://qa.cloudstack.cloud:8080/client/pr/6755 (SL-JID-2390)

@acs-robot
Copy link

Found UI changes, kicking a new UI QA build
@blueorangutan ui

@blueorangutan
Copy link

@acs-robot a Jenkins job has been kicked to build UI QA env. I'll keep you posted as I make progress.

@acs-robot
Copy link

Found UI changes, kicking a new UI QA build
@blueorangutan ui

@blueorangutan
Copy link

@acs-robot a Jenkins job has been kicked to build UI QA env. I'll keep you posted as I make progress.

@blueorangutan
Copy link

UI build: ✔️
Live QA URL: http://qa.cloudstack.cloud:8080/client/pr/6755 (SL-JID-2396)

@blueorangutan
Copy link

UI build: ✔️
Live QA URL: http://qa.cloudstack.cloud:8080/client/pr/6755 (SL-JID-2397)

@acs-robot
Copy link

Found UI changes, kicking a new UI QA build
@blueorangutan ui

@blueorangutan
Copy link

@acs-robot a Jenkins job has been kicked to build UI QA env. I'll keep you posted as I make progress.

@blueorangutan
Copy link

UI build: ✔️
Live QA URL: http://qa.cloudstack.cloud:8080/client/pr/6755 (SL-JID-2399)

@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug C 1 Bug
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 6 Code Smells

29.5% 29.5% Coverage
9.0% 9.0% Duplication

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks generally good.
A functional question though, there is a UI component, but will the API let all clustered MS shutdown? (I didn´t see code for that) It seems only the MS that happens to handle the API will shut down.

private static final int HEARTBEAT_INTERVAL = 2000;
private static final int GC_INTERVAL = 10000; // 10 seconds

private boolean allowAsyncJobs = true ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this should be called shutdownTriggered, or else the messages of the exceptions are to specific and we don´t know the reason the async jobs are disallowed is a shutdown instead of some other maintenance task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this can be reused later, hence the more generic naming. I'll look into polishing the message accordingly

}
if (shutdownManager.countPendingJobs() == 0) {
s_logger.info("Shutting down now");
System.exit(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are other threads allowed to shutdown as well (by maybe setting a flag on ManagedContextRunnable and adding a check for it there? (<== genuine question)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When System.exit() is called, all shutdown hooks are triggered and only when complete does the program exit

https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#exit(int)

@acs-robot
Copy link

Found UI changes, kicking a new UI QA build
@blueorangutan ui

@blueorangutan
Copy link

@acs-robot a Jenkins job has been kicked to build UI QA env. I'll keep you posted as I make progress.

@blueorangutan
Copy link

UI build: ✔️
Live QA URL: http://qa.cloudstack.cloud:8080/client/pr/6755 (SL-JID-2596)

@davidjumani
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@davidjumani a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@DaanHoogland
Copy link
Contributor

looks generally good. A functional question though, there is a UI component, but will the API let all clustered MS shutdown? (I didn´t see code for that) It seems only the MS that happens to handle the API will shut down.

@davidjumani can you explain this?

@davidjumani
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@davidjumani a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 5789

@davidjumani
Copy link
Contributor Author

@blueorangutan test

@davidjumani davidjumani marked this pull request as draft March 28, 2023 10:14
@blueorangutan
Copy link

@davidjumani a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-6328)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 21691 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6755-t6328-kvm-centos7.zip
Smoke tests completed. 84 look OK, 0 have errors, 25 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
all_test_safe_shutdown Skipped --- test_safe_shutdown.py
all_test_login Skipped --- test_login.py
all_test_scale_vm Skipped --- test_scale_vm.py
all_test_metrics_api Skipped --- test_metrics_api.py
all_test_outofbandmanagement Skipped --- test_outofbandmanagement.py
all_test_outofbandmanagement_nestedplugin Skipped --- test_outofbandmanagement_nestedplugin.py
all_test_routers_iptables_default_policy Skipped --- test_routers_iptables_default_policy.py
all_test_secondary_storage Skipped --- test_secondary_storage.py
all_test_service_offerings Skipped --- test_service_offerings.py
all_test_storage_policy Skipped --- test_storage_policy.py
all_test_update_security_group Skipped --- test_update_security_group.py
all_test_usage_events Skipped --- test_usage_events.py
all_test_vm_autoscaling Skipped --- test_vm_autoscaling.py
all_test_vm_deployment_planner Skipped --- test_vm_deployment_planner.py
all_test_vm_life_cycle Skipped --- test_vm_life_cycle.py
all_test_vm_lifecycle_unmanage_import Skipped --- test_vm_lifecycle_unmanage_import.py
all_test_vm_snapshot_kvm Skipped --- test_vm_snapshot_kvm.py
all_test_vm_snapshots Skipped --- test_vm_snapshots.py
all_test_volumes Skipped --- test_volumes.py
all_test_vpc_ipv6 Skipped --- test_vpc_ipv6.py
all_test_vpc_redundant Skipped --- test_vpc_redundant.py
all_test_vpc_router_nics Skipped --- test_vpc_router_nics.py
all_test_vpc_vpn Skipped --- test_vpc_vpn.py
all_test_host_maintenance Skipped --- test_host_maintenance.py
all_test_hostha_kvm Skipped --- test_hostha_kvm.py

@davidjumani
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@davidjumani a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 5818

@davidjumani
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@davidjumani a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-6345)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43283 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6755-t6345-kvm-centos7.zip
Smoke tests completed. 109 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Contributor

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the feature of Safe shutdown, LGTM

Scenarios I had tested

safeshutdown-testcases.xlsx

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@DaanHoogland DaanHoogland marked this pull request as ready for review April 6, 2023 11:21
@github-actions
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 5 Code Smells

13.6% 13.6% Coverage
0.0% 0.0% Duplication

@DaanHoogland DaanHoogland merged commit 941cc83 into apache:main Apr 12, 2023
@davidjumani davidjumani deleted the safe-shutdown branch April 12, 2023 10:55
@rohityadavcloud rohityadavcloud added this to the 4.19.0.0 milestone Apr 12, 2023
@rohityadavcloud
Copy link
Member

rohityadavcloud commented Apr 12, 2023

Thanks, great feature. The banner background colour and text colour aren't in-line with our current AntD theme. Should this be warning (yellow) or error (red) for the message? https://antdv.com/components/alert/#Alert (see banner, to be used on top of the page)

@davidjumani
Copy link
Contributor Author

Thanks @rohityadavcloud I'll create a PR to address this

@rohityadavcloud rohityadavcloud mentioned this pull request Apr 19, 2023
12 tasks
weizhouapache pushed a commit to shapeblue/cloudstack that referenced this pull request Mar 4, 2025
* Safely shutdown feature (ref: apache#6755)

* Updated version and some improvements

* Management Server Maintenance - Prepare and Cancel Maintenance changes

This is supported for the Cloudstack deployments with multiple management servers.
- During preparing for maintenance, MS waits for pending jobs to finish, and then Transfer/Migrate the agents to other available MS
- New APIs: prepareForMaintenance, cancelMaintenance
- New MS States: PreparingToMaintenance, Maintenance

* check for single active management server

* refactoring plugin name

* updated version, and cleanup

* code improvements

* support list hosts by management server id

* update ui with ms maintenance apis

* code improvements

* ui changes

* ui icons update

* ui fixes

* cond checks for maintenance and shutdown

* fix for management server not down issue on service stop

* continue with other components on error

* agent transfer fixes

* maintenance window timeout and fixes

* ui changes - added connected agents tab, and updated hosts & management servers fields

* marvin test update

* keep maintenance after shutdown/restart, do not update last_updated time in cluster heartbeat during maintenance (notifies node inactive/down after heartbeat threshold)

* listener for ms maintenance updates

* cleanup

* keep last msid in host table

* review comments

* allow only one mgmt server to prepare for maintenance

* added ms uuid in logs

* minor code improvements

* ui fields update

* fix systemvm navigation in connected agents

* algorithm check and input from ui

* check for active ms from host setting

* agent migration code improvements

* minor ui label fix

* fixes & code improvements

* agent reconnect fixes, consider avoid list

* ui fixes

* direct agents transfer and pending jobs timer task fixes

* close unclosed socket channels if any

* Updated pending jobs check timer task with ScheduledExecutorService

* fixes

* keep maintenance state on trigger shutdown call when ms is in maintenance

* direct agent transfer fixes

* add pending jobs count to ms response

* during ms heartbeat, update state to up only when it's down

* allow vm work jobs of async job created before prepare for maintenance

* Revert "keep maintenance state on trigger shutdown call when ms is in maintenance"

This reverts commit 4ebbea71ef20a65286bed41a517f03e253a8fe90.

* removed duplicate schema changes from schema-41800to41810.sql (already defined at schema-41811to41812.sql)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants