Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failure on agent reconnection #8089

Merged
merged 6 commits into from Oct 26, 2023

Conversation

vishesh92
Copy link
Member

Description

Depending on the agents' configuration, restarting a management server (preferred MS for the agent) will make the agent connect to another management server (non preferred MS). When the preferred MS comes back up, agent will try to disconnect with non-preferred MS and connect with the preferred MS. A race condition can happen during this process in which disconnection from non-preferred MS completes after the connection with preferred MS. This leads to agent to go into an Alert state. During this time, agent is still sending Ping to the preferred MS.

This PR solves this issue by:

  • Taking a lock in database which ensures that for a host, only connection or disconnection can happen across different MS.
  • While processing the Ping command if the Host is not in Up state, we request the agent to send a startup command again to the connection. If the startup is successful, the agent will come back in Up state.

To reproduce the issue,

  1. apply the patch below.
diff --git a/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
index b74c11cf..aec8a8dd 100644
--- a/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
+++ b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
@@ -836,6 +836,9 @@ public class AgentManagerImpl extends ManagerBase implements AgentManager, Handl
 
         removeAgent(attache, nextStatus);
         // update the DB
+        try {
+            Thread.sleep(5000L);
+        } catch (Exception e){}
         if (host != null && transitState) {
             disconnectAgent(host, event, _nodeId);
         }
  1. Setup an environment with two management servers and below global configuration
agent.lb.enabled = true
indirect.agent.lb.algorithm = roundrobin
indirect.agent.lb.check.interval = 30
  1. Restart the preferred management server.
  2. Check status of host in database or in the UI

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

  1. Tried by applying the above patch.
  2. Updated the state of a host to Alert state in database. After getting a ping, it gets a startup command after which it turns back to Up state.

How did you try to break this feature and the system with this change?

@codecov
Copy link

codecov bot commented Oct 12, 2023

Codecov Report

Merging #8089 (0169fc9) into 4.18 (3e7f21a) will increase coverage by 0.00%.
Report is 14 commits behind head on 4.18.
The diff coverage is 0.00%.

@@            Coverage Diff            @@
##               4.18    #8089   +/-   ##
=========================================
  Coverage     13.06%   13.07%           
- Complexity     9109     9111    +2     
=========================================
  Files          2720     2720           
  Lines        257526   257566   +40     
  Branches      40150    40154    +4     
=========================================
+ Hits          33655    33666   +11     
- Misses       219644   219671   +27     
- Partials       4227     4229    +2     
Files Coverage Δ
...java/com/cloud/agent/manager/AgentManagerImpl.java 4.82% <0.00%> (-0.12%) ⬇️

... and 8 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7351

@rohityadavcloud rohityadavcloud added this to the 4.18.2.0 milestone Oct 13, 2023
Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - didn't test it though

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@rohityadavcloud
Copy link
Member

@blueorangutan test alma8 kvm-alma8

@blueorangutan
Copy link

@rohityadavcloud a [SF] Trillian-Jenkins test job (alma8 mgmt + kvm-alma8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-7962)
Environment: kvm-alma8 (x2), Advanced Networking with Mgmt server a8
Total time taken: 42908 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8089-t7962-kvm-alma8.zip
Smoke tests completed. 107 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_deploy_vm_wrong_checksum Error 43.84 test_templates.py
test_09_list_templates_download_details Failure 0.06 test_templates.py
test_01_migrate_VM_and_root_volume Error 83.08 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 50.71 test_vm_life_cycle.py
test_08_migrate_vm Error 47.03 test_vm_life_cycle.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-7961)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 48364 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8089-t7961-kvm-centos7.zip
Smoke tests completed. 105 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_deploy_vm_wrong_checksum Error 40.57 test_templates.py
test_09_list_templates_download_details Failure 0.05 test_templates.py
test_08_upgrade_kubernetes_ha_cluster Failure 3643.84 test_kubernetes_clusters.py
test_09_delete_kubernetes_ha_cluster Failure 0.06 test_kubernetes_clusters.py
ContextSuite context=TestKubernetesCluster>:teardown Error 324.61 test_kubernetes_clusters.py
test_08_migrate_vm Error 43.77 test_vm_life_cycle.py
test_hostha_enable_ha_when_host_in_maintenance Error 303.79 test_hostha_kvm.py
test_hostha_kvm_host_degraded Error 9.15 test_hostha_kvm.py

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@vishesh92 vishesh92 force-pushed the fix-failure-on-agent-reconnection branch from 04ef4c2 to 17d6385 Compare October 18, 2023 06:58
@vishesh92 vishesh92 force-pushed the fix-failure-on-agent-reconnection branch from 17d6385 to 39b9cfe Compare October 18, 2023 07:16
@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7411

// update the DB
if (host != null && transitState) {
disconnectAgent(host, event, _nodeId);
// update the DB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// update the DB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was part of the initial code and we are updating the state on disconnecting the agent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean this comment should stay there @vishesh92 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, yes. The comment should be more detailed instead.

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the use of host.getUuid() instead of host.getId() in my suggestions!?

@vishesh92 vishesh92 force-pushed the fix-failure-on-agent-reconnection branch from 7bab9b3 to c8dd68f Compare October 23, 2023 07:24
@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7468

@vishesh92
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8071)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38840 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8089-t8071-kvm-centos7.zip
Smoke tests completed. 107 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_migrate_vm Error 45.83 test_vm_life_cycle.py
test_hostha_enable_ha_when_host_in_maintenance Error 305.88 test_hostha_kvm.py

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7509

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland
Copy link
Contributor

tried to break this by setting a breakpoint on the host check in the disconnect handler on the secondary MS/host and could see the host getting back up and connected to its original MS/host.

@blueorangutan
Copy link

[SF] Trillian test result (tid-8096)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41889 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8089-t8096-kvm-centos7.zip
Smoke tests completed. 109 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland DaanHoogland merged commit 3b11663 into apache:4.18 Oct 26, 2023
26 of 27 checks passed
@DaanHoogland DaanHoogland deleted the fix-failure-on-agent-reconnection branch October 26, 2023 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants