New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix failure on agent reconnection #8089
Fix failure on agent reconnection #8089
Conversation
Codecov Report
@@ Coverage Diff @@
## 4.18 #8089 +/- ##
=========================================
Coverage 13.06% 13.07%
- Complexity 9109 9111 +2
=========================================
Files 2720 2720
Lines 257526 257566 +40
Branches 40150 40154 +4
=========================================
+ Hits 33655 33666 +11
- Misses 219644 219671 +27
- Partials 4227 4229 +2
... and 8 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@blueorangutan package |
@vishesh92 a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7351 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - didn't test it though
@blueorangutan test |
@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
@blueorangutan test alma8 kvm-alma8 |
@rohityadavcloud a [SF] Trillian-Jenkins test job (alma8 mgmt + kvm-alma8) has been kicked to run smoke tests |
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
Show resolved
Hide resolved
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
Outdated
Show resolved
Hide resolved
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
Outdated
Show resolved
Hide resolved
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
Outdated
Show resolved
Hide resolved
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
Outdated
Show resolved
Hide resolved
[SF] Trillian test result (tid-7962)
|
[SF] Trillian test result (tid-7961)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
04ef4c2
to
17d6385
Compare
17d6385
to
39b9cfe
Compare
@blueorangutan package |
@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7411 |
// update the DB | ||
if (host != null && transitState) { | ||
disconnectAgent(host, event, _nodeId); | ||
// update the DB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// update the DB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was part of the initial code and we are updating the state on disconnecting the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean this comment should stay there @vishesh92 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, yes. The comment should be more detailed instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the use of host.getUuid()
instead of host.getId()
in my suggestions!?
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
Outdated
Show resolved
Hide resolved
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
Outdated
Show resolved
Hide resolved
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
Outdated
Show resolved
Hide resolved
7bab9b3
to
c8dd68f
Compare
@blueorangutan package |
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7468 |
@blueorangutan test |
@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
@blueorangutan test |
@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
[SF] Trillian test result (tid-8071)
|
@blueorangutan package |
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7509 |
@blueorangutan test |
@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
tried to break this by setting a breakpoint on the host check in the disconnect handler on the secondary MS/host and could see the host getting back up and connected to its original MS/host. |
[SF] Trillian test result (tid-8096)
|
Description
Depending on the agents' configuration, restarting a management server (preferred MS for the agent) will make the agent connect to another management server (non preferred MS). When the preferred MS comes back up, agent will try to disconnect with non-preferred MS and connect with the preferred MS. A race condition can happen during this process in which disconnection from non-preferred MS completes after the connection with preferred MS. This leads to agent to go into an
Alert
state. During this time, agent is still sending Ping to the preferred MS.This PR solves this issue by:
Ping
command if the Host is not inUp
state, we request the agent to send a startup command again to the connection. If the startup is successful, the agent will come back in Up state.To reproduce the issue,
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Alert
state in database. After getting a ping, it gets a startup command after which it turns back toUp
state.How did you try to break this feature and the system with this change?