New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSVM cannot reconnect after connection disruption if there is an active event. #2633

Closed
PaulAngus opened this Issue May 9, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@PaulAngus
Contributor

PaulAngus commented May 9, 2018

ISSUE TYPE
  • Bug Report
COMPONENT NAME
System VMs
(Maybe also KVM hosts)
CLOUDSTACK VERSION
4.11.0
CONFIGURATION
OS / ENVIRONMENT

4.11.0 environment with VMware

SUMMARY

If there is an interruption to mgmt server <-> agent communications while an action is taking place (such as the mgmt server restarting when the ssvm is performing a snapshot) the SSVM will not be able to reconnect due to following error:
2018-05-09 11:37:09,403 INFO [cloud.agent.Agent] (Agent-Handler-9:null) Lost connection to host: 10.220.136.127. Dealing with the remaining commands...
2018-05-09 11:37:09,404 INFO [cloud.agent.Agent] (Agent-Handler-9:null) Cannot connect because we still have 1 commands in progress.

STEPS TO REPRODUCE
During a volume snapshot exporting the ovf restart the management server.  
EXPECTED RESULTS
SSVM reconnects.
ACTUAL RESULTS
The storage VM does not reconnect to the management server and has an error such as: 
INFO  [cloud.agent.Agent] (Agent-Handler-9:null) Lost connection to host: 10.220.136.127. Dealing with the remaining commands...
INFO  [cloud.agent.Agent] (Agent-Handler-9:null) Cannot connect because we still have 1 commands in progress.

Once the job had finished it will reconnect but until this point all other jobs failed unless there is another secondary storage vm up and running.
The backup job even though it is forced to complete from secondary storage is still left in the db as state backing up forever so it does not make that it even waiting for it to finish.

@PaulAngus PaulAngus added this to the 4.11.1.0 milestone May 9, 2018

@rhtyd

This comment has been minimized.

Member

rhtyd commented May 10, 2018

@PaulAngus This is by design of agent (not specific to any cloudstack version), any pending job in Agent's internal queue will block it from reconnecting until the job finishes. /cc @nvazquez @DaanHoogland - any comments on how to deal with this?

@DaanHoogland

This comment has been minimized.

Contributor

DaanHoogland commented May 10, 2018

@rhtyd , as discussed on other media: We can make sure there is
0. make sure reconnect is always attempted

  1. an other thread guarding there is always exacly one connection open.
  2. have a setting that says 'agent.always.reconnect'
  3. have a setting 'agent.job.queue.maximum.size' that makes sure only reconnect attempts are done when less than a certain number of jobs are in the queue.

or any combination of the above

@rhtyd

This comment has been minimized.

Member

rhtyd commented May 11, 2018

I looked at git history and code, there is no discussion about this design decision. When connection is lost with management server, the agent quickly shutsdown the links and other internal connection related data structure. There is high chance that any pending tasks in its internal queue even if succeeds will fail to send a proper/valid response, in which case they fail. I'll send a PR after some testing and start a discussion on why this historic design was used and if it is applicable for today's version.

@rhtyd rhtyd self-assigned this May 11, 2018

rhtyd added a commit to shapeblue/cloudstack that referenced this issue May 11, 2018

agent: Fixes apache#2633 don't wait for pending tasks on reconnection
When agent loses connection with management server, the reconnection
logic waits for any pending tasks to finish. However, when such tasks
do finish they fail to send an `Answer` back to managements server.
Therefore from a management server's perspective such pending
operations are stuck in a FSM state and need manual removal or fixing.
This is by design where management server's side cmd-answer request
pattern is code/execution dependent, therefore even if the answer
were to be sent when management server came back up (reconnects)
the management server will fail to acknowledge and process the answer
due to missing listeners or being in the exact state to handle answers.

Historically, the Agent would wait to reconnect until the internal
tasks complete but I found no reason why it should wait for reconnection
at all.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@rhtyd

This comment has been minimized.

Member

rhtyd commented May 11, 2018

After doing several tests, I've submitted a quick fix that does not block agent from reconnection however other failure cases remain the same - #2638

A much bigger task would be to fix the agent-mgmt server execution design.

@asfgit asfgit closed this in d893fb5 May 16, 2018

asfgit pushed a commit that referenced this issue May 16, 2018

Merge branch '4.11': Fixes #2633 don't block agent for pending tasks …
…on reconnection (#2638)

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment