Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: Fixes #2633 don't wait for pending tasks on reconnection #2638

Merged
merged 1 commit into from May 16, 2018

Conversation

@rhtyd
Copy link
Member

commented May 11, 2018

When agent loses connection with management server, the reconnection
logic waits for any pending tasks to finish. However, when such tasks
do finish they fail to send an Answer back to managements server.
Therefore from a management server's perspective such pending
operations are stuck in a FSM state and need manual removal or fixing.
This is by design where management server's side cmd-answer request
pattern is code/execution dependent, therefore even if the answer
were to be sent when management server came back up (reconnects)
the management server will fail to acknowledge and process the answer
due to missing listeners or being in the exact state to handle answers.

Historically, the Agent would wait to reconnect until the internal
tasks complete but I found no reason why it should wait for reconnection
at all.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

GitHub Issue/PRs

Screenshots (if appropriate):

How Has This Been Tested?

Before fix: Started a snapshot of a volume, killed/shutdown the management server to see that agent is blocked until the job finished. When the job finishes, it fails to send answer. When mgmt server is started again, it has the snapshot still in backing state. However, the agent is blocked until the job finishes, even if the mgmt server were to come up online. Irrespective of the case, the pending job fails to reply (as the link object changes, the send fails).

After fix: The same as above, but this time agent is not blocked by any long-running pending job and reconnects faster. The failure scenarios remain the same, including manual fixing (if any) needed after the mgmt server is back.

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
    Testing
  • I have added tests to cover my changes.
  • All relevant new and existing integration tests have passed.
  • A full integration testsuite with all test that can run on my environment has passed.
agent: Fixes #2633 don't wait for pending tasks on reconnection
When agent loses connection with management server, the reconnection
logic waits for any pending tasks to finish. However, when such tasks
do finish they fail to send an `Answer` back to managements server.
Therefore from a management server's perspective such pending
operations are stuck in a FSM state and need manual removal or fixing.
This is by design where management server's side cmd-answer request
pattern is code/execution dependent, therefore even if the answer
were to be sent when management server came back up (reconnects)
the management server will fail to acknowledge and process the answer
due to missing listeners or being in the exact state to handle answers.

Historically, the Agent would wait to reconnect until the internal
tasks complete but I found no reason why it should wait for reconnection
at all.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

@rhtyd rhtyd added this to the 4.11.1.0 milestone May 11, 2018

@rhtyd rhtyd self-assigned this May 11, 2018

@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 11, 2018

@blueoragutan package

@DaanHoogland
Copy link
Contributor

left a comment

this quickly fixes the issue. follow up work will come as jobs may or may not still be valid upon disconnect. see possible strategies in #2633

@borisstoyanov

This comment has been minimized.

Copy link
Contributor

commented May 14, 2018

@blueorangutan package

@blueorangutan

This comment has been minimized.

Copy link

commented May 14, 2018

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

Copy link

commented May 14, 2018

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2053

@borisstoyanov

This comment has been minimized.

Copy link
Contributor

commented May 14, 2018

@blueorangutan

This comment has been minimized.

Copy link

commented May 14, 2018

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@wido

wido approved these changes May 14, 2018

Copy link
Contributor

left a comment

Code LGTM, but the side-effects this might have are unknown to me

@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 14, 2018

@wido good point. I'm thinking we may need additional time to test the side-effects if any, however theoretically reading the code the same error/fault situations/behaviour will be seen irrespective of this change. I'm okay to let it not merge towards 4.11.1.0 milestone, but consider for 4.11.2.0 milestone.

@DaanHoogland @PaulAngus have taken over as RM, I'll ask them to decide on the fate of this PR

@wido

This comment has been minimized.

Copy link
Contributor

commented May 14, 2018

Is this something which we need in 4.11? Yes, that is LTS, but can't we postpone this to 4.12?

@marcaurele

This comment has been minimized.

Copy link
Member

commented May 14, 2018

I would argue for 4.12. It's something that has been around for years and need more "real" testing before going into a LTS.

@sureshanaparti

This comment has been minimized.

Copy link
Contributor

commented May 14, 2018

@rhtyd Are the pending tasks in the agent cancelled/failed before re-connecting to management server? May be these job states has to be communicated back to the management server after re-connection for sync and clean up operations. How the agent disconnects & communication handled due to intermittent network issues (atleast with minor delay)?

@blueorangutan

This comment has been minimized.

Copy link

commented May 14, 2018

Trillian test result (tid-2672)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38784 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2638-t2672-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_migration.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 65 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Failure 198.16 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Error 218.29 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 374.23 test_privategw_acl.py
test_01_secured_vm_migration Error 24.53 test_vm_life_cycle.py
test_01_secured_vm_migration Error 25.61 test_vm_life_cycle.py
test_02_not_secured_vm_migration Error 0.08 test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration Error 0.08 test_vm_life_cycle.py
test_04_nonsecured_to_secured_vm_migration Error 0.07 test_vm_life_cycle.py
@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 15, 2018

@sureshanaparti The pending tasks fail before reconnecting to the management server. I tried a while-wait approach for answers to be sent only when mgmt server is connected again. With this approach the answers will be sent after mgmt server is reconnected, however, due to lack of thread context such answers are dropped by mgmt server. There is no way to influence state transitions/progress/success/failure in the current design. As long as the tcp connection is not closed, delays and latencies are okay (within the same dc I don't think they would be huge).

@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 15, 2018

Tests LGTM, but I'll kick another round.
@blueorangutan test matrix

@blueorangutan

This comment has been minimized.

Copy link

commented May 15, 2018

@rhtyd a Trillian-Jenkins matrix job (centos6 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 15, 2018

All, if there are no objections based on the discussion on dev@ and code investigation and experimentation/testing, I'll merge the short-term fix. The large design fix can be done for the future 4.11.2.0 or 4.12.0.0+ release. If there are any side-effect related hesitations or objections please let me know by end of this week, thanks.

@blueorangutan

This comment has been minimized.

Copy link

commented May 15, 2018

Trillian test result (tid-2674)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 21630 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2638-t2674-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Smoke tests completed. 65 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_03_vpc_privategw_restart_vpc_cleanup Error 118.61 test_privategw_acl.py
test_01_redundant_vpc_site2site_vpn Failure 357.83 test_vpc_vpn.py
@blueorangutan

This comment has been minimized.

Copy link

commented May 16, 2018

Trillian test result (tid-2675)
Environment: vmware-65 (x2), Advanced Networking with Mgmt server 7
Total time taken: 29090 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2638-t2675-vmware-65.zip
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 66 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_03_vpc_privategw_restart_vpc_cleanup Error 249.93 test_privategw_acl.py
@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 16, 2018

@blueorangutan test centos7 xenserver-71

@blueorangutan

This comment has been minimized.

Copy link

commented May 16, 2018

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + xenserver-71) has been kicked to run smoke tests

@borisstoyanov
Copy link
Contributor

left a comment

LGTM, let's wait for latest results

@apache apache deleted a comment from blueorangutan May 16, 2018

@blueorangutan

This comment has been minimized.

Copy link

commented May 16, 2018

Trillian test result (tid-2677)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 20786 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2638-t2677-xenserver-71.zip
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 66 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_03_vpc_privategw_restart_vpc_cleanup Error 182.82 test_privategw_acl.py
@rhtyd

This comment has been minimized.

Copy link
Member Author

commented May 16, 2018

Tests LGTM, based on reviews/discussions and test results I'll merge this. If there are any objections, we'll revert this. Thanks.

@rhtyd rhtyd merged commit d893fb5 into apache:4.11 May 16, 2018

1 of 2 checks passed

Jenkins Looks like there's a problem with this pull request
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

asfgit pushed a commit that referenced this pull request May 16, 2018

Merge branch '4.11': Fixes #2633 don't block agent for pending tasks …
…on reconnection (#2638)

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.