Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Azure ARM, destroy_node() do VHD cleanup even when NIC cleanup fails #1120

Closed
wants to merge 5 commits into from

Conversation

@ldipenti
Copy link
Contributor

commented Oct 4, 2017

Description

When calling destroy_node() on Azure ARM, several cloud calls happen. After successfully destroying the VM, the method tries to remove all NICs related to that VM. If one of these attempts fail, the VHD's life is spared.
The updates on this PR make destroy_node() continue with its task after a NIC cleanup fail, trying to remove the remaining NICs and the assigned VHD.
Also added tests for destroy_node(), modified the test suite a little to be able to simulate different HTTP response sequences.

Status

  • done, ready for review

Checklist (tick everything that applies)

  • Code linting (required, can be done after the PR checks)
  • Documentation
  • Tests
  • ICLA (required for bigger changes)
On Azure ARM, fix destroy_node() return value when successfully
destroying a node but the NIC failed to be cleaned up.
Added missing tests.
@codecov-io

This comment has been minimized.

Copy link

commented Oct 4, 2017

Codecov Report

Merging #1120 into trunk will increase coverage by 0.04%.
The diff coverage is 96%.

Impacted file tree graph

@@            Coverage Diff            @@
##           trunk    #1120      +/-   ##
=========================================
+ Coverage   85.4%   85.44%   +0.04%     
=========================================
  Files        346      346              
  Lines      66193    66272      +79     
  Branches    5892     5899       +7     
=========================================
+ Hits       56530    56629      +99     
+ Misses      7265     7241      -24     
- Partials    2398     2402       +4
Impacted Files Coverage Δ
libcloud/test/compute/test_azure_arm.py 100% <100%> (ø) ⬆️
libcloud/compute/drivers/azure_arm.py 50.53% <75%> (+3.74%) ⬆️
libcloud/common/azure.py 83.67% <0%> (-0.33%) ⬇️
libcloud/test/compute/test_ec2.py 97.74% <0%> (-0.17%) ⬇️
libcloud/test/common/test_upcloud.py 100% <0%> (ø) ⬆️
libcloud/compute/drivers/upcloud.py 95.95% <0%> (ø) ⬆️
libcloud/test/test_http.py 97.43% <0%> (+0.46%) ⬆️
libcloud/common/upcloud.py 91.01% <0%> (+0.53%) ⬆️
libcloud/http.py 92.42% <0%> (+3.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23e65be...9c2571b. Read the comment docs.

@pquentin

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2017

Thanks for this! I'm glad that you added tests too.

Don't you think we should return False when NIC cleanup fails but VHD cleanup succeeds?

@tetron

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2017

@pquentin The abstract interface for a libcloud Node just says "True on success and False on failure" but it is not obvious what that should mean in this case. Perhaps:

  • True means all resources were either successfully deleted, or returned 404 (in particular, trying to destroy a VM resource that doesn't exist would be considered a success, not a failure)
  • False means one or more resource deletion actions failed for some other 4xx or 5xx

This would allow a caller to safely invoke destroy_node() in a retry loop until it returns True (assuming the failures are cloud weather that will eventually resolve themselves). This would allow us to remove the retry loops from destroy_node as well (and document that the caller is responsible for performing retries).

@pquentin

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2017

@tetron As I don't use Azure ARM, I would be happy to trust you on the True/False choice. What do you think is the best choice?

Regarding the retry loops, I would prefer not change the API, but maybe improving the way the loops work if we believe they could be improved.

@pquentin

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2017

Okay, reading your answer more carefully, you seem to agree that we should only return True when everything was removed with success. @ldipenti Do you think you could change this part of your PR? Thanks!

@ldipenti

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2017

@pquentin of course!

ldipenti added 2 commits Oct 6, 2017
destroy_node() will return True only when all node's resources
have been cleaned up successfully.
Can be called multiple times to retry when partial errors happen.
@pquentin
Copy link
Contributor

left a comment

Thanks, that's much better! We're nearly there.

@@ -766,9 +776,10 @@ def destroy_node(self, node, ex_destroy_nic=True, ex_destroy_vhd=True):
# LibcloudError. Wait a bit and try again.
time.sleep(10)
else:
raise
success = False
break

This comment has been minimized.

Copy link
@pquentin

pquentin Oct 8, 2017

Contributor

I don't think swallowing the exception is a good idea here. It's OK if this raises. Maybe another PR could ensure the number of retries is fixed and returns False when we got "LeaseIdMissing", say, 10 times, but it's out of scope here.

What do you think?

This comment has been minimized.

Copy link
@ldipenti

ldipenti Oct 8, 2017

Author Contributor

@pquentin how about cases when in a previous call the VHD was successfully deleted but some NIC wasn't? Should we check for 404s and ignore them like we do with NICs and VMs?

This comment has been minimized.

Copy link
@pquentin

pquentin Oct 8, 2017

Contributor

I had not noticed that we were swallowing an exception in the NIC and VM cases too. I don't really know. Do you think returning False without exposing the exception is the correct choice?

This comment has been minimized.

Copy link
@ldipenti

ldipenti Oct 9, 2017

Author Contributor

I suppose we could catch & re-raise specific exceptions, for example the ones related to throttling/rate limits, that would make re-trying something dangerous to do, and the rest log the message and return False, maybe @tetron has another idea.

This comment has been minimized.

Copy link
@tetron

tetron Oct 9, 2017

Contributor

I'm fine with re-raising if we get an exception from NIC or VHD delete. The key change here is catch 404 and treat it as succces.

This comment has been minimized.

Copy link
@tetron

tetron Oct 9, 2017

Contributor

And actually, the meaning of "False" when returned by destroy_node() is ambiguous, so it is probably better to raise an exception anyway.

ldipenti added 2 commits Oct 9, 2017
Now destroy_node() raises exception on unexpected errors instead of
returning False with no additional information.
Updated tests.

@asfgit asfgit closed this in 7d442a1 Oct 10, 2017

@pquentin

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2017

Thank you @ldipenti, merged in trunk!

For the next PR, can you try splitting your commit messages in two? One short summary on the first line, followed by more details (http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.