On Azure ARM, destroy_node() do VHD cleanup even when NIC cleanup fails #1120

ldipenti · 2017-10-04T17:19:06Z

Description

When calling destroy_node() on Azure ARM, several cloud calls happen. After successfully destroying the VM, the method tries to remove all NICs related to that VM. If one of these attempts fail, the VHD's life is spared.
The updates on this PR make destroy_node() continue with its task after a NIC cleanup fail, trying to remove the remaining NICs and the assigned VHD.
Also added tests for destroy_node(), modified the test suite a little to be able to simulate different HTTP response sequences.

Status

done, ready for review

Checklist (tick everything that applies)

Code linting (required, can be done after the PR checks)
Documentation
Tests
ICLA (required for bigger changes)

destroying a node but the NIC failed to be cleaned up. Added missing tests.

codecov-io · 2017-10-04T17:34:56Z

Codecov Report

Merging #1120 into trunk will increase coverage by 0.04%.
The diff coverage is 96%.

@@            Coverage Diff            @@
##           trunk    #1120      +/-   ##
=========================================
+ Coverage   85.4%   85.44%   +0.04%     
=========================================
  Files        346      346              
  Lines      66193    66272      +79     
  Branches    5892     5899       +7     
=========================================
+ Hits       56530    56629      +99     
+ Misses      7265     7241      -24     
- Partials    2398     2402       +4

Impacted Files	Coverage Δ
libcloud/test/compute/test_azure_arm.py	`100% <100%> (ø)`	⬆️
libcloud/compute/drivers/azure_arm.py	`50.53% <75%> (+3.74%)`	⬆️
libcloud/common/azure.py	`83.67% <0%> (-0.33%)`	⬇️
libcloud/test/compute/test_ec2.py	`97.74% <0%> (-0.17%)`	⬇️
libcloud/test/common/test_upcloud.py	`100% <0%> (ø)`	⬆️
libcloud/compute/drivers/upcloud.py	`95.95% <0%> (ø)`	⬆️
libcloud/test/test_http.py	`97.43% <0%> (+0.46%)`	⬆️
libcloud/common/upcloud.py	`91.01% <0%> (+0.53%)`	⬆️
libcloud/http.py	`92.42% <0%> (+3.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23e65be...9c2571b. Read the comment docs.

pquentin · 2017-10-05T04:59:50Z

Thanks for this! I'm glad that you added tests too. ✨

Don't you think we should return False when NIC cleanup fails but VHD cleanup succeeds?

tetron · 2017-10-05T14:09:34Z

@pquentin The abstract interface for a libcloud Node just says "True on success and False on failure" but it is not obvious what that should mean in this case. Perhaps:

True means all resources were either successfully deleted, or returned 404 (in particular, trying to destroy a VM resource that doesn't exist would be considered a success, not a failure)
False means one or more resource deletion actions failed for some other 4xx or 5xx

This would allow a caller to safely invoke destroy_node() in a retry loop until it returns True (assuming the failures are cloud weather that will eventually resolve themselves). This would allow us to remove the retry loops from destroy_node as well (and document that the caller is responsible for performing retries).

pquentin · 2017-10-05T17:46:17Z

@tetron As I don't use Azure ARM, I would be happy to trust you on the True/False choice. What do you think is the best choice?

Regarding the retry loops, I would prefer not change the API, but maybe improving the way the loops work if we believe they could be improved.

pquentin · 2017-10-06T18:55:31Z

Okay, reading your answer more carefully, you seem to agree that we should only return True when everything was removed with success. @ldipenti Do you think you could change this part of your PR? Thanks!

ldipenti · 2017-10-06T18:57:11Z

@pquentin of course!

have been cleaned up successfully. Can be called multiple times to retry when partial errors happen.

pquentin

Thanks, that's much better! We're nearly there.

pquentin · 2017-10-08T08:48:21Z

libcloud/compute/drivers/azure_arm.py

@@ -766,9 +776,10 @@ def destroy_node(self, node, ex_destroy_nic=True, ex_destroy_vhd=True):
                        # LibcloudError.  Wait a bit and try again.
                        time.sleep(10)
                    else:
-                        raise
+                        success = False
+                        break


I don't think swallowing the exception is a good idea here. It's OK if this raises. Maybe another PR could ensure the number of retries is fixed and returns False when we got "LeaseIdMissing", say, 10 times, but it's out of scope here.

What do you think?

@pquentin how about cases when in a previous call the VHD was successfully deleted but some NIC wasn't? Should we check for 404s and ignore them like we do with NICs and VMs?

I had not noticed that we were swallowing an exception in the NIC and VM cases too. I don't really know. Do you think returning False without exposing the exception is the correct choice?

I suppose we could catch & re-raise specific exceptions, for example the ones related to throttling/rate limits, that would make re-trying something dangerous to do, and the rest log the message and return False, maybe @tetron has another idea.

I'm fine with re-raising if we get an exception from NIC or VHD delete. The key change here is catch 404 and treat it as succces.

And actually, the meaning of "False" when returned by destroy_node() is ambiguous, so it is probably better to raise an exception anyway.

returning False with no additional information. Updated tests.

pquentin · 2017-10-10T05:15:50Z

Thank you @ldipenti, merged in trunk! ✨

For the next PR, can you try splitting your commit messages in two? One short summary on the first line, followed by more details (http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html). Thanks!

On Azure ARM, fix destroy_node() return value when successfully

04eca2f

destroying a node but the NIC failed to be cleaned up. Added missing tests.

ldipenti added 2 commits October 6, 2017 19:15

destroy_node() will return True only when all node's resources

c0579b8

have been cleaned up successfully. Can be called multiple times to retry when partial errors happen.

Removed unused import

6924aca

pquentin suggested changes Oct 8, 2017

View reviewed changes

ldipenti added 2 commits October 9, 2017 17:26

Now destroy_node() raises exception on unexpected errors instead of

1848c73

returning False with no additional information. Updated tests.

Make lint happy.

9c2571b

asfgit closed this in 7d442a1 Oct 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On Azure ARM, destroy_node() do VHD cleanup even when NIC cleanup fails #1120

On Azure ARM, destroy_node() do VHD cleanup even when NIC cleanup fails #1120

ldipenti commented Oct 4, 2017

codecov-io commented Oct 4, 2017 •

edited

pquentin commented Oct 5, 2017

tetron commented Oct 5, 2017

pquentin commented Oct 5, 2017

pquentin commented Oct 6, 2017

ldipenti commented Oct 6, 2017

pquentin left a comment

pquentin Oct 8, 2017

ldipenti Oct 8, 2017

pquentin Oct 8, 2017

ldipenti Oct 9, 2017

tetron Oct 9, 2017

tetron Oct 9, 2017

pquentin commented Oct 10, 2017

On Azure ARM, destroy_node() do VHD cleanup even when NIC cleanup fails #1120

On Azure ARM, destroy_node() do VHD cleanup even when NIC cleanup fails #1120

Conversation

ldipenti commented Oct 4, 2017

Description

Status

Checklist (tick everything that applies)

codecov-io commented Oct 4, 2017 • edited

Codecov Report

pquentin commented Oct 5, 2017

tetron commented Oct 5, 2017

pquentin commented Oct 5, 2017

pquentin commented Oct 6, 2017

ldipenti commented Oct 6, 2017

pquentin left a comment

Choose a reason for hiding this comment

pquentin Oct 8, 2017

Choose a reason for hiding this comment

ldipenti Oct 8, 2017

Choose a reason for hiding this comment

pquentin Oct 8, 2017

Choose a reason for hiding this comment

ldipenti Oct 9, 2017

Choose a reason for hiding this comment

tetron Oct 9, 2017

Choose a reason for hiding this comment

tetron Oct 9, 2017

Choose a reason for hiding this comment

pquentin commented Oct 10, 2017

codecov-io commented Oct 4, 2017 •

edited