Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Azure ARM, destroy_node() do VHD cleanup even when NIC cleanup fails #1120

Closed
wants to merge 5 commits into from

Conversation

ldipenti
Copy link
Contributor

@ldipenti ldipenti commented Oct 4, 2017

Description

When calling destroy_node() on Azure ARM, several cloud calls happen. After successfully destroying the VM, the method tries to remove all NICs related to that VM. If one of these attempts fail, the VHD's life is spared.
The updates on this PR make destroy_node() continue with its task after a NIC cleanup fail, trying to remove the remaining NICs and the assigned VHD.
Also added tests for destroy_node(), modified the test suite a little to be able to simulate different HTTP response sequences.

Status

  • done, ready for review

Checklist (tick everything that applies)

  • Code linting (required, can be done after the PR checks)
  • Documentation
  • Tests
  • ICLA (required for bigger changes)

destroying a node but the NIC failed to be cleaned up.
Added missing tests.
@codecov-io
Copy link

codecov-io commented Oct 4, 2017

Codecov Report

Merging #1120 into trunk will increase coverage by 0.04%.
The diff coverage is 96%.

Impacted file tree graph

@@            Coverage Diff            @@
##           trunk    #1120      +/-   ##
=========================================
+ Coverage   85.4%   85.44%   +0.04%     
=========================================
  Files        346      346              
  Lines      66193    66272      +79     
  Branches    5892     5899       +7     
=========================================
+ Hits       56530    56629      +99     
+ Misses      7265     7241      -24     
- Partials    2398     2402       +4
Impacted Files Coverage Δ
libcloud/test/compute/test_azure_arm.py 100% <100%> (ø) ⬆️
libcloud/compute/drivers/azure_arm.py 50.53% <75%> (+3.74%) ⬆️
libcloud/common/azure.py 83.67% <0%> (-0.33%) ⬇️
libcloud/test/compute/test_ec2.py 97.74% <0%> (-0.17%) ⬇️
libcloud/test/common/test_upcloud.py 100% <0%> (ø) ⬆️
libcloud/compute/drivers/upcloud.py 95.95% <0%> (ø) ⬆️
libcloud/test/test_http.py 97.43% <0%> (+0.46%) ⬆️
libcloud/common/upcloud.py 91.01% <0%> (+0.53%) ⬆️
libcloud/http.py 92.42% <0%> (+3.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23e65be...9c2571b. Read the comment docs.

@pquentin
Copy link
Contributor

pquentin commented Oct 5, 2017

Thanks for this! I'm glad that you added tests too. ✨

Don't you think we should return False when NIC cleanup fails but VHD cleanup succeeds?

@tetron
Copy link
Contributor

tetron commented Oct 5, 2017

@pquentin The abstract interface for a libcloud Node just says "True on success and False on failure" but it is not obvious what that should mean in this case. Perhaps:

  • True means all resources were either successfully deleted, or returned 404 (in particular, trying to destroy a VM resource that doesn't exist would be considered a success, not a failure)
  • False means one or more resource deletion actions failed for some other 4xx or 5xx

This would allow a caller to safely invoke destroy_node() in a retry loop until it returns True (assuming the failures are cloud weather that will eventually resolve themselves). This would allow us to remove the retry loops from destroy_node as well (and document that the caller is responsible for performing retries).

@pquentin
Copy link
Contributor

pquentin commented Oct 5, 2017

@tetron As I don't use Azure ARM, I would be happy to trust you on the True/False choice. What do you think is the best choice?

Regarding the retry loops, I would prefer not change the API, but maybe improving the way the loops work if we believe they could be improved.

@pquentin
Copy link
Contributor

pquentin commented Oct 6, 2017

Okay, reading your answer more carefully, you seem to agree that we should only return True when everything was removed with success. @ldipenti Do you think you could change this part of your PR? Thanks!

@ldipenti
Copy link
Contributor Author

ldipenti commented Oct 6, 2017

@pquentin of course!

have been cleaned up successfully.
Can be called multiple times to retry when partial errors happen.
Copy link
Contributor

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's much better! We're nearly there.

@@ -766,9 +776,10 @@ def destroy_node(self, node, ex_destroy_nic=True, ex_destroy_vhd=True):
# LibcloudError. Wait a bit and try again.
time.sleep(10)
else:
raise
success = False
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think swallowing the exception is a good idea here. It's OK if this raises. Maybe another PR could ensure the number of retries is fixed and returns False when we got "LeaseIdMissing", say, 10 times, but it's out of scope here.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pquentin how about cases when in a previous call the VHD was successfully deleted but some NIC wasn't? Should we check for 404s and ignore them like we do with NICs and VMs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not noticed that we were swallowing an exception in the NIC and VM cases too. I don't really know. Do you think returning False without exposing the exception is the correct choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could catch & re-raise specific exceptions, for example the ones related to throttling/rate limits, that would make re-trying something dangerous to do, and the rest log the message and return False, maybe @tetron has another idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with re-raising if we get an exception from NIC or VHD delete. The key change here is catch 404 and treat it as succces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And actually, the meaning of "False" when returned by destroy_node() is ambiguous, so it is probably better to raise an exception anyway.

@asfgit asfgit closed this in 7d442a1 Oct 10, 2017
@pquentin
Copy link
Contributor

Thank you @ldipenti, merged in trunk! ✨

For the next PR, can you try splitting your commit messages in two? One short summary on the first line, followed by more details (http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants