New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On Azure ARM, destroy_node() do VHD cleanup even when NIC cleanup fails #1120
Conversation
destroying a node but the NIC failed to be cleaned up. Added missing tests.
Codecov Report
@@ Coverage Diff @@
## trunk #1120 +/- ##
=========================================
+ Coverage 85.4% 85.44% +0.04%
=========================================
Files 346 346
Lines 66193 66272 +79
Branches 5892 5899 +7
=========================================
+ Hits 56530 56629 +99
+ Misses 7265 7241 -24
- Partials 2398 2402 +4
Continue to review full report at Codecov.
|
Thanks for this! I'm glad that you added tests too. ✨ Don't you think we should return False when NIC cleanup fails but VHD cleanup succeeds? |
@pquentin The abstract interface for a libcloud
This would allow a caller to safely invoke destroy_node() in a retry loop until it returns True (assuming the failures are cloud weather that will eventually resolve themselves). This would allow us to remove the retry loops from destroy_node as well (and document that the caller is responsible for performing retries). |
@tetron As I don't use Azure ARM, I would be happy to trust you on the True/False choice. What do you think is the best choice? Regarding the retry loops, I would prefer not change the API, but maybe improving the way the loops work if we believe they could be improved. |
Okay, reading your answer more carefully, you seem to agree that we should only return True when everything was removed with success. @ldipenti Do you think you could change this part of your PR? Thanks! |
@pquentin of course! |
have been cleaned up successfully. Can be called multiple times to retry when partial errors happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's much better! We're nearly there.
@@ -766,9 +776,10 @@ def destroy_node(self, node, ex_destroy_nic=True, ex_destroy_vhd=True): | |||
# LibcloudError. Wait a bit and try again. | |||
time.sleep(10) | |||
else: | |||
raise | |||
success = False | |||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think swallowing the exception is a good idea here. It's OK if this raises. Maybe another PR could ensure the number of retries is fixed and returns False when we got "LeaseIdMissing", say, 10 times, but it's out of scope here.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pquentin how about cases when in a previous call the VHD was successfully deleted but some NIC wasn't? Should we check for 404s and ignore them like we do with NICs and VMs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not noticed that we were swallowing an exception in the NIC and VM cases too. I don't really know. Do you think returning False without exposing the exception is the correct choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could catch & re-raise specific exceptions, for example the ones related to throttling/rate limits, that would make re-trying something dangerous to do, and the rest log the message and return False, maybe @tetron has another idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with re-raising if we get an exception from NIC or VHD delete. The key change here is catch 404 and treat it as succces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And actually, the meaning of "False" when returned by destroy_node() is ambiguous, so it is probably better to raise an exception anyway.
returning False with no additional information. Updated tests.
Thank you @ldipenti, merged in trunk! ✨ For the next PR, can you try splitting your commit messages in two? One short summary on the first line, followed by more details (http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html). Thanks! |
Description
When calling destroy_node() on Azure ARM, several cloud calls happen. After successfully destroying the VM, the method tries to remove all NICs related to that VM. If one of these attempts fail, the VHD's life is spared.
The updates on this PR make destroy_node() continue with its task after a NIC cleanup fail, trying to remove the remaining NICs and the assigned VHD.
Also added tests for destroy_node(), modified the test suite a little to be able to simulate different HTTP response sequences.
Status
Checklist (tick everything that applies)