Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in GCE’s list_nodes() #727

Closed
wants to merge 1 commit into from

Conversation

lhuard1A
Copy link
Contributor

Invoking GCE’s list_nodes() while some VMs are being shutdown can result
in the following exception to be raised out of list_nodes():

  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 1411, in list_nodes
    v.get('instances', [])]
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 5065, in _to_node
    extra['boot_disk'] = self.ex_get_volume(bd['name'], bd['zone'])
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 3982, in ex_get_volume
    response = self.connection.request(request, method='GET').object
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 684, in request
    *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 736, in request
    response = responseCls(**kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 119, in __init__
    self.object = self.parse_body()
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 259, in parse_body
    raise ResourceNotFoundError(message, self.status, code)
libcloud.common.google.ResourceNotFoundError: {'domain': 'global', 'message': "The resource 'projects/lenaic/zones/europe-west1-c/disks/devops-reg' was not found", 'reason': 'notFound'}

The above error occurred while the devops-reg machine was being deleted.

The issue occurs when the following events happen in that order:

When this happens, we should simply discard the node that was deleted during the execution of list_nodes() and return the information about the other nodes.

for i in v.get('instances', []):
try:
list_nodes.append(self._to_node(i))
except ResourceNotFoundError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think silently swallowing the exception justifies a comment :-)

@tonybaloney
Copy link
Contributor

Thanks @lhuard1A for the contribution, please can you add a comment about the swallowed exceptions then I can merge this.

Invoking GCE’s `list_nodes()` while some VMs are being shutdown can result
in the following exception to be raised out of `list_nodes()`:

```
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 1411, in list_nodes
    v.get('instances', [])]
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 5065, in _to_node
    extra['boot_disk'] = self.ex_get_volume(bd['name'], bd['zone'])
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 3982, in ex_get_volume
    response = self.connection.request(request, method='GET').object
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 684, in request
    *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 736, in request
    response = responseCls(**kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 119, in __init__
    self.object = self.parse_body()
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 259, in parse_body
    raise ResourceNotFoundError(message, self.status, code)
libcloud.common.google.ResourceNotFoundError: {'domain': 'global', 'message': "The resource 'projects/lenaic/zones/europe-west1-c/disks/devops-reg' was not found", 'reason': 'notFound'}
```

The above error occurred while the `devops-reg` machine was being deleted.

The issue occurs when the following events happen in that order:

* [`list_nodes()` sends a request to list all the instances.](https://github.com/apache/libcloud/blob/trunk/libcloud/compute/drivers/gce.py#L1622)
  At this point, the `devops-reg` was still existing.
* The `devops-reg` instance is deleted.
* `list_nodes()` calls `_to_node` which calls [`ex_get_volume` which attempts to retrieve the information of the volumes](https://github.com/apache/libcloud/blob/trunk/libcloud/compute/drivers/gce.py#L4235)
  But, as the instance was deleted since it was listed, `ex_get_volume` raises a `ResourceNotFoundError` exception.

When this happens, we should simply discard the node that was deleted during the execution of `list_nodes()` and return the information about the other nodes.
@lhuard1A lhuard1A force-pushed the fix_race_in_list_nodes branch from f49cea8 to 6d2b3cf Compare April 1, 2016 14:17
@lhuard1A
Copy link
Contributor Author

lhuard1A commented Apr 1, 2016

Thanks for the review @tonybaloney.
I have added a comment explaining the rational behind the exception ignorance.

@tonybaloney
Copy link
Contributor

LGTM 👍

@asfgit asfgit closed this in b1d0731 Apr 11, 2016
asfgit pushed a commit that referenced this pull request Apr 12, 2016
Signed-off-by: anthony-shaw <anthony.p.shaw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants