Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in GCE’s list_nodes() #727

Closed
wants to merge 1 commit into from

Conversation

@lhuard1A
Copy link
Contributor

lhuard1A commented Mar 25, 2016

Invoking GCE’s list_nodes() while some VMs are being shutdown can result
in the following exception to be raised out of list_nodes():

  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 1411, in list_nodes
    v.get('instances', [])]
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 5065, in _to_node
    extra['boot_disk'] = self.ex_get_volume(bd['name'], bd['zone'])
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 3982, in ex_get_volume
    response = self.connection.request(request, method='GET').object
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 684, in request
    *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 736, in request
    response = responseCls(**kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 119, in __init__
    self.object = self.parse_body()
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 259, in parse_body
    raise ResourceNotFoundError(message, self.status, code)
libcloud.common.google.ResourceNotFoundError: {'domain': 'global', 'message': "The resource 'projects/lenaic/zones/europe-west1-c/disks/devops-reg' was not found", 'reason': 'notFound'}

The above error occurred while the devops-reg machine was being deleted.

The issue occurs when the following events happen in that order:

When this happens, we should simply discard the node that was deleted during the execution of list_nodes() and return the information about the other nodes.

for i in v.get('instances', []):
try:
list_nodes.append(self._to_node(i))
except ResourceNotFoundError:

This comment has been minimized.

Copy link
@tonybaloney

tonybaloney Mar 31, 2016

Contributor

I think silently swallowing the exception justifies a comment :-)

@tonybaloney
Copy link
Contributor

tonybaloney commented Mar 31, 2016

Thanks @lhuard1A for the contribution, please can you add a comment about the swallowed exceptions then I can merge this.

Invoking GCE’s `list_nodes()` while some VMs are being shutdown can result
in the following exception to be raised out of `list_nodes()`:

```
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 1411, in list_nodes
    v.get('instances', [])]
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 5065, in _to_node
    extra['boot_disk'] = self.ex_get_volume(bd['name'], bd['zone'])
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 3982, in ex_get_volume
    response = self.connection.request(request, method='GET').object
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 684, in request
    *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 736, in request
    response = responseCls(**kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 119, in __init__
    self.object = self.parse_body()
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 259, in parse_body
    raise ResourceNotFoundError(message, self.status, code)
libcloud.common.google.ResourceNotFoundError: {'domain': 'global', 'message': "The resource 'projects/lenaic/zones/europe-west1-c/disks/devops-reg' was not found", 'reason': 'notFound'}
```

The above error occurred while the `devops-reg` machine was being deleted.

The issue occurs when the following events happen in that order:

* [`list_nodes()` sends a request to list all the instances.](https://github.com/apache/libcloud/blob/trunk/libcloud/compute/drivers/gce.py#L1622)
  At this point, the `devops-reg` was still existing.
* The `devops-reg` instance is deleted.
* `list_nodes()` calls `_to_node` which calls [`ex_get_volume` which attempts to retrieve the information of the volumes](https://github.com/apache/libcloud/blob/trunk/libcloud/compute/drivers/gce.py#L4235)
  But, as the instance was deleted since it was listed, `ex_get_volume` raises a `ResourceNotFoundError` exception.

When this happens, we should simply discard the node that was deleted during the execution of `list_nodes()` and return the information about the other nodes.
@lhuard1A lhuard1A force-pushed the lhuard1A:fix_race_in_list_nodes branch from f49cea8 to 6d2b3cf Apr 1, 2016
@lhuard1A
Copy link
Contributor Author

lhuard1A commented Apr 1, 2016

Thanks for the review @tonybaloney.
I have added a comment explaining the rational behind the exception ignorance.

@tonybaloney
Copy link
Contributor

tonybaloney commented Apr 11, 2016

LGTM 👍

@asfgit asfgit closed this in b1d0731 Apr 11, 2016
asfgit pushed a commit that referenced this pull request Apr 12, 2016
Signed-off-by: anthony-shaw <anthony.p.shaw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.