New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in GCE’s list_nodes() #727

Closed
wants to merge 1 commit into
base: trunk
from

Conversation

Projects
None yet
2 participants
@lhuard1A
Contributor

lhuard1A commented Mar 25, 2016

Invoking GCE’s list_nodes() while some VMs are being shutdown can result
in the following exception to be raised out of list_nodes():

  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 1411, in list_nodes
    v.get('instances', [])]
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 5065, in _to_node
    extra['boot_disk'] = self.ex_get_volume(bd['name'], bd['zone'])
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 3982, in ex_get_volume
    response = self.connection.request(request, method='GET').object
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 684, in request
    *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 736, in request
    response = responseCls(**kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 119, in __init__
    self.object = self.parse_body()
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 259, in parse_body
    raise ResourceNotFoundError(message, self.status, code)
libcloud.common.google.ResourceNotFoundError: {'domain': 'global', 'message': "The resource 'projects/lenaic/zones/europe-west1-c/disks/devops-reg' was not found", 'reason': 'notFound'}

The above error occurred while the devops-reg machine was being deleted.

The issue occurs when the following events happen in that order:

When this happens, we should simply discard the node that was deleted during the execution of list_nodes() and return the information about the other nodes.

Show outdated Hide outdated libcloud/compute/drivers/gce.py
for i in v.get('instances', []):
try:
list_nodes.append(self._to_node(i))
except ResourceNotFoundError:

This comment has been minimized.

@tonybaloney

tonybaloney Mar 31, 2016

Contributor

I think silently swallowing the exception justifies a comment :-)

@tonybaloney

tonybaloney Mar 31, 2016

Contributor

I think silently swallowing the exception justifies a comment :-)

@tonybaloney

This comment has been minimized.

Show comment
Hide comment
@tonybaloney

tonybaloney Mar 31, 2016

Contributor

Thanks @lhuard1A for the contribution, please can you add a comment about the swallowed exceptions then I can merge this.

Contributor

tonybaloney commented Mar 31, 2016

Thanks @lhuard1A for the contribution, please can you add a comment about the swallowed exceptions then I can merge this.

Fix a race condition in GCE’s list_nodes()
Invoking GCE’s `list_nodes()` while some VMs are being shutdown can result
in the following exception to be raised out of `list_nodes()`:

```
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 1411, in list_nodes
    v.get('instances', [])]
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 5065, in _to_node
    extra['boot_disk'] = self.ex_get_volume(bd['name'], bd['zone'])
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 3982, in ex_get_volume
    response = self.connection.request(request, method='GET').object
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 684, in request
    *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 736, in request
    response = responseCls(**kwargs)
  File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 119, in __init__
    self.object = self.parse_body()
  File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 259, in parse_body
    raise ResourceNotFoundError(message, self.status, code)
libcloud.common.google.ResourceNotFoundError: {'domain': 'global', 'message': "The resource 'projects/lenaic/zones/europe-west1-c/disks/devops-reg' was not found", 'reason': 'notFound'}
```

The above error occurred while the `devops-reg` machine was being deleted.

The issue occurs when the following events happen in that order:

* [`list_nodes()` sends a request to list all the instances.](https://github.com/apache/libcloud/blob/trunk/libcloud/compute/drivers/gce.py#L1622)
  At this point, the `devops-reg` was still existing.
* The `devops-reg` instance is deleted.
* `list_nodes()` calls `_to_node` which calls [`ex_get_volume` which attempts to retrieve the information of the volumes](https://github.com/apache/libcloud/blob/trunk/libcloud/compute/drivers/gce.py#L4235)
  But, as the instance was deleted since it was listed, `ex_get_volume` raises a `ResourceNotFoundError` exception.

When this happens, we should simply discard the node that was deleted during the execution of `list_nodes()` and return the information about the other nodes.
@lhuard1A

This comment has been minimized.

Show comment
Hide comment
@lhuard1A

lhuard1A Apr 1, 2016

Contributor

Thanks for the review @tonybaloney.
I have added a comment explaining the rational behind the exception ignorance.

Contributor

lhuard1A commented Apr 1, 2016

Thanks for the review @tonybaloney.
I have added a comment explaining the rational behind the exception ignorance.

@tonybaloney

This comment has been minimized.

Show comment
Hide comment
@tonybaloney

tonybaloney Apr 11, 2016

Contributor

LGTM 👍

Contributor

tonybaloney commented Apr 11, 2016

LGTM 👍

@asfgit asfgit closed this in b1d0731 Apr 11, 2016

asfgit pushed a commit that referenced this pull request Apr 12, 2016

Add changes for #727
Signed-off-by: anthony-shaw <anthony.p.shaw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment