InvalidInstanceID.NotFound exception on ec2 resource instances.all() #977

Closed
jantman opened this Issue Feb 6, 2017 · 3 comments

Projects

None yet

2 participants

@jantman
jantman commented Feb 6, 2017

I'm intermittently getting this error when calling .all() on the EC2 Service Resource instances collection:

  File "awslimitchecker/lib/python2.7/site-packages/awslimitchecker/services/ec2.py", line 247, in _instance_usage
    for inst in self.resource_conn.instances.all():
  File "awslimitchecker/lib/python2.7/site-packages/boto3/resources/collection.py", line 83, in __iter__
    for page in self.pages():
  File "awslimitchecker/lib/python2.7/site-packages/boto3/resources/collection.py", line 166, in pages
    for page in pages:
  File "awslimitchecker/lib/python2.7/site-packages/botocore/paginate.py", line 102, in __iter__
    response = self._make_request(current_kwargs)
  File "awslimitchecker/lib/python2.7/site-packages/botocore/paginate.py", line 174, in _make_request
    return self._method(**current_kwargs)
  File "awslimitchecker/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "awslimitchecker/lib/python2.7/site-packages/botocore/client.py", line 537, in _make_api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidInstanceID.NotFound) when calling the DescribeInstances operation: The instance IDs 'i-0a1a5cb3fb472c5ea, i-02cc14dfc6cba644e' do not exist

We have a LOT of instances and also do a lot of autoscaling. We're running this code daily, and have seen the error twice in the past two months. From what I can tell, I assume it's happening when an instance disappears (is pruned from the DescribeInstances result) during the pagination operation; i.e. the instance was present when the API call was initially made, but disappeared before instances.all() completed.

The code inside the loop is fairly trivial, just incrementing some counters based on attributes of the instance.

From what I can tell, the problem is in ResourceCollection.pages(), where it calls for item in self._handler(self._parent, params, page):; I can only infer that the instances in question are disappearing between the time that their IDs show up in the DescribeInstances result that's being paginated, and when handler tries to create the resource object for them. I suppose I could just use the botocore paginator directly, or use the client instead of the service, but it seems like this case should be handled where a Resource Collection's .all() method attempts to create an object for a resource that no longer exists...

@jantman jantman referenced this issue in jantman/awslimitchecker Feb 6, 2017
Closed

race condition in services.ec2 #229

@stealthycoin
Contributor

What version of the SDK are you using? Can you start by updating to the most recent boto3/botocore? That stack trace you posted does not match up with our current codebase. Our pagination should not be loading the resources the .all() generates so it shouldn't be throwing this error.

It would also be very helpful to see the logs produced by boto3.set_stream_logger('') when this error occurs.

@jantman
jantman commented Feb 9, 2017

@stealthycoin

The code in question (for me) is in a Jenkins job that runs once daily. The above traceback was captured on January 13th, the last time we hit that exception.

It was using boto3 1.4.3 and botocore 1.4.93, which were current at the time. We're now running with boto3-1.4.4 and botocore-1.5.8.

I've setup a quick script on my local machine that runs our code in question every 10 seconds until it catches this exception, and logs everything (boto3 and botocore) and DEBUG. Hopefully that will be able to catch the issue with the latest code...

I'll update sometime today or tomorrow if I find anything... if not, I'll modify the code for our Jenkins job to write the boto3 debug logging to a file that gets archived as an artifact, close this issue, and reopen if we can reproduce it...

@jantman
jantman commented Feb 17, 2017

@stealthycoin ok, using current versions (boto3-1.4.4 and botocore-1.5.8) I've run code to reproduce this issue in a loop for almost 3 days now, sleeping only 10 seconds between iterations. I haven't hit the error yet, so I'm going to say "can't reproduce" and assume this was fixed recently. I'll reopen if I can ever reproduce and have full debug logging. Sorry for the noise.

@jantman jantman closed this Feb 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment