Snapshot deletes fail without errors after ec2 session times out #11

waynerobinson · 2013-07-11T06:18:48Z

The program claims to delete a specific snapshot id.
That snapshot id is not deleted when the program completes.
This appears to be happening after deleting many (hundreds) of snapshots.

This may or may not be related to a rate limit that sometimes shows up in other large snapshot delete operations.

ehammond · 2013-07-11T22:45:30Z

Thanks for reporting this. Are you saying that the program is not reporting any errors trying to delete the snapshot? How do you know it is a rate limit error?

waynerobinson · 2013-07-11T22:57:00Z

The program definitely isn't reporting any errors.

I am making the assumption (due to my Perl skills being pretty non-existent) that it is a rate limit error because I'm currently trying to clean up a snapshot archive with more than 20k snapshots, deleting many hundreds per volume at a time and I've been getting the odd rate limit error in the EC2 Console and my automated snapshotting script as I'm using the ec2-expire-snapshot script.

In situations where I have been deleting around 300+ snapshots, not only the first couple of hundred seem to get deleted, yet the script runs to completion. Thankfully, given the implementation of the script I can run it again to remove any remaining snapshots.

I would prefer that these rate limit errors retried after waiting for a period of time for the rate limit counters to recharge (which is basically how they work) as well as potentially having the ability to have the ec2-expire-snapshot script work in a more "off peak" manner… for example, deleting no more than 1 snapshot per second (or similar) so it doesn't interfere with the operation of other production scripts (snapshotting, instance creation/control, etc).

ehammond · 2013-07-11T23:07:33Z

Does this describe what you are actually seeing?

The program claims to delete a specific snapshot id.
That snapshot id is not deleted when the program completes.

waynerobinson · 2013-07-11T23:08:28Z

That is correct.

ehammond · 2013-07-11T23:15:21Z

The code does appear to check for an error return from the EBS snapshot delete call, so there might be an issue with the underlying package.

I'd be comfortable adding an option to reduce the rate of EBS snapshot deletes by adding an N second delay between deletes, say:

ec2-expire-snapshots --delete-delay N [...]

I'm not sure when I'll get around to this, but if somebody submits a patch or merge request, I'd be happy to review.

waynerobinson · 2013-07-12T02:07:41Z

Maybe this isn't a rate limit issue. I am running the same expiration scripts with a 2 second delay and it appears that the script still stalls in a similar way (saying it's deleting snapshots but not actually deleting them).

waynerobinson · 2013-07-12T02:50:06Z

Is there some way I can turn on full logging for all HTTP messages?

ehammond · 2013-07-12T02:52:57Z

Try uncommenting this line:
# ($Debug ? (debug => 1) : ()),

ehammond · 2013-07-12T02:54:30Z

I'm wondering if it is a timing issue or if we need to re-create the Net::Amazon::EC2 object after a while.

ehammond · 2013-07-12T02:58:31Z

Does increasing the --delete-delay significantly (60 seconds?) cause fewer snapshots to be deleted successfully or more?

waynerobinson · 2013-07-12T03:07:29Z

OK, at some point after starting deletions the API starts returning:

$VAR1 = {
          'RequestID' => 'N/A',
          'Errors' => [
                      {
                        'Error' => {
                                   'Code' => 'HTTP POST FAILURE',
                                   'Message' => '503 Service Unavailable'
                                 }
                      }
                    ]
        };


ERROR CODE: HTTP POST FAILURE MESSAGE: 503 Service Unavailable FOR REQUEST: N/A

I've tested this with delete delays of 5 seconds, however will try a 30 second version (although it may take some time for it to start failing).

ehammond · 2013-07-12T03:17:13Z

Hm, sure would be nice if Net::Amazon::EC2 detected and returned that error.

waynerobinson · 2013-07-12T03:42:56Z

OK, it appears that the 503 may be the rate limit message (the above message was the result of no delay).

After waiting some time in the 30 second delayed version it started generating the below messages:

$VAR1 = {
          'RequestID' => 'b856ee33-69b6-4d78-b806-a4e4c8d6c91c',
          'Errors' => [
                      {
                        'Error' => {
                                   'Code' => 'RequestExpired',
                                   'Message' => 'Request has expired. Timestamp date is 2013-07-12T03:07:00.000Z'
                                 }
                      }
                    ]
        };


ERROR CODE: RequestExpired MESSAGE: Request has expired. Timestamp date is 2013-07-12T03:07:00.000Z FOR REQUEST: b856ee33-69b6-4d78-b806-a4e4c8d6c91c

So you might be right, there may be a session timeout or something that you may need to regenerate the `Net::Amazon::EC2" object after, maybe regenerate and retry when the timeout message is received?

ehammond · 2013-07-12T04:01:30Z

Thanks for running these tests. Am I correct in understanding that you see these errors only in the Net::Amazon::EC2 debug output and that none of them show up as being reported by ec2-expire-snapshots?

waynerobinson · 2013-07-12T04:03:06Z

Yes, these are only available in the library's debug output.

ec2-expire-snapshots just ignores both these errors.

ehammond · 2015-09-08T06:27:16Z

Looks like this thread got lost for a while. After re-reading, I think the next step would be to test having the program re-open a new Net::Amazon::EC2 object every N minutes. If this improves the situation, a pull request can be submitted.

waynerobinson mentioned this issue Jul 12, 2013

Add --delete-delay DELAY_IN_SECONDS option. #12

Closed

waynerobinson closed this as completed Jul 12, 2013

waynerobinson reopened this Jul 12, 2013

ehammond closed this as completed in c2186dc Jul 12, 2013

ehammond reopened this Jul 12, 2013

ehammond changed the title ~~Snapshot deletes fail without errors~~ Snapshot deletes fail without errors after ec2 session times out Dec 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot deletes fail without errors after ec2 session times out #11

Snapshot deletes fail without errors after ec2 session times out #11

waynerobinson commented Jul 11, 2013

ehammond commented Jul 11, 2013

waynerobinson commented Jul 11, 2013

ehammond commented Jul 11, 2013

waynerobinson commented Jul 11, 2013

ehammond commented Jul 11, 2013

waynerobinson commented Jul 12, 2013

waynerobinson commented Jul 12, 2013

ehammond commented Jul 12, 2013

ehammond commented Jul 12, 2013

ehammond commented Jul 12, 2013

waynerobinson commented Jul 12, 2013

ehammond commented Jul 12, 2013

waynerobinson commented Jul 12, 2013

ehammond commented Jul 12, 2013

waynerobinson commented Jul 12, 2013

ehammond commented Sep 8, 2015

Snapshot deletes fail without errors after ec2 session times out #11

Snapshot deletes fail without errors after ec2 session times out #11

Comments

waynerobinson commented Jul 11, 2013

ehammond commented Jul 11, 2013

waynerobinson commented Jul 11, 2013

ehammond commented Jul 11, 2013

waynerobinson commented Jul 11, 2013

ehammond commented Jul 11, 2013

waynerobinson commented Jul 12, 2013

waynerobinson commented Jul 12, 2013

ehammond commented Jul 12, 2013

ehammond commented Jul 12, 2013

ehammond commented Jul 12, 2013

waynerobinson commented Jul 12, 2013

ehammond commented Jul 12, 2013

waynerobinson commented Jul 12, 2013

ehammond commented Jul 12, 2013

waynerobinson commented Jul 12, 2013

ehammond commented Sep 8, 2015