Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot deletes fail without errors after ec2 session times out #11

Open
waynerobinson opened this issue Jul 11, 2013 · 16 comments
Open

Comments

@waynerobinson
Copy link
Contributor

The program claims to delete a specific snapshot id.
That snapshot id is not deleted when the program completes.
This appears to be happening after deleting many (hundreds) of snapshots.

This may or may not be related to a rate limit that sometimes shows up in other large snapshot delete operations.

@ehammond
Copy link
Member

Thanks for reporting this. Are you saying that the program is not reporting any errors trying to delete the snapshot? How do you know it is a rate limit error?

@waynerobinson
Copy link
Contributor Author

The program definitely isn't reporting any errors.

I am making the assumption (due to my Perl skills being pretty non-existent) that it is a rate limit error because I'm currently trying to clean up a snapshot archive with more than 20k snapshots, deleting many hundreds per volume at a time and I've been getting the odd rate limit error in the EC2 Console and my automated snapshotting script as I'm using the ec2-expire-snapshot script.

In situations where I have been deleting around 300+ snapshots, not only the first couple of hundred seem to get deleted, yet the script runs to completion. Thankfully, given the implementation of the script I can run it again to remove any remaining snapshots.

I would prefer that these rate limit errors retried after waiting for a period of time for the rate limit counters to recharge (which is basically how they work) as well as potentially having the ability to have the ec2-expire-snapshot script work in a more "off peak" manner… for example, deleting no more than 1 snapshot per second (or similar) so it doesn't interfere with the operation of other production scripts (snapshotting, instance creation/control, etc).

@ehammond
Copy link
Member

Does this describe what you are actually seeing?

  • The program claims to delete a specific snapshot id.
  • That snapshot id is not deleted when the program completes.

@waynerobinson
Copy link
Contributor Author

That is correct.

@ehammond
Copy link
Member

The code does appear to check for an error return from the EBS snapshot delete call, so there might be an issue with the underlying package.

I'd be comfortable adding an option to reduce the rate of EBS snapshot deletes by adding an N second delay between deletes, say:

ec2-expire-snapshots --delete-delay N [...]

I'm not sure when I'll get around to this, but if somebody submits a patch or merge request, I'd be happy to review.

@waynerobinson
Copy link
Contributor Author

Maybe this isn't a rate limit issue. I am running the same expiration scripts with a 2 second delay and it appears that the script still stalls in a similar way (saying it's deleting snapshots but not actually deleting them).

@ehammond ehammond reopened this Jul 12, 2013
@waynerobinson
Copy link
Contributor Author

Is there some way I can turn on full logging for all HTTP messages?

@ehammond
Copy link
Member

Try uncommenting this line:
# ($Debug ? (debug => 1) : ()),

@ehammond
Copy link
Member

I'm wondering if it is a timing issue or if we need to re-create the Net::Amazon::EC2 object after a while.

@ehammond
Copy link
Member

Does increasing the --delete-delay significantly (60 seconds?) cause fewer snapshots to be deleted successfully or more?

@waynerobinson
Copy link
Contributor Author

OK, at some point after starting deletions the API starts returning:

$VAR1 = {
          'RequestID' => 'N/A',
          'Errors' => [
                      {
                        'Error' => {
                                   'Code' => 'HTTP POST FAILURE',
                                   'Message' => '503 Service Unavailable'
                                 }
                      }
                    ]
        };


ERROR CODE: HTTP POST FAILURE MESSAGE: 503 Service Unavailable FOR REQUEST: N/A

I've tested this with delete delays of 5 seconds, however will try a 30 second version (although it may take some time for it to start failing).

@ehammond
Copy link
Member

Hm, sure would be nice if Net::Amazon::EC2 detected and returned that error.

@waynerobinson
Copy link
Contributor Author

OK, it appears that the 503 may be the rate limit message (the above message was the result of no delay).

After waiting some time in the 30 second delayed version it started generating the below messages:

$VAR1 = {
          'RequestID' => 'b856ee33-69b6-4d78-b806-a4e4c8d6c91c',
          'Errors' => [
                      {
                        'Error' => {
                                   'Code' => 'RequestExpired',
                                   'Message' => 'Request has expired. Timestamp date is 2013-07-12T03:07:00.000Z'
                                 }
                      }
                    ]
        };


ERROR CODE: RequestExpired MESSAGE: Request has expired. Timestamp date is 2013-07-12T03:07:00.000Z FOR REQUEST: b856ee33-69b6-4d78-b806-a4e4c8d6c91c

So you might be right, there may be a session timeout or something that you may need to regenerate the `Net::Amazon::EC2" object after, maybe regenerate and retry when the timeout message is received?

@ehammond
Copy link
Member

Thanks for running these tests. Am I correct in understanding that you see these errors only in the Net::Amazon::EC2 debug output and that none of them show up as being reported by ec2-expire-snapshots?

@waynerobinson
Copy link
Contributor Author

Yes, these are only available in the library's debug output.

ec2-expire-snapshots just ignores both these errors.

@ehammond
Copy link
Member

ehammond commented Sep 8, 2015

Looks like this thread got lost for a while. After re-reading, I think the next step would be to test having the program re-open a new Net::Amazon::EC2 object every N minutes. If this improves the situation, a pull request can be submitted.

@ehammond ehammond changed the title Snapshot deletes fail without errors Snapshot deletes fail without errors after ec2 session times out Dec 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants