New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Curator fails with 'missing snapshot' intermittently (solved) #333
Comments
Thanks for submitting this. First, you mention that you're trying to delete Regardless, I have racked my brain to try to guess how this scenario could happen. One of three explanations I can guess at are as follows. Duplicate snapshots in the working list.This theory falls apart because of https://github.com/elastic/curator/blob/master/curator/cli/snapshot_selection.py#L81 Individual nodes are not connected to the repository properly.The error, While this is concerning, it's not something Curator itself has any control over. It makes a call via the Elasticsearch python module, and if that call cannot be completed due to a remote exception, there's really nothing Curator can do. Curator can neither create this scenario, nor correct it. I can try to put some better error handling in to catch this sort of failure, but that will only be cosmetic to this scenario. Ideas to troubleshoot/fix this include updating the S3 plugin and/or your Elasticsearch version. Snapshot is in a bad stateThe snapshot in the error, Cluster issuesThis may indicate something amiss with your cluster. One node may be able to see something another cannot. ConclusionIn any case, with this not caused by having the snapshot name in the list 2x, this is outside the scope of Curator. It is possible I've missed something, but these are my best guesses as to how this state could have happened. |
Thanks for the detailed response. There are no cluster issues or duplicate snapshots or individual nodes connection problem etc. All instances have the same instance profile. I can see the metadata and snapshot-curator fine. I can also retrieve the JSON which shows that snapshot was taken fine directly from elasticsearch. Here's another type of the problem: Unable to take snapshots : transport error (whatever that means)
However the snapshots was taken fine. I can see snapshot-curator-20150331103654 and metadata-20150331103654. I also retrieved the state of the snapshot using curl:
But curator is raising a transport error failing my script. |
Also to answer your query:
I don't use any prefixes - just whatever defaults curator uses. All my snapshots have format like snapshot-curator-yyyymmddhhmmss Metadata is like metadata-curator-yyyymmddhhmmss |
I'm sorry you're having a bad time with Curator. You're having a bad experience, and it's my goal to fix it. From what I can tell, though, it can only be indirectly Curator's fault from what the errors are indicating. FlowLet's look at the flow again:
What this means is that the error:
...is, at its root, coming from Elasticsearch. You should actually be able to find a corresponding error in your Elasticsearch log files (in fact, I'd love to see them inline in a response here). Neither Curator nor the elasticsearch-py module have any ability of themselves to connect to S3, which is handled by Elasticsearch and the S3 plugin. This is why I was suggesting a client update on that end might be a potential fix. Can you tell me what version of the S3 plugin you are using in your Elasticsearch? And with which Elasticsearch version? Possible causes
Proposed resolutionsOn the simple side, add a Add retry and verification logic. This is more involved. I can add some code to double-check that a snapshot is present, and attempt to perform the action again. This might warrant some "test for snapshot, attempt delete, and retry if fail," kind of logic. |
Hi, This is the log activity for snapshot creation. Looks like a bug as it seems curator is retrying for some reason and bombing out. Elasticsearch complains as a snapshot is already running with the same name.
I also found this on google which is somewhat similar to the problem. I did a
I don't see any errors related to AWS API call. Note the log entry |
Regarding the delete, here are the logs for comparison: Execution log
Elasticsearch log
You see there are no AWS specific exceptions. It is important to mention that I was running older version of curator on the 27th which was mainly complaining about "delete" intermittently - transport error With the latest version of curator, the problem seems to have shifted to creation of snapshots - "transport error". In both the cases, the error logged in ELasticsearch is the same - "a snapshot is already running". So it seems curator is retrying for some reason and bombing out. |
It's good to know that the AWS plugin isn't encountering an error at the ES level. That means that Elasticsearch is getting a legitimate 404 from Amazon. Why Amazon is giving a 404 is still not clear. There is no retry within Curator. It only tries once. See https://github.com/elastic/curator/blob/master/curator/api/snapshot.py#L60-L64 That means the retry is probably inside Elasticsearch and/or the AWS plugin.
|
1.3.2 Elasticsearch I've tried to take snapshot using curl on Elasticsearch which works fine. But wait_for_completion works differently. See below:
This command completes after 40-50 sec and
This shows the snapshot status as "STARTED" and not SUCCESS. So wait_for_completion does not wait until the snapshot status is successful. I then quickly ran another snapshot
This failed with "a snapshot is already running" I then waited a couple more minutes and ran the command again. It succeeded. I'm afraid I'll have to stop using curator and use curl commands to take snapshots instead which works. Something in curator seems to be retrying I believe. |
I ran curator snapshot in debug mode.
There are two lines where it shows it is retrying the request.
Any ideas why it is doing a PUT on the same snapshot again? In contrast the curl command works fine.
|
Which version of Curator are you using right now? |
The portions you are showing are only revealing methods within the elasticsearch-py and urllib3 modules. Note the 2nd column, which contains the name of the module, while the 3rd column shows the method being called, and on what line number. Only the very bottom method is a Look at what's happening. The first
But that 504 is probably indicating the same disconnect you saw with your curl call, in that And because of the 504, it thinks it failed, hence the retries and the failure messages, but these are not from Curator, but from the urllib3 calls in the elasticsearch-py module. It never goes back to Curator until it has retried 3 times and failed with each. Each failed retry within the elasticsearch module results in a call to the My suspicions at this point are two:
|
I think you are right. Both curl and urllib are not waiting for the snapshot. Culprit could be AWS load balancer which sits in front of Elasticsearch non data nodes in my cluster. It has an idle timeout of 60 sec by default. I'll try to change it to a higher value and see if it solves the problem. Will try on tuesday as Easter holidays have started and let you know. Happy Easter! PS: curator version is the latest one. I did a pip install on the 27th. |
Happy Easter to you as well! I'm glad we may have pinned this down. I've looked into the elasticsearch-py module API docs and cannot see a way to set up a keep-alive to work around this. I don't know if you'd be amenable to a special port, firewalled off and all that, that isn't behind the ELB, just for snapshots, but that's another possibility. You could also set up a dedicated wrapper that uses the regular Curator command-line to do snapshots, but with |
Any update on this? |
Working fine now. Thanks. |
Thanks for the update. I'm going to add this scenario as an FAQ to the documentation. |
Hi there, I encounter the same problem. I use :
Here are the full log I get with --debug almost similar to pkr1234 logs :
When I list all backup, the snapshot shows as SUCCESS :
The full backup process duration indicate 82 seconds but the backup size is barely 1mb so when I delete all files on my aws bucket and start the backup, I can see that the files are uploaded within a few seconds so I don't really get why it is waiting so long to complete. I disabled my firewall but it does not work either.
@pkr1234 could you share how you solved your problem ? was caused by your aws load balancer ? I use haproxy as loadbalancer and keepalived in front of it, don't know if one of them could be the culprit. Thanks for the help :) |
Hi, AWS loadbalancer has a default idle timeout of 60 sec. It was just a case of editing load balancer setting and changing that to a higher value. Mind you, this timeout was not affecting the actual snapshot. It is just that client connection was being terminated and there was no way of knowing whether the snapshot succeeded or not. PS: I have not looked at your detailed output. |
Yes everything is working now :) |
I run a job daily which uses curator (in that order) to:
From time to time I see (2) fail with missing "SnapshotMissingException" but it does exist (as I see metadata-curator-[timestamp] and snapshot-curator-[timestamp])and it does delete it. My script terminates as the return code is not zero.
Here's the output log:
The text was updated successfully, but these errors were encountered: