Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Curator fails with 'missing snapshot' intermittently (solved) #333

Closed
pkr1234 opened this issue Mar 30, 2015 · 22 comments · Fixed by #363
Closed

[DOC] Curator fails with 'missing snapshot' intermittently (solved) #333

pkr1234 opened this issue Mar 30, 2015 · 22 comments · Fixed by #363

Comments

@pkr1234
Copy link

pkr1234 commented Mar 30, 2015

I run a job daily which uses curator (in that order) to:

1. Takes snapshot all-indices
curator --timeout $CONNECT_TIMEOUT --host $ELASTICSEARCHSERVER snapshot --repository $REPONAME --request_timeout $SNAPSHOT_TIMEOUT indices --all-indices

2. Deletes snapshot older than 45 days
curator --timeout $CONNECT_TIMEOUT --host $ELASTICSEARCHSERVER delete snapshots --older-than $SNAPSHOTRETENTION --time-unit days --repository $REPONAME

3. Closes indices after 15 days
curator --timeout $CONNECT_TIMEOUT --host $ELASTICSEARCHSERVER close indices --older-than $CLOSERETENTION --time-unit days --timestring '%Y.%m.%d' --prefix "logstash-"

4. Deletes indices after 30 days (prefix "logstash")
curator --timeout $CONNECT_TIMEOUT --host $ELASTICSEARCHSERVER delete indices --older-than $LOGRETENTION --time-unit days --timestring '%Y.%m.%d' --prefix "logstash-"

(Snapshots on S3)

From time to time I see (2) fail with missing "SnapshotMissingException" but it does exist (as I see metadata-curator-[timestamp] and snapshot-curator-[timestamp])and it does delete it. My script terminates as the return code is not zero.

Here's the output log:

Now deleting snapshots ..
13:18:34            2015-03-30 13:18:34,228 INFO      Job starting...
13:20:01            Traceback (most recent call last):
13:20:01              File "/usr/bin/curator", line 9, in <module>
13:20:01                load_entry_point('elasticsearch-curator==3.0.2', 'console_scripts', 'curator')()
13:20:01              File "/usr/lib/python2.6/site-packages/curator/curator.py", line 5, in main
13:20:01                cli( obj={ "filters": [] } )
13:20:01              File "/usr/lib/python2.6/site-packages/click/core.py", line 610, in __call__
13:20:01                return self.main(*args, **kwargs)
13:20:01              File "/usr/lib/python2.6/site-packages/click/core.py", line 590, in main
13:20:01                rv = self.invoke(ctx)
13:20:01              File "/usr/lib/python2.6/site-packages/click/core.py", line 936, in invoke
13:20:01                return _process_result(sub_ctx.command.invoke(sub_ctx))
13:20:01              File "/usr/lib/python2.6/site-packages/click/core.py", line 936, in invoke
13:20:01                return _process_result(sub_ctx.command.invoke(sub_ctx))
13:20:01              File "/usr/lib/python2.6/site-packages/click/core.py", line 782, in invoke
13:20:01                return ctx.invoke(self.callback, **ctx.params)
13:20:01              File "/usr/lib/python2.6/site-packages/click/core.py", line 416, in invoke
13:20:01                return callback(*args, **kwargs)
13:20:01              File "/usr/lib/python2.6/site-packages/curator/cli/snapshot_selection.py", line 91, in snapshots
13:20:01                retval = delete_snapshot(client, snapshot=snap, repository=repository)
13:20:01              File "/usr/lib/python2.6/site-packages/curator/api/snapshot.py", line 87, in delete_snapshot
13:20:01                client.snapshot.delete(repository=repository, snapshot=snapshot)
13:20:01              File "/usr/lib/python2.6/site-packages/elasticsearch/client/utils.py", line 68, in _wrapped
13:20:01                return func(*args, params=params, **kwargs)
13:20:01              File "/usr/lib/python2.6/site-packages/elasticsearch/client/snapshot.py", line 40, in delete
13:20:01                _make_path('_snapshot', repository, snapshot), params=params)
13:20:01              File "/usr/lib/python2.6/site-packages/elasticsearch/transport.py", line 301, in perform_request
13:20:01                status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
13:20:01              File "/usr/lib/python2.6/site-packages/elasticsearch/connection/http_urllib3.py", line 82, in perform_request
13:20:01                self._raise_error(response.status, raw_data)
13:20:01              File "/usr/lib/python2.6/site-packages/elasticsearch/connection/base.py", line 102, in _raise_error
13:20:01                raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
13:20:01            elasticsearch.exceptions.NotFoundError: TransportError(404, u'RemoteTransportException[[Lemuel Dorcas][inet[/10.11.102.53:9300]][cluster/snapshot/delete]]; nested: SnapshotMissingException[[my_s3_repository:curator-20150213001500] is missing]; nested: FileNotFoundException[The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 5531B70DBC6BD25B)]; ')
13:20:01            Unable to delete snapshots ..
13:20:01            Result: 1
@untergeek
Copy link
Member

Thanks for submitting this.

First, you mention that you're trying to delete metadata-curator-[timestamp] and snapshot-curator-[timestamp] and it deletes these. However, the name in the error is curator-20150213001500. Was this correct? If so, is that snapshot supposed to be deleted?

Regardless, I have racked my brain to try to guess how this scenario could happen. One of three explanations I can guess at are as follows.

Duplicate snapshots in the working list.

This theory falls apart because of https://github.com/elastic/curator/blob/master/curator/cli/snapshot_selection.py#L81
Before any work is done, the list is turned into a set (only one of anything in a set), then back into a list again. This basically guarantees a unique list.

Individual nodes are not connected to the repository properly.

The error, (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 5531B70DBC6BD25B), indicates that the Elasticsearch S3 plugin can't find the file.

While this is concerning, it's not something Curator itself has any control over. It makes a call via the Elasticsearch python module, and if that call cannot be completed due to a remote exception, there's really nothing Curator can do. Curator can neither create this scenario, nor correct it.

I can try to put some better error handling in to catch this sort of failure, but that will only be cosmetic to this scenario.

Ideas to troubleshoot/fix this include updating the S3 plugin and/or your Elasticsearch version.

Snapshot is in a bad state

The snapshot in the error, curator-20150213001500, may be incomplete, or corrupted in some way. It shows up in the snapshot metadata, but the file has been deleted. This could also happen if it were incompletely deleted in a previous attempt (i.e. the files were deleted, but the metadata was not, so it appeared that the snapshot was still there), or the snapshot was malformed when created.

Cluster issues

This may indicate something amiss with your cluster. One node may be able to see something another cannot.

Conclusion

In any case, with this not caused by having the snapshot name in the list 2x, this is outside the scope of Curator. It is possible I've missed something, but these are my best guesses as to how this state could have happened.

@pkr1234
Copy link
Author

pkr1234 commented Mar 31, 2015

Thanks for the detailed response. There are no cluster issues or duplicate snapshots or individual nodes connection problem etc. All instances have the same instance profile. I can see the metadata and snapshot-curator fine.

I can also retrieve the JSON which shows that snapshot was taken fine directly from elasticsearch. Here's another type of the problem:

Unable to take snapshots : transport error (whatever that means)

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
11:36:49                                             Dload  Upload   Total   Spent    Left  Speed
11:36:49            
11:36:49              0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0179   179  179   179    0     0   3881      0 --:--:-- --:--:-- --:--:--  7782  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
11:36:49                                             Dload  Upload   Total   Spent    Left  Speed
11:36:49            
11:36:49            Now taking snapshots ..
11:36:49              0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0179   179  179   179    0     0   3969      0 --:--:-- --:--:-- --:--:--  77822015-03-31 11:36:49,675 INFO      Job starting...
11:36:49            2015-03-31 11:36:49,771 INFO      Matching all indices. Ignoring flags other than --exclude.
11:36:54            2015-03-31 11:36:54,279 INFO      Snapshot name: curator-20150331103654
11:40:56            2015-03-31 11:40:56,125 ERROR     Client raised a TransportError.
11:40:56            Unable to take snapshots.
11:40:56            Result: 1
11:40:56            Failed: NonZeroResultCode: Result code was 1
11:40:56            Execution failed: 43236: [Workflow step failures: {1=Dispatch failed on 1 nodes: [rndvmhsvr01: NonZeroResultCode: Result code was 1]}, Node failures: {xxxxxx01=[NonZeroResultCode: Result code was 1]}]

However the snapshots was taken fine. I can see snapshot-curator-20150331103654 and metadata-20150331103654.

I also retrieved the state of the snapshot using curl:

 curl -s -XGET "http://myserver.lala.com:9200/_snapshot/my_repository/curator-20150331103654?pretty"
{
  "snapshots" : [ {
    "snapshot" : "curator-20150331103654",
    "indices" : [ "kibana-int", "logstash-2015.01.28", "logstash-2015.01.29", "logstash-2015.01.30", "logstash-2015.02.02", "logstash-2015.02.03", "logstash-2015.02.04", "logstash-2015.02.05", "logstash-2015.02.06", "logstash-2015.02.07", "logstash-2015.02.08", "logstash-2015.02.09", "logstash-2015.02.10", "logstash-2015.02.11", "logstash-2015.02.12", "logstash-2015.02.13", "logstash-2015.02.14", "logstash-2015.02.15", "logstash-2015.02.16", "logstash-2015.02.17", "logstash-2015.02.18", "logstash-2015.02.19", "logstash-2015.02.20", "logstash-2015.02.21", "logstash-2015.02.22", "logstash-2015.02.23", "logstash-2015.02.24", "logstash-2015.02.25", "logstash-2015.02.26", "logstash-2015.02.27", "logstash-2015.02.28", "logstash-2015.03.01", "logstash-2015.03.02", "logstash-2015.03.03", "logstash-2015.03.04", "logstash-2015.03.05", "logstash-2015.03.06", "logstash-2015.03.07", "logstash-2015.03.08", "logstash-2015.03.09", "logstash-2015.03.10", "logstash-2015.03.11", "logstash-2015.03.12", "logstash-2015.03.13", "logstash-2015.03.14", "logstash-2015.03.15", "logstash-2015.03.16", "logstash-2015.03.17", "logstash-2015.03.18", "logstash-2015.03.19", "logstash-2015.03.20", "logstash-2015.03.21", "logstash-2015.03.22", "logstash-2015.03.23", "logstash-2015.03.24", "logstash-2015.03.25", "logstash-2015.03.26", "logstash-2015.03.27", "logstash-2015.03.28", "logstash-2015.03.29", "logstash-2015.03.30", "logstash-2015.03.31" ],
    "state" : "SUCCESS",
    "start_time" : "2015-03-31T10:36:57.683Z",
    "start_time_in_millis" : 1427798217683,
    "end_time" : "2015-03-31T10:44:34.495Z",
    "end_time_in_millis" : 1427798674495,
    "duration_in_millis" : 456812,
    "failures" : [ ],
    "shards" : {
      "total" : 310,
      "failed" : 0,
      "successful" : 310
    }
  } ]
}

But curator is raising a transport error failing my script.

@pkr1234
Copy link
Author

pkr1234 commented Mar 31, 2015

Also to answer your query:

First, you mention that you're trying to delete metadata-curator-[timestamp] and snapshot-curator-[timestamp] and it deletes these. However, the name in the error is curator-20150213001500. Was this correct? If so, is that snapshot supposed to be deleted?

I don't use any prefixes - just whatever defaults curator uses. All my snapshots have format like snapshot-curator-yyyymmddhhmmss

Metadata is like metadata-curator-yyyymmddhhmmss

@untergeek
Copy link
Member

I'm sorry you're having a bad time with Curator. You're having a bad experience, and it's my goal to fix it. From what I can tell, though, it can only be indirectly Curator's fault from what the errors are indicating.

Flow

Let's look at the flow again:

  • Curator creates a client connection to Elasticsearch using the elasticsearch-py module
  • The command to delete a snapshot is issued to Elasticsearch using elasticsearch-py calls
  • Elasticsearch, via the Amazon S3 plugin, issues the call to Amazon
  • The response you're getting is coming from Amazon, back to Elasticsearch, back to elasticsearch-py, up the chain to Curator again.

What this means is that the error:

elasticsearch.exceptions.NotFoundError: TransportError(404, u'RemoteTransportException[[Lemuel Dorcas][inet[/10.11.102.53:9300]][cluster/snapshot/delete]]; nested: SnapshotMissingException[[my_s3_repository:curator-20150213001500] is missing]; nested: FileNotFoundException[The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 5531B70DBC6BD25B)]; ')

...is, at its root, coming from Elasticsearch. You should actually be able to find a corresponding error in your Elasticsearch log files (in fact, I'd love to see them inline in a response here). Neither Curator nor the elasticsearch-py module have any ability of themselves to connect to S3, which is handled by Elasticsearch and the S3 plugin. This is why I was suggesting a client update on that end might be a potential fix. Can you tell me what version of the S3 plugin you are using in your Elasticsearch? And with which Elasticsearch version?

Possible causes

  • The calls to delete are happening too quickly in succession
    This is, at best, a guess. Elasticsearch is, indeed, unable to find the file. There could be no 404 otherwise. Amazon is sending the 404—(Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 5531B70DBC6BD25B)]; ')—complete with a request ID. Why Amazon would do this, I can only speculate. My guess is that it's an incorrect 404 because too many calls are hitting at a time, and it's not able to search the file-system quickly enough so the request actually results in a timeout of some kind on their end. That's why you can still find it later with a curl call (things aren't hammering the server).
  • There is some kind of background retry logic happening in the S3 plugin
    This ties in to the "too many requests in rapid succession" case above. The call may go through to delete, but Amazon (or Elasticsearch) issue their own retry. As you noted, sometimes the file gets deleted anyway. This could be the case here, and we're getting a 404 because the file did get deleted, but we're getting a retry bounce-back.

Proposed resolutions

On the simple side, add a --delay. Make it wait in between each delete. This is too passive, though, to my thinking.

Add retry and verification logic. This is more involved. I can add some code to double-check that a snapshot is present, and attempt to perform the action again. This might warrant some "test for snapshot, attempt delete, and retry if fail," kind of logic.

@pkr1234
Copy link
Author

pkr1234 commented Apr 1, 2015

Hi,

This is the log activity for snapshot creation. Looks like a bug as it seems curator is retrying for some reason and bombing out. Elasticsearch complains as a snapshot is already running with the same name.

[2015-04-01 00:16:07,494][WARN ][snapshots                ] [Lemuel Dorcas] [my_backup_s3_repository][curator-20150331231504] failed to create snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150331231504] a snapshot is already running
        at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:183)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:328)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-04-01 00:16:07,542][WARN ][snapshots                ] [Lemuel Dorcas] [my_backup_s3_repository][curator-20150331231504] failed to create snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150331231504] a snapshot is already running
        at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:183)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:328)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-04-01 00:16:07,568][WARN ][snapshots                ] [Lemuel Dorcas] [my_backup_s3_repository][curator-20150331231504] failed to create snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150331231504] a snapshot is already running
        at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:183)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:328)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-04-01 00:16:16,966][INFO ][snapshots                ] [Lemuel Dorcas] snapshot [my_backup_s3_repository:curator-20150331231504] is done
[2015-04-01 01:00:01,711][INFO ][cluster.metadata         ] [Lemuel Dorcas] [logstash-2015.04.01] creating index, cause [auto(bulk api)], shards [5]/[1], mappings [_default_]

I also found this on google which is somewhat similar to the problem.
http://grokbase.com/t/gg/elasticsearch/149p3bhs0x/1-2-2-snapshot-throws-error-but-still-completes

I did a grep xception * on my logs and found many such occurrences.


_LogCentral.ElasticSearch_.log.2015-03-12:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150312084624] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-12:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150312084624] Invalid snapshot name [curator-20150312084624], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-13:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150313001500] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-13:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150313001500] Invalid snapshot name [curator-20150313001500], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-14:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150314001500] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-14:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150314001500] Invalid snapshot name [curator-20150314001500], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-16:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150316001500] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-16:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150316001500] Invalid snapshot name [curator-20150316001500], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-16:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150316084051] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-16:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150316084051] Invalid snapshot name [curator-20150316084051], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-17:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150317001500] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-17:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150317001500] Invalid snapshot name [curator-20150317001500], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-18:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150318001500] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-18:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150318001500] Invalid snapshot name [curator-20150318001500], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-19:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150319001500] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-19:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150319001500] Invalid snapshot name [curator-20150319001500], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-20:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150320001500] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-20:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150320001500] Invalid snapshot name [curator-20150320001500], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-21:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150321001500] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-21:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150321001500] Invalid snapshot name [curator-20150321001500], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-27:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150327001500] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-27:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150327001500] Invalid snapshot name [curator-20150327001500], snapshot with such name already exists
_LogCentral.ElasticSearch_.log.2015-03-27:org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150327102947] a snapshot is already running
_LogCentral.ElasticSearch_.log.2015-03-27:org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150327102947] Invalid snapshot name [curator-20150327102947], snapshot with such name already exists

I don't see any errors related to AWS API call. Note the log entry my_backup_s3_repository:curator-20150331231504] is done. It completes fine but it seems curator is trying to resubmit the request whilst it is running and receiving an exception - my guess.

@pkr1234
Copy link
Author

pkr1234 commented Apr 1, 2015

Regarding the delete, here are the logs for comparison:

Execution log


0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0179   179  179   179    0     0   1282      0 --:--:-- --:--:-- --:--:--  7458  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
00:15:00                                             Dload  Upload   Total   Spent    Left  Speed
00:15:00            
00:15:00              0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0179   179  179   179    0     0   3920      0 --:--:-- --:--:-- --:--:--  74582015-03-27 00:15:00,864 INFO      Job starting...
00:15:00            2015-03-27 00:15:00,913 INFO      Capturing snapshots of specified indices...
00:15:00            2015-03-27 00:15:00,914 INFO      Snapshot will capture all indices
00:15:00            2015-03-27 00:15:00,914 INFO      Snapshot name: curator-20150327001500
00:16:35            2015-03-27 00:16:35,727 ERROR     Client raised a TransportError.  Error: TransportError(400, u'RemoteTransportException[[Lemuel Dorcas][inet[/10.11.102.53:9300]][cluster/snapshot/create]]; nested: InvalidSnapshotNameException[[my_backup_s3_repository:curator-20150327001500] Invalid snapshot name [curator-20150327001500], snapshot with such name already exists]; ')
00:16:35            2015-03-27 00:16:35,728 INFO      Snapshots captured for specified indices.
00:16:35            2015-03-27 00:16:35,729 INFO      Done in 0:01:34.890266.
00:16:36            2015-03-27 00:16:36,080 INFO      Job starting...
00:16:36            2015-03-27 00:16:36,127 INFO      Deleting specified snapshots...
00:16:37            2015-03-27 00:16:37,993 INFO      curator-20150211001500 is within the threshold period (45 days).
00:16:37            2015-03-27 00:16:37,994 INFO      curator-20150212001500 is within the threshold period (45 days).
00:16:37            2015-03-27 00:16:37,995 INFO      curator-20150213001500 is within the threshold period (45 days).
00:16:37            2015-03-27 00:16:37,996 INFO      curator-20150214001500 is within the threshold period (45 days).
00:16:37            2015-03-27 00:16:37,997 INFO      curator-20150215001505 is within the threshold period (45 days).
00:16:37            2015-03-27 00:16:37,997 INFO      curator-20150216001500 is within the threshold period (45 days).
00:16:37            2015-03-27 00:16:37,998 INFO      curator-20150217001500 is within the threshold period (45 days).
00:16:37            2015-03-27 00:16:37,999 INFO      curator-20150218001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,000 INFO      curator-20150219001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,000 INFO      curator-20150220001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,001 INFO      curator-20150221001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,002 INFO      curator-20150222001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,003 INFO      curator-20150223001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,003 INFO      curator-20150224001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,004 INFO      curator-20150225001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,005 INFO      curator-20150226001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,006 INFO      curator-20150227001502 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,006 INFO      curator-20150228001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,007 INFO      curator-20150301001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,007 INFO      curator-20150302001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,007 INFO      curator-20150303001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,008 INFO      curator-20150304001505 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,008 INFO      curator-20150305001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,008 INFO      curator-20150306001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,008 INFO      curator-20150307001504 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,008 INFO      curator-20150309001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,009 INFO      curator-20150310093704 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,009 INFO      curator-20150310123006 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,009 INFO      curator-20150311001501 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,009 INFO      curator-20150311001601 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,010 INFO      curator-20150311001900 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,010 INFO      curator-20150312001501 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,016 INFO      curator-20150312084624 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,016 INFO      curator-20150313001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,016 INFO      curator-20150314001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,016 INFO      curator-20150316001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,017 INFO      curator-20150316084051 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,017 INFO      curator-20150317001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,017 INFO      curator-20150318001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,017 INFO      curator-20150319001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,017 INFO      curator-20150320001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,018 INFO      curator-20150321001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,018 INFO      curator-20150322001501 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,018 INFO      curator-20150323001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,018 INFO      curator-20150324001509 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,019 INFO      curator-20150325001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,019 INFO      curator-20150325084815 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,019 INFO      curator-20150326001500 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,019 INFO      curator-20150326085926 is within the threshold period (45 days).
00:16:38            2015-03-27 00:16:38,019 INFO      curator-20150327001500 is within the threshold period (45 days).
00:17:45            Traceback (most recent call last):
00:17:45              File "/usr/bin/curator", line 9, in <module>
00:17:45                load_entry_point('elasticsearch-curator==2.1.2', 'console_scripts', 'curator')()
00:17:45              File "/usr/lib/python2.6/site-packages/curator/curator_script.py", line 383, in main
00:17:45                arguments.func(client, **argdict)
00:17:45              File "/usr/lib/python2.6/site-packages/curator/curator.py", line 1216, in snapshot
00:17:45                _op_loop(client, matching_snapshots, op=delete_snapshot, dry_run=dry_run, **kwargs)
00:17:45              File "/usr/lib/python2.6/site-packages/curator/curator.py", line 779, in _op_loop
00:17:45                skipped = op(client, item, **kwargs)
00:17:45              File "/usr/lib/python2.6/site-packages/curator/curator.py", line 749, in delete_snapshot
00:17:45                client.snapshot.delete(repository=repository, snapshot=snap)
00:17:45              File "/usr/lib/python2.6/site-packages/elasticsearch/client/utils.py", line 68, in _wrapped
00:17:45                return func(*args, params=params, **kwargs)
00:17:45              File "/usr/lib/python2.6/site-packages/elasticsearch/client/snapshot.py", line 40, in delete
00:17:45                _make_path('_snapshot', repository, snapshot), params=params)
00:17:45              File "/usr/lib/python2.6/site-packages/elasticsearch/transport.py", line 301, in perform_request
00:17:45                status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
00:17:45              File "/usr/lib/python2.6/site-packages/elasticsearch/connection/http_urllib3.py", line 82, in perform_request
00:17:45                self._raise_error(response.status, raw_data)
00:17:45              File "/usr/lib/python2.6/site-packages/elasticsearch/connection/base.py", line 102, in _raise_error
00:17:45                raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
00:17:45            elasticsearch.exceptions.NotFoundError: TransportError(404, u'RemoteTransportException[[Lemuel Dorcas][inet[/10.11.102.53:9300]][cluster/snapshot/delete]]; nested: SnapshotMissingException[[my_backup_s3_repository:curator-20150210001500] is missing]; nested: FileNotFoundException[The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 9904EBCF27CFB8FF)]; ')
00:17:45            Unable to purge snapshots.
00:17:45            Result: 1
00:17:45            Failed: NonZeroResultCode: Result code was 1

Elasticsearch log

[2015-03-27 00:16:37,460][WARN ][snapshots                ] [Lemuel Dorcas] [my_backup_s3_repository][curator-20150327001500] failed to create snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [my_backup_s3_repository:curator-20150327001500] a snapshot is already running
        at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:183)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:328)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-03-27 00:16:37,605][WARN ][snapshots                ] [Lemuel Dorcas] failed to create snapshot [my_backup_s3_repository:curator-20150327001500]
org.elasticsearch.snapshots.InvalidSnapshotNameException: [my_backup_s3_repository:curator-20150327001500] Invalid snapshot name [curator-20150327001500], snapshot with such name already exists
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.initializeSnapshot(BlobStoreRepository.java:225)
        at org.elasticsearch.snapshots.SnapshotsService.beginSnapshot(SnapshotsService.java:276)
        at org.elasticsearch.snapshots.SnapshotsService.access$600(SnapshotsService.java:88)
        at org.elasticsearch.snapshots.SnapshotsService$1$1.run(SnapshotsService.java:202)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-03-27 03:06:51,186][INFO ][cluster.metadata         ] [Lemuel Dorcas] [logstash-2015.03.27] update_mapping [cloudtraillog] (dynamic)
[2015-03-27 03:06:51,241][INFO ][cluster.metadata         ] [Lemuel Dorcas] [logstash-2015.03.27] update_mapping [cloudtraillog] (dynamic)

You see there are no AWS specific exceptions. It is important to mention that I was running older version of curator on the 27th which was mainly complaining about "delete" intermittently - transport error

With the latest version of curator, the problem seems to have shifted to creation of snapshots - "transport error".

In both the cases, the error logged in ELasticsearch is the same - "a snapshot is already running". So it seems curator is retrying for some reason and bombing out.

@untergeek
Copy link
Member

It's good to know that the AWS plugin isn't encountering an error at the ES level. That means that Elasticsearch is getting a legitimate 404 from Amazon. Why Amazon is giving a 404 is still not clear.

There is no retry within Curator. It only tries once. See https://github.com/elastic/curator/blob/master/curator/api/snapshot.py#L60-L64

That means the retry is probably inside Elasticsearch and/or the AWS plugin.

  • What version of Elasticsearch are you running?
  • What version of the AWS plugin are you running?

@untergeek
Copy link
Member

If you're using Curator 3.0.2 or 3.0.3 you shouldn't be experiencing #307 (fixed in #308). That was one of the issues with snapshots "retrying" because the index list was getting segmented.

@pkr1234
Copy link
Author

pkr1234 commented Apr 2, 2015

1.3.2 Elasticsearch
2.3.0 AWS Plugin

I've tried to take snapshot using curl on Elasticsearch which works fine. But wait_for_completion works differently. See below:

 curl -m 3600 -XPUT "myserver.lala.com:9200/_snapshot/my_backup_s3_repository/adhocsnapshot_2?wait_for_completion=true"

This command completes after 40-50 sec and $? is 0. I then check the snapshot status.

 curl -m 3600 -XGET "myserver.lala.com:9200/_snapshot/my_backup_s3_repository/adhocsnapshot_2?pretty"

This shows the snapshot status as "STARTED" and not SUCCESS. So wait_for_completion does not wait until the snapshot status is successful.

I then quickly ran another snapshot

 curl -m 3600 -XPUT "myserver.lala.com:9200/_snapshot/my_backup_s3_repository/adhocsnapshot_3?wait_for_completion=true"

This failed with "a snapshot is already running"

I then waited a couple more minutes and ran the command again. It succeeded.

I'm afraid I'll have to stop using curator and use curl commands to take snapshots instead which works. Something in curator seems to be retrying I believe.

@pkr1234
Copy link
Author

pkr1234 commented Apr 2, 2015

I ran curator snapshot in debug mode.

2015-04-02 13:17:37,925 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
2015-04-02 13:18:37,702 DEBUG     urllib3.connectionpool          _make_request:368  "PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true HTTP/1.1" 504 0
2015-04-02 13:18:37,703 WARNING            elasticsearch       log_request_fail:81   PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true [status:504 request:59.777s]
2015-04-02 13:18:37,703 DEBUG              elasticsearch       log_request_fail:89   > {"indices": "kibana-int,logstash-2015.03.11,logstash-2015.03.12,logstash-2015.03.13,logstash-2015.03.14,logstash-2015.03.15,logstash-2015.03.16,logstash-2015.03.17,logstash-2015.03.18,logstash-2015.03.19,logstash-2015.03.20,logstash-2015.03.21,logstash-2015.03.22,logstash-2015.03.23,logstash-2015.03.24,logstash-2015.03.25,logstash-2015.03.26,logstash-2015.03.27,logstash-2015.03.28,logstash-2015.03.29,logstash-2015.03.30,logstash-2015.03.31,logstash-2015.04.01,logstash-2015.04.02", "partial": false, "ignore_unavailable": false, "include_global_state": true}
2015-04-02 13:18:37,704 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
2015-04-02 13:18:37,704 INFO      urllib3.connectionpool              _get_conn:223  Resetting dropped connection: logcentral.test.costcutter.com
2015-04-02 13:18:37,789 DEBUG     urllib3.connectionpool          _make_request:368  "PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true HTTP/1.1" 503 252
2015-04-02 13:18:37,790 WARNING            elasticsearch       log_request_fail:81   PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true [status:503 request:0.086s]
2015-04-02 13:18:37,790 DEBUG              elasticsearch       log_request_fail:89   > {"indices": "kibana-int,logstash-2015.03.11,logstash-2015.03.12,logstash-2015.03.13,logstash-2015.03.14,logstash-2015.03.15,logstash-2015.03.16,logstash-2015.03.17,logstash-2015.03.18,logstash-2015.03.19,logstash-2015.03.20,logstash-2015.03.21,logstash-2015.03.22,logstash-2015.03.23,logstash-2015.03.24,logstash-2015.03.25,logstash-2015.03.26,logstash-2015.03.27,logstash-2015.03.28,logstash-2015.03.29,logstash-2015.03.30,logstash-2015.03.31,logstash-2015.04.01,logstash-2015.04.02", "partial": false, "ignore_unavailable": false, "include_global_state": true}
2015-04-02 13:18:37,791 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
2015-04-02 13:18:37,832 DEBUG     urllib3.connectionpool          _make_request:368  "PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true HTTP/1.1" 503 252
2015-04-02 13:18:37,833 WARNING            elasticsearch       log_request_fail:81   PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true [status:503 request:0.042s]
2015-04-02 13:18:37,833 DEBUG              elasticsearch       log_request_fail:89   > {"indices": "kibana-int,logstash-2015.03.11,logstash-2015.03.12,logstash-2015.03.13,logstash-2015.03.14,logstash-2015.03.15,logstash-2015.03.16,logstash-2015.03.17,logstash-2015.03.18,logstash-2015.03.19,logstash-2015.03.20,logstash-2015.03.21,logstash-2015.03.22,logstash-2015.03.23,logstash-2015.03.24,logstash-2015.03.25,logstash-2015.03.26,logstash-2015.03.27,logstash-2015.03.28,logstash-2015.03.29,logstash-2015.03.30,logstash-2015.03.31,logstash-2015.04.01,logstash-2015.04.02", "partial": false, "ignore_unavailable": false, "include_global_state": true}
2015-04-02 13:18:37,834 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
2015-04-02 13:18:37,876 DEBUG     urllib3.connectionpool          _make_request:368  "PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true HTTP/1.1" 503 252
2015-04-02 13:18:37,877 WARNING            elasticsearch       log_request_fail:81   PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true [status:503 request:0.043s]
2015-04-02 13:18:37,877 DEBUG              elasticsearch       log_request_fail:89   > {"indices": "kibana-int,logstash-2015.03.11,logstash-2015.03.12,logstash-2015.03.13,logstash-2015.03.14,logstash-2015.03.15,logstash-2015.03.16,logstash-2015.03.17,logstash-2015.03.18,logstash-2015.03.19,logstash-2015.03.20,logstash-2015.03.21,logstash-2015.03.22,logstash-2015.03.23,logstash-2015.03.24,logstash-2015.03.25,logstash-2015.03.26,logstash-2015.03.27,logstash-2015.03.28,logstash-2015.03.29,logstash-2015.03.30,logstash-2015.03.31,logstash-2015.04.01,logstash-2015.04.02", "partial": false, "ignore_unavailable": false, "include_global_state": true}
2015-04-02 13:18:37,878 ERROR       curator.api.snapshot        create_snapshot:66   Client raised a TransportError.

There are two lines where it shows it is retrying the request.

2015-04-02 13:18:37,833 WARNING            elasticsearch       log_request_fail:81   PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true [status:503 request:0.042s]
...
...
2015-04-02 13:18:37,877 WARNING            elasticsearch       log_request_fail:81   PUT /_snapshot/my_backup_s3_repository/curator-20150402121735?wait_for_completion=true [status:503 request:0.043s]

Any ideas why it is doing a PUT on the same snapshot again?

In contrast the curl command works fine.

 curl -m 3600 -XPUT "myserver.lala.com:9200/_snapshot/my_backup_s3_repository/adhocsnapshot_3?wait_for_completion=true"
 echo $?
0

@untergeek
Copy link
Member

Which version of Curator are you using right now?

@untergeek
Copy link
Member

The portions you are showing are only revealing methods within the elasticsearch-py and urllib3 modules. Note the 2nd column, which contains the name of the module, while the 3rd column shows the method being called, and on what line number. Only the very bottom method is a curator.api.snapshot call.

Look at what's happening. The first _make_request line shows a 504 error, which is a gateway timeout, which is

A server (not necessarily a Web server) is acting as a gateway or proxy to fulfil the request by the client to access the requested URL. This server did not receive a timely response from an upstream server it accessed to deal with your HTTP request.

This usually means that the upstream server is down (no response to the gateway/proxy)

But that 504 is probably indicating the same disconnect you saw with your curl call, in that wait_for_completion did not seem to wait for the complete time. It seems that the urllib3 code that the elasticsearch library is using is just as confused by the odd timeout. In both cases, the snapshot has been started, and is still going, but wait_for_timeout is not waiting.

And because of the 504, it thinks it failed, hence the retries and the failure messages, but these are not from Curator, but from the urllib3 calls in the elasticsearch-py module. It never goes back to Curator until it has retried 3 times and failed with each. Each failed retry within the elasticsearch module results in a call to the log_request_fail method. This confirms what I have repeatedly stated: Curator itself is not retrying, but rather something farther up the chain.

My suspicions at this point are two:

  1. There is some transport layer (gateway or proxy) between you and S3 that is perhaps causing some of these issues. For example, an aggressive firewall rule that disallows HTTP keepalive connections, or those longer than 60 seconds, might do something like this.
  2. Since you have an older version of Elasticsearch (1.3.2)—and correspondingly, the S3 plugin, there's a bug in your version of the Elasticsearch code that is signaling an improper failure, so the elasticsearch python library is instigating a retry, but that retry fails. I did not see your exact issue in the Changelogs from Elasticsearch 1.3.3 up to 1.3.9, but there were many snapshot related fixes merged in those versions. Updating may solve the problem, and certainly wouldn't hurt if it didn't.

@pkr1234
Copy link
Author

pkr1234 commented Apr 2, 2015

I think you are right. Both curl and urllib are not waiting for the snapshot. Culprit could be AWS load balancer which sits in front of Elasticsearch non data nodes in my cluster. It has an idle timeout of 60 sec by default. I'll try to change it to a higher value and see if it solves the problem.

Will try on tuesday as Easter holidays have started and let you know. Happy Easter!

PS: curator version is the latest one. I did a pip install on the 27th.

@untergeek
Copy link
Member

Happy Easter to you as well! I'm glad we may have pinned this down. I've looked into the elasticsearch-py module API docs and cannot see a way to set up a keep-alive to work around this. I don't know if you'd be amenable to a special port, firewalled off and all that, that isn't behind the ELB, just for snapshots, but that's another possibility.

You could also set up a dedicated wrapper that uses the regular Curator command-line to do snapshots, but with --wait_for_completion set to False. The other part would be calls to the Elasticsearch API to keep polling periodically until the snapshot is complete, which would let the wrapper complete successfully when the snapshot successfully completed.

@untergeek
Copy link
Member

Any update on this?

@pkr1234
Copy link
Author

pkr1234 commented Apr 15, 2015

Working fine now. Thanks.

@untergeek untergeek changed the title Curator fails with 'missing snapshot' intermittently [DOC] Curator fails with 'missing snapshot' intermittently (solved) Apr 15, 2015
@untergeek
Copy link
Member

Thanks for the update. I'm going to add this scenario as an FAQ to the documentation.

@trompx
Copy link

trompx commented Nov 6, 2015

Hi there,

I encounter the same problem. I use :

  • elasticsearch : 1.7.1
  • curator : 3.4.0 (same result with 3.2.3)
  • aws plugin : 2.7.1 (latest and version adviced for es 1.7.1)

Here are the full log I get with --debug almost similar to pkr1234 logs :

Nov  6 11:36:31 atest docker/286fde13c77e[911]: Check if repository exists
Nov  6 11:36:32 atest docker/286fde13c77e[911]: Logs backup indices newer than 10 days at 2015-11-06_11-36-31
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,905 DEBUG          curator.cli.utils        filter_callback:189  REGEX = ^logs-.*$
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,905 DEBUG          curator.cli.utils        filter_callback:192  Added filter: {'pattern': '^logs-.*$'}
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,905 DEBUG          curator.cli.utils        filter_callback:193  New list of filters: [{'pattern': '^logs-.*$'}]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,905 DEBUG         curator.api.filter         get_date_regex:158  regex = \d{4}\.\d{2}\.\d{2}
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,905 DEBUG          curator.cli.utils        filter_callback:189  REGEX = (?P<date>\d{4}\.\d{2}\.\d{2})
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,906 DEBUG          curator.cli.utils        filter_callback:192  Added filter: {'pattern': '(?P<date>\\d{4}\\.\\d{2}\\.\\d{2})', 'value': 10, 'groupname': 'date', 'time_unit': 'days', 'timestring': '%Y.%m.%d', 'method': 'newer_than'}
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,906 DEBUG          curator.cli.utils        filter_callback:193  New list of filters: [{'pattern': '^logs-.*$'}, {'pattern': '(?P<date>\\d{4}\\.\\d{2}\\.\\d{2})', 'value': 10, 'groupname': 'date', 'time_unit': 'days', 'timestring': '%Y.%m.%d', 'method': 'newer_than'}]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,906 INFO      curator.cli.index_selection                indices:57   Job starting: snapshot indices
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,906 DEBUG     curator.cli.index_selection                indices:60   Params: {'url_prefix': '', 'http_auth': None, 'dry_run': False, 'certificate': None, 'loglevel': 'INFO', 'logformat': 'default', 'quiet': False, 'host': '192.168.33.105', 'timeout': 21600, 'debug': True, 'use_ssl': False, 'logfile': '/elasticsearch_curator.log', 'master_only': False, 'port': 9200, 'ssl_no_validate': False}
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,906 DEBUG          curator.cli.utils             get_client:112  kwargs = {'url_prefix': '', 'http_auth': None, 'dry_run': False, 'certificate': None, 'loglevel': 'INFO', 'quiet': False, 'debug': True, 'logformat': 'default', 'timeout': 21600, 'host': '192.168.33.105', 'use_ssl': False, 'logfile': '/elasticsearch_curator.log', 'master_only': False, 'port': 9200, 'ssl_no_validate': False}
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,906 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,906 INFO      urllib3.connectionpool              _new_conn:207  Starting new HTTP connection (1): 192.168.33.105
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,914 DEBUG     urllib3.connectionpool          _make_request:387  "GET / HTTP/1.1" 200 343
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,915 INFO               elasticsearch    log_request_success:63   GET http://192.168.33.105:9200/ [status:200 request:0.009s]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,915 DEBUG              elasticsearch    log_request_success:65   > None
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,915 DEBUG              elasticsearch    log_request_success:66   < {
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "status" : 200,
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "name" : "atest.com",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "cluster_name" : "elasticsearch",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "version" : {
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "number" : "1.7.1",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_timestamp" : "2015-07-29T09:54:16Z",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_snapshot" : false,
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "lucene_version" : "4.10.4"
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   },
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "tagline" : "You Know, for Search"
Nov  6 11:36:32 atest docker/286fde13c77e[911]: }
Nov  6 11:36:32 atest docker/286fde13c77e[911]:
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,918 DEBUG          curator.cli.utils          check_version:90   Detected Elasticsearch version 1.7.1
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,918 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,920 DEBUG     urllib3.connectionpool          _make_request:387  "GET /_all/_settings?expand_wildcards=open%2Cclosed HTTP/1.1" 200 4651
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,920 INFO               elasticsearch    log_request_success:63   GET http://192.168.33.105:9200/_all/_settings?expand_wildcards=open%2Cclosed [status:200 request:0.002s]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,920 DEBUG              elasticsearch    log_request_success:65   > None
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,920 DEBUG              elasticsearch    log_request_success:66   < {"logs-nginx-access-2015.10.30":{"settings":{"index":{"creation_date":"1446209966103","number_of_shards":"5","uuid":"94KEjOorQT6v3qOxCneDXA","version":{"created":"1070199"},"number_of_replicas":"1"}}}}
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,922 DEBUG          curator.api.utils            get_indices:28   All indices: [u'logs-nginx-access-2015.10.30']
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,926 DEBUG     curator.cli.index_selection                indices:76   Full list of indices: [u'logs-nginx-access-2015.10.30']
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,926 DEBUG     curator.cli.index_selection                indices:98   All filters: [{'pattern': '^logs-.*$'}, {'pattern': '(?P<date>\\d{4}\\.\\d{2}\\.\\d{2})', 'value': 10, 'groupname': 'date', 'time_unit': 'days', 'timestring': '%Y.%m.%d', 'method': 'newer_than'}]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,926 DEBUG     curator.cli.index_selection                indices:103  Filter: {'pattern': '^logs-.*$'}
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,926 DEBUG     curator.cli.index_selection                indices:103  Filter: {'pattern': '(?P<date>\\d{4}\\.\\d{2}\\.\\d{2})', 'value': 10, 'groupname': 'date', 'time_unit': 'days', 'timestring': '%Y.%m.%d', 'method': 'newer_than'}
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,929 DEBUG         curator.api.filter        timestamp_check:301  Timestamp "2015.10.28" is outside the cutoff period (newer than 10 days).
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,929 DEBUG         curator.api.filter        timestamp_check:301  Timestamp "2015.10.27" is outside the cutoff period (newer than 10 days).
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,934 DEBUG         curator.api.filter        timestamp_check:301  Timestamp "2015.10.28" is outside the cutoff period (newer than 10 days).
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,934 DEBUG         curator.api.filter        timestamp_check:301  Timestamp "2015.10.28" is outside the cutoff period (newer than 10 days).
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,934 INFO      curator.cli.index_selection                indices:144  Action snapshot will be performed on the following indices: [u'logs-nginx-access-2015.10.30']
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,934 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,938 DEBUG     urllib3.connectionpool          _make_request:387  "GET / HTTP/1.1" 200 343
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,938 INFO               elasticsearch    log_request_success:63   GET http://192.168.33.105:9200/ [status:200 request:0.004s]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,938 DEBUG              elasticsearch    log_request_success:65   > None
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,938 DEBUG              elasticsearch    log_request_success:66   < {
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "status" : 200,
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "name" : "atest.com",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "cluster_name" : "elasticsearch",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "version" : {
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "number" : "1.7.1",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_timestamp" : "2015-07-29T09:54:16Z",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_snapshot" : false,
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "lucene_version" : "4.10.4"
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   },
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "tagline" : "You Know, for Search"
Nov  6 11:36:32 atest docker/286fde13c77e[911]: }
Nov  6 11:36:32 atest docker/286fde13c77e[911]:
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,939 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,947 DEBUG     urllib3.connectionpool          _make_request:387  "GET /_snapshot/_status HTTP/1.1" 200 16
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,947 INFO               elasticsearch    log_request_success:63   GET http://192.168.33.105:9200/_snapshot/_status [status:200 request:0.008s]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,947 DEBUG              elasticsearch    log_request_success:65   > None
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,947 DEBUG              elasticsearch    log_request_success:66   < {"snapshots":[]}
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,948 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,951 DEBUG     urllib3.connectionpool          _make_request:387  "GET / HTTP/1.1" 200 343
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,951 INFO               elasticsearch    log_request_success:63   GET http://192.168.33.105:9200/ [status:200 request:0.003s]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,951 DEBUG              elasticsearch    log_request_success:65   > None
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,951 DEBUG              elasticsearch    log_request_success:66   < {
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "status" : 200,
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "name" : "atest.com",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "cluster_name" : "elasticsearch",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "version" : {
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "number" : "1.7.1",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_timestamp" : "2015-07-29T09:54:16Z",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_snapshot" : false,
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "lucene_version" : "4.10.4"
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   },
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "tagline" : "You Know, for Search"
Nov  6 11:36:32 atest docker/286fde13c77e[911]: }
Nov  6 11:36:32 atest docker/286fde13c77e[911]:
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,952 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,973 DEBUG     urllib3.connectionpool          _make_request:387  "GET /_cat/indices/logs-nginx-access-2015.10.30?h=status&format=json HTTP/1.1" 200 19
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,973 INFO               elasticsearch    log_request_success:63   GET http://192.168.33.105:9200/_cat/indices/logs-nginx-access-2015.10.30?h=status&format=json [status:200 request:0.021s]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,973 DEBUG              elasticsearch    log_request_success:65   > None
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,973 DEBUG              elasticsearch    log_request_success:66   < [{"status":"open"}]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,973 DEBUG          curator.api.utils   prune_open_or_closed:328  Including indexlogs-nginx-access-2015.10.30: Opened.
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,973 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,975 DEBUG     urllib3.connectionpool          _make_request:387  "GET / HTTP/1.1" 200 343
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,975 INFO               elasticsearch    log_request_success:63   GET http://192.168.33.105:9200/ [status:200 request:0.002s]
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,975 DEBUG              elasticsearch    log_request_success:65   > None
Nov  6 11:36:32 atest docker/286fde13c77e[911]: 2015-11-06 11:36:31,975 DEBUG              elasticsearch    log_request_success:66   < {
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "status" : 200,
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "name" : "atest.com",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "cluster_name" : "elasticsearch",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "version" : {
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "number" : "1.7.1",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_timestamp" : "2015-07-29T09:54:16Z",
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "build_snapshot" : false,
Nov  6 11:36:32 atest docker/286fde13c77e[911]:     "lucene_version" : "4.10.4"
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   },
Nov  6 11:36:32 atest docker/286fde13c77e[911]:   "tagline" : "You Know, for Search"
Nov  6 11:36:32 atest docker/286fde13c77e[911]: }
Nov  6 11:36:32 atest docker/286fde13c77e[911]:
Nov  6 11:36:34 atest docker/286fde13c77e[911]: 2015-11-06 11:36:33,983 DEBUG     urllib3.connectionpool          _make_request:387  "POST /_snapshot/atest_backups/_verify HTTP/1.1" 200 58
Nov  6 11:36:34 atest docker/286fde13c77e[911]: 2015-11-06 11:36:33,984 INFO               elasticsearch    log_request_success:63   POST http://192.168.33.105:9200/_snapshot/atest_backups/_verify [status:200 request:0.962s]
Nov  6 11:36:34 atest docker/286fde13c77e[911]: 2015-11-06 11:36:33,984 DEBUG              elasticsearch    log_request_success:65   > None
Nov  6 11:36:34 atest docker/286fde13c77e[911]: 2015-11-06 11:36:33,984 DEBUG              elasticsearch    log_request_success:66   < {"nodes":{"U87Xhq2TTvyhaR9fVvgLVQ":{"name":"atest.com"}}}
Nov  6 11:36:34 atest docker/286fde13c77e[911]: 2015-11-06 11:36:33,984 DEBUG       curator.api.snapshot        create_snapshot:62   Nodes with verified repository access: {u'U87Xhq2TTvyhaR9fVvgLVQ': {u'name': u'atest.com'}}
Nov  6 11:36:34 atest docker/286fde13c77e[911]: 2015-11-06 11:36:33,984 INFO        curator.api.snapshot        create_snapshot:70   Snapshot name: logs-20151106113633
Nov  6 11:36:34 atest docker/286fde13c77e[911]: 2015-11-06 11:36:33,984 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:36:35 atest docker/286fde13c77e[911]: 2015-11-06 11:36:35,038 DEBUG     urllib3.connectionpool          _make_request:387  "GET /_snapshot/atest_backups/_all HTTP/1.1" 200 5031
Nov  6 11:36:35 atest docker/286fde13c77e[911]: 2015-11-06 11:36:35,038 INFO               elasticsearch    log_request_success:63   GET http://192.168.33.105:9200/_snapshot/atest_backups/_all [status:200 request:1.054s]
Nov  6 11:36:35 atest docker/286fde13c77e[911]: 2015-11-06 11:36:35,038 DEBUG              elasticsearch    log_request_success:65   > None
Nov  6 11:36:35 atest docker/286fde13c77e[911]: 2015-11-06 11:36:35,039 DEBUG              elasticsearch    log_request_success:66   < {"snapshots":[
{{"snapshot":"logs-20151105175858","version_id":1070199,"version":"1.7.1","indices":["logs-nginx-access-2015.10.30"],"state":"SUCCESS","start_time":"2015-11-05T17:58:59.609Z","start_time_in_millis":1446746339609,"end_time":"2015-11-05T18:03:19.815Z","end_time_in_millis":1446746599815,"duration_in_millis":260206,"failures":[],"shards":{"total":5,"failed":0,"successful":5}}}]}
Nov  6 11:36:35 atest docker/286fde13c77e[911]: 2015-11-06 11:36:35,040 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:36:43 atest kernel: [  990.428805] docker0: port 12(vethf2357a6) entered forwarding state
Nov  6 11:37:37 atest docker/286fde13c77e[911]: 2015-11-06 11:37:35,043 WARNING            elasticsearch       log_request_fail:82   PUT http://192.168.33.105:9200/_snapshot/atest_backups/logs-20151106113633?wait_for_completion=true [status:N/A request:60.003s]
Nov  6 11:37:37 atest docker/286fde13c77e[911]: Traceback (most recent call last):
Nov  6 11:37:37 atest docker/286fde13c77e[911]:   File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 78, in perform_request
Nov  6 11:37:37 atest docker/286fde13c77e[911]:     response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
Nov  6 11:37:37 atest docker/286fde13c77e[911]:   File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 609, in urlopen
Nov  6 11:37:37 atest docker/286fde13c77e[911]:     _stacktrace=sys.exc_info()[2])
Nov  6 11:37:37 atest docker/286fde13c77e[911]:   File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 222, in increment
Nov  6 11:37:37 atest docker/286fde13c77e[911]:     raise six.reraise(type(error), error, _stacktrace)
Nov  6 11:37:37 atest docker/286fde13c77e[911]:   File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 559, in urlopen
Nov  6 11:37:37 atest docker/286fde13c77e[911]:     body=body, headers=headers)
Nov  6 11:37:37 atest docker/286fde13c77e[911]:   File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 376, in _make_request
Nov  6 11:37:37 atest docker/286fde13c77e[911]:     httplib_response = conn.getresponse(buffering=True)
Nov  6 11:37:37 atest docker/286fde13c77e[911]:   File "/usr/lib/python2.7/httplib.py", line 1132, in getresponse
Nov  6 11:37:37 atest docker/286fde13c77e[911]:     response.begin()
Nov  6 11:37:37 atest docker/286fde13c77e[911]:   File "/usr/lib/python2.7/httplib.py", line 453, in begin
Nov  6 11:37:37 atest docker/286fde13c77e[911]:     version, status, reason = self._read_status()
Nov  6 11:37:37 atest docker/286fde13c77e[911]:   File "/usr/lib/python2.7/httplib.py", line 417, in _read_status
Nov  6 11:37:37 atest docker/286fde13c77e[911]:     raise BadStatusLine(line)
Nov  6 11:37:37 atest docker/286fde13c77e[911]: ProtocolError: ('Connection aborted.', BadStatusLine("''",))
Nov  6 11:37:37 atest docker/286fde13c77e[911]: 2015-11-06 11:37:37,524 DEBUG              elasticsearch       log_request_fail:90   > {"indices": "logs-nginx-access-2015.10.30", "partial": false, "ignore_unavailable": false, "include_global_state": true}
Nov  6 11:37:37 atest docker/286fde13c77e[911]: 2015-11-06 11:37:37,524 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:37:37 atest docker/286fde13c77e[911]: 2015-11-06 11:37:37,524 INFO      urllib3.connectionpool              _new_conn:207  Starting new HTTP connection (2): 192.168.33.105
Nov  6 11:37:57 atest docker/61f057472da8[911]: [2015-11-06 11:37:57,168][INFO ][snapshots                ] [atest.com] snapshot [atest_backups:logs-20151106113633] is done
Nov  6 11:37:57 atest docker/61f057472da8[911]: [2015-11-06 11:37:57,171][WARN ][snapshots                ] [atest.com] [atest_backups][logs-20151106113633] failed to create snapshot
Nov  6 11:37:57 atest docker/61f057472da8[911]: org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [atest_backups:logs-20151106113633] a snapshot is already running
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:192)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:374)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:196)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:162)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at java.lang.Thread.run(Thread.java:745)
Nov  6 11:37:57 atest docker/61f057472da8[911]: [2015-11-06 11:37:57,693][WARN ][snapshots                ] [atest.com] failed to create snapshot [atest_backups:logs-20151106113633]
Nov  6 11:37:57 atest docker/61f057472da8[911]: org.elasticsearch.snapshots.InvalidSnapshotNameException: [atest_backups:logs-20151106113633] Invalid snapshot name [logs-20151106113633], snapshot with such name already exists
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at org.elasticsearch.repositories.blobstore.BlobStoreRepository.initializeSnapshot(BlobStoreRepository.java:233)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at org.elasticsearch.snapshots.SnapshotsService.beginSnapshot(SnapshotsService.java:290)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at org.elasticsearch.snapshots.SnapshotsService.access$600(SnapshotsService.java:92)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at org.elasticsearch.snapshots.SnapshotsService$1$1.run(SnapshotsService.java:211)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Nov  6 11:37:57 atest docker/61f057472da8[911]: #011at java.lang.Thread.run(Thread.java:745)
Nov  6 11:37:57 atest docker/286fde13c77e[911]: 2015-11-06 11:37:57,175 DEBUG     urllib3.connectionpool          _make_request:387  "PUT /_snapshot/atest_backups/logs-20151106113633?wait_for_completion=true HTTP/1.1" 503 129
Nov  6 11:37:57 atest docker/286fde13c77e[911]: 2015-11-06 11:37:57,176 WARNING            elasticsearch       log_request_fail:82   PUT /_snapshot/atest_backups/logs-20151106113633?wait_for_completion=true [status:503 request:19.652s]
Nov  6 11:37:57 atest docker/286fde13c77e[911]: 2015-11-06 11:37:57,176 DEBUG              elasticsearch       log_request_fail:90   > {"indices": "logs-nginx-access-2015.10.30", "partial": false, "ignore_unavailable": false, "include_global_state": true}
Nov  6 11:37:57 atest docker/286fde13c77e[911]: 2015-11-06 11:37:57,176 DEBUG         urllib3.util.retry               from_int:155  Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
Nov  6 11:37:57 atest docker/286fde13c77e[911]: 2015-11-06 11:37:57,701 DEBUG     urllib3.connectionpool          _make_request:387  "PUT /_snapshot/atest_backups/logs-20151106113633?wait_for_completion=true HTTP/1.1" 400 175
Nov  6 11:37:57 atest docker/286fde13c77e[911]: 2015-11-06 11:37:57,701 WARNING            elasticsearch       log_request_fail:82   PUT /_snapshot/atest_backups/logs-20151106113633?wait_for_completion=true [status:400 request:0.525s]
Nov  6 11:37:57 atest docker/286fde13c77e[911]: 2015-11-06 11:37:57,701 DEBUG              elasticsearch       log_request_fail:90   > {"indices": "logs-nginx-access-2015.10.30", "partial": false, "ignore_unavailable": false, "include_global_state": true}
Nov  6 11:37:57 atest docker/286fde13c77e[911]: 2015-11-06 11:37:57,701 ERROR       curator.api.snapshot        create_snapshot:91   Client raised aTransportError.
Nov  6 11:37:57 atest docker/286fde13c77e[911]: 2015-11-06 11:37:57,701 WARNING        curator.cli.utils               exit_msg:69   Job did not complete successfully.
Nov  6 11:37:57 atest docker/286fde13c77e[911]: Logs backup indices complete in : 86 seconds.

When I list all backup, the snapshot shows as SUCCESS :

{"snapshot":"logs-20151106113633","version_id":1070199,"version":"1.7.1","indices":["logs-nginx-access-2015.10.30"],"state":"SUCCESS","start_time":"2015-11-06T11:36:35.044Z","start_time_in_millis":1446809795044,"end_time":"2015-11-06T11:37:57.237Z","end_time_in_millis":1446809877237,"duration_in_millis":82193,"failures":[],"shards":{"total":5,"failed":0,"successful":5}}

The full backup process duration indicate 82 seconds but the backup size is barely 1mb so when I delete all files on my aws bucket and start the backup, I can see that the files are uploaded within a few seconds so I don't really get why it is waiting so long to complete.

I disabled my firewall but it does not work either.
Maybe it has something to do with the timeout as you mentionned as the first log request fail happens after 60 seconds :

Nov  6 11:37:37 atest docker/286fde13c77e[911]: 2015-11-06 11:37:35,043 WARNING            elasticsearch       log_request_fail:82   PUT http://192.168.33.105:9200/_snapshot/atest_backups/logs-20151106113633?wait_for_completion=true [status:N/A request:60.003s]

@pkr1234 could you share how you solved your problem ? was caused by your aws load balancer ?

I use haproxy as loadbalancer and keepalived in front of it, don't know if one of them could be the culprit.

Thanks for the help :)

@pkr1234
Copy link
Author

pkr1234 commented Nov 6, 2015

Hi,

AWS loadbalancer has a default idle timeout of 60 sec. It was just a case of editing load balancer setting and changing that to a higher value.

Mind you, this timeout was not affecting the actual snapshot. It is just that client connection was being terminated and there was no way of knowing whether the snapshot succeeded or not.

PS: I have not looked at your detailed output.

@trompx
Copy link

trompx commented Nov 6, 2015

Thanks for the fast answer @pkr1234.

I just stumbled upon an issue and was just investigating haproxy timeout : #457

I think I will try to have a direct connection between curator and elasticsearch without going through haproxy as I don't feel like increasing the timeout for all connections.

@untergeek
Copy link
Member

Thanks for responding @pkr1234. @trompx, The answer right above your issue was closed by PR #363 where this FAQ was added to the official documentation.

Glad you have it sorted out.

@trompx
Copy link

trompx commented Nov 6, 2015

Yes everything is working now :)
Thank you again for the great tool @untergeek !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants