Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot aborted but but still in progress #5958

Closed
dipthegeezer opened this issue Apr 28, 2014 · 26 comments

Comments

Projects
None yet
9 participants
@dipthegeezer
Copy link

commented Apr 28, 2014

Hi

I'm running elasticsearch 1.1.1 on centos 6 AWS instances with java 7. We have a snapshot that the api call "_snapshot/<repo_name>/_all" is listing as IN_PROGRESS, however when attempting to delete that snapshot, it doesn't delete. In fact the delete call just hangs and never returns. When I check the current status of all the snapshots using "_snapshot/<repo_name>/_status" that to just hangs and does not return.

I then looked at the cluster state and that tells me that the snapshot is in an "ABORTED" state. I can't actually create a new snapshot right now. Any ideas how I can resolve this and is it a bug?

Thanks

@imotov imotov self-assigned this Apr 28, 2014

@imotov

This comment has been minimized.

Copy link
Member

commented Apr 28, 2014

Could you post here the portion of the cluster state with the information about the snapshot?

@dipthegeezer

This comment has been minimized.

Copy link
Author

commented Apr 28, 2014

Hi imotov

It's quite a big bit of json and I had to obscure the index and shard information as it was too long for the comment section in github. If you need it happy to talk offline via email etc. We quite a large number of indices.

{ 
        "snapshots": {
            "snapshots": [
                {
                    "include_global_state": true, 
                    "indices": [
                    ], 
                    "repository": "s3_repository", 
                    "shards": [
                    ], 
                    "snapshot": "custom_snapshot_24_04_2014_17:27", 
                    "state": "ABORTED"
                }
            ]
        }
}

@imotov imotov added bug and removed v1.3.0 labels Apr 29, 2014

imotov added a commit to imotov/elasticsearch that referenced this issue May 12, 2014

Fix for hanging aborted snapshot during node shutdown
If a node is shutdown while a snapshot that runs on this node is aborted, it might cause the snapshot process to hang.

Closes elastic#5958

@imotov imotov closed this in #5966 May 12, 2014

imotov added a commit that referenced this issue May 12, 2014

Fix for hanging aborted snapshot during node shutdown
If a node is shutdown while a snapshot that runs on this node is aborted, it might cause the snapshot process to hang.

Closes #5958

@imotov imotov removed the v1.1.2 label Jun 16, 2014

@williamsandrew

This comment has been minimized.

Copy link
Contributor

commented Jul 28, 2014

Was the solution to this problem updating to a new version? We were on 1.1.1, upgraded to 1.3.1, and are still having this problem

@imotov

This comment has been minimized.

Copy link
Member

commented Jul 28, 2014

@TheDude05 yes, it should be fixed in 1.1.2 and above. Could you provide more details about your problem? Was the snapshot in the ABORTED state and it's still in ABORTED state after rolling restart or you got an snapshot stack in an ABORTED state while running 1.3.1. Could you describe what happened before the snapshot went into this state?

@williamsandrew

This comment has been minimized.

Copy link
Contributor

commented Jul 28, 2014

I started a snapshot on my cluster (at the time all nodes were version 1.1.1) and found that it was taking a very long time because every node in my cluster was garbage collecting frequently. I performed a rolling restart of all of the nodes in my cluster, found that the snapshot I started was still in a running state so I aborted it via the HTTP DELETE /_snapshot//<snapshot_name> end point. That snapshot was then showing up as aborted in the cluster status so I attempted to start a new one. I however couldn't start a new snapshot and would instead get an error message indicating a snapshot was already running. Whenever I tried to delete that first snapshot again, the HTTP request would hang indefinitely. The issue did not go away after performing another rolling restart. There is also nothing interesting in the logs.

After reading the release notes and this Github issue it seemed that 1.2+ fixed some snapshot issues so I updated my cluster (one by one) to version 1.3.1. I again tried deleting the snapshot in question but am still having the same issue as before.

$ curl -s 'http://localhost:9200/_cluster/state' | jq -M '.metadata.snapshots.snapshots[] | {state: .state, snapshot: .snapshot}'
{
  "snapshot": "snapshot-1406558337",
  "state": "ABORTED"
}

The repository type is AWS/S3 and I do see that there are "snapshot" and "metadata" files in our S3 bucket for this particular snapshot.

Let me know if there is more debugging output that would be useful.

EDIT: Reworded some sentences for clarity

@imotov

This comment has been minimized.

Copy link
Member

commented Jul 29, 2014

Unfortunately, the fix in 1.2 doesn't clean already stuck snapshots, it only prevents new snapshots from being stuck in this state. So, you need to perform a full cluster restart to clean the stuck snapshot or use the cleanup utility to remove stuck snapshots.

@williamsandrew

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2014

@imotov Thank you for your help. The cleanup utility worked and removed that old snapshot that was causing problems

@peillis

This comment has been minimized.

Copy link

commented Aug 27, 2014

This issue is closed but I'm using elasticsearch 1.3.2 and we are having problems with this. We have to use the cleanup utility to remove stuck snapshots. Am I missing something?

@JoeZ99

This comment has been minimized.

Copy link

commented Aug 29, 2014

more precisely. snapshots restoring hangs from time to time. No need to try to delete or to shut down the node. Igor's script has to bee run periodically. Version is 1.3.2

@imotov

This comment has been minimized.

Copy link
Member

commented Aug 29, 2014

@JoeZ99 now I am completely confused. Which script are we talking about? If you mean https://github.com/imotov/elasticsearch-snapshot-cleanup it shouldn't do anything for snapshot restore process. The only way to stop restore is by deleting the files being restored. Could you describe in more details which repository you are using and what's going on?

@JoeZ99

This comment has been minimized.

Copy link

commented Aug 30, 2014

we're using a s3 repository.

we're on a 1.3.2 ES version

we make about 400 or so snapshot restores a day. each snapshot restore
restores two indices, all 400 restoring process are for different indices.
the restoration process of a single snapshot usually takes less than a
minute.

Some times the snapshot restoring process takes forever, like it was hung
or something

we've found that after applying that cleanup script you talk about, the
snapshot restore process is agile again, so we apply it regulartly.

For what i've understood from the ticket, it looks like the bug consist one
some snapshot restore process being "hung" and not able to being aborted
afterwards, using the standard DELETE entrypoint (at least until the 1.2
version). Your script is meant to wipe out any "hung" restoring process,
because the fixing that is since version 1.2 doesn't take care of already
hung restoring process, it just makes sure the restoring process doesn't
get "hung" again.

At that light, looks like from time to time one of our multiple restore
process gets "hung" and so no further restored process can be applied ,
since only one restoration at a time is allowed on the cluster, and that
would be when we see our restoration process gets "hung". After applying
your script, this supposedly "hung" restoration process is wiped out and
the cluster is back in business.

Could it be something like that?

On Fri, Aug 29, 2014 at 1:42 PM, Igor Motov notifications@github.com
wrote:

@JoeZ99 https://github.com/JoeZ99 now I am completely confused. Which
script are we talking about? If you mean
https://github.com/imotov/elasticsearch-snapshot-cleanup it shouldn't do
anything for snapshot restore process. The only way to stop restore is by
deleting the files being restored. Could you describe in more details which
repository you are using and what's going on?


Reply to this email directly or view it on GitHub
#5958 (comment)
.

uh, oh http://www.youtube.com/watch?v=GMD_T7ICL0o.

http://www.defectivebydesign.org/no-drm-in-html5

@imotov

This comment has been minimized.

Copy link
Member

commented Sep 2, 2014

@JoeZ99 can you email me cluster state from the cluster in such stuck state?

@l0bster

This comment has been minimized.

Copy link

commented Sep 29, 2014

Hey there,

we are using ES 1.3.2 and there are several Snapshots in State IN_PROGRESS. Deleting them manually doesn't work and using this script https://github.com/imotov/elasticsearch-snapshot-cleanup told me:

[2014-09-29 11:27:17,376][INFO ][org.elasticsearch.org.motovs.elasticsearch.snapshots.AbortedSnapshotCleaner] No snapshots found

A rolling restart of all nodes didn't help removing these stale snapshots. Any advice on removing them?

Thanks in advance!

#######

ptlxtme02:/tmp/elasticsearch-snapshot-cleanup-1.0-SNAPSHOT/bin # curl -XGET "http://ptlxtme02:9200/_snapshot/es_backup_fast/_all?pretty=true"
{
"snapshots" : [ {
"snapshot" : "2014-08-11_19:30:03",
"indices" : [ ".ps_config", ".ps_status", "_river", "ps_article_index_01" ],
"state" : "IN_PROGRESS",
"start_time" : "2014-08-11T17:30:03.764Z",
"start_time_in_millis" : 1407778203764,
"failures" : [ ],
"shards" : {
"total" : 0,
"failed" : 0,
"successful" : 0
}
}, {
"snapshot" : "2014-08-12_20:00:03",
"indices" : [ ".ps_config", ".ps_status", "_river", "ps_article_index_01" ],
"state" : "IN_PROGRESS",
"start_time" : "2014-08-12T18:00:03.255Z",
"start_time_in_millis" : 1407866403255,
"failures" : [ ],
"shards" : {
"total" : 0,
"failed" : 0,
"successful" : 0
}
}, {
"snapshot" : "2014-08-12_21:00:03",
"indices" : [ ".ps_config", ".ps_status", "_river", "ps_article_index_01" ],
"state" : "IN_PROGRESS",
"start_time" : "2014-08-12T19:00:03.723Z",
"start_time_in_millis" : 1407870003723,
"failures" : [ ],
"shards" : {
"total" : 0,
"failed" : 0,
"successful" : 0
}
}, {
"snapshot" : "2014-08-13_03:00:03",
"indices" : [ ".ps_config", ".ps_status", "_river", "ps_article_index_01" ],
"state" : "IN_PROGRESS",
"start_time" : "2014-08-13T01:00:03.350Z",
"start_time_in_millis" : 1407891603350,
"failures" : [ ],
"shards" : {
"total" : 0,
"failed" : 0,
"successful" : 0
}
}, {
"snapshot" : "2014-08-13_08:30:03",
"indices" : [ ".ps_config", ".ps_status", "_river", "ps_article_index_01" ],
"state" : "IN_PROGRESS",
"start_time" : "2014-08-13T06:30:03.183Z",
"start_time_in_millis" : 1407911403183,
"failures" : [ ],
"shards" : {
"total" : 0,
"failed" : 0,
"successful" : 0
}
}, {
"snapshot" : "2014-08-13_13:30:02",
"indices" : [ ".ps_config", ".ps_status", "_river", "ps_article_index_01" ],
"state" : "IN_PROGRESS",
"start_time" : "2014-08-13T11:30:03.009Z",
"start_time_in_millis" : 1407929403009,
"failures" : [ ],
"shards" : {
"total" : 0,
"failed" : 0,
"successful" : 0
}
}, {
"snapshot" : "2014-08-14_19:30:03",
"indices" : [ ".ps_config", ".ps_status", "_river", "ps_article_index_01" ],
"state" : "IN_PROGRESS",
"start_time" : "2014-08-14T17:30:03.620Z",
"start_time_in_millis" : 1408037403620,
"failures" : [ ],
"shards" : {
"total" : 0,
"failed" : 0,
"successful" : 0
}
}, {
"snapshot" : "2014-09-26_16:09:43",
"indices" : [ ".ps_config", ".ps_status", "_river", "ps_article_index_01" ],
"state" : "SUCCESS",
"start_time" : "2014-09-26T14:09:43.829Z",
"start_time_in_millis" : 1411740583829,
"end_time" : "2014-09-26T15:26:43.933Z",
"end_time_in_millis" : 1411745203933,
"duration_in_millis" : 4620104,
"failures" : [ ],
"shards" : {
"total" : 6,
"failed" : 0,
"successful" : 6
}
} ]
}

@imotov

This comment has been minimized.

Copy link
Member

commented Sep 29, 2014

@l0bster what do you get when you try to delete these snapshots?

@l0bster

This comment has been minimized.

Copy link

commented Sep 29, 2014

@imotov i get:
ptlxtme02:/tmp # curl -XDELETE "http://ptlxtme02:9200/_snapshot/es_backup_fast/2014-08-14_19:30:03"
{"error":"SnapshotMissingException[[es_backup_fast:2014-08-14_19:30:03] is missing]; nested: FileNotFoundException[/mnt/es_backup/fast_snapshot/metadata-2014-08-14_19:30:03 (No such file or directory)]; ","status":404

here is an ls output of the snapshot directory:
ptlxtme02:/tmp # ll /mnt/es_backup/fast_snapshot/
insgesamt 368
-rw-r--r-- 1 elasticsearch elasticsearch 118 29. Sep 11:10 index
drwxr-xr-x 8 elasticsearch elasticsearch 4096 7. Aug 17:00 indices
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 04:00 metadata-2014-08-11_04:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 04:30 metadata-2014-08-11_04:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 05:30 metadata-2014-08-11_05:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 07:00 metadata-2014-08-11_07:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 07:30 metadata-2014-08-11_07:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 08:30 metadata-2014-08-11_08:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 10:30 metadata-2014-08-11_10:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 11:30 metadata-2014-08-11_11:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 12:00 metadata-2014-08-11_12:00:04
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 13:00 metadata-2014-08-11_13:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 15:00 metadata-2014-08-11_15:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 16:00 metadata-2014-08-11_16:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 16:30 metadata-2014-08-11_16:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 17:00 metadata-2014-08-11_17:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 19:00 metadata-2014-08-11_19:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 20:00 metadata-2014-08-11_20:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 20:30 metadata-2014-08-11_20:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 21:00 metadata-2014-08-11_21:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 21:30 metadata-2014-08-11_21:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 22:00 metadata-2014-08-11_22:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 11. Aug 23:30 metadata-2014-08-11_23:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 01:30 metadata-2014-08-12_01:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 02:00 metadata-2014-08-12_02:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 03:00 metadata-2014-08-12_03:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 03:30 metadata-2014-08-12_03:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 04:00 metadata-2014-08-12_04:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 04:30 metadata-2014-08-12_04:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 06:00 metadata-2014-08-12_06:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 10:00 metadata-2014-08-12_10:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 10:30 metadata-2014-08-12_10:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 11:00 metadata-2014-08-12_11:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 11:30 metadata-2014-08-12_11:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 12:00 metadata-2014-08-12_12:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 12:30 metadata-2014-08-12_12:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 13:00 metadata-2014-08-12_13:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 13:30 metadata-2014-08-12_13:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 16:00 metadata-2014-08-12_16:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 17:00 metadata-2014-08-12_17:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 17:30 metadata-2014-08-12_17:30:04
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 21:30 metadata-2014-08-12_21:30:04
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 22:00 metadata-2014-08-12_22:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 23:00 metadata-2014-08-12_23:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 12. Aug 23:30 metadata-2014-08-12_23:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 00:00 metadata-2014-08-13_00:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 01:30 metadata-2014-08-13_01:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 02:30 metadata-2014-08-13_02:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 04:30 metadata-2014-08-13_04:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 05:00 metadata-2014-08-13_05:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 06:00 metadata-2014-08-13_06:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 07:00 metadata-2014-08-13_07:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 08:00 metadata-2014-08-13_08:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 09:00 metadata-2014-08-13_09:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 14:00 metadata-2014-08-13_14:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 14:30 metadata-2014-08-13_14:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 16:00 metadata-2014-08-13_16:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 18:00 metadata-2014-08-13_18:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 19:00 metadata-2014-08-13_19:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 20:30 metadata-2014-08-13_20:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 21:00 metadata-2014-08-13_21:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 21:30 metadata-2014-08-13_21:30:04
-rw-r--r-- 1 elasticsearch elasticsearch 253 13. Aug 23:30 metadata-2014-08-13_23:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 00:00 metadata-2014-08-14_00:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 02:00 metadata-2014-08-14_02:00:04
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 03:00 metadata-2014-08-14_03:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 04:00 metadata-2014-08-14_04:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 05:30 metadata-2014-08-14_05:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 06:00 metadata-2014-08-14_06:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 06:30 metadata-2014-08-14_06:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 07:30 metadata-2014-08-14_07:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 08:00 metadata-2014-08-14_08:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 09:00 metadata-2014-08-14_09:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 09:30 metadata-2014-08-14_09:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 12:30 metadata-2014-08-14_12:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 14:30 metadata-2014-08-14_14:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 16:00 metadata-2014-08-14_16:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 18:00 metadata-2014-08-14_18:00:02
-rw-r--r-- 1 elasticsearch elasticsearch 253 14. Aug 18:30 metadata-2014-08-14_18:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 15. Aug 05:30 metadata-2014-08-15_05:30:04
-rw-r--r-- 1 elasticsearch elasticsearch 253 15. Aug 07:30 metadata-2014-08-15_07:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 15. Aug 08:00 metadata-2014-08-15_08:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 253 15. Aug 10:00 metadata-2014-08-15_10:00:35
-rw-r--r-- 1 elasticsearch elasticsearch 312 26. Sep 16:09 metadata-2014-09-26_16:09:43
-rw-r--r-- 1 elasticsearch elasticsearch 232 11. Aug 19:30 snapshot-2014-08-11_19:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 232 12. Aug 20:00 snapshot-2014-08-12_20:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 232 12. Aug 21:00 snapshot-2014-08-12_21:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 231 13. Aug 03:00 snapshot-2014-08-13_03:00:03
-rw-r--r-- 1 elasticsearch elasticsearch 232 13. Aug 08:30 snapshot-2014-08-13_08:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 232 13. Aug 13:30 snapshot-2014-08-13_13:30:02
-rw-r--r-- 1 elasticsearch elasticsearch 231 14. Aug 19:30 snapshot-2014-08-14_19:30:03
-rw-r--r-- 1 elasticsearch elasticsearch 237 26. Sep 17:26 snapshot-2014-09-26_16:09:43

@peillis

This comment has been minimized.

Copy link

commented Sep 29, 2014

We also see the "No snapshots found" message from the cleanup script, but the fact is that after that the stuck seems to disappear.

@l0bster

This comment has been minimized.

Copy link

commented Sep 30, 2014

@peillis i gave it a try ... we have 7 stuck snapshots in state "IN_PROGRESS". I ran the utility, it told me, no snapshots found and after that, there were still 7 stuck snapshots :(

@imotov

This comment has been minimized.

Copy link
Member

commented Sep 30, 2014

@l0bster the utility is created to clean up snapshots that are currently running. In you case these snapshots are no longer running. They got stuck in IN_PROGRESS state because cluster was shutdown or you lost connection to the mounted shared file system while they were running. So elasticsearch didn't have chance to update them and they are still stored in this intermittent state. Theoretically, it should be possible to delete them using snapshot delete command. But since it doesn't work for you, you might be hitting a bug similar to #6383. I will try to reproduce the issue but it would really help me if you could check log file on the current master node to see if there are any errors logged while you are running this snapshot delete command. If you see an error, please post it here with complete stacktrace.

@l0bster

This comment has been minimized.

Copy link

commented Sep 30, 2014

@imotov Hey, when i try to delete one snapshot via: curl -XDELETE "http://ptlxtme02:9200/_snapshot/es_backup_fast/2014-08-11_19:30:03" there is no error logged. I can increase the log level to debug but that requires a node restart. I will edit my post asap to post eventually logged errors.

@imotov

This comment has been minimized.

Copy link
Member

commented Sep 30, 2014

@l0bster thanks, but it will unlikely result in more logging. The log message that I was looking for should have been logged on the WARN level.

@l0bster

This comment has been minimized.

Copy link

commented Sep 30, 2014

@imotov Hey, i managed to enable debuging... sry for the delay. But the Only thing es is logging is:

[2014-09-30 16:29:27,443][DEBUG][cluster.service ] [ptlxtme02] processing [delete snapshot]: execute
[2014-09-30 16:29:27,444][DEBUG][cluster.service ] [ptlxtme02] processing [delete snapshot]: no change in cluster_state

:(

@l0bster

This comment has been minimized.

Copy link

commented Oct 2, 2014

@imotov did you have time to reproduce the issue alrdy?

@imotov

This comment has been minimized.

Copy link
Member

commented Oct 3, 2014

@l0bster yes, I was able to reproduce it. It's a different bug. So I created a new issue for it - #7980. Thank you for report and providing helpful information. As a workaround you can simply delete the file snapshot-2014-08-14_19:30:03 from the repository.

@bruce-lyft

This comment has been minimized.

Copy link

commented Dec 12, 2014

I am seeing this error in v1.3.6. Cluster state shows the snapshot is in ABORTED state on all shards.
New snapshots cannot be started (ConcurrentSnapshotExecutionException), and the current snapshot cannot be deleted - DELETE just hangs.

@imotov 's cleanup tool did not help.

Update: rolling restart corrected the issue.

cluster state is
"snapshots": {
"snapshots": [
{
"repository": "my_backup",
"snapshot": "snapshot_1",
"include_global_state": true,
"state": "ABORTED",
"indices": [
"grafana-dash"
],
"shards": [
{
"index": "grafana-dash",
"shard": 0,
"state": "ABORTED",
"node": "_mTOiyD_TN2vV2C2A8sNbw"
},
{
"index": "grafana-dash",
"shard": 1,
"state": "ABORTED",
"node": "TYy7OHbXR2q_U-xTG4Xtqg"
},
{
"index": "grafana-dash",
"shard": 2,
"state": "ABORTED",
"node": "TYy7OHbXR2q_U-xTG4Xtqg"
},
{
"index": "grafana-dash",
"shard": 3,
"state": "ABORTED",
"node": "TYy7OHbXR2q_U-xTG4Xtqg"
},
{
"index": "grafana-dash",
"shard": 4,
"state": "ABORTED",
"node": "TYy7OHbXR2q_U-xTG4Xtqg"
}
]
}
]
}
},

@vanga

This comment has been minimized.

Copy link

commented Apr 6, 2015

hey @imotov , I have the same problem that my ES node was restarted during a snpashot process, I am trying to run this script https://github.com/imotov/elasticsearch-snapshot-cleanup
My ES version 1.4.0, I did this

For all other versions update pom.xml file and appropriate elasticsearch and lucene version, run mvn clean package and untar the file found in the target/releases directory. copied my cluster config to config/elasticsearch.yml

when I run the bin/cleanup package I get this error

Setting ES_HOME as /root/elasticsearch-snapshot-cleanup/target/releases/elasticsearch-snapshot-cleanup-1.4.4.1
Error: Could not find or load main class org.motovs.elasticsearch.snapshots.AbortedSnapshotCleaner

Any idea?

@MosesMansaray

This comment has been minimized.

Copy link

commented Oct 25, 2016

Occurred on ES v1.4.2. elasticsearch-snapshot-cleanup did the trick for me with no fuss.

Thanks a bunch @imotov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.