Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot/Restore: snapshot with missing metadata file cannot be deleted #7980

Closed
imotov opened this Issue Oct 3, 2014 · 21 comments

Comments

Projects
None yet
3 participants
@imotov
Copy link
Member

imotov commented Oct 3, 2014

If snapshot metadata file disappears from a repository or it wasn't created due to network issues or master node crash during snapshot process, such snapshot cannot be deleted. Was originally reported in #5958 (comment)

@imotov imotov added the >bug label Oct 3, 2014

@imotov imotov self-assigned this Oct 3, 2014

imotov added a commit to imotov/elasticsearch that referenced this issue Oct 7, 2014

@imotov imotov closed this in #7981 Oct 7, 2014

imotov added a commit that referenced this issue Oct 7, 2014

imotov added a commit that referenced this issue Oct 7, 2014

imotov added a commit that referenced this issue Oct 7, 2014

@saahn

This comment has been minimized.

Copy link

saahn commented Nov 4, 2014

Hi, I just ran into this issue today and was wondering if you had any idea when this patch would be released. Is the target version 1.3.5 or 1.4? Thanks!

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Nov 4, 2014

@saahn yes, 1.3.5 and 1.4.0. You can check labels on the issue #7981 to see all versions that it was merged into.

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 26, 2015

We've run into (what I believe) is this issue. Running 1.4.4 and there is a snapshot that is:

IN_PROGRESS when I check localhost:9200/_snapshot/backups/snapshot_name endpoint
ABORTED when I check localhost:9200/_cluster/state

-XDELETE hangs when attempting to delete the snapshot

The reason I believe it is this issue - we upgraded and restarted Elasticsearch around the time this snapshot was running.

@imotov I also tried to use your cleanup script - however it returns with "No snapshots found" and the snapshot is still stuck in the same states above.

Any other ideas on a way to force delete this snapshot? It is currently blocking us from getting any other snapshots created.

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Feb 26, 2015

@sarkis which type of repository and which version are you using? Could you post the snapshot part of the cluster state here?

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 26, 2015

@imotov snapshot part of cluster state: https://gist.github.com/sarkis/f46de23dc81b1dba0d1a

We're now on ES 1.4.4 the snapshot was started on ES 1.4.2, and we ran into troubles as the snapshot was running while upgrading 1.4.2 -> 1.4.4. We are using an fs snapshot, more info:

{"backups_sea":{"type":"fs","settings":{"compress":"true","location":"/path/to/snapshot_dir"}}

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Feb 26, 2015

@sarkis how many nodes are in the cluster right now? Is the node bhwQqwZ2QuCUPjcZGrGpuQ still running?

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 26, 2015

@imotov 1 gateway and 2 data nodes (1 set to master)

I cannot find that node name - I assume it was renamed upon rolling restart or possibly from the upgrade?

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Feb 26, 2015

@sarkis Are you sure it doesn't appear in curl "localhost:9200/_nodes?pretty" output?

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 26, 2015

@imotov just double checked - nothing.

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Feb 26, 2015

@sarkis did you try running cleanup script after the upgrade or before? Did you restart master node during upgrade or it is still running 1.4.2? Does master node have proper access to the shared file system, or read/write operations with the shared files system still hang?

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 26, 2015

@imotov I tried running the cleanup script after the upgrade.

The master node was restarted and is running 1.4.4 - if I had known about this issue I would have stopped the snapshot before rolling restarts/upgrades :(

The snapshot directory is a nfs mount and the "elasticsearch" user does have proper read/write perms. I just double checked this on all nodes in the cluster.

Thanks a lot for the help and quick responses.

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Feb 26, 2015

@sarkis I am completely puzzled about what went wrong and I am still trying to figure out what happened and how to reproduce the issue. With a single master node, the snapshot should have disappeared during restart. There is simply no place for it to survive since snapshot information is not getting persisted on disk. Even if the snapshot somehow survived the restart, the cleanup script should have removed it. So, I feel that I am missing something important about the issue.

When you said rolling restart, what did you mean? Could you describe the process in as many details as possible. Was snapshot stuck before the upgrade or was it simply taking long time. What was the upgrade process? Which nodes did you restart first?

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 26, 2015

@imotov Sure - so we have 2 data nodes and 1 gateway node (total of 3 nodes). The rolling restart was done following the recommended way to do so via elasticsearch documentation:

  1. turn off allocation
  2. upgrade / restart gateway
  3. turn on allocation (wait for green)
  4. turn off allocation
  5. upgrade / restart non-master data node
  6. turn on allocation (wait for green)
  7. turn off allocation
  8. upgrade / restart master data node
  9. turn on allocation

I think we have tried everything we could at this point as well. Would you recommend removing/adding back the repo? What's the best way to just get around this? I understand you wanted to reproduce it on your end but I'd like to get snapshots working ASAP.

Update: I know there isn't truly a "master" - I called the above nodes master and non-master based off of info from paramedic at the time of upgrades

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Feb 26, 2015

So they are all master-eligible nodes! That explains the first part - how snapshot survived restart. It doesn't explain how it happened in the first place, though. Removing and adding back the repo is not going to help. There are really only two ways to go - I can try to figure out what went wrong and fix cleanup script to clean the issue or you can do full cluster restart (shut down all master-eligible nodes and then start them back up). By the way, what do you mean by "gateway"?

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 26, 2015

We have a dedicated node for this: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html

On the full cluster restart - do you mean to shut them all down at the same time and bring them back up? Does that mean there will be data loss in the time it takes to bring them both back up?

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Feb 26, 2015

@sarkis not sure I am following. Are you using non-local gateway on one of your nodes? Which gateway are you using there? How did you configure this node comparing to all other nodes? Could you send me your complete cluster state (you can send it to igor.motov@elasticsearch.com if you don't want to post it here).

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 26, 2015

@imotov sorry for the confusion - our 3rd non-data, non-master node we refer to as a gateway is the entry point to the cluster. It's one and only purpose is to pass traffic through to the data nodes. Sending you the full cluster state via e-mail.

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Feb 26, 2015

OK, so "gateway" node doesn't have anything to do with gateways and it's simply a client node. Got it! I will be waiting for the cluster state to continue investigation. Thanks!

Full cluster restart will make cluster unavailable for indexing and searching while nodes are restarting and shards are recovering. Depending on how you are indexing data it might or might not cause loss of data (if your client has a retry logic to reindex failed records, it shouldn't lead to any data loss).

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 26, 2015

@imotov sent the cluster state - let me know if I can do anything else. I am looking for a window we can do a full restart to see if this will fix our problem.

@sarkis

This comment has been minimized.

Copy link

sarkis commented Feb 27, 2015

In case others come here with the same issue. @imotov's updated cleanup script (https://github.com/imotov/elasticsearch-snapshot-cleanup) for 1.4.4 worked in clearing up the ABORTED snapshots.

@imotov

This comment has been minimized.

Copy link
Member Author

imotov commented Feb 27, 2015

Since it seems to be a different problem, I have created a separate issue for it.

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.