Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot restore snapshot on new cluster #78320

Closed
cdalexndr opened this issue Sep 27, 2021 · 10 comments · Fixed by #79670
Closed

Cannot restore snapshot on new cluster #78320

cdalexndr opened this issue Sep 27, 2021 · 10 comments · Fixed by #79670
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Data Management Meta label for data/management team Team:Distributed Meta label for distributed team

Comments

@cdalexndr
Copy link

cdalexndr commented Sep 27, 2021

Elasticsearch version (bin/elasticsearch --version): Version: 7.14.1, Build: default/docker/66b55ebfa59c92c15db3f69a335d500018b3331e/2021-08-26T09:01:05.390870785Z, JVM: 16.0.2

Plugins installed: []

JVM version (java -version): OpenJDK 64-Bit Server VM Temurin-16.0.2+7 (build 16.0.2+7, mixed mode, sharing)

OS version (uname -a if on a Unix-like system): Linux d3463a9ac7de 4.9.0-14-amd64 #1 SMP Debian 4.9.246-2 (2020-12-17) x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Trying to restore snapshot on a new single node cluster throws error:

"type" : "snapshot_restore_exception",
"reason" : "[backup:snapshot-2021.09.23/esJtA1MeRcenJbz3tkIL2A] cannot restore index [.geoip_databases] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"

Steps to reproduce:

  1. PUT /_snapshot/backup/%3Csnapshot-%7Bnow%2Fd%7D%3E
  2. Create new single node cluster
  3. POST /_snapshot/backup/snapshot-2021.09.23/_restore

Provide logs (if relevant):

Discuss url: https://discuss.elastic.co/t/cannot-restore-snapshot-to-new-single-node-cluster/285025

@cdalexndr cdalexndr added >bug needs:triage Requires assignment of a team area label labels Sep 27, 2021
@dadoonet
Copy link
Member

I think that we should add an option to ignore (may be by default) system indices. In which case the .geoip_databases would not be saved within the regular snapshot made by a user.

@danhermann danhermann added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed needs:triage Requires assignment of a team area label :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Sep 29, 2021
@elasticmachine elasticmachine added Team:Data Management Meta label for data/management team Team:Distributed Meta label for distributed team labels Sep 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@gwbrown
Copy link
Contributor

gwbrown commented Oct 1, 2021

There's a couple concerns here:

  1. When you restore a snapshot, you must restore at least one index. Since the only index that exists in the snapshot is .geoip_databases, you must restore that index. If there were other indices, you could specify them in the indices field when restoring, and would not encounter this problem.
  2. You can tell Elasticsearch that system indices in the cluster should be overwritten with the system indices from the snapshot, which will allow you to restore this snapshot. However, it's opt-in as this could potentially overwrite a lot of cluster configuration.
    The first way will tell Elasticsearch that all system indices should be replaced with the ones from the snapshot:
POST /_snapshot/backup/snapshot-2021.09.23/_restore
{
  "include_global_state": true
}

This restores the cluster state (cluster setttings, etc.) as well as system indices.

The second way specifies the features1 which should have their state overwritten. The features present in the cluster depends on the installed plugins and can be viewed with the Get Features API, or get GETing the snapshot and checking the feature_states field, which will list only the features with indices present in the snapshot. In this case, the feature for the index is named geoip.

We can tell Elasticsearch that only geoip's system indices (not including cluster state) can be restored using the feature_states field on the restore request:

POST /_snapshot/backup/snapshot-2021.09.23/_restore
{
  "feature_states": ["geoip"]
}

I've tested these both locally on a 7.15.0 cluster.

While we have a workaround here, it's clear that the initial user experience isn't the best. I'm not sure how it would be best to improve it without changing the behavior in such a way that it acts dangerously by default, but at a bare minimum we can improve the error message here.

Footnotes

  1. This is done by feature rather than by index as some features have dependencies between their indices, so only restoring some of that feature's indices would result in a broken system.

@cdalexndr
Copy link
Author

cdalexndr commented Oct 1, 2021

When you restore a snapshot, you must restore at least one index. Since the only index that exists in the snapshot is .geoip_databases, you must restore that index. If there were other indices, you could specify them in the indices field when restoring, and would not encounter this problem.

I was restoring a snapshot that contained multiple indexes. Indeed, I've managed to restore the snapshot by manually specifying the indexes, but this is a workaround.
If the command that made the snapshot required no additional arguments, then the restore command with no arguments should work out of the box!

@gwbrown
Copy link
Contributor

gwbrown commented Oct 1, 2021

If the command that made the snapshot required no additional arguments, then the restore command with no arguments should work out of the box!

While I agree that this would be ideal, there are things that make this difficult, if not impossible, to achieve without sacrificing other critical qualities. Especially when the cluster(s) in question have some amount of configuration already in place.

At snapshot creation time we want to default to capturing as much data as possible: Not just user-created indices, but system-owned indices and global cluster state as well. That way, we can be sure that if the snapshot wasn't configured more precisely, we have whatever the user wanted to save. But when we go to restore that snapshot, we need to make sure that restoring the snapshot won't have any destructive effects on any data already in the cluster unless Elasticsearch has explicitly been told that's okay by an administrator. In order to make that happen without any arguments at all, we'd have to choose between:

  • Restore all indices, overwriting any that already exist, which silently deletes data.
  • Only restore indices which don't already exist in the cluster, and simply do not restore any indices which conflict. This means that a snapshot restoration might only partially succeed with very little way of signaling that this has happened.

Both options lead to obscure problems where the data in the system isn't what one would expect. If we encounter a situation where the only way forward is to drop data, we've found that it's best to just raise an error and ask a human rather than guessing what the best thing to do is. Unfortunately, while this frequently averts disaster, it does mean that some of our APIs are picky.

All of this to say: I don't necessarily disagree with you that the current situation isn't too user-friendly and should be improved, but this is a hard problem to solve and figuring out how to improve it is likely to be challenging. We could simply omit this system index, but this error will happen any time you try to restore a snapshot that contains an index that's already present in the restoring cluster, so that would be a band-aid fix for one very particular instance of this problem - a similar situation can occur with the .security index, and that we definitely want to include in snapshots by default as that index contains critical configuration.

@cdalexndr
Copy link
Author

From a user perspective, I think the most common sense would be to merge the existing data with the snapshot data (by default).
Don't know if it's supported to merge two indices, but I found this: https://discuss.elastic.co/t/merging-two-indexes/8745/2

@gwbrown
Copy link
Contributor

gwbrown commented Oct 2, 2021

While that might seem intuitive, what's the intuitive behavior if those indices have document IDs that conflict? Do you take the one from the cluster or from the snapshot? Or do you merge them? If so, how does that logic work - do fields from the live index or the snapshot take priority? What about all the applications that are out there today that can't handle a foreign process merging documents into an index that they expect to have complete control over?

Regardless of whether it would be intuitive, merging two indices is not something Elasticsearch is capable of at this time or at any time in the near future. The post you link to is effectively reindexing both indices, which will take vastly more time and resources than restoring a snapshot. Being able to merge indices in a more efficient way would be both challenging (see above) and consume a lot of our development resources which could be spent building something else.

To bring this back around to the original issue: I think we can correct this behavior in the next major version to at least be a little more intuitive. For the 7.x series of releases, there's nothing we can do without breaking our backwards compatibility policy. But in 8.0 and later, I believe we can change the behavior on snapshot restoration to not include system indices unless they're requested via the feature_states or include_global_state parameters on the restore call, which also signal that the existing indices should be overwritten. This aligns with some other changes we're looking to make in 8.0.

@cdalexndr
Copy link
Author

what's the intuitive behavior if those indices have document IDs that conflict? Do you take the one from the cluster or from the snapshot? Or do you merge them?

As the user requested a snapshot restore, the snapshot takes priority, so overwrite. I'm guessing the case where only some fields differ in the same document is very rare, because docs should be immutable (add new doc instead of updating old) so this should cover most user cases (noop for overwrite same doc with same fields and values).

What about all the applications that are out there today that can't handle a foreign process merging documents into an index that they expect to have complete control over?

You provide additional options to the restore command so that they have control over it. But the default behavior should be made so that it works for most user cases.

@markwalkom
Copy link
Contributor

markwalkom commented Oct 2, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Data Management Meta label for data/management team Team:Distributed Meta label for distributed team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants