Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow snapshot restore after write alias has been moved by ILM #73934

Open
matschaffer opened this issue Jun 9, 2021 · 8 comments
Open

Allow snapshot restore after write alias has been moved by ILM #73934

matschaffer opened this issue Jun 9, 2021 · 8 comments
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement feedback_needed Team:Distributed Meta label for distributed team

Comments

@matschaffer
Copy link
Contributor

I've seen some cases where a snapshot restore has failed with an error like this:

[illegal_state_exception] alias [matschaffer-filebeat-7.7.1] has more than one write index [matschaffer-filebeat-7.7.1-2021.03.22-000096,matschaffer-filebeat-7.7.1-2021.03.21-000095]

The sequence of events is roughly:

  1. Data is being written to matschaffer-filebeat-7.7.1-2021.03.21-000095 via matschaffer-filebeat-7.7.1 write alias
  2. A snapshot is taken which backs up matschaffer-filebeat-7.7.1-2021.03.21-000095 with the alias information
  3. ILM rolls over matschaffer-filebeat-7.7.1-2021.03.21-000095 to matschaffer-filebeat-7.7.1-2021.03.21-000096 and updates the write alias
  4. A failure occurs and matschaffer-filebeat-7.7.1-2021.03.21-000095 is lost
  5. Restore of matschaffer-filebeat-7.7.1-2021.03.21-000095 fails because it attempts to also use the matschaffer-filebeat-7.7.1 write index, currently backed by matschaffer-filebeat-7.7.1-2021.03.21-000096

To work around this I had to perform the restore manually without aliases:

POST _snapshot/found-snapshots/cloud-snapshot-2021.03.22-UUID/_restore
{
    "indices": [
        "matschaffer-filebeat-7.7.1-2021.03.21-000095"
    ],
    "include_aliases": false
}

Then replace the read alias so the restored data would be available via normal query load:

POST _aliases
{
    "actions" : [
        { "add" : { "index" : "matschaffer-filebeat-7.7.1-2021.03.21-000095", "alias" : "matschaffer-filebeat-7.7.1", "is_write_index": false } }
    ]
}

It'd be great if restore could be more ILM-aware such that it won't try to re-claim write indices already backed by a more-current index.

@matschaffer matschaffer added >enhancement needs:triage Requires assignment of a team area label labels Jun 9, 2021
@nik9000 nik9000 added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs team-discuss and removed needs:triage Requires assignment of a team area label labels Jun 15, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Jun 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor

We (the @elastic/es-distributed team) discussed possible solutions in our team meeting today. Our favourite idea was to introduce a new option that would let you preserve the aliases of an existing index rather than overwriting them or clearing them as we do today. The reasoning was that when restoring an index like this you're really trying to put its data back without changing its place in the cluster, so the aliases of the existing index are likely more useful than the aliases in the snapshot.

We discussed changing the default behaviour but decided it'd be surprising for the API to behave differently from today by default. Instead we would expect tooling that restores indices like this to use this new option explicitly.

We also discussed whether to preserve any other metadata (mappings, settings, ...) rather than overwriting them from those in the snapshot but decided that there are too many ways that such a mechanism might lead to operational surprises.

How does that sound @matschaffer?

@matschaffer
Copy link
Contributor Author

matschaffer commented Jun 17, 2021

Hard to say without a little more detail.

My expectation would be that you have some ability to restore matschaffer-filebeat-7.7.1-2021.03.21-000095 with only the read alias, leaving the write alias pointed to matschaffer-filebeat-7.7.1-2021.03.21-000096. In contrast to today where you get either read+write or nothing (via include_aliases: false).

If the new option would do this, then that's probably fine. It'd be good if we make this the default in Kibana's restore UI, or maybe even in elasticsearch itself.

We see this with some frequency when orchestrating snapshot restore after VM failure on non-HA indices.

@DaveCTurner
Copy link
Contributor

On closer inspection it seems that include_aliases: false already does what we propose, preserving the aliases of the existing closed index over the top of which we're doing the restore, but the orchestration tooling isn't setting this option so its restores will often fail as described. I believe we should always use include_aliases: false when restoring an index to recover it from some misadventure that left it in red health.

@matschaffer
Copy link
Contributor Author

cc @elastic/cloud-orchestration for comment/prioritization

@ean5533
Copy link

ean5533 commented Jun 21, 2021

I don't have a strong understanding of all the implications here, but if the recommendation from ES is to just set include_aliases: false on all snapshot restores (no conditional logic) then we can do that very easily. cc @anyasabo

@anyasabo
Copy link

anyasabo commented Jun 21, 2021

Yep +1 here, though dave your wording here has me a little concerned.

I believe we should always use include_aliases: false when restoring an index to recover it from some misadventure that left it in red health.

Should we just always be setting include_aliases: false?

@deckkh
Copy link

deckkh commented Jul 17, 2021

one additional thing , that happens to us after snapshot restore. By default , it will restore the ILM policy , which means that ILM usually kicks in and removes the restored index , shortly after restore has completed , which is very annoying.

We opened a support case on this and we pretty arrived at the conclusion , that the snapshot web interface cant be used and we have since then used dev tools for this , which is kinda sad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement feedback_needed Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

7 participants