Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering from 8.0 upgrade failure due to unmigrated 6.x indices #81326

Closed
cjcenizal opened this issue Dec 3, 2021 · 8 comments · Fixed by #82689
Closed

Recovering from 8.0 upgrade failure due to unmigrated 6.x indices #81326

cjcenizal opened this issue Dec 3, 2021 · 8 comments · Fixed by #82689
Assignees

Comments

@cjcenizal
Copy link
Contributor

The Stack Management team removed the system index migration feature from Upgrade Assistant in 7.16 (elastic/kibana#119798). During upgrade testing, @LeeDr discovered that users who upgrade from 6.8 to 7.16 will use Upgrade Assistant to prepare their deployment for upgrade, skipping the system indices migration step. It's reasonable to assume that some portion of these users might believe that they can upgrade to 8.0 at this point, unaware that we expect them to upgrade to 7.17 first.

After upgrading to 8.0, Elasticsearch will fail to start for these users because of the unmigrated 6.8 system indices. This startup failure occurs after ES has updated the keystore to be 8.x compatible. A user who attempts to fix this problem by downgrading to 7.16 will be blocked from doing so due to keystore incompatibility with 7.x (this is my assumption and needs verification). The user is now stuck, unable to complete their upgrade.

We have many users still on 6.x and prior, increasing the risk of users encountering this scenario. @DaveCTurner suggested we file this issue as an 8.0 blocker.

Some solutions suggested by David:

  • Run some checks before doing the reorg
  • Revert the reorg later on
  • Provide some tool support for reverting it

CC @elastic/kibana-stack-management

@cjcenizal cjcenizal changed the title Recovering from 8.0 upgrade failure Recovering from 8.0 upgrade failure due to unmigrated system indices Dec 3, 2021
@cjcenizal cjcenizal added the Team:Core/Infra Meta label for core/infra team label Dec 3, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@williamrandolph
Copy link
Contributor

Yes, I agree that this should be a blocker. I believe our best option is for 8.x installations of Elasticsearch to check for 6.x indices as early as possible, before making any irreversible changes.

@gwbrown
Copy link
Contributor

gwbrown commented Dec 3, 2021

Agreed that this should be a blocker for 8.0.

The initial ticket only discusses system indices, but we don't handle system indices in any special way at this stage of startup - I would expect to see the same thing if we tried to upgrade a cluster with un-upgraded regular (non-system) indices as well. Have we tried that? Just trying to pin down the scope of this issue.

@williamrandolph
Copy link
Contributor

williamrandolph commented Dec 9, 2021

I looked at the code.

The keystore can be upgraded from the command line with elasticsearch-keystore upgrade, but we don't emphasize this feature because Elasticsearch itself does the upgrading. The code for upgrading at startup is in KeyStoreWrapper, which is called from the Bootstrap class in order to load secure settings. We load secure settings very early in the startup process, well before we have loaded any index information from disk.

So it seems we need to delay upgrading the keystore but... how long? Do we have to wait until the node has loaded whatever index information it has on disk? Or do we need to wait until the node is part of a healthy cluster? The difficulty from the security perspective is that we are trying to keep all the code that needs the keystore password in one place so that we can avoid holding it in memory for a long time.

@DaveCTurner
Copy link
Contributor

I'm not sure we can delay the rewrite of the keystore - as Ryan notes here we want to do this before installing the security manager because -- apart from at startup -- we don't want anything to be able to write to it. However I don't think that's a big deal, users can re-create the keystore from scratch if needed with the tools that they have today. The big problem is that we effectively run mv ${path.data}/nodes/0/* ${path.data}/ before checking for any unmigrated indices (or other obsolete metadata) and today there is no way to undo that (apart from data-path surgery which we very strongly discourage).

@williamrandolph
Copy link
Contributor

williamrandolph commented Dec 15, 2021

EDIT: This approach has been superseded.

We talked about this in today's Core/Infra team meeting.

At a high level, there are three potential approaches in 8.x:

  1. Reorganize the startup code so that we can detect 6.8 indices on disk and refuse to start before updating the keystore and moving the data directories
  2. Catch the issue where we currently catch it, but provide a downgrade tool that downgrades the keystores and restores the data directory to the 7.x format
  3. Reorganize the startup code to delay destructive changes until we are sure that there are no 6.x nodes shards in the cluster.

(1) and (3) would involve massive changes to the codebase and an extremely large risk of serious bugs, so by process of elimination we are left with 2.

We expect this issue to be rare. In order to encounter it, the user would have to have a 7.x cluster that contains indices created in 6.x, ignored every warning in our documentation and deprecation endpoints, and upgraded in specific circumstances. For example, in a large cluster with good replication and a rolling upgrade strategy, one node might get into a bad state, but it can be discarded and recreated without much trouble. However, in a smaller cluster or a cluster that is upgraded with a full restart, discarding and recreating may not be an option.

The most difficult part of solving this problem will probably be creating "n-2" upgrade tests, which create a 6.x cluster and add test data, then upgrade to 7.x, and finally test different upgrade scenarios to 8.x. Right now we only test upgrades going back one version.

A proposed task list:

  1. Write the "n-2" test framework.
  2. Verify that 8.x nodes will fail when trying to join a cluster that contains 6.x indices
  3. Add code to back up keystore and cluster state files before making destructive changes, and code to remove those backups once a node has successfully joined a cluster
  4. Write a downgrade utility that will restore backups of keystore and cluster state, and will put the data directory back in its 7.x format

@williamrandolph
Copy link
Contributor

#81865 will begin to enforce that every upgrade to 8.x goes through 7.17.x. This should make the "unmigrated indices" failure even rarer. It could happen in two cases:

  1. User upgrades from 7.16 or lower to an 8.0.x release specifically.
  2. User upgrades from 7.17 to 8.x but completely ignores the upgrade assistant and deprecation APIs.

It is possible that we could now address these failures by building the rollback logic into 7.17.x, so that a failed upgrade to 8.x could always be addressed by installing 7.17.x over whatever version failed to upgrade.

@williamrandolph williamrandolph changed the title Recovering from 8.0 upgrade failure due to unmigrated system indices Recovering from 8.0 upgrade failure due to unmigrated 6.x indices Dec 17, 2021
@williamrandolph
Copy link
Contributor

We have decided on a different approach to this problem, which will also solve #81865 . There is a PR for part of it here: #82321

  1. 7.17 (or 7.last) will include the earliest index version in cluster metadata.
  2. 8.x will check to make sure that this value exists (ensuring that we are in fact upgrading from 7.last), and will then check that the earliest index is from 7.x. If either check fails, the node will refuse to start before making any destructive changes to the data directory. We are also going to de-prioritize protecting the keystore, since we are all right with Support manually rebuilding the keystore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants