Recovering from 8.0 upgrade failure due to unmigrated 6.x indices #81326

cjcenizal · 2021-12-03T19:24:30Z

The Stack Management team removed the system index migration feature from Upgrade Assistant in 7.16 (elastic/kibana#119798). During upgrade testing, @LeeDr discovered that users who upgrade from 6.8 to 7.16 will use Upgrade Assistant to prepare their deployment for upgrade, skipping the system indices migration step. It's reasonable to assume that some portion of these users might believe that they can upgrade to 8.0 at this point, unaware that we expect them to upgrade to 7.17 first.

After upgrading to 8.0, Elasticsearch will fail to start for these users because of the unmigrated 6.8 system indices. This startup failure occurs after ES has updated the keystore to be 8.x compatible. A user who attempts to fix this problem by downgrading to 7.16 will be blocked from doing so due to keystore incompatibility with 7.x (this is my assumption and needs verification). The user is now stuck, unable to complete their upgrade.

We have many users still on 6.x and prior, increasing the risk of users encountering this scenario. @DaveCTurner suggested we file this issue as an 8.0 blocker.

Some solutions suggested by David:

Run some checks before doing the reorg
Revert the reorg later on
Provide some tool support for reverting it

CC @elastic/kibana-stack-management

elasticmachine · 2021-12-03T20:17:29Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

williamrandolph · 2021-12-03T20:41:43Z

Yes, I agree that this should be a blocker. I believe our best option is for 8.x installations of Elasticsearch to check for 6.x indices as early as possible, before making any irreversible changes.

gwbrown · 2021-12-03T20:46:13Z

Agreed that this should be a blocker for 8.0.

The initial ticket only discusses system indices, but we don't handle system indices in any special way at this stage of startup - I would expect to see the same thing if we tried to upgrade a cluster with un-upgraded regular (non-system) indices as well. Have we tried that? Just trying to pin down the scope of this issue.

williamrandolph · 2021-12-09T22:49:41Z

I looked at the code.

The keystore can be upgraded from the command line with elasticsearch-keystore upgrade, but we don't emphasize this feature because Elasticsearch itself does the upgrading. The code for upgrading at startup is in KeyStoreWrapper, which is called from the Bootstrap class in order to load secure settings. We load secure settings very early in the startup process, well before we have loaded any index information from disk.

So it seems we need to delay upgrading the keystore but... how long? Do we have to wait until the node has loaded whatever index information it has on disk? Or do we need to wait until the node is part of a healthy cluster? The difficulty from the security perspective is that we are trying to keep all the code that needs the keystore password in one place so that we can avoid holding it in memory for a long time.

DaveCTurner · 2021-12-10T12:04:59Z

I'm not sure we can delay the rewrite of the keystore - as Ryan notes here we want to do this before installing the security manager because -- apart from at startup -- we don't want anything to be able to write to it. However I don't think that's a big deal, users can re-create the keystore from scratch if needed with the tools that they have today. The big problem is that we effectively run mv ${path.data}/nodes/0/* ${path.data}/ before checking for any unmigrated indices (or other obsolete metadata) and today there is no way to undo that (apart from data-path surgery which we very strongly discourage).

williamrandolph · 2021-12-15T22:14:22Z

EDIT: This approach has been superseded.

We talked about this in today's Core/Infra team meeting.

At a high level, there are three potential approaches in 8.x:

Reorganize the startup code so that we can detect 6.8 indices on disk and refuse to start before updating the keystore and moving the data directories
Catch the issue where we currently catch it, but provide a downgrade tool that downgrades the keystores and restores the data directory to the 7.x format
Reorganize the startup code to delay destructive changes until we are sure that there are no 6.x ~~nodes~~ shards in the cluster.

(1) and (3) would involve massive changes to the codebase and an extremely large risk of serious bugs, so by process of elimination we are left with 2.

We expect this issue to be rare. In order to encounter it, the user would have to have a 7.x cluster that contains indices created in 6.x, ignored every warning in our documentation and deprecation endpoints, and upgraded in specific circumstances. For example, in a large cluster with good replication and a rolling upgrade strategy, one node might get into a bad state, but it can be discarded and recreated without much trouble. However, in a smaller cluster or a cluster that is upgraded with a full restart, discarding and recreating may not be an option.

The most difficult part of solving this problem will probably be creating "n-2" upgrade tests, which create a 6.x cluster and add test data, then upgrade to 7.x, and finally test different upgrade scenarios to 8.x. Right now we only test upgrades going back one version.

A proposed task list:

Write the "n-2" test framework.
Verify that 8.x nodes will fail when trying to join a cluster that contains 6.x indices
Add code to back up keystore and cluster state files before making destructive changes, and code to remove those backups once a node has successfully joined a cluster
Write a downgrade utility that will restore backups of keystore and cluster state, and will put the data directory back in its 7.x format

williamrandolph · 2021-12-17T16:58:38Z

#81865 will begin to enforce that every upgrade to 8.x goes through 7.17.x. This should make the "unmigrated indices" failure even rarer. It could happen in two cases:

User upgrades from 7.16 or lower to an 8.0.x release specifically.
User upgrades from 7.17 to 8.x but completely ignores the upgrade assistant and deprecation APIs.

It is possible that we could now address these failures by building the rollback logic into 7.17.x, so that a failed upgrade to 8.x could always be addressed by installing 7.17.x over whatever version failed to upgrade.

williamrandolph · 2022-01-07T16:18:49Z

We have decided on a different approach to this problem, which will also solve #81865 . There is a PR for part of it here: #82321

7.17 (or 7.last) will include the earliest index version in cluster metadata.
8.x will check to make sure that this value exists (ensuring that we are in fact upgrading from 7.last), and will then check that the earliest index is from 7.x. If either check fails, the node will refuse to start before making any destructive changes to the data directory. We are also going to de-prioritize protecting the keystore, since we are all right with Support manually rebuilding the keystore.

cjcenizal added blocker >upgrade labels Dec 3, 2021

cjcenizal changed the title ~~Recovering from 8.0 upgrade failure~~ Recovering from 8.0 upgrade failure due to unmigrated system indices Dec 3, 2021

cjcenizal added the Team:Core/Infra Meta label for core/infra team label Dec 3, 2021

williamrandolph self-assigned this Dec 8, 2021

williamrandolph added v8.0.0 v8.1.0 labels Dec 8, 2021

williamrandolph changed the title ~~Recovering from 8.0 upgrade failure due to unmigrated system indices~~ Recovering from 8.0 upgrade failure due to unmigrated 6.x indices Dec 17, 2021

williamrandolph mentioned this issue Jan 5, 2022

Integration test framework for failed 7.x-to-8.x upgrades #82291

Open

williamrandolph assigned grcevski and unassigned williamrandolph Jan 7, 2022

grcevski mentioned this issue Jan 17, 2022

Prevent direct upgrade of indices from 6.8 to 8.0 #82689

Merged

grcevski closed this as completed in #82689 Jan 20, 2022

pugnascotia added v8.0.0-rc2 and removed v8.0.0 labels Feb 1, 2022

DaveCTurner mentioned this issue Feb 21, 2022

Deprecated feature use in 7.x index fails upgrade to 8.x but cannot be downgraded. #84199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovering from 8.0 upgrade failure due to unmigrated 6.x indices #81326

Recovering from 8.0 upgrade failure due to unmigrated 6.x indices #81326

cjcenizal commented Dec 3, 2021

elasticmachine commented Dec 3, 2021

williamrandolph commented Dec 3, 2021

gwbrown commented Dec 3, 2021 •

edited

Loading

williamrandolph commented Dec 9, 2021 •

edited

Loading

DaveCTurner commented Dec 10, 2021

williamrandolph commented Dec 15, 2021 •

edited

Loading

williamrandolph commented Dec 17, 2021

williamrandolph commented Jan 7, 2022

Recovering from 8.0 upgrade failure due to unmigrated 6.x indices #81326

Recovering from 8.0 upgrade failure due to unmigrated 6.x indices #81326

Comments

cjcenizal commented Dec 3, 2021

elasticmachine commented Dec 3, 2021

williamrandolph commented Dec 3, 2021

gwbrown commented Dec 3, 2021 • edited Loading

williamrandolph commented Dec 9, 2021 • edited Loading

DaveCTurner commented Dec 10, 2021

williamrandolph commented Dec 15, 2021 • edited Loading

williamrandolph commented Dec 17, 2021

williamrandolph commented Jan 7, 2022

gwbrown commented Dec 3, 2021 •

edited

Loading

williamrandolph commented Dec 9, 2021 •

edited

Loading

williamrandolph commented Dec 15, 2021 •

edited

Loading