Perform more startup consistency checks before writing anything to disk #44624

DaveCTurner · 2019-07-19T13:07:45Z

Today when a node starts up after an upgrade it might write upgraded versions of at least these separate structures to disk:

the keystore
the node metadata file
(some portion of) the cluster metadata

None of these have forwards-compatible representations, and all of them are loaded, checked, and then rewritten independently. This can potentially get a node completely stuck in an upgrade:

if the node metadata file is invalid (e.g. comes from a version that is too old to support an in-place upgrade) then we do not discover this until after upgrading the keystore to the latest version. This version of the node cannot start up due to the invalid node metadata file, but an attempt to downgrade to the previous working version will also fail because of the upgraded keystore.
if the cluster metadata is invalid (e.g. contains an index from an unsupported version) then we do not discover this until after upgrading the keystore and the node metadata files to the latest versions. Again, this version of the node cannot start up due to the invalid cluster metadata, but an attempt to downgrade to the previous working version will also fail because of the upgraded keystore and node metadata files.

One common path into this kind of situation is by upgrading without first getting a clean bill of health from the upgrade assistant.

We can make this experience better by performing more consistency checks before writing anything to disk at startup, to avoid blocking a subsequent downgrade in cases where the upgrade is obviously infeasible.

elasticmachine · 2019-07-19T13:07:47Z

Pinging @elastic/es-distributed

elasticmachine · 2019-07-19T13:07:48Z

Pinging @elastic/es-core-infra

rjernst · 2019-07-26T08:08:59Z

I agree this is something important we should aim for, but there are some caveats we need to consider regarding the keystore. The keystore is loaded very early in startup, before we have even installed the security manager. This is important, as our security policy only allows reading from the config directory, not writing.

We could add this write permission, but that doesn't seem worth it, and would mean any plugin could overwrite the keystore.

The important thing about security manager installation is it happens before we have loaded any plugin code. Can the checks on node metadata and cluster metadata be done independent of any services being loaded?

DaveCTurner · 2019-09-02T06:50:48Z

It's going to be tricky to do much with the cluster metadata before loading plugins.

Do we need to write to the keystore at all during startup? Could we instead move the responsibility for upgrading an old-format keystore to the elasticsearch-keystore command?

That said, on reflection the keystore might not matter so much here anyway because it's something that a user can create again themselves. It could be enough to emit a more actionable error message telling them to delete and re-create it if it can't be loaded.

This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates elastic#44624

rjernst · 2019-09-03T18:27:45Z

Do we need to write to the keystore at all during startup?

If the keystore does not exist, we auto create it. This is because we always need the keystore.seed value, and do not want archive users to need additional setup before running Elasticsearch. We also do format upgrade as you mention, which is again important to not require additional setup on upgrade, and would not change the difficulty in downgrading.

It could be enough to emit a more actionable error message telling them to delete and re-create it if it can't be loaded.

I agree. I opened #46291

This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624

DaveCTurner · 2020-01-13T13:34:50Z

We made progress towards fixing this in #50907 which delays writing the node metadata file until after validating and upgrading the cluster metadata, but unfortunately that's not enough: #42489 also moves the contents of the data path around so as to make it incompatible with 7.x, and does so before looking at the cluster metadata.

DaveCTurner added :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jul 19, 2019

DaveCTurner mentioned this issue Jul 19, 2019

Better handling of ancient indices #44230

Closed

DaveCTurner added the >bug label Jul 19, 2019

Bukhtawar mentioned this issue Aug 2, 2019

Fault detection ping doesn't check for disk health #40326

Closed

rjernst mentioned this issue Sep 3, 2019

Add more meaningful keystore version mismatch errors #46291

Merged

rjernst added a commit that referenced this issue Sep 11, 2019

Add more meaningful keystore version mismatch errors (#46291)

78e5df4

This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624

rjernst added a commit that referenced this issue Sep 11, 2019

Add more meaningful keystore version mismatch errors (#46291)

fa9327c

This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624

rjernst added a commit that referenced this issue Sep 11, 2019

Add more meaningful keystore version mismatch errors (#46291)

d50ed0d

This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624

rjernst added a commit that referenced this issue Sep 11, 2019

Add more meaningful keystore version mismatch errors (#46291)

48dec10

This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624

DaveCTurner mentioned this issue Feb 17, 2020

New directory layout does not play nicely with upgrades/downgrades #52414

Closed

rjernst added Team:Core/Infra Meta label for core/infra team Team:Distributed Meta label for distributed team labels May 4, 2020

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

DaveCTurner removed the :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. label Dec 8, 2020

elasticmachine removed the Team:Core/Infra Meta label for core/infra team label Dec 8, 2020

This comment has been minimized.

Sign in to view

jamshid mentioned this issue Dec 9, 2020

Docs say elasticsearch-keystore must be run as the elasticsearch user but that fails because of permissions #66133

Closed

jimczi removed the needs:triage Requires assignment of a team area label label Jan 12, 2021

DaveCTurner mentioned this issue Dec 10, 2021

Recovering from 8.0 upgrade failure due to unmigrated 6.x indices #81326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform more startup consistency checks before writing anything to disk #44624

Perform more startup consistency checks before writing anything to disk #44624

DaveCTurner commented Jul 19, 2019 •

edited

Loading

elasticmachine commented Jul 19, 2019

elasticmachine commented Jul 19, 2019

rjernst commented Jul 26, 2019

DaveCTurner commented Sep 2, 2019

rjernst commented Sep 3, 2019

DaveCTurner commented Jan 13, 2020

This comment has been minimized.

This comment has been minimized.

Perform more startup consistency checks before writing anything to disk #44624

Perform more startup consistency checks before writing anything to disk #44624

Comments

DaveCTurner commented Jul 19, 2019 • edited Loading

elasticmachine commented Jul 19, 2019

elasticmachine commented Jul 19, 2019

rjernst commented Jul 26, 2019

DaveCTurner commented Sep 2, 2019

rjernst commented Sep 3, 2019

DaveCTurner commented Jan 13, 2020

This comment has been minimized.

This comment has been minimized.

DaveCTurner commented Jul 19, 2019 •

edited

Loading