Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide option to allow writes when master is down #60605

Merged
merged 9 commits into from
Aug 12, 2020

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Aug 3, 2020

Elasticsearch currently blocks writes by default when a master is unavailable. The cluster.no_master_block setting allows a user to change this behavior to also block reads when a master is unavailable. This PR introduces a way to now also still allow writes when a master is offline. Writes will continue to work as long as routing table changes are not needed (as those require the master for consistency), or if dynamic mapping updates are not required (as again, these require the master for consistency).

Eventually we should switch the default of cluster.no_master_block to this new mode.

@ywelsch ywelsch added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.10.0 labels Aug 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Aug 3, 2020
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When indexing with the block in place, we would previously timeout the entire shard or bulk request after the timeout provided (defaulting to 1 minute).

With the new metadata_write block, the write will go through (which is fine), but in case of a shard failure, it will block the request indefinitely instead.

I think this has two potential bad effects:

  1. We could build up lots of shard failed requests waiting for this.
  2. When a master comes back, we could have a burst of those sent to master.

I guess the byte based limiting also puts a limit to 1 and the shard failed deduplication solves 2 so this is likely not an issue, but thought I would mention anyway in case it makes others worried.

Otherwise looking good to me.

@ywelsch
Copy link
Contributor Author

ywelsch commented Aug 12, 2020

As you pointed out, the previous behavior was to unconditionally time out these write requests in the Reroute stage after a minute. The new behavior will proceed in the reroute phase, but keep the requests in a "stuck" state until a master is back. As a lot of requests can be piling up on a node within a minute (more than the node has memory), I think this should not introduce new unseen behavior. The byte-based memory limit for indexing is of help not only with this new block, but also with the old blocks. With the write block active (i.e. the current default), many requests can start piling up, with no bound at all (each one is turned into a ClusterStateObserver, waiting up to a minute for cluster state updates).

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ywelsch ywelsch merged commit 0b517dd into elastic:master Aug 12, 2020
ywelsch added a commit that referenced this pull request Aug 12, 2020
Elasticsearch currently blocks writes by default when a master is unavailable. The cluster.no_master_block setting allows
a user to change this behavior to also block reads when a master is unavailable. This PR introduces a way to now also still
allow writes when a master is offline. Writes will continue to work as long as routing table changes are not needed (as
those require the master for consistency), or if dynamic mapping updates are not required (as again, these require the
master for consistency).

Eventually we should switch the default of cluster.no_master_block to this new mode.
ywelsch added a commit that referenced this pull request Aug 13, 2020
We can't assert on the specific exception, unfortunately.
ywelsch added a commit that referenced this pull request Aug 13, 2020
We can't assert on the specific exception, unfortunately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Meta label for distributed team v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants