Write shard state metadata as soon as shard is created / initializing #16625

ywelsch · 2016-02-12T08:30:13Z

As we now rely on active allocation ids persisted in the cluster state to select
the primary shard copy, we can write shard state metadata on the allocated node
as soon as the node knows about receiving this shard. This also ensures that
in case of primary relocation, when the relocation target is marked as started
by the master node, the shard state metadata with the correct allocation id has
already been written on the relocation target. Before this change, shard state
metadata was only written once the node knows it is marked as started. In case
of failures between master marking the node as started and the node
receiving and processing this event, the relation between the shard copy on disk
and the cluster state could get lost. This means that manual allocation of
the shard using the reroute command allocate_stale_primary was necessary.

Relates to #14739

ywelsch · 2016-02-12T08:37:18Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

@@ -342,8 +342,6 @@ public void cleanFiles(int totalTranslogOps, Store.MetadataSnapshot sourceMetaDa
        // first, we go and move files that were created with the recovery id suffix to
        // the actual names, its ok if we have a corrupted index here, since we have replicas
        // to recover from in case of a full cluster shutdown just when this code executes...
-        indexShard().deleteShardState(); // we have to delete it first since even if we fail to rename the shard


@bleskes I'm not sure what effect removing this has. The issue that made me remove this is that the shard state metadata was written when shard is created, then it was removed again if shard was recovery target, and not updated anymore since the shard state metadata did not change from point of view of IndexShard.persistMetadata(). With writing shard state metadata directly, we now know that the shard state metadata is up-to-date before we do recovery (hence no need to delete shard state?)

This was added as a protection against a failure during the rename, leaving the shard in a corrupted state (#10053) . We later added better checks in another place ( #11269 ) . I think it's OK to remove.

ywelsch · 2016-02-22T18:19:35Z

@bleskes ping

bleskes · 2016-02-26T15:35:40Z

core/src/main/java/org/elasticsearch/gateway/PrimaryShardAllocator.java

+                    assert nodeShardState.allocationId() == null : "Allocation id and legacy version cannot be both present";
+                    logger.trace("[{}] on node [{}] has version [{}] of shard", shard, nodeShardState.getNode(), version);
+                } else {
+                    // shard was already selected in a 3.x cluster as best candidate for recovery but did not make it to STARTED state


if I understand this correctly, this part is relevant where we assigned a primary after a cluster upgrade and the shard initialized (and wrote a new state file) but we never got around to activating it before crushing again. if that's correct, can you add this to the comment?

bleskes · 2016-02-29T09:59:13Z

change looks good to me. Left some suggestions and questions re testing..

ywelsch · 2016-02-29T11:40:03Z

Pushed another commit addressing review comments. Also found a copy-paste bug in a test.

bleskes · 2016-02-29T12:19:10Z

LGTM. Thanks @ywelsch

As we rely on active allocation ids persisted in the cluster state to select the primary shard copy, we can write shard state metadata on the allocated node as soon as the node knows about receiving this shard. This also ensures that in case of primary relocation, when the relocation target is marked as started by the master node, the shard state metadata with the correct allocation id has already been written on the relocation target. Before this change, shard state metadata was only written once the node knows it is marked as started. In case of failures between master marking the node as started and the node receiving and processing this event, the relation between the shard copy on disk and the cluster state could get lost. This means that manual allocation of the shard using the reroute command allocate_stale_primary was necessary. Closes elastic#16625

…e-metadata Write shard state metadata as soon as shard is created / initializing

ywelsch added review v5.0.0-alpha1 labels Feb 12, 2016

ywelsch assigned bleskes Feb 12, 2016

ywelsch reviewed Feb 12, 2016
View reviewed changes

ywelsch force-pushed the fix/fail-on-persist-shard-state-metadata branch from 9b0988e to 8b252b7 Compare February 12, 2016 08:40

clintongormley added the >enhancement label Feb 13, 2016

ywelsch mentioned this pull request Feb 14, 2016

Allocate primary shard based on allocation IDs #14739

Closed

7 tasks

bleskes reviewed Feb 26, 2016
View reviewed changes

ywelsch force-pushed the fix/fail-on-persist-shard-state-metadata branch from fe713f0 to ef3f69e Compare February 29, 2016 11:33

ywelsch force-pushed the fix/fail-on-persist-shard-state-metadata branch from ef3f69e to d76161d Compare February 29, 2016 12:49

ywelsch pushed a commit that referenced this pull request Feb 29, 2016

Merge pull request #16625 from ywelsch/fix/fail-on-persist-shard-stat…

7fc9f03

…e-metadata Write shard state metadata as soon as shard is created / initializing

ywelsch merged commit 7fc9f03 into elastic:master Feb 29, 2016

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write shard state metadata as soon as shard is created / initializing #16625

Write shard state metadata as soon as shard is created / initializing #16625

ywelsch commented Feb 12, 2016

ywelsch Feb 12, 2016

bleskes Feb 26, 2016

ywelsch commented Feb 22, 2016

bleskes Feb 26, 2016

ywelsch Feb 26, 2016

bleskes commented Feb 29, 2016

ywelsch commented Feb 29, 2016

bleskes commented Feb 29, 2016

Write shard state metadata as soon as shard is created / initializing #16625

Write shard state metadata as soon as shard is created / initializing #16625

Conversation

ywelsch commented Feb 12, 2016

ywelsch Feb 12, 2016

Choose a reason for hiding this comment

bleskes Feb 26, 2016

Choose a reason for hiding this comment

ywelsch commented Feb 22, 2016

bleskes Feb 26, 2016

Choose a reason for hiding this comment

ywelsch Feb 26, 2016

Choose a reason for hiding this comment

bleskes commented Feb 29, 2016

ywelsch commented Feb 29, 2016

bleskes commented Feb 29, 2016