Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update resliency page #17586

Merged
merged 1 commit into from
Apr 7, 2016
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 23 additions & 21 deletions docs/resiliency/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -94,16 +94,35 @@ space. The following issues have been identified:

Other safeguards are tracked in the meta-issue {GIT}11511[#11511].


[float]
=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING)

Indices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node
while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. {GIT}13719[#13719]

[float]
=== Jepsen Test Failures (STATUS: ONGOING)

We have increased our test coverage to include scenarios tested by Jepsen. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce new error conditions. You can follow the work on the master branch of the https://github.com/elastic/elasticsearch/blob/master/core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java[`DiscoveryWithServiceDisruptionsIT` class], where we will add more tests as time progresses.

[float]
=== Document guarantees and handling of failure (STATUS: ONGOING)

This status page is a start, but we can do a better job of explicitly documenting the processes at work in Elasticsearch, and what happens in the case of each type of failure. The plan is to have a test case that validates each behavior under simulated conditions. Every test will document the expected results, the associated test code and an explicit PASS or FAIL status for each simulated case.

== Unreleased

[float]
=== Loss of documents during network partition (STATUS: ONGOING)
=== Loss of documents during network partition (STATUS: UNRELEASED, v5.0.0)

If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults).

If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in {GIT}2488[split-brain] before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master ({GIT}7572[#7572]).
To prevent this situation, the primary needs to wait for the master to acknowledge replica shard failures before acknowledging the write to the client. {GIT}14252[#14252]

[float]
=== Safe primary relocations (STATUS: ONGOING)
=== Safe primary relocations (STATUS: UNRELEASED, v5.0.0)

When primary relocation completes, a cluster state is propagated that deactivates the old primary and marks the new primary as active. As
cluster state changes are not applied synchronously on all nodes, there can be a time interval where the relocation target has processed the
Expand All @@ -117,23 +136,7 @@ on the relocation target, each of the nodes believes the other to be the active
chasing the primary being quickly sent back and forth between the nodes, potentially making them both go OOM. {GIT}12573[#12573]

[float]
=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING)

Indices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node
while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. {GIT}13719[#13719]

[float]
=== Jepsen Test Failures (STATUS: ONGOING)

We have increased our test coverage to include scenarios tested by Jepsen. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce new error conditions. You can follow the work on the master branch of the https://github.com/elastic/elasticsearch/blob/master/core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java[`DiscoveryWithServiceDisruptionsIT` class], where we will add more tests as time progresses.

[float]
=== Document guarantees and handling of failure (STATUS: ONGOING)

This status page is a start, but we can do a better job of explicitly documenting the processes at work in Elasticsearch, and what happens in the case of each type of failure. The plan is to have a test case that validates each behavior under simulated conditions. Every test will document the expected results, the associated test code and an explicit PASS or FAIL status for each simulated case.

[float]
=== Do not allow stale shards to automatically be promoted to primary (STATUS: ONGOING, v5.0.0)
=== Do not allow stale shards to automatically be promoted to primary (STATUS: UNRELEASED, v5.0.0)

In some scenarios, after the loss of all valid copies, a stale replica shard can be automatically assigned as a primary, preferring old data
to no data at all ({GIT}14671[#14671]). This can lead to a loss of acknowledged writes if the valid copies are not lost but are rather
Expand All @@ -143,7 +146,7 @@ for one of the good shard copies to reappear. In case where all good copies are
stale shard copy.

[float]
=== Make index creation resilient to index closing and full cluster crashes (STATUS: ONGOING, v5.0.0)
=== Make index creation resilient to index closing and full cluster crashes (STATUS: UNRELEASED, v5.0.0)

Recovering an index requires a quorum (with an exception for 2) of shard copies to be available to allocate a primary. This means that
a primary cannot be assigned if the cluster dies before enough shards have been allocated ({GIT}9126[#9126]). The same happens if an index
Expand All @@ -153,7 +156,6 @@ recover an index in the presence of a single shard copy. Allocation IDs can also
but none of the shards have been started. If such an index was inadvertently closed before at least one shard could be started, a fresh
shard will be allocated upon reopening the index.

== Unreleased

[float]
=== Use two phase commit for Cluster State publishing (STATUS: UNRELEASED, v5.0.0)
Expand Down