New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify write failure handling #19105

Merged
merged 29 commits into from Nov 1, 2016

Conversation

Projects
None yet
6 participants
@areek
Contributor

areek commented Jun 27, 2016

Currently, any write (e.g. index, delete) operation failure can be categorized as:

  • request failure (e.g. analysis, parsing error, version conflict)
  • transient operation failure (e.g. due to shard initializing, relocation)
  • environment failure (e.g. out of disk, corruption, lucene tragic event)

The main motivation of the PR is to handle these failure types appropriately for a
write request. Each failure type needs to be handled differently:

  • request failure (being request specific) should be replicated and then failed
  • transient failure should be retried (eventually succeeding)
  • environment failure (persistent primary shard failure) should fail the request
    immediately.

Currently, transient operation failures are retried in replication action but no distinction
is made between request and environment failures, both fails write request immediately.

In this PR, we distinguish between request and environment failures for a write operation.
In case of environment failures, the exception is bubbled up failing the request and in case
of request failures, the exception is captured and replication continues (we ignore performing
on replicas when such failures occur in primary). Transient operation failures are bubbled up
to be retried by the replication operation, as before.

Note: #20109 simplifies bulk execution code, which should clean up error handling for shard bulk requests.

@areek areek changed the title from Replicate primary write operation failures to Make primary write operation failure a valid result Jun 29, 2016

@areek

This comment has been minimized.

Show comment
Hide comment
@areek

areek Jun 29, 2016

Contributor

After discussions with @bleskes, I changed the scope of this PR. Now, the PR focuses on making primary write operation failures a valid write result, so replication operation can handle them explicitly. We can add operation failure replication, as needed in the feature/seq_no branch

Contributor

areek commented Jun 29, 2016

After discussions with @bleskes, I changed the scope of this PR. Now, the PR focuses on making primary write operation failures a valid write result, so replication operation can handle them explicitly. We can add operation failure replication, as needed in the feature/seq_no branch

@bleskes

View changes

Show outdated Hide outdated ...rc/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java
@bleskes

View changes

Show outdated Hide outdated ...rc/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java
@bleskes

View changes

Show outdated Hide outdated ...rc/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java
@bleskes

View changes

Show outdated Hide outdated ...rc/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java
@bleskes

View changes

Show outdated Hide outdated ...src/main/java/org/elasticsearch/action/delete/TransportDeleteAction.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/action/index/TransportIndexAction.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/action/index/TransportIndexAction.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/action/index/TransportIndexAction.java
@bleskes

View changes

Show outdated Hide outdated ...a/org/elasticsearch/action/support/replication/ReplicationOperation.java
@bleskes

View changes

Show outdated Hide outdated ...rc/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java
@bleskes

View changes

Show outdated Hide outdated ...rc/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java
@bleskes

View changes

Show outdated Hide outdated ...rc/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java
@bleskes

View changes

Show outdated Hide outdated ...rc/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/shard/IndexShard.java
@bleskes

View changes

Show outdated Hide outdated ...c/main/java/org/elasticsearch/index/shard/TranslogRecoveryPerformer.java
@jasontedor

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
@jasontedor

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
@areek

This comment has been minimized.

Show comment
Hide comment
@areek

areek Aug 3, 2016

Contributor

@bleskes I updated the PR to only capture operation level write failures in write results for primary and replica operations that do not fail the engine and are not dwelt with (retried) by the replication action. Could you take a look?

Contributor

areek commented Aug 3, 2016

@bleskes I updated the PR to only capture operation level write failures in write results for primary and replica operations that do not fail the engine and are not dwelt with (retried) by the replication action. Could you take a look?

@bleskes

View changes

Show outdated Hide outdated ...rc/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java
@areek

This comment has been minimized.

Show comment
Hide comment
@areek

areek Oct 27, 2016

Contributor

Thanks @bleskes for the feedback. I updated the PR, addressing all your comments, including adding tests for failure handling in TransportWriteAction and InternalEngine

Contributor

areek commented Oct 27, 2016

Thanks @bleskes for the feedback. I updated the PR, addressing all your comments, including adding tests for failure handling in TransportWriteAction and InternalEngine

@s1monw

I just looked at the exception handling for now and left a single comment

@bleskes

bleskes approved these changes Oct 31, 2016 edited

I left a few minor comments and some requests for new issues. I like how this looks.

I would be great to get @s1monw LGTM.

Also, @jpountz can you sanity check https://github.com/elastic/elasticsearch/pull/19105/files#diff-7cdd93f7b049567dc8e2ffc37300852eR169 ? It would be great to have one exception type

@areek

This comment has been minimized.

Show comment
Hide comment
@areek

areek Nov 1, 2016

Contributor

@bleskes thanks again for the review, I addressed all the comments and had
one question regarding bwc for the removed exceptions in #19105 (comment)

Contributor

areek commented Nov 1, 2016

@bleskes thanks again for the review, I addressed all the comments and had
one question regarding bwc for the removed exceptions in #19105 (comment)

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Nov 1, 2016

Member

@areek because the github ui sucks, I'm responding here so it will be easy to see:

if we care about rolling restarts from 5.x

We care! and yeah, I think we can just keep them in there and doc that they can be removed in 7.0 (assertion?)

The index/delete operation failures were communicated as exceptions, so do we even need a bwc for these failures for serialization/deserialization or can we just rely on generic exception serialization/deserialization like we currently do for persistent engine failures during index/delete operations?

I don't think we can rely on generic exceptions for incoming responses from old nodes. That will ask for a specific exception ID and we won't have it -> boom.

Member

bleskes commented Nov 1, 2016

@areek because the github ui sucks, I'm responding here so it will be easy to see:

if we care about rolling restarts from 5.x

We care! and yeah, I think we can just keep them in there and doc that they can be removed in 7.0 (assertion?)

The index/delete operation failures were communicated as exceptions, so do we even need a bwc for these failures for serialization/deserialization or can we just rely on generic exception serialization/deserialization like we currently do for persistent engine failures during index/delete operations?

I don't think we can rely on generic exceptions for incoming responses from old nodes. That will ask for a specific exception ID and we won't have it -> boom.

@s1monw

I left some minor comments, I did review the engine changes and glanced on the replciation action stuff. I think it LGTM except of the one or two commetns I gave. The one with the maybeFail is important

Show outdated Hide outdated ...ain/java/org/elasticsearch/index/engine/DeleteFailedEngineException.java
}
private void postIndexing(ParsedDocument doc, long tookInNanos) {

This comment has been minimized.

@s1monw

s1monw Nov 1, 2016

Contributor

👍

@s1monw

s1monw Nov 1, 2016

Contributor

👍

Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/engine/Engine.java
Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/engine/Engine.java
Show outdated Hide outdated ...main/java/org/elasticsearch/index/engine/IndexFailedEngineException.java
Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
if (result.hasFailure() == false) {
if (!index.origin().isRecovery()) {
long took = result.getTook();
totalStats.indexMetric.inc(took);

This comment has been minimized.

@s1monw

s1monw Nov 1, 2016

Contributor

should we have a write failures statistic too? @bleskes

@s1monw

s1monw Nov 1, 2016

Contributor

should we have a write failures statistic too? @bleskes

@areek areek merged commit 03abf4a into elastic:master Nov 1, 2016

2 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details

dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jan 19, 2017

Simplify bulk request execution
This is a bespoke backport of #20109 for 5.x:

Currently, bulk item requests can be any ActionRequest, this PR restricts bulk
item requests to DocumentRequest. This simplifies handling failures during bulk
requests. Additionally, a new enum is added to DocumentRequest to represent the
intended operation to be performed by a document request (create, index, update
and delete), which was previously represented with a mix of strings and index
request operation type.

Now, index request operation type reuses the new enum to specify whether the
request should create or index a document. Restricting bulk requests to
DocumentRequest further simplifies execution of shard-level bulk operations to
use the same failure handling for index, delete and update operations. This PR
also fixes a bug which executed delete operations twice for replica copies while
executing bulk requests.

Relates to #19105 and #20109

dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jan 23, 2017

Simplify bulk request execution
This is a bespoke backport of #20109 for 5.x:

Currently, bulk item requests can be any ActionRequest, this PR restricts bulk
item requests to DocumentRequest. This simplifies handling failures during bulk
requests. Additionally, a new enum is added to DocumentRequest to represent the
intended operation to be performed by a document request (create, index, update
and delete), which was previously represented with a mix of strings and index
request operation type.

Now, index request operation type reuses the new enum to specify whether the
request should create or index a document. Restricting bulk requests to
DocumentRequest further simplifies execution of shard-level bulk operations to
use the same failure handling for index, delete and update operations. This PR
also fixes a bug which executed delete operations twice for replica copies while
executing bulk requests.

Relates to #19105 and #20109

areek added a commit to areek/elasticsearch that referenced this pull request Jan 25, 2017

Simplify write failure handling (backport of #19105)
Currently, any write (e.g. `index`, `delete`) operation failure can be categorized as:
- request failure (e.g. analysis, parsing error, version conflict)
- transient operation failure (e.g. due to shard initializing, relocation)
- environment failure (e.g. out of disk, corruption, lucene tragic event)

The main motivation of the PR is to handle these failure types appropriately for a
write request. Each failure type needs to be handled differently:
- request failure (being request specific) should be replicated and then failed
- transient failure should be retried (eventually succeeding)
- environment failure (persistent primary shard failure) should fail the request
  immediately.

Currently, transient operation failures are retried in replication action but no distinction
is made between request and environment failures, both fails write request immediately.

In this PR, we distinguish between request and environment failures for a write operation.
In case of environment failures, the exception is bubbled up failing the request and in case
of request failures, the exception is captured and replication continues (we ignore performing
on replicas when such failures occur in primary). Transient operation failures are bubbled up
to be retried by the replication operation, as before.

areek added a commit that referenced this pull request Jan 25, 2017

Simplify write failure handling (backport of #19105) (#22778)
* Simplify write failure handling (backport of #19105)

Currently, any write (e.g. `index`, `delete`) operation failure can be categorized as:
- request failure (e.g. analysis, parsing error, version conflict)
- transient operation failure (e.g. due to shard initializing, relocation)
- environment failure (e.g. out of disk, corruption, lucene tragic event)

The main motivation of the PR is to handle these failure types appropriately for a
write request. Each failure type needs to be handled differently:
- request failure (being request specific) should be replicated and then failed
- transient failure should be retried (eventually succeeding)
- environment failure (persistent primary shard failure) should fail the request
  immediately.

Currently, transient operation failures are retried in replication action but no distinction
is made between request and environment failures, both fails write request immediately.

In this PR, we distinguish between request and environment failures for a write operation.
In case of environment failures, the exception is bubbled up failing the request and in case
of request failures, the exception is captured and replication continues (we ignore performing
on replicas when such failures occur in primary). Transient operation failures are bubbled up
to be retried by the replication operation, as before.

* incorporate feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment