Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail engine if hit document failure on replicas #43523

Merged
merged 14 commits into from
Jul 14, 2019
Merged

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jun 24, 2019

An indexing on a replica should never fail after it was successfully indexed on a primary. Hence, we should fail an engine if we hit any failure (document level or tragic failure) when processing an indexing on a replica.

Relates #43228
Closes #40435 (see #40435 (comment)).

We should not generate Noops for failed indexing operations on replicas or followers.
@dnhatn dnhatn added >bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v8.0.0 v7.3.0 v7.2.1 labels Jun 24, 2019
@dnhatn dnhatn requested a review from ywelsch June 24, 2019 04:27
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky PR. We want to make sure we're not recording an operation as failed in the translog when we fail to add it to Lucene on a replica. Instead, we let the failure bubble up to the primary so that it can fail the replica. We could also consider this as a fatal failure, and directly fail the shard once indexing into Lucene fails.
The case we also need to consider is when we replay from the translog to Lucene on recovery from store. Should we then also fail the primary if we fail to replay the operation? This could mean that the primary is unrecoverable, e.g. because of some incompatibility introduced during an upgrade. If we're lenient there, however, it brings the risk of primary and replica going out of sync (if we let the replica locally recover up to global checkpoint). Perhaps we could allow a way for the shard to be recovered with a force command, which changes the history uuid. I think we need a more comprehensive plan here.

@dnhatn dnhatn changed the title Only generate noop for failed indexing on primary Fail engine if hit document failure on non-primary indexing Jul 10, 2019
@dnhatn dnhatn changed the title Fail engine if hit document failure on non-primary indexing Fail engine if hit failure on non-primary indexing Jul 10, 2019
@dnhatn dnhatn removed the v7.2.1 label Jul 10, 2019
@dnhatn dnhatn changed the title Fail engine if hit failure on non-primary indexing Fail engine if hit document failure on replicas Jul 10, 2019
@dnhatn
Copy link
Member Author

dnhatn commented Jul 10, 2019

@ywelsch I've updated this PR to proceed with operations on replicas only. Can you please take a look? Thank you!

@dnhatn dnhatn requested a review from ywelsch July 10, 2019 16:55
@@ -1055,7 +1059,7 @@ private IndexResult indexIntoLucene(Index index, IndexingStrategy plan)
}
return new IndexResult(plan.versionForIndexing, index.primaryTerm(), index.seqNo(), plan.currentNotFoundOrDeleted);
} catch (Exception ex) {
if (indexWriter.getTragicException() == null) {
if ( treatDocumentFailureAsTragicError(index) == false && indexWriter.getTragicException() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we treat AlreadyClosedException specially here as well (same as when we index deletion or noop tombstone).

Copy link
Member Author

@dnhatn dnhatn Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have a special treatment for AlreadyClosedException here. If the engine was failed and closed by other thread, it's perfectly fine to bubble up to the AlreadyClosedException. In fact, we should bubble up AlreadyClosedException so we can detect situations where the engine has a buggy state.

However, we probably should call maybeFailEngine instead of failEngine if the exception is AlreadyClosedEngine to avoid unnecessary warning log if the engine was failed already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not try to wrap AlreadyClosedException into an IndexResult as we might possibly write it to the translog during closing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++. Fixed in c632526.

@@ -929,7 +929,11 @@ public IndexResult index(Index index) throws IOException {
}
} catch (RuntimeException | IOException e) {
try {
maybeFailEngine("index", e);
if (treatDocumentFailureAsTragicError(index)) {
failEngine("index", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add more info about document into the "reason" string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 8725216

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant some info about the document itself, i.e. the id of the document (This could help in figuring out why the given failure happened)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more info in c632526.

@dnhatn dnhatn requested a review from ywelsch July 12, 2019 03:49
@dnhatn
Copy link
Member Author

dnhatn commented Jul 14, 2019

Thanks @ywelsch.

@dnhatn dnhatn merged commit cb3e0cb into elastic:master Jul 14, 2019
@dnhatn dnhatn deleted the noops branch July 14, 2019 23:25
dnhatn added a commit that referenced this pull request Jul 15, 2019
An indexing on a replica should never fail after it was successfully
indexed on a primary. Hence, we should fail an engine if we hit any
failure (document level or tragic failure) when processing an indexing
on a replica.

Relates #43228
Closes #40435
dnhatn added a commit that referenced this pull request Jul 15, 2019
michalperlak pushed a commit to michalperlak/elasticsearch that referenced this pull request Jul 16, 2019
An indexing on a replica should never fail after it was successfully
indexed on a primary. Hence, we should fail an engine if we hit any
failure (document level or tragic failure) when processing an indexing
on a replica.

Relates elastic#43228
Closes elastic#40435
michalperlak pushed a commit to michalperlak/elasticsearch that referenced this pull request Jul 16, 2019
polyfractal pushed a commit to polyfractal/elasticsearch that referenced this pull request Jul 29, 2019
An indexing on a replica should never fail after it was successfully
indexed on a primary. Hence, we should fail an engine if we hit any
failure (document level or tragic failure) when processing an indexing
on a replica.

Relates elastic#43228
Closes elastic#40435
polyfractal pushed a commit to polyfractal/elasticsearch that referenced this pull request Jul 29, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 3, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 4, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 4, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 4, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 4, 2019
mergify bot pushed a commit to crate/crate that referenced this pull request Sep 4, 2019
mergify bot pushed a commit to crate/crate that referenced this pull request Sep 24, 2019
Backport of elastic/elasticsearch#43523

(cherry picked from commit 9929cb2)

# Conflicts:
#	blackbox/docs/appendices/release-notes/unreleased.rst
#	es/es-server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
kovrus added a commit to crate/crate that referenced this pull request Sep 24, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 25, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 25, 2019
mergify bot pushed a commit to crate/crate that referenced this pull request Sep 25, 2019
dnhatn added a commit that referenced this pull request Apr 15, 2020
An indexing on a replica should never fail after it was successfully
indexed on a primary. Hence, we should fail an engine if we hit any
failure (document level or tragic failure) when processing an indexing
on a replica.

Relates #43228
Closes #40435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SearchWithRandomExceptionsIT timeout
5 participants