Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnect between coordinating node and shards can cause duplicate updates or wrong status code #9967

Open
brwe opened this issue Mar 3, 2015 · 4 comments
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. resiliency Team:Distributed Meta label for distributed team

Comments

@brwe
Copy link
Contributor

brwe commented Mar 3, 2015

A document update can be sent to any node in the cluster (coordinating node) and this node will forward it to the node that has the shard (the executing node). If the update fails, then under certain conditions the coordinating node tries to send the the update again (for example https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java#L447). However, the executing node might already have applied the update and will then just apply it again. This is problematic if the update was for example increasing a counter. The same effect might cause the wrong status code to be returned for versioned indexing requests. A real word scenario where this can happen is when nodes are restarted that have shards without replicas and updates are send to the restarted node.

@erik777
Copy link

erik777 commented Jul 11, 2017

This can be an issue for an incremental counter.

@jasontedor jasontedor added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Core/Infra/Core Core issues without another label labels Mar 14, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

bleskes added a commit that referenced this issue May 6, 2018
The test indexes new documents and is thus correct in testing that the response result
is `CREATED`. Sadly we can't guarantee exactly once delivery just yet.

Relates #9967

Closes #21658
@rjernst rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020
@Leaf-Lin
Copy link
Contributor

Leaf-Lin commented Dec 3, 2021

We discussed this within the distributed team meeting. It was surfaced while we review the page https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html:

Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)

If the node holding a primary shard is disconnected for whatever reason, the coordinating node retries the request on the same or a new primary shard. In certain rare conditions, where the node disconnects and immediately reconnects, it is possible that the original request has already been successfully applied but has not been reported, resulting in duplicate requests. This is particularly true when retrying bulk requests, where some actions may have completed and some may not have.

An optimization which disabled the existence check for documents indexed with auto-generated IDs could result in the creation of duplicate documents. This optimization has been removed. #9468 (STATUS: DONE, v1.5.0)
Further issues remain with the retry mechanism:

  • Unversioned index requests could increment the _version twice, obscuring a created status.
  • Versioned index requests could return a conflict exception, even though they were applied correctly.
  • Update requests could be applied twice.

See #9967. (STATUS: ONGOING)

There is a possible solution with an extra round trip, but it would hurt performance. As the issue is rare and the impact is small, applying the solution would end up costing for common cases.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. resiliency Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

9 participants