Disconnect between coordinating node and shards can cause duplicate updates or wrong status code #9967

brwe · 2015-03-03T17:17:12Z

A document update can be sent to any node in the cluster (coordinating node) and this node will forward it to the node that has the shard (the executing node). If the update fails, then under certain conditions the coordinating node tries to send the the update again (for example https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java#L447). However, the executing node might already have applied the update and will then just apply it again. This is problematic if the update was for example increasing a counter. The same effect might cause the wrong status code to be returned for versioned indexing requests. A real word scenario where this can happen is when nodes are restarted that have shards without replicas and updates are send to the restarted node.

erik777 · 2017-07-11T04:36:21Z

This can be an issue for an incremental counter.

elasticmachine · 2018-03-14T02:50:32Z

Pinging @elastic/es-distributed

The test indexes new documents and is thus correct in testing that the response result is `CREATED`. Sadly we can't guarantee exactly once delivery just yet. Relates #9967 Closes #21658

Leaf-Lin · 2021-12-03T04:42:03Z

We discussed this within the distributed team meeting. It was surfaced while we review the page https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html:

Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)

If the node holding a primary shard is disconnected for whatever reason, the coordinating node retries the request on the same or a new primary shard. In certain rare conditions, where the node disconnects and immediately reconnects, it is possible that the original request has already been successfully applied but has not been reported, resulting in duplicate requests. This is particularly true when retrying bulk requests, where some actions may have completed and some may not have.

An optimization which disabled the existence check for documents indexed with auto-generated IDs could result in the creation of duplicate documents. This optimization has been removed. #9468 (STATUS: DONE, v1.5.0)
Further issues remain with the retry mechanism:

Unversioned index requests could increment the _version twice, obscuring a created status.

Versioned index requests could return a conflict exception, even though they were applied correctly.

Update requests could be applied twice.

See #9967. (STATUS: ONGOING)

There is a possible solution with an extra round trip, but it would hurt performance. As the issue is rare and the impact is small, applying the solution would end up costing for common cases.

elasticsearchmachine · 2022-07-29T11:41:24Z

Pinging @elastic/es-distributed (Team:Distributed)

clintongormley added >bug :Core/Infra/Core Core issues without another label resiliency labels Mar 3, 2015

clintongormley assigned bleskes Dec 5, 2015

jasontedor added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Core/Infra/Core Core issues without another label labels Mar 14, 2018

bleskes mentioned this issue May 6, 2018

[CI] Indexing operation returns UPDATED instead of CREATED (testAckedIndexing) #21658

Closed

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

Leaf-Lin added the team-discuss label Nov 11, 2021

Leaf-Lin removed the team-discuss label Dec 3, 2021

DaveCTurner unassigned bleskes Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disconnect between coordinating node and shards can cause duplicate updates or wrong status code #9967

Disconnect between coordinating node and shards can cause duplicate updates or wrong status code #9967

brwe commented Mar 3, 2015 •

edited by rudolf

Loading

erik777 commented Jul 11, 2017

elasticmachine commented Mar 14, 2018

Leaf-Lin commented Dec 3, 2021

Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)

elasticsearchmachine commented Jul 29, 2022

Disconnect between coordinating node and shards can cause duplicate updates or wrong status code #9967

Disconnect between coordinating node and shards can cause duplicate updates or wrong status code #9967

Comments

brwe commented Mar 3, 2015 • edited by rudolf Loading

erik777 commented Jul 11, 2017

elasticmachine commented Mar 14, 2018

Leaf-Lin commented Dec 3, 2021

Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)

elasticsearchmachine commented Jul 29, 2022

brwe commented Mar 3, 2015 •

edited by rudolf

Loading