-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnect between coordinating node and shards can cause duplicate updates or wrong status code #9967
Comments
This can be an issue for an incremental counter. |
Pinging @elastic/es-distributed |
We discussed this within the distributed team meeting. It was surfaced while we review the page https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html:
There is a possible solution with an extra round trip, but it would hurt performance. As the issue is rare and the impact is small, applying the solution would end up costing for common cases. |
Pinging @elastic/es-distributed (Team:Distributed) |
A document update can be sent to any node in the cluster (coordinating node) and this node will forward it to the node that has the shard (the executing node). If the update fails, then under certain conditions the coordinating node tries to send the the update again (for example https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java#L447). However, the executing node might already have applied the update and will then just apply it again. This is problematic if the update was for example increasing a counter. The same effect might cause the wrong status code to be returned for versioned indexing requests. A real word scenario where this can happen is when nodes are restarted that have shards without replicas and updates are send to the restarted node.
The text was updated successfully, but these errors were encountered: