-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk operation fail to replicate operations when a mapping update times out #30379
Conversation
we may fail to properly replicate operation when a mapping update on master fails. If a bulk operations needs a mapping update half way, it will send a request to the master before continuing to index the operations. If that request times out or isn't acked (i.e., even one node in the cluster didn't process it within 30s), we end up throwing the exception and aborting the entire bulk. This is a problem because all operations that were processed so far are not replicated any more to the replicas. Although these operations were never "acked" to the user (we threw an error) it cause the local checkpoint on the replicas to lag (on 6.x) and the primary and replica to diverge. This PR changes the logic to treat *any* mapping update failure as a document level failure, meaning only the relevant indexing operation will fail. Backport of elastic#30244
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
import java.util.concurrent.CountDownLatch; | ||
import java.util.concurrent.atomic.AtomicReference; | ||
|
||
public class BlockMasterServiceOnMaster extends SingleNodeDisruption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5.x does not have a separate MasterService. Maybe just use BlockClusterStateProcessing
instead of introducing this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Will do.
Thanks @ywelsch |
Starting with the refactoring in #22778 (released in 5.3)
we may fail to properly replicate operation when a mapping update on master fails.
If a bulk operations needs a mapping update half way, it will send a request to the
master before continuing to index the operations. If that request times out or isn't
acked (i.e., even one node in the cluster didn't process it within 30s), we end up
throwing the exception and aborting the entire bulk. This is a problem because all
operations that were processed so far are not replicated any more to the replicas.
Although these operations were never "acked" to the user (we threw an error) it
cause the local checkpoint on the replicas to lag (on 6.x) and the primary and replica
to diverge.
This PR changes the logic to treat any mapping update failure as a document level
failure, meaning only the relevant indexing operation will fail.
Back port of #30244