-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only run retention lease actions on active primary #40386
Only run retention lease actions on active primary #40386
Conversation
In some cases, a request to perform a retention lease action can arrive on a primary shard before it is active. In this case, the primary shard would not yet be in primary mode, tripping an assertion in the replication tracker. Instead, we should not attempt to perform such actions on an initializing shard. This commit addresses this by not returning the primary shard in the single shard iterator if the primary shard is not yet active.
Pinging @elastic/es-distributed |
Another option that I considered is rejecting this when acquiring the permit on a shard that is not yet active, but this approach seems preferable to me (note that we manually handle inactive shards elsewhere, such as in the reroute phase of a replication action). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@elasticmachine test this please |
@elasticmachine run elasticsearch-ci/1 |
In some cases, a request to perform a retention lease action can arrive on a primary shard before it is active. In this case, the primary shard would not yet be in primary mode, tripping an assertion in the replication tracker. Instead, we should not attempt to perform such actions on an initializing shard. This commit addresses this by not returning the primary shard in the single shard iterator if the primary shard is not yet active.
In some cases, a request to perform a retention lease action can arrive on a primary shard before it is active. In this case, the primary shard would not yet be in primary mode, tripping an assertion in the replication tracker. Instead, we should not attempt to perform such actions on an initializing shard. This commit addresses this by not returning the primary shard in the single shard iterator if the primary shard is not yet active.
In some cases, a request to perform a retention lease action can arrive on a primary shard before it is active. In this case, the primary shard would not yet be in primary mode, tripping an assertion in the replication tracker. Instead, we should not attempt to perform such actions on an initializing shard. This commit addresses this by not returning the primary shard in the single shard iterator if the primary shard is not yet active.
In some cases, a request to perform a retention lease action can arrive on a primary shard before it is active. In this case, the primary shard would not yet be in primary mode, tripping an assertion in the replication tracker. Instead, we should not attempt to perform such actions on an initializing shard. This commit addresses this by not returning the primary shard in the single shard iterator if the primary shard is not yet active.
.shardRoutingTable(request.concreteIndex(), request.request().getShardId().id()) | ||
.primaryShardIt(); | ||
.shardRoutingTable(request.concreteIndex(), request.request().getShardId().id()); | ||
if (shardRoutingTable.primaryShard().active()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TransportSingleShardAction doesn't do our usual "chase the shard" pattern where we re-resolve shards on each node until we find a place where the shard is locally available. This means that if the coordinating node thinks the shard is active but the node with the shard didn't yet process the shard activation cluster state, I think this still goes wrong (i.e., the primary would not be in primary mode, which is activated when the cluster state is processed). I hope I'm wrong and please let me know what I'm missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. In this case, I think that we should use the other approach that I considered. I can not think of any situations where we would want to acquire a permit on a non-active primary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was actually why I went looking - my instinct was that this should be done under permit and that the permit shouldn't be given under a non-initialized primary. We currently don't do that so that requires a much bigger change/vision. We can also have a targeted-check in asyncShardOperation that the shard is active before performing the operation. @ywelsch any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RetentionLeaseActions
and TransportForgetFollowerAction
can also possibly violate the assertion in acquirePrimaryOperationPermit
that the shard is actually a primary (by the time the request arrives, the primary could have failed and a replica allocated instead). Ensuring this was previously left to the caller of this method.
We can explore changing acquirePrimaryOperationPermit
and acquireAllPrimaryOperationsPermits
to throw appropriate exceptions if the shard is not in primary mode (e.g. replica or initializing/relocated primary). TRA can then react to an IndexShardRelocatedException
to delegate to the relocation target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That takes it one level higher - ensuring that the primary is actually an active primary (rather than just an active shard). SGTM.
I thought about the targeted check there but I realized earlier there is another operation prone to this problem: forget follower which is not a single shard action. That’s why I now lean towards a global approach. |
A failure relating to this: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.8+bwc-tests/60/console.
@jasontedor @ywelsch I think we need to act on this. |
In some cases, a request to perform a retention lease action can arrive on a primary shard before it is active. In this case, the primary shard would not yet be in primary mode, tripping an assertion in the replication tracker. Instead, we should not attempt to perform such actions on an initializing shard. This commit addresses this by not returning the primary shard in the single shard iterator if the primary shard is not yet active.
Closes #40089
Closes #40373