Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify AllocationIDs in replication actions #20320

Merged
merged 18 commits into from Sep 6, 2016

Conversation

Projects
None yet
5 participants
@bleskes
Copy link
Member

commented Sep 4, 2016

Replicated operation consist of a routing action (the original), which is in charge of sending the operation to the primary shard, a primary action which executes the operation on the resolved primary and replica actions which performs the operation on a specific replica. This commit adds the targeted shard's allocation id to the primary and replica actions and makes sure that those match the shard the actions end up executing on.

This helps preventing extremely rare failure mode where a shard moves off a node and back to it, all between an action is sent and the time it's processed.

For example:

  1. Primary action is sent to a relocating primary on node A.
  2. The primary finishes relocation to node B and start relocating back.
  3. The relocation back gets to the phase and opens up the target engine, on the original node, node A.
  4. The primary action is executed on the target engine before the relocation finishes, at which the shard copy on node B is still the official primary - i.e., it is executed on the wrong primary.

RequestWithAllocationID(Supplier<R> requestSupplier) {
request = requestSupplier.get();
allocationId = null;

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 5, 2016

Contributor

what is the invariant that this can be null?

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 5, 2016

Contributor

oh that is for deserialization :( bummer can you document it?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

yeah :(. maybe we should add a variant of registerRequestHandler which takes a function of a stream and returns a request.

}

RequestWithAllocationID(R request, String allocationId) {
this.request = request;

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 5, 2016

Contributor

can we use Objects.requireNonNull() here for both request and allocation ID

@@ -930,6 +962,69 @@ public void onFailure(Exception shardFailedError) {
}
}

/** a wrapper class to encapsulate a request when being sent to a specific allocation id **/
final class RequestWithAllocationID<R extends TransportRequest> extends TransportRequest {

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 5, 2016

Contributor

can we call this maybe ConcreteShardRequest or something like this?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

yeah - I couldn't find a good name so I prefer a really concrete name.

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 5, 2016

Contributor

ConcreteShardRequest is much better than RequestWithAllocationID ? lets name classes about what they are not what properties they hold

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

to me ConcreteShardRequest is meaningless unless you know about this PR and its context, which all of us will soon forget. If you feel ConcreteShardRequest is a much better name, I will happily make you happy :)

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2016

left some suggestions LGTM in general

ActionListener<Releasable> callback = (ActionListener<Releasable>) invocation.getArguments()[1];
final long primaryTerm = indexShard.getPrimaryTerm();
if (term < primaryTerm) {
throw new IllegalArgumentException(LoggerMessageFormat.format("{} operation term [{}] is too old (current [{}])",

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Can we not add another use of LoggerMessageFormat; I'd like to remove it?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

I copied from another place - do you have a decent suggestion for an alternative (the obvious one is chaining strings)

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

String.format(Locale.ROOT, "%s operation term [%d] is too old (current [%d])", shardId, term, primaryTerm)

final class RequestWithAllocationID<R extends TransportRequest> extends TransportRequest {

/** {@link AllocationId#getId()} of the shard this request is sent to **/
private String allocationId;

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Can we call this targetAllocationID?

return request;
}

public String getAllocationId() {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Can we call this getTargetAllocationID?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

changed

@@ -725,6 +765,110 @@ public void testDefaultWaitForActiveShardsUsesIndexSetting() throws Exception {
assertEquals(ActiveShardCount.from(requestWaitForActiveShards), request.waitForActiveShards());
}

/** test that a primary request is reject if it arrives at a shard with a wrong allocation id */

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Nit: reject -> rejected

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

changed

fail("using a wrong aid didn't fail the operation");
} catch (ExecutionException execException) {
Throwable throwable = execException.getCause();
logger.debug((Supplier<?>) () -> new ParameterizedMessage("got exception e"), throwable);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

No need to be fancy here, just: logger.debug("got exception", throwable);.

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

yep. Had some formatting before..

}
}

class AsyncPrimaryAction extends AbstractRunnable implements ActionListener<PrimaryShardReference> {

private final Request request;
/** allocationId of the shard this request is meant for */
private final String allocationId;

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Can we call this targetAllocationID?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

changed

private final TransportChannel channel;
private final ReplicationTask replicationTask;

AsyncPrimaryAction(Request request, TransportChannel channel, ReplicationTask replicationTask) {
AsyncPrimaryAction(Request request, String allocationId, TransportChannel channel, ReplicationTask replicationTask) {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Can we call this constructor parameter targetAllocationID?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

yep

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

I think you missed this one, and the corresponding field (it looks like you edit the comment on the field though).

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

indeed. Fixed.

On 05 Sep 2016, at 20:55, Jason Tedor notifications@github.com wrote:

In core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java:

     private final TransportChannel channel;
     private final ReplicationTask replicationTask;
  •    AsyncPrimaryAction(Request request, TransportChannel channel, ReplicationTask replicationTask) {
    
  •    AsyncPrimaryAction(Request request, String allocationId, TransportChannel channel, ReplicationTask replicationTask) {
    

I think you missed this one, and the corresponding field (it looks like you edit the comment on the field though).


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

new AsyncReplicaAction(request, channel, (ReplicationTask) task).run();
public void messageReceived(RequestWithAllocationID<ReplicaRequest> request, TransportChannel channel, Task task)
throws Exception {
new AsyncReplicaAction(request.request, request.allocationId, channel, (ReplicationTask) task).run();

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Maybe the parameter request needs a better name or the field request on the RequestWithAllocationID needs a better name? It's request.request that is triggering this comment.

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

I renamed the request to requestWithAID

@@ -417,6 +432,8 @@ public RetryOnReplicaException(StreamInput in) throws IOException {

private final class AsyncReplicaAction extends AbstractRunnable implements ActionListener<Releasable> {
private final ReplicaRequest request;
// allocation id of the replica this request is meant for
private final String allocationId;

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Can we call this targetAllocationID?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

yep

@@ -426,10 +443,11 @@ public RetryOnReplicaException(StreamInput in) throws IOException {
// something we want to avoid at all costs
private final ClusterStateObserver observer = new ClusterStateObserver(clusterService, null, logger, threadPool.getThreadContext());

AsyncReplicaAction(ReplicaRequest request, TransportChannel channel, ReplicationTask task) {
AsyncReplicaAction(ReplicaRequest request, String allocationId, TransportChannel channel, ReplicationTask task) {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Can we rename this constructor parameter to targetAllocationID?

This comment has been minimized.

Copy link
@bleskes

bleskes Sep 5, 2016

Author Member

done

}
}

/** test that a replica request is reject if it arrives at a shard with a wrong allocation id */

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

Nit: reject -> rejected

@jasontedor

This comment has been minimized.

Copy link
Member

commented Sep 5, 2016

I left a few comments, but it looks good.

allocationId = null;
}

RequestWithAllocationID(R request, String allocationId) {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Sep 5, 2016

Member

I guess this constructor parameter should be targetAllocationID too to be consistent with my other suggestions.

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Sep 5, 2016

thx @s1monw , @jasontedor . I pushed a commit addressing all comments

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2016

I don't get why we call these properties targetAllocationID? what's wrong with allocationId?

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Sep 5, 2016

I don't get why we call these properties targetAllocationID? what's wrong with allocationId?

Because @jasontedor asked.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2016

Because @jasontedor asked.

can we keep it simple?

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2016

@bleskes I don't wanna be in the way for such an improvement just because of some internal debatable naming. lets get it in!

bleskes added some commits Sep 6, 2016

@bleskes bleskes merged commit c56cd46 into elastic:master Sep 6, 2016

1 of 2 checks passed

elasticsearch-ci Build started sha1 is merged.
Details
CLA Commit author is a member of Elasticsearch
Details

@bleskes bleskes deleted the bleskes:replication_with_aid branch Sep 6, 2016

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Sep 6, 2016

Thx @s1monw and @jasontedor for the review.

MaineC pushed a commit to MaineC/elasticsearch that referenced this pull request Sep 7, 2016

Verify AllocationIDs in replication actions (elastic#20320)
Replicated operation consist of a routing action (the original), which is in charge of sending the operation to the primary shard, a primary action which executes the operation on the resolved primary and replica actions which performs the operation on a specific replica. This commit adds the targeted shard's allocation id to the primary and replica actions and makes sure that those match the shard the actions end up executing on.

This helps preventing extremely rare failure mode where a shard moves off a node and back to it, all between an action is sent and the time it's processed. 

For example:
1) Primary action is sent to a relocating primary on node A.
2) The primary finishes relocation to node B and start relocating back.
3) The relocation back gets to the phase and opens up the target engine, on the original node, node A.
4) The primary action is executed on the target engine before the relocation finishes, at which the shard copy on node B is still the official primary - i.e., it is executed on the wrong primary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.