Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change certain replica failures not to fail the replica shard #22874

Merged

Conversation

Projects
None yet
3 participants
@dakrone
Copy link
Member

commented Jan 30, 2017

This changes the way that replica failures are handled such that not all
failures will cause the replica shard to be failed or marked as stale.

In some cases such as refresh operations, or global checkpoint syncs, it is
"okay" for the operation to fail without the shard being failed (because no data
is out of sync). In these cases, instead of failing the shard we should simply
fail the operation, and, in the event it is a user-facing operation, return a
5xx response code including the shard-specific failures.

This was accomplished by having two forms of the Replicas proxy, one that is
for non-write operations that does not fail the shard, and one that is for write
operations that will fail the shard when an operation fails.

Relates to #10708

@dakrone dakrone requested a review from bleskes Jan 30, 2017

@dakrone dakrone force-pushed the dakrone:replica-failures-dont-always-fail-shard branch Jan 31, 2017

@bleskes
Copy link
Member

left a comment

Thx @dakrone. I left a bunch of initial comments. Maybe I miss something obvious but where is the logic to respond with an error if one of the shard copies fail (except in the TransportWriteAction)?

core/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java Outdated
ReplicationOperation.this::onPrimaryDemoted,
throwable -> decPendingAndFinishIfNeeded()
);
final boolean shardWasFailed = replicasProxy.failShardIfNeeded(shard, replicaRequest.primaryTerm(), message,

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 1, 2017

Member

I'd rather not introduce more code paths. I would vote to always have the callback pattern. If people have nothing to do they can call onSuccess on the same thread.

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 1, 2017

Author Member

Okay, I pushed 3589d9c39d7030b9d7817764626520cb976ba9b2 that switches this to use the callback

core/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java Outdated
@@ -209,14 +209,20 @@ public void onFailure(Exception replicaException) {
shardReplicaFailures.add(new ReplicationResponse.ShardInfo.Failure(
shard.shardId(), shard.currentNodeId(), replicaException, restStatus, false));
String message = String.format(Locale.ROOT, "failed to perform %s on replica %s", opType, shard);
// TODO: Boaz wanted to get rid of this warning, should we?
logger.warn(

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 1, 2017

Member

we plan to use this for background operations like the global check point sync, where it's not a big deal if this happen. I don't want the logs to have warning in them. Instead, implementations (i.e., TransportWriteAction) of failShardIfNeeded can log a warning if they are going to fail the shard

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 1, 2017

Author Member

Okay, I pushed ccda991ee569490ce2ced3f5bff7763e867e20ac

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java Outdated
* interface that performs the actual {@code ReplicaRequest} on the replica
* shards. It also encapsulates the logic required for failing the replica
* if deemed necessary as well as marking it as stale when needed.
*/
final class ReplicasProxy implements ReplicationOperation.Replicas<ReplicaRequest> {

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 1, 2017

Member

why make this final? WriteReplicasProxy can share a lot of it's logic with this one?

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 1, 2017

Author Member

I originally had them sharing knowledge (subclassing), however, ReplicasProxy used parts of TransportReplicationAction and made WriteActionReplicasProxy not able to subclass it well.

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

This seems to work fine for me?

diff --git a/core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java b/core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java
index db037ad..2cd1d8e 100644
--- a/core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java
+++ b/core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java
@@ -1040,7 +1040,7 @@ public abstract class TransportReplicationAction<
      * shards. It also encapsulates the logic required for failing the replica
      * if deemed necessary as well as marking it as stale when needed.
      */
-    final class ReplicasProxy implements ReplicationOperation.Replicas<ReplicaRequest> {
+    class ReplicasProxy implements ReplicationOperation.Replicas<ReplicaRequest> {
 
         @Override
         public void performOn(ShardRouting replica, ReplicaRequest request, ActionListener<ReplicationOperation.ReplicaResponse> listener) {
diff --git a/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java b/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java
index ac79a8c..8a168ca 100644
--- a/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java
+++ b/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java
@@ -74,7 +74,7 @@ public abstract class TransportWriteAction<
 
     @Override
     protected ReplicationOperation.Replicas newReplicasProxy() {
-        return new WriteActionReplicasProxy(shardStateAction);
+        return new WriteActionReplicasProxy();
     }
 
     /**
@@ -335,26 +335,7 @@ public abstract class TransportWriteAction<
      * replicas, where a failure to execute the operation should fail
      * the replica shard and/or mark the replica as stale.
      */
-    final class WriteActionReplicasProxy implements ReplicationOperation.Replicas<ReplicaRequest> {
-
-        private final ShardStateAction shardStateAction;
-
-        WriteActionReplicasProxy(ShardStateAction shardStateAction) {
-            this.shardStateAction = shardStateAction;
-        }
-
-        @Override
-        public void performOn(ShardRouting replica, ReplicaRequest request, ActionListener<ReplicationOperation.ReplicaResponse> listener) {
-            String nodeId = replica.currentNodeId();
-            final DiscoveryNode node = clusterService.state().nodes().get(nodeId);
-            if (node == null) {
-                listener.onFailure(new NoNodeAvailableException("unknown node [" + nodeId + "]"));
-                return;
-            }
-            final ConcreteShardRequest<ReplicaRequest> concreteShardRequest =
-                    new ConcreteShardRequest<>(request, replica.allocationId().getId());
-            sendReplicaRequest(concreteShardRequest, node, listener);
-        }
+    class WriteActionReplicasProxy extends ReplicasProxy {
 
         @Override
         public void failShardIfNeeded(ShardRouting replica, long primaryTerm, String message, Exception exception,

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 2, 2017

Author Member

@bleskes that's actually a scary thing, that "works" for Intellij's compiler (it reports no compilation errors), but if you actually run gradle compileJava:

/home/hinmanm/es/elasticsearch/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java:347: error: name clash: failShardIfNeeded(ShardRouting,long,String,Exception,Runnable,Consumer<Exception>,Consumer<Exception>) in TransportWriteAction.WriteActionReplicasProxy and failShardIfNeeded(ShardRouting,long,String,Exception,Runnable,Consumer<Exception>,Consumer<Exception>) in TransportReplicationAction.ReplicasProxy have the same erasure, yet neither overrides the other
        public void failShardIfNeeded(ShardRouting replica, long primaryTerm, String message, Exception exception,
                    ^
/home/hinmanm/es/elasticsearch/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java:346: error: method does not override or implement a method from a supertype
        @Override
        ^

It fails because of type erasure (I think due to the crazy generics that are in use), I tried a lot of different ways to get around it but I wasn't able to find a way.

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java Outdated
private final ClusterService clusterService;
private final TransportReplicationAction replicationAction;

ReplicasProxy(ClusterService clusterService, TransportReplicationAction replicationAction) {

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 1, 2017

Member

this is an inner class anyway, why pass these along? If you want to make it a static inner class, I'm good with it, although I'm not sure if it's worth the extra clutter.

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 1, 2017

Author Member

I originally tried to factor this out into a non-inner class, but turned out that the generics prevented it, so I will removed these.

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 1, 2017

Author Member

I pushed b5e91badc401a5aab1f20209ce0096881b07992f

core/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java Outdated
*/
void failShard(ShardRouting replica, long primaryTerm, String message, Exception exception, Runnable onSuccess,
Consumer<Exception> onPrimaryDemoted, Consumer<Exception> onIgnoredFailure);
void markShardCopyAsStaleIfNeeded(ShardId shardId, String allocationId, long primaryTerm, Runnable onSuccess,

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 1, 2017

Member

why do we need two of these? can't we have just markShardCopyAsStaleIfNeeded? we can rename markUnavailableShardsAsStale to markUnavailableShardsAsStaleIfNeeded

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 1, 2017

Author Member

We need two because there are situations when a replica shard needs to be marked as stale without a write action, I originally had this and then the tests that restart ES exposed that they still need to be able to mark it as stale without any actions.

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java Outdated
Consumer<Exception> onPrimaryDemoted, Consumer<Exception> onIgnoredFailure) {
// Just like with failing the shard, it should only be marked as
// stale when write operations fail, so for regular operations
// don't mark it as stale when not necessary.
}

@Override
public void markShardCopyAsStale(ShardId shardId, String allocationId, long primaryTerm, Runnable onSuccess,
Consumer<Exception> onPrimaryDemoted, Consumer<Exception> onIgnoredFailure) {
shardStateAction.remoteShardFailed(shardId, allocationId, primaryTerm, "mark copy as stale", null,

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 1, 2017

Member

this logic belongs in transportWriteAction

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 1, 2017

Author Member

As I mentioned in my previous comment, we still need a way to mark the replica as stale when the shard is unavailable.

What we don't need is the "if needed" version, because we only do this when the shard is not available, not when an operation fails.

I've removed the "if needed" version of this in 1f58744362b756bc9ef6f6bfb412c2836890d8f7

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

We need two because there are situations when a replica shard needs to be marked as stale without a write action

Which situations do you mean?

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointSyncAction.java Outdated
public class GlobalCheckpointSyncAction extends TransportReplicationAction<GlobalCheckpointSyncAction.PrimaryRequest,
GlobalCheckpointSyncAction.ReplicaRequest, ReplicationResponse> {
public class GlobalCheckpointSyncAction extends TransportReplicationAction<GlobalCheckpointSyncAction.GlobalCheckpointPrimaryRequest,
GlobalCheckpointSyncAction.GlobalCheckpointReplicaRequest, ReplicationResponse> {

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 1, 2017

Member

why rename this , it's already scoped under GlobalCheckpointSyncAction. it gets so long...

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 1, 2017

Author Member

I pushed 68d0ddd456d5e42db40c06261af8e8a19c201533 for this.

@dakrone

This comment has been minimized.

Copy link
Member Author

commented Feb 1, 2017

Maybe I miss something obvious but where is the logic to respond with an error if one of the shard copies fail (except in the TransportWriteAction)?

That logic is in the BroadcastShardResponse now, (getStatus()), which is used on the REST side. The error reporting was already there, this just changes the response to be a 500 instead of 200 in the event there is an error.

@dakrone

This comment has been minimized.

Copy link
Member Author

commented Feb 2, 2017

Thanks @bleskes, I think I pushed commits and added comments for all of your feedback so far.

@bleskes
Copy link
Member

left a comment

Thx @dakrone . I responded.

@@ -71,6 +74,17 @@ public int getFailedShards() {
}

/**
* The REST status that should be used for the response
*/
public RestStatus getStatus() {

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

Since this is effects all response that inherit from BroadcastResponse - can you check that all of them do the right thing and filter shard not available failures? might be good to add an assertion in the constructors, looking at all the exceptions this get.

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 2, 2017

Author Member

Certainly, I added an assert in f79c0f724819ff428c7c2c1d9bc2701521d96a81

core/src/main/java/org/elasticsearch/action/support/broadcast/BroadcastResponse.java Outdated
*/
public RestStatus getStatus() {
if (failedShards > 0) {
return RestStatus.INTERNAL_SERVER_ERROR;

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

shall we make an heroic effort to take the worst of all underlying rest status? maybe just the first ?

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 2, 2017

Author Member

shall we make an heroic effort to take the worst of all underlying rest status? maybe just the first ?

I don't know, how do you define "worst"? Is a 403 worse than a 500 or is a 500 worse? We can totally go with first though, if that's desired.

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 3, 2017

Member

I think 5xx (server error) is worst than 4xx, but let's just go with the first.

core/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java Outdated
Request extends ReplicationRequest<Request>,
ReplicaRequest extends ReplicationRequest<ReplicaRequest>,
PrimaryResultT extends PrimaryResult<ReplicaRequest>
R extends ReplicationRequest<R>,

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

nit: Why do we call this R and use an T suffix on the rest?

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 2, 2017

Author Member

I have seen two standard "formats", one is the single char format, which the JDK uses a lot like Collection<E>, so that's why the R. For the others, PrimaryResult as PR I could do, but I figured since it wasn't a single char I'd suffix it with T for type (another format I've seen in codebases).

I'm happy to change it to R, and PR if you think that would be clearer

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 3, 2017

Member

I would go with a T suffix - i.e., RequestT

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java Outdated
return new ReplicationOperation<>(request, primaryShardReference, listener,
executeOnReplicas, replicasProxy, clusterService::state, logger, actionName
);
return new ReplicationOperation<Request, ReplicaRequest, PrimaryResult<ReplicaRequest, Response>>(request,

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

out of curiosity - why does <> not work?

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 2, 2017

Author Member

This was added unnecessarily, I've removed it in 5eafb1866e85b994fafd178d72c873bba9838e71

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java Outdated
* interface that performs the actual {@code ReplicaRequest} on the replica
* shards. It also encapsulates the logic required for failing the replica
* if deemed necessary as well as marking it as stale when needed.
*/
final class ReplicasProxy implements ReplicationOperation.Replicas<ReplicaRequest> {

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

This seems to work fine for me?

diff --git a/core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java b/core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java
index db037ad..2cd1d8e 100644
--- a/core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java
+++ b/core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java
@@ -1040,7 +1040,7 @@ public abstract class TransportReplicationAction<
      * shards. It also encapsulates the logic required for failing the replica
      * if deemed necessary as well as marking it as stale when needed.
      */
-    final class ReplicasProxy implements ReplicationOperation.Replicas<ReplicaRequest> {
+    class ReplicasProxy implements ReplicationOperation.Replicas<ReplicaRequest> {
 
         @Override
         public void performOn(ShardRouting replica, ReplicaRequest request, ActionListener<ReplicationOperation.ReplicaResponse> listener) {
diff --git a/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java b/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java
index ac79a8c..8a168ca 100644
--- a/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java
+++ b/core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java
@@ -74,7 +74,7 @@ public abstract class TransportWriteAction<
 
     @Override
     protected ReplicationOperation.Replicas newReplicasProxy() {
-        return new WriteActionReplicasProxy(shardStateAction);
+        return new WriteActionReplicasProxy();
     }
 
     /**
@@ -335,26 +335,7 @@ public abstract class TransportWriteAction<
      * replicas, where a failure to execute the operation should fail
      * the replica shard and/or mark the replica as stale.
      */
-    final class WriteActionReplicasProxy implements ReplicationOperation.Replicas<ReplicaRequest> {
-
-        private final ShardStateAction shardStateAction;
-
-        WriteActionReplicasProxy(ShardStateAction shardStateAction) {
-            this.shardStateAction = shardStateAction;
-        }
-
-        @Override
-        public void performOn(ShardRouting replica, ReplicaRequest request, ActionListener<ReplicationOperation.ReplicaResponse> listener) {
-            String nodeId = replica.currentNodeId();
-            final DiscoveryNode node = clusterService.state().nodes().get(nodeId);
-            if (node == null) {
-                listener.onFailure(new NoNodeAvailableException("unknown node [" + nodeId + "]"));
-                return;
-            }
-            final ConcreteShardRequest<ReplicaRequest> concreteShardRequest =
-                    new ConcreteShardRequest<>(request, replica.allocationId().getId());
-            sendReplicaRequest(concreteShardRequest, node, listener);
-        }
+    class WriteActionReplicasProxy extends ReplicasProxy {
 
         @Override
         public void failShardIfNeeded(ShardRouting replica, long primaryTerm, String message, Exception exception,
core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java Outdated
Consumer<Exception> onPrimaryDemoted, Consumer<Exception> onIgnoredFailure) {
// Just like with failing the shard, it should only be marked as
// stale when write operations fail, so for regular operations
// don't mark it as stale when not necessary.
}

@Override
public void markShardCopyAsStale(ShardId shardId, String allocationId, long primaryTerm, Runnable onSuccess,
Consumer<Exception> onPrimaryDemoted, Consumer<Exception> onIgnoredFailure) {
shardStateAction.remoteShardFailed(shardId, allocationId, primaryTerm, "mark copy as stale", null,

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

We need two because there are situations when a replica shard needs to be marked as stale without a write action

Which situations do you mean?

core/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java Outdated
@@ -124,7 +138,11 @@ public synchronized void respond(ActionListener<Response> listener) {
protected void respondIfPossible(Exception ex) {
if (finishedAsyncActions && listener != null) {
if (ex == null) {
super.respond(listener);
if (finalResponseIfSuccessful != null) {

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

why this change?

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 2, 2017

Author Member

This was a leftover from me trying something out, I've removed it in 56cc63f612e5281555d5da1202dc9a44748eceea

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointSyncAction.java Outdated
@@ -129,7 +129,7 @@ public String toString() {
private ReplicaRequest() {
}

public ReplicaRequest(PrimaryRequest primaryRequest, long checkpoint) {
public ReplicaRequest(ReplicationRequest primaryRequest, long checkpoint) {

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

Why this change?

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 2, 2017

Author Member

I manually reverted the changes to GlobalCheckpointSyncAction and I missed one :)

I undid it in f7b29d78395c2e13c715f6c167d6e8cbffe28e6e

core/src/main/java/org/elasticsearch/rest/action/RestFieldStatsAction.java Outdated
@@ -98,7 +98,7 @@ public RestResponse buildResponse(FieldStatsResponse response, XContentBuilder b
builder.endObject();
}
builder.endObject();
return new BytesRestResponse(RestStatus.OK, builder);
return new BytesRestResponse(response.getStatus(), builder);

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 2, 2017

Member

these are not related changes - can we maybe move them to a separate PR?

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 2, 2017

Author Member

Sure, though I think it makes it more consistent to change these to be the same, I'll open a separate PR for it after this one to change them to be 500 if a shard fails and remove all of these except for the refresh action

This comment has been minimized.

Copy link
@dakrone

dakrone Feb 2, 2017

Author Member

I pushed d16441b19d606d87166d82345c1197f7fa6d8386 for this

@dakrone

This comment has been minimized.

Copy link
Member Author

commented Feb 3, 2017

@bleskes I think I addressed all the feedback for this round, thanks again for taking a look!

@dakrone dakrone force-pushed the dakrone:replica-failures-dont-always-fail-shard branch Feb 3, 2017

@bleskes

bleskes approved these changes Feb 3, 2017

Copy link
Member

left a comment

LGTM. Left some comments that you can adopt or not. No need for another review. Thx @dakrone

core/src/main/java/org/elasticsearch/action/support/broadcast/BroadcastResponse.java Outdated
*/
public RestStatus getStatus() {
if (failedShards > 0) {
return RestStatus.INTERNAL_SERVER_ERROR;

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 3, 2017

Member

I think 5xx (server error) is worst than 4xx, but let's just go with the first.

core/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java Outdated
Request extends ReplicationRequest<Request>,
ReplicaRequest extends ReplicationRequest<ReplicaRequest>,
PrimaryResultT extends PrimaryResult<ReplicaRequest>
R extends ReplicationRequest<R>,

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 3, 2017

Member

I would go with a T suffix - i.e., RequestT

private final TransportRequestOptions transportOptions;
private final String executor;

// package private for testing
private final String transportReplicaAction;
private final String transportPrimaryAction;
private final ReplicasProxy replicasProxy;
private final ReplicationOperation.Replicas replicasProxy;

protected TransportReplicationAction(Settings settings, String actionName, TransportService transportService,
ClusterService clusterService, IndicesService indicesService,

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 3, 2017

Member

we can remove the shard state action here, no?

return indexService;
}

final IndicesService mockIndicesService(ClusterService clusterService) {

This comment has been minimized.

Copy link
@bleskes

bleskes Feb 3, 2017

Member

can we share this and mockIndexShard with TransportReplicationAction? is there any point in that?

@dakrone dakrone force-pushed the dakrone:replica-failures-dont-always-fail-shard branch 2 times, most recently Feb 3, 2017

@dakrone

This comment has been minimized.

Copy link
Member Author

commented Feb 3, 2017

Thank you very much for reviewing this @bleskes!

@dakrone dakrone force-pushed the dakrone:replica-failures-dont-always-fail-shard branch Feb 3, 2017

Change certain replica failures not to fail the replica shard
This changes the way that replica failures are handled such that not all
failures will cause the replica shard to be failed or marked as stale.

In some cases such as refresh operations, or global checkpoint syncs, it is
"okay" for the operation to fail without the shard being failed (because no data
is out of sync). In these cases, instead of failing the shard we should simply
fail the operation, and, in the event it is a user-facing operation, return a
5xx response code including the shard-specific failures.

This was accomplished by having two forms of the `Replicas` proxy, one that is
for non-write operations that does not fail the shard, and one that is for write
operations that will fail the shard when an operation fails.

Relates to #10708

@dakrone dakrone force-pushed the dakrone:replica-failures-dont-always-fail-shard branch to 39e7c30 Feb 3, 2017

@dakrone dakrone merged commit 39e7c30 into elastic:master Feb 3, 2017

1 of 2 checks passed

elasticsearch-ci Build started sha1 is merged.
Details
CLA Commit author has signed the CLA
Details

@bleskes bleskes referenced this pull request Feb 3, 2017

Closed

Add Sequence Numbers to write operations #10708

56 of 64 tasks complete

@dakrone dakrone deleted the dakrone:replica-failures-dont-always-fail-shard branch Dec 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.