Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added `_shards` header to all write responses. #7994

Merged

Conversation

@martijnvg
Copy link
Member

commented Oct 6, 2014

The header indicates to how many shard copies (primary and replicas shards) a write was supposed to go to, to how many shard copies to write succeeded and potentially captures shard failures if writing into a replica shard fails.

For async writes it also includes the number of shards a write is still pending.

@bleskes
bleskes reviewed Oct 8, 2014
View changes
docs/reference/docs/index_.asciidoc Outdated
* `successful`- Indicates the number of shard copies to index operation succeeded on.
* `pending` - Indicates to how many shard copies this index operation still needs to go to at the time index operation
succeeded on the primary shard. This field is only returned if `async` replication is used.
* `failures` - An array that contains replication related errors in the case an index operation failed on a replica shard.

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 8, 2014

Member

I think we need failed: INT , just like in search.

This comment has been minimized.

Copy link
@martijnvg

martijnvg Oct 8, 2014

Author Member

+1 I'll add that.

@bleskes
bleskes reviewed Oct 8, 2014
View changes
docs/reference/migration/migrate_1_x.asciidoc Outdated
`successful` and `failed` fields in the header are based on the number of primary shards. The failures on replica
shards aren't being kept track of. From version `1.5.0` the stats in the `_shards` header are based on all shards
in an index. The http result code is remains to be based on the failures that have occurred in the primary shards.

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 8, 2014

Member

i think this should be all shards of an index. Also "remains to be based on" should be "The http status code is left unchanged and is only based on failures that occurred while executing on primary shards".

@bleskes
bleskes reviewed Oct 8, 2014
View changes
src/main/java/org/elasticsearch/action/ActionWriteResponse.java Outdated
@Override
public void readFrom(StreamInput in) throws IOException {
super.readFrom(in);
if (in.getVersion().onOrAfter(Version.V_1_5_0)) {

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 8, 2014

Member

how do we deal with earlier versions? Is it OK for shardInfo to be null? do we want a marker Unknown instance?

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 8, 2014

Member

I see some places where null is not protected against...

@bleskes
bleskes reviewed Oct 8, 2014
View changes
src/main/java/org/elasticsearch/action/ActionWriteResponse.java Outdated
}

/**
* @return the total number of shards the write actually succeeded on.

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 8, 2014

Member

drop the actually? sounds so "uncertain" :)

@javanna
javanna reviewed Oct 8, 2014
View changes
docs/reference/docs/delete-by-query.asciidoc Outdated
"successful" : 5,
"failed" : 0
"total" : 10,
"successful" : 10,

This comment has been minimized.

Copy link
@javanna

javanna Oct 8, 2014

Member

is it correct that the failed is gone?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Oct 8, 2014

Author Member

I'll add it back :-)

(my idea was that failures.length would be enough, but that wouldn't be consistent with the _shards header in search and percolate apis)

This comment has been minimized.

Copy link
@javanna

javanna Oct 10, 2014

Member

I think you have to add back the failed to the docs too then? That means there are no changes to the delete by query json response?

This comment has been minimized.

Copy link
@bleskes

bleskes Dec 1, 2014

Member

failed was supposed to come back no?

This comment has been minimized.

Copy link
@imotov

imotov Jan 6, 2015

Member

+1 on adding failed back to docs.

@javanna
javanna reviewed Oct 8, 2014
View changes
...java/org/elasticsearch/action/admin/indices/mapping/delete/TransportDeleteMappingAction.java Outdated
logger.trace("Delete by query[{}] completed with total[{}], successful[{}] and failed[{}]", indexResponse.getIndex(), indexResponse.getShardInfo().getTotal(), indexResponse.getShardInfo().getSuccessful(), indexResponse.getShardInfo().getFailures().length);
if (indexResponse.getShardInfo().getFailures().length > 0) {
for (ActionWriteResponse.ShardInfo.Failure failure : indexResponse.getShardInfo().getFailures()) {
logger.trace("[{}/{}/] Delete by query shard failure reason: {}", failure.getIndex(), failure.getShardId(), failure.getNodeId(), failure.getReason());

This comment has been minimized.

Copy link
@javanna

javanna Oct 8, 2014

Member

missing an {} somewhere?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Oct 8, 2014

Author Member

true! I'll fix that.

@javanna
javanna reviewed Oct 8, 2014
View changes
src/main/java/org/elasticsearch/action/ActionWriteResponse.java Outdated
*/
public abstract class ActionWriteResponse extends ActionResponse {

private ShardInfo shardInfo;

This comment has been minimized.

Copy link
@javanna

javanna Oct 8, 2014

Member

I'd consider adding a constructor with shardInfo and changing the subclasses constructors to accept it there, just to enforce that this info is needed so we don't forget it anywhere. Thoughts?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Oct 8, 2014

Author Member

+1 make sense.

@bleskes
bleskes reviewed Oct 8, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
finishIfPossible();
}

@Override
public void handleException(TransportException exp) {
response.shardReplicaFailures.put(nodeId, exp);

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 8, 2014

Member

we don't count shard not allocated / not started/ closed etc. as shard failures - see Search logic. This will end up as a difference between total shards and shard failed. The reason is that there is no way to distinguish this case with the one that our cluster state said they were unassigned.

@bleskes
bleskes reviewed Oct 8, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
@@ -737,10 +731,12 @@ protected void doRun() {
try {
shardOperationOnReplica(shardRequest);
} catch (Throwable e) {
response.shardReplicaFailures.put(nodeId, e);

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 8, 2014

Member

same here. Not all failures are counted.

@bleskes
bleskes reviewed Oct 8, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
@@ -754,21 +750,24 @@ public boolean isForceExecution() {
public void onFailure(Throwable t) {}
});
} catch (Throwable e) {
response.shardReplicaFailures.put(nodeId, e);

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 8, 2014

Member

same here. Not all failures are counted.

@bleskes
bleskes reviewed Oct 8, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
failReplicaIfNeeded(shard.index(), shard.id(), e);
}
response.successFullShardCopyCount.incrementAndGet();

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 8, 2014

Member

This is not necessarily a success, right?

@martijnvg martijnvg force-pushed the martijnvg:feature/shard_info_write_responses branch Oct 9, 2014
@martijnvg

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2014

@javanna @bleskes I updated the PR and the provided feedback has been applied.

The other most noticeable change is that the replication logic in TransportShardReplicationOperationAction has been changed to keep better track of the state of replication while it is being performed. This helped to simplify the replication logic.

@javanna
javanna reviewed Oct 10, 2014
View changes
docs/reference/docs/delete.asciidoc Outdated
@@ -16,6 +16,10 @@ The result of the above delete operation is:
[source,js]
--------------------------------------------------
{
"_shards" : {
"total" : 10,
"successful" : 10

This comment has been minimized.

Copy link
@javanna

javanna Oct 10, 2014

Member

missing failed?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Oct 10, 2014

Author Member

I added it to the code, I'll add it to the examples as well!

@javanna
javanna reviewed Oct 10, 2014
View changes
docs/reference/docs/index_.asciidoc Outdated
@@ -19,6 +19,10 @@ The result of the above index operation is:
[source,js]
--------------------------------------------------
{
"_shards" : {
"total" : 10,
"successful" : 10

This comment has been minimized.

Copy link
@javanna

javanna Oct 10, 2014

Member

missing failed?

@javanna
javanna reviewed Oct 10, 2014
View changes
src/main/java/org/elasticsearch/action/deletebyquery/IndexDeleteByQueryResponse.java Outdated
this.shardInfo.append(shardResponses);
// just append the primary failures:
if (!failures.isEmpty()) {
List<ActionWriteResponse.ShardInfo.Failure> k = new ArrayList<>();

This comment has been minimized.

Copy link
@javanna

javanna Oct 10, 2014

Member

can you rename k to something a bit more meaningful? :)

This comment has been minimized.

Copy link
@martijnvg

martijnvg Oct 10, 2014

Author Member

heh :) will do

@javanna
javanna reviewed Oct 10, 2014
View changes
src/main/java/org/elasticsearch/action/ActionWriteResponse.java Outdated
*/
public abstract class ActionWriteResponse extends ActionResponse {

private ShardInfo shardInfo = new ShardInfo();

This comment has been minimized.

Copy link
@javanna

javanna Oct 10, 2014

Member

I think we discussed this before, but it didn't change, thus I'm bringing it up again ;) can we add a constructor that accepts shardInfo as argument and change the subclasses constructors to accept it there, just to enforce that this info is needed so we don't forget it anywhere. Maybe then we could also remove the setter...

This comment has been minimized.

Copy link
@martijnvg

martijnvg Oct 10, 2014

Author Member

The responses are created on the primary shard while the _shards header is constructed and modified while replication is in progress, that makes a bit weird if we supply it when the action of the primary has completed.

Also on certain responses (delete by query) we need to aggregate the shard responses's _shards header and supply it a level higher up. (same from index response to indices response). So that is why the setter is there.

This comment has been minimized.

Copy link
@javanna

javanna Oct 10, 2014

Member

I see, thanks for the explanation!

@bleskes
bleskes reviewed Oct 13, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
// initialize the counter
int replicaCounter = shardIt.assignedReplicasIncludingRelocating();

ReplicationState status = new ReplicationState(por, shardIt, primaryResponse);

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 13, 2014

Member

Can we either rename ReplicationState to ReplicationStatus or change status to replicationState?

@bleskes
bleskes reviewed Oct 13, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
if (newPrimaryShard != null) {
replicaCounter++;
numberOfShardInstance++;

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 13, 2014

Member

Nit picking - can we call this numberOfShardInstances?

@bleskes
bleskes reviewed Oct 13, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
if (replicaCounter == 0) {
postPrimaryOperation(internalRequest, response);
listener.onResponse(response.response());
status.onStartReplication(numberOfShardInstance, listener);

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 13, 2014

Member

I'm a bit worried about the differences in the total count (based on shardIt.size()) and the amount of calls we actually make (based on numberOfShardInstace). We may end up in a situation where successful > total and maybe worse successful == total but they mean different shards. I think we should not base the counts in ReplicationState on the ShardIt but rather on the total of actual work + unassigned.

@bleskes
bleskes reviewed Oct 13, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
this.replicaRequest = replicaRequest;
this.response = response;
this.payload = payload;
public void onReplicaFailure(String nodeId, Throwable e) {

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 13, 2014

Member

Can we add @nullable to the e param?

@bleskes
bleskes reviewed Oct 13, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
this.response = response;
this.payload = payload;
public void onReplicaFailure(String nodeId, Throwable e) {
// Only version conflict should be ignored from being put into the _shards header?

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 13, 2014

Member

We should call ignoreReplicaException() imho.

@bleskes
bleskes reviewed Oct 13, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated

private void doFinish() {
if (finished.compareAndSet(false, true)) {
ActionWriteResponse.ShardInfo shardInfo = finalResponse.getShardInfo();

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 13, 2014

Member

I think it will be cleaner to have ShardInfo have a constructor that takes all parameters and set that on the finalResponse. This will make sure we will not forget anything in the future.

@bleskes
bleskes reviewed Oct 13, 2014
View changes
src/main/java/org/elasticsearch/action/ActionWriteResponse.java Outdated
this.shardId = shardId;
this.nodeId = nodeId;
this.reason = reason;
this.status = RestStatus.OK; // <-- Replica failures are ok and can happen.

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 13, 2014

Member

We suppress and not report all errors which are OK. I don't think we need a special protection here about it.

@bleskes
bleskes reviewed Oct 13, 2014
View changes
...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java Outdated
for (Map.Entry<String, Throwable> entry : shardReplicaFailures.entrySet()) {
String reason = ExceptionsHelper.detailedMessage(entry.getValue());
ShardId shardId = shardIterator.shardId();
failures.add(new ActionWriteResponse.ShardInfo.Failure(shardId.getIndex(), shardId.getId(), entry.getKey(), reason));

This comment has been minimized.

Copy link
@bleskes

bleskes Oct 13, 2014

Member

We should try to take the rest status from the exception. See ShardSearchFailure

// The failure doesn't include the node id, maybe add it to ShardOperationFailedException...
ShardOperationFailedException sf = shardActionResult.shardFailure;

ShardIterator thisShardIterator = null;

This comment has been minimized.

Copy link
@bleskes

bleskes Dec 8, 2014

Member

why not capture the shard info during the failure handling? Then we don't need to search for the iterator..

This comment has been minimized.

Copy link
@martijnvg

martijnvg Dec 9, 2014

Author Member

good call, let me change that.

@bleskes

This comment has been minimized.

Copy link
Member

commented Dec 8, 2014

Thx @martijnvg .

Not sure if we should repeat the primary exception or just tell we never executed on replica shards.

yeah, I see what you mean. I was thinking wrapping the primary failure with a new exception which has the right info in the message? (ala "Failed to execute on shard [](primary + [] replicas): [MESSAGE]")

left some other minor comments.

@martijnvg

This comment has been minimized.

Copy link
Member Author

commented Dec 9, 2014

Thanks for the feedback @bleskes. I updated the PR. Since we don't have exceptions (the Failure class has no failure wrapping) for the replica shards, I instead just wrap the original primary shard failure on the message of the replica shard failure. I think this is sufficient?

@martijnvg martijnvg force-pushed the martijnvg:feature/shard_info_write_responses branch from c99bb0f Jan 6, 2015
@martijnvg

This comment has been minimized.

Copy link
Member Author

commented Jan 6, 2015

@bleskes @javanna I applied the feedback and Boaz's commits (bleskes@a9b659d and bleskes@6180417), rebased to master and squashed everything to a single commit for clarity purposes.

I think this is getting close to get merged now, would be great if someone else can take a look at the PR before it gets merged in.

@imotov
imotov reviewed Jan 6, 2015
View changes
src/test/java/org/elasticsearch/deleteByQuery/DeleteByQueryTests.java Outdated

private void assertSyncShardInfo(ActionWriteResponse.ShardInfo shardInfo, NumShards numShards) {
assertThat(shardInfo.getTotal(), equalTo(numShards.totalNumShards));
assertThat(shardInfo.getSuccessful(), greaterThanOrEqualTo(1));

This comment has been minimized.

Copy link
@imotov

imotov Jan 6, 2015

Member

Why not greaterThanOrEqualTo(numShards.numPrimaries) or even `equalTo(numShards.totalNumShards)?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 7, 2015

Author Member

I don't know why this wasn't using numShards.numPrimaries, let me change that!

@imotov
imotov reviewed Jan 6, 2015
View changes
src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java Outdated
<T> T response() {
<T extends ActionWriteResponse> T response() {
// set all to 0, we will embed this into the replica request and not use it
response.setShardInfo(new ActionWriteResponse.ShardInfo(0, 0, 0));

This comment has been minimized.

Copy link
@imotov

imotov Jan 6, 2015

Member

That makes me think that ShardInfo needs a public default constructor.

@imotov

This comment has been minimized.

Copy link
Member

commented Jan 6, 2015

I left a couple of small comments. In general looks good to me.

@martijnvg

This comment has been minimized.

Copy link
Member Author

commented Jan 7, 2015

@imotov I updated the PR and applied your comments.

@imotov

This comment has been minimized.

Copy link
Member

commented Jan 7, 2015

LGTM

@martijnvg martijnvg changed the title Added `_shards` header to all write responses. Core: added `_shards` header to all write responses. Jan 8, 2015
@martijnvg martijnvg force-pushed the martijnvg:feature/shard_info_write_responses branch Jan 8, 2015
@martijnvg martijnvg force-pushed the martijnvg:feature/shard_info_write_responses branch Jan 8, 2015
@martijnvg martijnvg force-pushed the martijnvg:feature/shard_info_write_responses branch Jan 8, 2015
@martijnvg martijnvg force-pushed the martijnvg:feature/shard_info_write_responses branch Jan 8, 2015
The header indicates to how many shard copies (primary and replicas shards) a write was supposed to go to, to how many
shard copies to write succeeded and potentially captures shard failures if writing into a replica shard fails.

For async writes it also includes the number of shards a write is still pending.

Closes #7994
@martijnvg martijnvg force-pushed the martijnvg:feature/shard_info_write_responses branch to ca4f27f Jan 8, 2015
@martijnvg martijnvg merged commit ca4f27f into elastic:master Jan 8, 2015
1 check passed
1 check passed
CLA Commit author is a member of Elasticsearch
Details
@martijnvg martijnvg added the resiliency label Mar 2, 2015
@martijnvg martijnvg deleted the martijnvg:feature/shard_info_write_responses branch May 18, 2015
@clintongormley clintongormley changed the title Core: added `_shards` header to all write responses. Added `_shards` header to all write responses. Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.