-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make refresh a replicated action #13068
Conversation
logger.trace("{} flush request executed on replica", indexShard.shardId()); | ||
} | ||
@Override | ||
protected boolean resolveIndex() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's unrelated to this change but why don't we have defaults for these boolean they look the same in 90% fo the cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Change that in this pr or better in another one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do another one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #13218
I love this change looks very good. I left some comments around unittesting |
3e47662
to
23bdc5a
Compare
|
||
@Override | ||
protected ShardIterator shards(ClusterState clusterState, InternalRequest request) { | ||
return clusterService.operationRouting().shards(clusterService.state(), request.concreteIndex(), request.request().getShardId().id()).shardsIt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the incoming state here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, why not use indexShards() now that we see this as a write op - then we don't need to change the visibility of the shards methond on Operation Routing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the incoming state here?
yes
also, why not use indexShards() now that we see this as a write op
you mean clusterService.operationRouting().indexShards()
? that one needs a type and id and we don't have that here. or is there another one that does not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed the indexShards suggestion is bad - it is not the right construct here as it is tied to a single doc. Since we are after shard ids here (not grouping) , I think we should simplify the API to return a list of shardIds (which will solve this too). See comments here : https://github.com/elastic/elasticsearch/pull/13068/files#r38203633
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now a method that returns List: https://github.com/elastic/elasticsearch/pull/13068/files#diff-8ec8c1c769c4acb6f880e4e15d2b96f6R120 Is that what your meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scatch that, different shards() method...
I agree with @s1monw that this looks great. I left minor comments here and there. |
4400824
to
5c19e15
Compare
LGTM |
|
||
public BroadcastResponse executeAndAssertImmediateResponse(TransportBroadcastReplicationAction refreshAction, BroadcastRequest request) throws InterruptedException, ExecutionException { | ||
Date beginDate = new Date(); | ||
BroadcastResponse response = (BroadcastResponse) refreshAction.execute(request).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use
BroadcastResponse response = (BroadcastResponse) refreshAction.execute(request).actionGet("30s");
which will throw an exception if things take longer than 30s.
Btw - if you need to measure duration , please use System.nanoTime(), which doesn't suffer from ntp corrections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much nicer!
Looks awesome. I only miss a test for the aggregation of results from multiple shard level responses. |
@bleskes addressed all comments and added test here: https://github.com/elastic/elasticsearch/pull/13068/files#diff-6030559b5ed4d55d9a754523f5c6ce6dR137 |
LGTM! (minor comments, no need for another review) |
prerequisite to elastic#9421 see also elastic#12600
c29cabb
to
d81f426
Compare
Make refresh a replicated action Conflicts: core/src/main/java/org/elasticsearch/action/admin/indices/flush/TransportFlushAction.java core/src/main/java/org/elasticsearch/action/admin/indices/refresh/TransportRefreshAction.java core/src/test/java/org/elasticsearch/action/IndicesRequestIT.java
Since #13068 refresh and flush requests go to the primary first and are then replicated. One difference to before is though that if a shard is not available (INITIALIZING for example) we wait a little for an indexing request but for refresh we don't and just give up immediately. Before, refresh requests were just send to the shards regardless of what their state is. In tests we sometimes create an index, issue an indexing request, refresh and then get the document. But we do not wait until all nodes know that all primaries have ben assigned. Now potentially one node can be one cluster state behind and not know yet that the shards have ben started. If the refresh is executed through this node then the refresh request will silently fail on shards that are started already because from the nodes perspective they are still initializing. As a consequence, documents that expected to be available in the test are now not. Example test failures are here: http://build-us-00.elastic.co/job/elasticsearch-20-oracle-jdk7/395/ This commit changes the timeout to 1m (default) to make sure we don't miss shards when we refresh. This will trigger the same retry mechanism as for indexing requests. We still have to make a decision if this change of behavior is acceptable. see #13238
Since #13068 refresh and flush requests go to the primary first and are then replicated. One difference to before is though that if a shard is not available (INITIALIZING for example) we wait a little for an indexing request but for refresh we don't and just give up immediately. Before, refresh requests were just send to the shards regardless of what their state is. In tests we sometimes create an index, issue an indexing request, refresh and then get the document. But we do not wait until all nodes know that all primaries have ben assigned. Now potentially one node can be one cluster state behind and not know yet that the shards have ben started. If the refresh is executed through this node then the refresh request will silently fail on shards that are started already because from the nodes perspective they are still initializing. As a consequence, documents that expected to be available in the test are now not. Example test failures are here: http://build-us-00.elastic.co/job/elasticsearch-20-oracle-jdk7/395/ This commit changes the timeout to 1m (default) to make sure we don't miss shards when we refresh. This will trigger the same retry mechanism as for indexing requests. We still have to make a decision if this change of behavior is acceptable. see #13238
Currently, we do not allow reads on shards which are in POST_RECOVERY which unfortunately can cause search failures on shards which just recovered if there no replicas (elastic#9421). The reason why we did not allow reads on shards that are in POST_RECOVERY is that after relocating a shard might miss a refresh if the node that executed the refresh is behind with cluster state processing. If that happens, a user might execute index/refresh/search but still not find the document that was indexed. We changed how refresh works now in elastic#13068 to make sure that shards cannot miss a refresh this way by sending refresh requests the same way that we send write requests. This commit changes IndexShard to allow reads on POST_RECOVERY now. In addition it adds two test: - test for issue elastic#9421 (After relocation shards might temporarily not be searchable if still in POST_RECOVERY) - test for visibility issue with relocation and refresh if reads allowed when shard is in POST_RECOVERY closes elastic#9421
Currently, we do not allow reads on shards which are in POST_RECOVERY which unfortunately can cause search failures on shards which just recovered if there no replicas (#9421). The reason why we did not allow reads on shards that are in POST_RECOVERY is that after relocating a shard might miss a refresh if the node that executed the refresh is behind with cluster state processing. If that happens, a user might execute index/refresh/search but still not find the document that was indexed. We changed how refresh works now in #13068 to make sure that shards cannot miss a refresh this way by sending refresh requests the same way that we send write requests. This commit changes IndexShard to allow reads on POST_RECOVERY now. In addition it adds two test: - test for issue #9421 (After relocation shards might temporarily not be searchable if still in POST_RECOVERY) - test for visibility issue with relocation and refresh if reads allowed when shard is in POST_RECOVERY closes #9421
…n action Before elastic#13068 refresh and flush ignored all exceptions that matched TransportActions.isShardNotAvailableException(e) and this should not change. In addition, refresh and flush which are based on broadcast replication might now get UnavailableShardsException from TransportReplicationAction if a shard is unavailable and this is not caught by TransportActions.isShardNotAvailableException(e). This must be ignored as well.
…n action Before #13068 refresh and flush ignored all exceptions that matched TransportActions.isShardNotAvailableException(e) and this should not change. In addition, refresh and flush which are based on broadcast replication might now get UnavailableShardsException from TransportReplicationAction if a shard is unavailable and this is not caught by TransportActions.isShardNotAvailableException(e). This must be ignored as well.
…n action Before #13068 refresh and flush ignored all exceptions that matched TransportActions.isShardNotAvailableException(e) and this should not change. In addition, refresh and flush which are based on broadcast replication might now get UnavailableShardsException from TransportReplicationAction if a shard is unavailable and this is not caught by TransportActions.isShardNotAvailableException(e). This must be ignored as well.
prerequisite to #9421
see also #12600