Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make refresh a replicated action #13068

Merged
merged 1 commit into from
Sep 1, 2015
Merged

Conversation

brwe
Copy link
Contributor

@brwe brwe commented Aug 24, 2015

prerequisite to #9421
see also #12600

logger.trace("{} flush request executed on replica", indexShard.shardId());
}
@Override
protected boolean resolveIndex() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's unrelated to this change but why don't we have defaults for these boolean they look the same in 90% fo the cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Change that in this pr or better in another one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do another one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #13218

@s1monw
Copy link
Contributor

s1monw commented Aug 25, 2015

I love this change looks very good. I left some comments around unittesting


@Override
protected ShardIterator shards(ClusterState clusterState, InternalRequest request) {
return clusterService.operationRouting().shards(clusterService.state(), request.concreteIndex(), request.request().getShardId().id()).shardsIt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the incoming state here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, why not use indexShards() now that we see this as a write op - then we don't need to change the visibility of the shards methond on Operation Routing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the incoming state here?

yes

also, why not use indexShards() now that we see this as a write op

you mean clusterService.operationRouting().indexShards()? that one needs a type and id and we don't have that here. or is there another one that does not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed the indexShards suggestion is bad - it is not the right construct here as it is tied to a single doc. Since we are after shard ids here (not grouping) , I think we should simplify the API to return a list of shardIds (which will solve this too). See comments here : https://github.com/elastic/elasticsearch/pull/13068/files#r38203633

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a method that returns List: https://github.com/elastic/elasticsearch/pull/13068/files#diff-8ec8c1c769c4acb6f880e4e15d2b96f6R120 Is that what your meant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scatch that, different shards() method...

@bleskes
Copy link
Contributor

bleskes commented Aug 28, 2015

I agree with @s1monw that this looks great. I left minor comments here and there.

@brwe brwe force-pushed the broadcast_replication branch 2 times, most recently from 4400824 to 5c19e15 Compare August 31, 2015 12:26
@s1monw
Copy link
Contributor

s1monw commented Aug 31, 2015

LGTM

@brwe
Copy link
Contributor Author

brwe commented Aug 31, 2015

@s1monw thanks for the review!

addressed all comments. @bleskes want to have one more look?


public BroadcastResponse executeAndAssertImmediateResponse(TransportBroadcastReplicationAction refreshAction, BroadcastRequest request) throws InterruptedException, ExecutionException {
Date beginDate = new Date();
BroadcastResponse response = (BroadcastResponse) refreshAction.execute(request).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use

        BroadcastResponse response = (BroadcastResponse) refreshAction.execute(request).actionGet("30s");

which will throw an exception if things take longer than 30s.

Btw - if you need to measure duration , please use System.nanoTime(), which doesn't suffer from ntp corrections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much nicer!

@bleskes
Copy link
Contributor

bleskes commented Aug 31, 2015

Looks awesome. I only miss a test for the aggregation of results from multiple shard level responses.

@brwe
Copy link
Contributor Author

brwe commented Aug 31, 2015

@bleskes
Copy link
Contributor

bleskes commented Aug 31, 2015

LGTM! (minor comments, no need for another review)

brwe added a commit that referenced this pull request Sep 1, 2015
@brwe brwe merged commit 333831c into elastic:master Sep 1, 2015
brwe added a commit that referenced this pull request Sep 1, 2015
Make refresh a replicated action
Conflicts:
	core/src/main/java/org/elasticsearch/action/admin/indices/flush/TransportFlushAction.java
	core/src/main/java/org/elasticsearch/action/admin/indices/refresh/TransportRefreshAction.java
	core/src/test/java/org/elasticsearch/action/IndicesRequestIT.java
brwe added a commit that referenced this pull request Sep 1, 2015
Since #13068 refresh and flush requests go to the primary first and are then replicated.
One difference to before is though that if a shard is  not available (INITIALIZING for example)
we wait a little for an indexing request but for refresh we don't and just give up immediately.
Before, refresh requests were just send to the shards regardless of what their state is.

In tests we sometimes create an index, issue an indexing request, refresh and
then get the document. But we do not wait until all nodes know that all primaries have ben assigned.
Now potentially one node can be one cluster state behind and not know yet that
the shards have ben started. If the refresh is executed through this node then the
refresh request will silently fail on shards that are started already because from
the nodes perspective they are still initializing. As a consequence, documents
that expected to be available in the test are now not.
Example test failures are here: http://build-us-00.elastic.co/job/elasticsearch-20-oracle-jdk7/395/

This commit changes the timeout to 1m (default) to make sure we don't miss shards
when we refresh. This will trigger the same retry mechanism as for indexing requests.
We still have to make a decision if this change of behavior is acceptable.

see #13238
brwe added a commit that referenced this pull request Sep 1, 2015
Since #13068 refresh and flush requests go to the primary first and are then replicated.
One difference to before is though that if a shard is  not available (INITIALIZING for example)
we wait a little for an indexing request but for refresh we don't and just give up immediately.
Before, refresh requests were just send to the shards regardless of what their state is.

In tests we sometimes create an index, issue an indexing request, refresh and
then get the document. But we do not wait until all nodes know that all primaries have ben assigned.
Now potentially one node can be one cluster state behind and not know yet that
the shards have ben started. If the refresh is executed through this node then the
refresh request will silently fail on shards that are started already because from
the nodes perspective they are still initializing. As a consequence, documents
that expected to be available in the test are now not.
Example test failures are here: http://build-us-00.elastic.co/job/elasticsearch-20-oracle-jdk7/395/

This commit changes the timeout to 1m (default) to make sure we don't miss shards
when we refresh. This will trigger the same retry mechanism as for indexing requests.
We still have to make a decision if this change of behavior is acceptable.

see #13238
brwe added a commit to brwe/elasticsearch that referenced this pull request Sep 2, 2015
Currently, we do not allow reads on shards which are in POST_RECOVERY which
unfortunately can cause search failures on shards which just recovered if there no replicas (elastic#9421).
The reason why we did not allow reads on shards that are in POST_RECOVERY is
that after relocating a shard might miss a refresh if the node that executed the
refresh is behind with cluster state processing. If that happens, a user might execute
index/refresh/search but still not find the document that was indexed.

We changed how refresh works now in elastic#13068 to make sure that shards cannot miss a refresh this
way by sending refresh requests the same way that we send write requests.

This commit changes IndexShard to allow reads on POST_RECOVERY now.
In addition it adds two test:

- test for issue elastic#9421 (After relocation shards might temporarily not be searchable if still in POST_RECOVERY)
- test for visibility issue with relocation and refresh if reads allowed when shard is in POST_RECOVERY

closes elastic#9421
brwe added a commit that referenced this pull request Sep 2, 2015
Currently, we do not allow reads on shards which are in POST_RECOVERY which
unfortunately can cause search failures on shards which just recovered if there no replicas (#9421).
The reason why we did not allow reads on shards that are in POST_RECOVERY is
that after relocating a shard might miss a refresh if the node that executed the
refresh is behind with cluster state processing. If that happens, a user might execute
index/refresh/search but still not find the document that was indexed.

We changed how refresh works now in #13068 to make sure that shards cannot miss a refresh this
way by sending refresh requests the same way that we send write requests.

This commit changes IndexShard to allow reads on POST_RECOVERY now.
In addition it adds two test:

- test for issue #9421 (After relocation shards might temporarily not be searchable if still in POST_RECOVERY)
- test for visibility issue with relocation and refresh if reads allowed when shard is in POST_RECOVERY

closes #9421
brwe added a commit to brwe/elasticsearch that referenced this pull request Sep 7, 2015
…n action

Before elastic#13068 refresh and flush ignored all exceptions that matched
TransportActions.isShardNotAvailableException(e) and this should not change.
In addition, refresh and flush which are based on broadcast replication
might now get UnavailableShardsException from TransportReplicationAction if a shard
is unavailable and this is not caught by TransportActions.isShardNotAvailableException(e).
This must be ignored as well.
brwe added a commit that referenced this pull request Sep 7, 2015
…n action

Before #13068 refresh and flush ignored all exceptions that matched
TransportActions.isShardNotAvailableException(e) and this should not change.
In addition, refresh and flush which are based on broadcast replication
might now get UnavailableShardsException from TransportReplicationAction if a shard
is unavailable and this is not caught by TransportActions.isShardNotAvailableException(e).
This must be ignored as well.
brwe added a commit that referenced this pull request Sep 7, 2015
…n action

Before #13068 refresh and flush ignored all exceptions that matched
TransportActions.isShardNotAvailableException(e) and this should not change.
In addition, refresh and flush which are based on broadcast replication
might now get UnavailableShardsException from TransportReplicationAction if a shard
is unavailable and this is not caught by TransportActions.isShardNotAvailableException(e).
This must be ignored as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants