Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for search failures if shard is in POST_RECOVERY #12600

Closed
wants to merge 8 commits into from

Conversation

brwe
Copy link
Contributor

@brwe brwe commented Aug 3, 2015

Currently, we do not allow reads on shards which are in POST_RECOVERY which unfortunately can cause search failures on shards which just recovered if there no replicas (#9421).
The reason why we did not allow reads on shards that are in POST_RECOVERY is that after relocating a shard might miss a refresh if the node that executed the refresh is behind with cluster state processing. If that happens, a user might execute index/refresh/search but still not find the document that was indexed.
More details here.

@bleskes and I discussed this briefly and he mentioned we could make refresh a replicated operation that goes the same route that index operations go and thereby make sure that the refresh reaches every shard. In this case we could also allow reads on POST_RECOVERY.

I make this PR as a proof of concept so that we can discuss if this is actually a good idea.
This PR contains:

Let me know what you think. I would make the same changes for flush also.

@brwe brwe changed the title Fix for searach failures if shard is in POST_RECOVERY Fix for search failures if shard is in POST_RECOVERY Aug 3, 2015
@s1monw
Copy link
Contributor

s1monw commented Aug 3, 2015

I like this a lot! I wonder if we can streamline the implementation here a little more and forward the original request with the shard ID together to have some code we can share between flush and refresh and whatever comes after that? Anyway I think we can just start with what we have here and see how flush turns out afterwards.

@brwe
Copy link
Contributor Author

brwe commented Aug 3, 2015

I was actually wondering if we should also have a dedicated Action, similar to TransportBroadcastAction something like TransportReplicatedBroadcastAction or something that flush and refresh can derive from.


private final IndicesService indicesService;
private final TransportReplicatedRefreshAction replicatedRefreshAction;
ClusterService clusterService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be private final?

@clintongormley clintongormley added review :Core/Infra/Core Core issues without another label labels Aug 5, 2015
…y not be searchable if still in POST_RECOVERY)

see elastic#9421
When a client indexes a documents and then calls refresh on the index
then the document must be visible after that with search requests.
This might not be the case if refresh is a BroadcastOperationAction,
see DiscoveryWithServiceDisruptionsTests.testReadOnPostRecoveryShards

related to elastic#9421
@brwe
Copy link
Contributor Author

brwe commented Aug 12, 2015

I continued on this. I tried to generalize what I did for refresh so that it can be used for flush too. Now I wonder: should this work for synced flush too?

@s1monw
Copy link
Contributor

s1monw commented Aug 17, 2015

@brwe what's the status here, do you wait for reviews?

@brwe
Copy link
Contributor Author

brwe commented Aug 18, 2015

@s1monw yes. @bleskes might have an opinion on that too.

@bleskes bleskes self-assigned this Aug 18, 2015
@bleskes
Copy link
Contributor

bleskes commented Aug 19, 2015

@brwe and I talked this through and we decided to try and simplify things and remove some intermediate abstractions. Concretely try to:

  1. Use ReplicationRequest/ReplicationResponse/TransportReplicationAction directly rather than having ReplicatedBroadcastShardRequest/ReplicatedBroadcastShardResponse/TransportReplicatedBroadcastShardAction.
  2. Use BroadcastRequest/BroadcastResponse instead of ReplicatedBroadcastRequest/ReplicatedBroadcastResponse
  3. Rename TransportReplicatedBroadcastAction to TransportBroadcastReplicationAction

Also, we should break this into two PRs:

  1. One changing the refresh/flush behavior.
  2. A follow up PR to change the read on POST_RECOVERY semantics.

@brwe
Copy link
Contributor Author

brwe commented Aug 24, 2015

Also, we should break this into two PRs:

  1. One changing the refresh/flush behavior.
  2. A follow up PR to change the read on POST_RECOVERY semantics.

I made a pr for the first part here: #13068

brwe added a commit to brwe/elasticsearch that referenced this pull request Sep 1, 2015
@brwe
Copy link
Contributor Author

brwe commented Sep 1, 2015

Opened the second pr for the actual fix now here: #13246
I'll close this one here now.

@brwe brwe closed this Sep 1, 2015
@brwe brwe removed :Core/Infra/Core Core issues without another label discuss review v2.0.0 WIP labels Sep 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants