Fix for search failures if shard is in POST_RECOVERY #12600

brwe · 2015-08-03T12:45:31Z

Currently, we do not allow reads on shards which are in POST_RECOVERY which unfortunately can cause search failures on shards which just recovered if there no replicas (#9421).
The reason why we did not allow reads on shards that are in POST_RECOVERY is that after relocating a shard might miss a refresh if the node that executed the refresh is behind with cluster state processing. If that happens, a user might execute index/refresh/search but still not find the document that was indexed.
More details here.

@bleskes and I discussed this briefly and he mentioned we could make refresh a replicated operation that goes the same route that index operations go and thereby make sure that the refresh reaches every shard. In this case we could also allow reads on POST_RECOVERY.

I make this PR as a proof of concept so that we can discuss if this is actually a good idea.
This PR contains:

a reliable test for After relocation shards might temporarily not be searchable if still in POST_RECOVERY #9421
a fix for After relocation shards might temporarily not be searchable if still in POST_RECOVERY #9421
a test for the visibility issue that we have when we allow reads in POST_RECOVERY
the change to make refresh a replicated action just like index, delete, etc.

Let me know what you think. I would make the same changes for flush also.

s1monw · 2015-08-03T13:01:38Z

I like this a lot! I wonder if we can streamline the implementation here a little more and forward the original request with the shard ID together to have some code we can share between flush and refresh and whatever comes after that? Anyway I think we can just start with what we have here and see how flush turns out afterwards.

brwe · 2015-08-03T13:09:01Z

I was actually wondering if we should also have a dedicated Action, similar to TransportBroadcastAction something like TransportReplicatedBroadcastAction or something that flush and refresh can derive from.

s1monw · 2015-08-04T09:00:56Z

core/src/main/java/org/elasticsearch/action/admin/indices/refresh/TransportRefreshAction.java


-    private final IndicesService indicesService;
+    private final TransportReplicatedRefreshAction replicatedRefreshAction;
+    ClusterService clusterService;


can this be private final?

…y not be searchable if still in POST_RECOVERY) see elastic#9421

…d when shard is in POST_RECOVERY

see elastic#9421

When a client indexes a documents and then calls refresh on the index then the document must be visible after that with search requests. This might not be the case if refresh is a BroadcastOperationAction, see DiscoveryWithServiceDisruptionsTests.testReadOnPostRecoveryShards related to elastic#9421

brwe · 2015-08-12T10:51:01Z

I continued on this. I tried to generalize what I did for refresh so that it can be used for flush too. Now I wonder: should this work for synced flush too?

s1monw · 2015-08-17T20:00:05Z

@brwe what's the status here, do you wait for reviews?

brwe · 2015-08-18T10:17:57Z

@s1monw yes. @bleskes might have an opinion on that too.

bleskes · 2015-08-19T09:45:38Z

@brwe and I talked this through and we decided to try and simplify things and remove some intermediate abstractions. Concretely try to:

Use ReplicationRequest/ReplicationResponse/TransportReplicationAction directly rather than having ReplicatedBroadcastShardRequest/ReplicatedBroadcastShardResponse/TransportReplicatedBroadcastShardAction.
Use BroadcastRequest/BroadcastResponse instead of ReplicatedBroadcastRequest/ReplicatedBroadcastResponse
Rename TransportReplicatedBroadcastAction to TransportBroadcastReplicationAction

Also, we should break this into two PRs:

One changing the refresh/flush behavior.
A follow up PR to change the read on POST_RECOVERY semantics.

brwe · 2015-08-24T09:42:08Z

Also, we should break this into two PRs:

One changing the refresh/flush behavior.

A follow up PR to change the read on POST_RECOVERY semantics.

I made a pr for the first part here: #13068

prerequisite to elastic#9421 see also elastic#12600

brwe · 2015-09-01T13:30:19Z

Opened the second pr for the actual fix now here: #13246
I'll close this one here now.

brwe added v2.0.0-beta1 discuss WIP labels Aug 3, 2015

brwe changed the title ~~Fix for searach failures if shard is in POST_RECOVERY~~ Fix for search failures if shard is in POST_RECOVERY Aug 3, 2015

s1monw reviewed Aug 4, 2015
View reviewed changes

brwe force-pushed the replicated_refresh branch from a294f81 to 0ab0903 Compare August 4, 2015 13:33

clintongormley added review :Core/Infra/Core Core issues without another label labels Aug 5, 2015

brwe added 6 commits August 11, 2015 14:33

Test for issue elastic#9421 (After relocation shards might temporaril…

4a66df7

…y not be searchable if still in POST_RECOVERY) see elastic#9421

test for visibility issue with relocation and refresh if reads allowe…

65637fe

…d when shard is in POST_RECOVERY

Allow reads on shards that are in POST_RECOVERY

a57089c

see elastic#9421

remove @Slow after rebase

c6cb483

review comments

41b5740

brwe force-pushed the replicated_refresh branch from 0ab0903 to 1100f56 Compare August 11, 2015 12:41

brwe added 2 commits August 12, 2015 12:49

more abstractions and make flush replicated action as well

2ce7af8

test for synced flush

253e60f

brwe force-pushed the replicated_refresh branch from 99992c9 to 253e60f Compare August 12, 2015 10:50

clintongormley added v2.0.0 and removed v2.0.0-beta1 labels Aug 13, 2015

bleskes self-assigned this Aug 18, 2015

brwe mentioned this pull request Aug 24, 2015

Make refresh a replicated action #13068

Merged

brwe added a commit to brwe/elasticsearch that referenced this pull request Sep 1, 2015

Make refresh a replicated action

d81f426

prerequisite to elastic#9421 see also elastic#12600

brwe closed this Sep 1, 2015

brwe removed :Core/Infra/Core Core issues without another label discuss review v2.0.0 WIP labels Sep 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for search failures if shard is in POST_RECOVERY #12600

Fix for search failures if shard is in POST_RECOVERY #12600

brwe commented Aug 3, 2015

s1monw commented Aug 3, 2015

brwe commented Aug 3, 2015

s1monw Aug 4, 2015

brwe commented Aug 12, 2015

s1monw commented Aug 17, 2015

brwe commented Aug 18, 2015

bleskes commented Aug 19, 2015

brwe commented Aug 24, 2015

brwe commented Sep 1, 2015

Fix for search failures if shard is in POST_RECOVERY #12600

Fix for search failures if shard is in POST_RECOVERY #12600

Conversation

brwe commented Aug 3, 2015

s1monw commented Aug 3, 2015

brwe commented Aug 3, 2015

s1monw Aug 4, 2015

Choose a reason for hiding this comment

brwe commented Aug 12, 2015

s1monw commented Aug 17, 2015

brwe commented Aug 18, 2015

bleskes commented Aug 19, 2015

brwe commented Aug 24, 2015

brwe commented Sep 1, 2015