Be more resilient to partial network partitions #8720

bleskes · 2014-12-01T13:19:17Z

When a node is experience network issues, the master detects it and removes the node from the cluster. That cause all ongoing recoveries from and to that nodes to be stopped and a new location is found for the relevant shards. However, in the case partial network partition, where there a connectivity issues between the source and target node of a recovery but not between those nodes and the current master things may go wrong. While the nodes successfully restore the connection, the on going recoveries may have encounter issues.

This PR adds a test that simulate disconnecting nodes and dropping requests during the various stages of recovery and solves all the issues that were raised by it. In short:

On going recoveries will be scheduled for retry upon network disconnect. The default retry period is 5s (cross node connections are checked every 10s by default).
Sometimes the disconnect happens after the target engine has started (but the shard is still in recovery). For simplicity, I opted to restart the recovery from scratch (where little to no files will be copied again, because there were just synced).
To protected against dropped requests, a Recovery Monitor was added that fails a recovery if no progress has been made in the last 30m (by default), which is equivalent to the long time outs we use in recovery requests.
When a shard fails on a node, we try to assign it to another node. If no such node is available, the shard will remain unassigned, causing the target node to clean any in memory state for it (files on disk remain). At the moment the shard will remain unassigned until another cluster state change happens, which will re-assigned it to the node in question but if no such change happens the shard will remain stuck at unassigned. The commits adds an extra delayed reroute in such cases to make sure the shard will be reassinged
Moved all recovery related settings to the RecoverySettings.

I'd love it if extra focus was give to the engine changes while reviewing - I'm not 100% familiar with the implications of the code to the underlying lucene state.

There is also one nocommit regarding the Java serializability of a Version object (used by DiscoveryNode). We rely on Java serialization for exceptions and this makes the ConnectTransportException unserializable because of it's DiscoveryNode field. This can be fixed in another change, but we need to discuss how.

One more todo left - add a reference to the resiliency page

s1monw · 2014-12-01T15:56:02Z

src/main/java/org/elasticsearch/Version.java

+
+    // we need custom serialization logic because org.apache.lucene.util.Version is not serializable
+    // nocommit - do we want to push this down to lucene?
+    private void writeObject(java.io.ObjectOutputStream out)


I don't think this is going to happen - lucene opted out of Serializable for a long time now I don't think we should add it back. I'd rather drop our dependency on it to be honest!

I'm all for controlling the serialization better (i.e., having versioning support) but that's a hard thing. At the moment we can't serialize transport related exceptions so imho we should fix this first. Since the version's variables are final, I had to move this to the disco node level.

s1monw · 2014-12-01T16:08:30Z

left a bunch of comments - I love the test :)

bleskes · 2014-12-02T11:13:54Z

@s1monw I pushed and update based on feedback. Note that I also modified the CancelableThreads a bit when extracting it. It's not identical.

s1monw · 2014-12-02T11:24:38Z

@bleskes I looked at the two commits. I like the second one but do not like the first one. I think your changes to the creation model makes thing even more complex and error prone than they are already. Then engine should not be started / not started / stopped and have some state in between can might or might not be cleaned up. I think we should have a dedicated class that we might even can use without all the guice stuff that initializes everything it needs in the constructor. It might even create a new instance if we update settings etc. and tear down anything during that time. That is much cleaner and then you can do the start stop logic cleanup on top.

bleskes · 2014-12-10T12:08:19Z

@s1monw I rebased this against master and adapted it to the changes in #8784 . I also moved all recovery settings to one place - i.e., RecoverySettings

s1monw · 2014-12-10T14:18:00Z

src/main/java/org/elasticsearch/Version.java


 /**
 */
 @SuppressWarnings("deprecation")
-public class Version implements Serializable {


is this still needed?

Do you mean removing the Serializable? yeah, because it's not. It depends on org.apache.lucene.util.Version

s1monw · 2014-12-15T11:01:08Z

src/main/java/org/elasticsearch/index/shard/service/InternalIndexShard.java

@@ -754,6 +733,16 @@ public void performRecoveryPrepareForTranslog() throws ElasticsearchException {
        engine.start();
    }

+    /** called if recovery has to be restarted after network error / delay ** */
+    public void performRecoveryRestart() {


I still wonder if it is needed to stop the engine - can't we just replay the translog more than once?

just replaying the translog is possible, but that would mean we'd have to make sure the primary doesn't flush in the mean time (the 10s before we retry). Right now, a disconnect will release all resources on the primary and restart from scratch. I still feel this is the right way to go but happy to hear alternatives.

s1monw · 2014-12-15T11:06:49Z

src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryTests.java

@@ -412,4 +428,117 @@ private void validateIndexRecoveryState(RecoveryState.Index indexState) {
        assertThat(indexState.percentBytesRecovered(), greaterThanOrEqualTo(0.0f));
        assertThat(indexState.percentBytesRecovered(), lessThanOrEqualTo(100.0f));
    }
+
+    @Test
+    @TestLogging("indices.recovery:TRACE")


is this needed?

s1monw · 2014-12-15T11:07:34Z

I left some more comments

bleskes · 2014-12-15T21:31:36Z

@s1monw I pushed another update. Responded to some comments as well.

This commits adds a test that simulate disconnecting nodes and dropping requests during the various stages of recovery and solves all the issues that were raised by it. In short: 1) On going recoveries will be scheduled for retry upon network disconnect. The default retry period is 5s (cross node connections are checked every 10s by default). 2) Sometimes the disconnect happens after the target engine has started (but the shard is still in recovery). For simplicity, I opted to restart the recovery from scratch (where little to no files will be copied again, because there were just synced). To do soI had to add a stop method to the internal engine, which doesn't free the underlying store (putting the engine back to it pre-start status). 3) To protected against dropped requests, a Recovery Monitor was added that fails a recovery if no progress has been made in the last 30m (by default), which is equivalent to the long time outs we use in recovery requests. 4) When a shard fails on a node, we try to assign it to another node. If no such node is available, the shard will remain unassigned, causing the target node to clean any in memory state for it (files on disk remain). At the moment the shard will remain unassigned until another cluster state change happens, which will re-assigned it to the node in question but if no such change happens the shard will remain stuck at unassigned. The commits adds an extra delayed reroute in such cases to make sure the shard will be reassinged

bleskes · 2015-01-09T12:44:03Z

@s1monw I rebased and squashed against the latest master. Would be great if you can give it another round.

s1monw · 2015-01-09T12:49:44Z

cool stuff LGTM

This commits adds a test that simulate disconnecting nodes and dropping requests during the various stages of recovery and solves all the issues that were raised by it. In short: 1) On going recoveries will be scheduled for retry upon network disconnect. The default retry period is 5s (cross node connections are checked every 10s by default). 2) Sometimes the disconnect happens after the target engine has started (but the shard is still in recovery). For simplicity, I opted to restart the recovery from scratch (where little to no files will be copied again, because there were just synced). 3) To protected against dropped requests, a Recovery Monitor was added that fails a recovery if no progress has been made in the last 30m (by default), which is equivalent to the long time outs we use in recovery requests. 4) When a shard fails on a node, we try to assign it to another node. If no such node is available, the shard will remain unassigned, causing the target node to clean any in memory state for it (files on disk remain). At the moment the shard will remain unassigned until another cluster state change happens, which will re-assigned it to the node in question but if no such change happens the shard will remain stuck at unassigned. The commits adds an extra delayed reroute in such cases to make sure the shard will be reassinged 5) Moved all recovery related settings to the RecoverySettings. Closes elastic#8720

Closes #9277

elastic#8720 introduced a timeout mechanism for ongoing recoveries, based on a last access time variable. In the many iterations on that PR the update of the access time was lost. This adds it back, including a test that should have been there in the first place.

#8720 introduced a timeout mechanism for ongoing recoveries, based on a last access time variable. In the many iterations on that PR the update of the access time was lost. This adds it back, including a test that should have been there in the first place. Closes #9506

disabling this until further discussion. Recent failures probably relate to #9211 & #8720 (+ friends)

bleskes added v1.5.0 v2.0.0-beta1 review :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. resiliency labels Dec 1, 2014

s1monw reviewed Dec 1, 2014
View reviewed changes

bleskes force-pushed the recovery_network_disconnect branch 2 times, most recently from b2e63d4 to 4237210 Compare December 4, 2014 16:53

bleskes mentioned this pull request Dec 4, 2014

Allow InternalEngine to be stopped and started #8784

Closed

bleskes force-pushed the recovery_network_disconnect branch from 4237210 to fb9f6f7 Compare December 9, 2014 15:36

s1monw reviewed Dec 10, 2014
View reviewed changes

s1monw self-assigned this Dec 15, 2014

s1monw reviewed Dec 15, 2014
View reviewed changes

bleskes force-pushed the recovery_network_disconnect branch 5 times, most recently from e3551aa to b3da055 Compare January 9, 2015 12:39

bleskes force-pushed the recovery_network_disconnect branch from b3da055 to ba64f52 Compare January 9, 2015 12:41

bleskes closed this in 102e7ad Jan 10, 2015

bleskes mentioned this pull request Jan 13, 2015

Shard stuck in relocating state with recovery stage=translog #9226

Closed

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jan 13, 2015

Add elastic#8720 to the resiliency page

636d503

bleskes deleted the recovery_network_disconnect branch January 14, 2015 17:52

bleskes added a commit that referenced this pull request Jan 16, 2015

Add #8720 to the resiliency page

fb2e4da

Closes #9277

bleskes mentioned this pull request Jan 30, 2015

Update access time of ongoing recoveries #9506

Closed

bleskes added a commit that referenced this pull request Feb 5, 2015

Test: add awaitFix to SearchWithRandomExceptionsTests

fa76268

disabling this until further discussion. Recent failures probably relate to #9211 & #8720 (+ friends)

bleskes added a commit that referenced this pull request Feb 5, 2015

Test: add awaitFix to SearchWithRandomExceptionsTests

97ac2f5

disabling this until further discussion. Recent failures probably relate to #9211 & #8720 (+ friends)

clintongormley added >enhancement and removed review labels Mar 19, 2015

clintongormley mentioned this pull request May 19, 2015

Check shard allocation periodically #7609

Closed

clintongormley changed the title ~~Recovery: be more resilient to partial network partitions~~ Be more resilient to partial network partitions Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be more resilient to partial network partitions #8720

Be more resilient to partial network partitions #8720

bleskes commented Dec 1, 2014

s1monw Dec 1, 2014

bleskes Dec 2, 2014

s1monw commented Dec 1, 2014

bleskes commented Dec 2, 2014

s1monw commented Dec 2, 2014

bleskes commented Dec 10, 2014

s1monw Dec 10, 2014

bleskes Dec 10, 2014

s1monw Dec 15, 2014

bleskes Dec 15, 2014

s1monw Dec 15, 2014

s1monw commented Dec 15, 2014

bleskes commented Dec 15, 2014

bleskes commented Jan 9, 2015

s1monw commented Jan 9, 2015

Be more resilient to partial network partitions #8720

Be more resilient to partial network partitions #8720

Conversation

bleskes commented Dec 1, 2014

s1monw Dec 1, 2014

Choose a reason for hiding this comment

bleskes Dec 2, 2014

Choose a reason for hiding this comment

s1monw commented Dec 1, 2014

bleskes commented Dec 2, 2014

s1monw commented Dec 2, 2014

bleskes commented Dec 10, 2014

s1monw Dec 10, 2014

Choose a reason for hiding this comment

bleskes Dec 10, 2014

Choose a reason for hiding this comment

s1monw Dec 15, 2014

Choose a reason for hiding this comment

bleskes Dec 15, 2014

Choose a reason for hiding this comment

s1monw Dec 15, 2014

Choose a reason for hiding this comment

s1monw commented Dec 15, 2014

bleskes commented Dec 15, 2014

bleskes commented Jan 9, 2015

s1monw commented Jan 9, 2015