Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more resilient to partial network partitions #8720

Closed
wants to merge 1 commit into from

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Dec 1, 2014

When a node is experience network issues, the master detects it and removes the node from the cluster. That cause all ongoing recoveries from and to that nodes to be stopped and a new location is found for the relevant shards. However, in the case partial network partition, where there a connectivity issues between the source and target node of a recovery but not between those nodes and the current master things may go wrong. While the nodes successfully restore the connection, the on going recoveries may have encounter issues.

This PR adds a test that simulate disconnecting nodes and dropping requests during the various stages of recovery and solves all the issues that were raised by it. In short:

  1. On going recoveries will be scheduled for retry upon network disconnect. The default retry period is 5s (cross node connections are checked every 10s by default).
  2. Sometimes the disconnect happens after the target engine has started (but the shard is still in recovery). For simplicity, I opted to restart the recovery from scratch (where little to no files will be copied again, because there were just synced).
  3. To protected against dropped requests, a Recovery Monitor was added that fails a recovery if no progress has been made in the last 30m (by default), which is equivalent to the long time outs we use in recovery requests.
  4. When a shard fails on a node, we try to assign it to another node. If no such node is available, the shard will remain unassigned, causing the target node to clean any in memory state for it (files on disk remain). At the moment the shard will remain unassigned until another cluster state change happens, which will re-assigned it to the node in question but if no such change happens the shard will remain stuck at unassigned. The commits adds an extra delayed reroute in such cases to make sure the shard will be reassinged
  5. Moved all recovery related settings to the RecoverySettings.

I'd love it if extra focus was give to the engine changes while reviewing - I'm not 100% familiar with the implications of the code to the underlying lucene state.

There is also one nocommit regarding the Java serializability of a Version object (used by DiscoveryNode). We rely on Java serialization for exceptions and this makes the ConnectTransportException unserializable because of it's DiscoveryNode field. This can be fixed in another change, but we need to discuss how.

One more todo left - add a reference to the resiliency page

@bleskes bleskes added v1.5.0 v2.0.0-beta1 review :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. resiliency labels Dec 1, 2014

// we need custom serialization logic because org.apache.lucene.util.Version is not serializable
// nocommit - do we want to push this down to lucene?
private void writeObject(java.io.ObjectOutputStream out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is going to happen - lucene opted out of Serializable for a long time now I don't think we should add it back. I'd rather drop our dependency on it to be honest!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for controlling the serialization better (i.e., having versioning support) but that's a hard thing. At the moment we can't serialize transport related exceptions so imho we should fix this first. Since the version's variables are final, I had to move this to the disco node level.

@s1monw
Copy link
Contributor

s1monw commented Dec 1, 2014

left a bunch of comments - I love the test :)

@bleskes
Copy link
Contributor Author

bleskes commented Dec 2, 2014

@s1monw I pushed and update based on feedback. Note that I also modified the CancelableThreads a bit when extracting it. It's not identical.

@s1monw
Copy link
Contributor

s1monw commented Dec 2, 2014

@bleskes I looked at the two commits. I like the second one but do not like the first one. I think your changes to the creation model makes thing even more complex and error prone than they are already. Then engine should not be started / not started / stopped and have some state in between can might or might not be cleaned up. I think we should have a dedicated class that we might even can use without all the guice stuff that initializes everything it needs in the constructor. It might even create a new instance if we update settings etc. and tear down anything during that time. That is much cleaner and then you can do the start stop logic cleanup on top.

@bleskes bleskes force-pushed the recovery_network_disconnect branch 2 times, most recently from b2e63d4 to 4237210 Compare December 4, 2014 16:53
@bleskes
Copy link
Contributor Author

bleskes commented Dec 10, 2014

@s1monw I rebased this against master and adapted it to the changes in #8784 . I also moved all recovery settings to one place - i.e., RecoverySettings


/**
*/
@SuppressWarnings("deprecation")
public class Version implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean removing the Serializable? yeah, because it's not. It depends on org.apache.lucene.util.Version

@s1monw s1monw self-assigned this Dec 15, 2014
@@ -754,6 +733,16 @@ public void performRecoveryPrepareForTranslog() throws ElasticsearchException {
engine.start();
}

/** called if recovery has to be restarted after network error / delay ** */
public void performRecoveryRestart() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wonder if it is needed to stop the engine - can't we just replay the translog more than once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just replaying the translog is possible, but that would mean we'd have to make sure the primary doesn't flush in the mean time (the 10s before we retry). Right now, a disconnect will release all resources on the primary and restart from scratch. I still feel this is the right way to go but happy to hear alternatives.

@@ -412,4 +428,117 @@ private void validateIndexRecoveryState(RecoveryState.Index indexState) {
assertThat(indexState.percentBytesRecovered(), greaterThanOrEqualTo(0.0f));
assertThat(indexState.percentBytesRecovered(), lessThanOrEqualTo(100.0f));
}

@Test
@TestLogging("indices.recovery:TRACE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

@s1monw
Copy link
Contributor

s1monw commented Dec 15, 2014

I left some more comments

@bleskes
Copy link
Contributor Author

bleskes commented Dec 15, 2014

@s1monw I pushed another update. Responded to some comments as well.

@bleskes bleskes force-pushed the recovery_network_disconnect branch 5 times, most recently from e3551aa to b3da055 Compare January 9, 2015 12:39
This commits adds a test that simulate disconnecting nodes and dropping requests during the various stages of recovery and solves all the issues that were raised by it. In short:

1) On going recoveries will be scheduled for retry upon network disconnect. The default retry period is 5s (cross node connections are checked every 10s by default).
2) Sometimes the disconnect happens after the target engine has started (but the shard is still in recovery). For simplicity, I opted to restart the recovery from scratch (where little to no files will be copied again, because there were just synced). To do soI had to add a stop method to the internal engine, which doesn't free the underlying store (putting the engine back to it pre-start status).
3) To protected against dropped requests, a Recovery Monitor was added that fails a recovery if no progress has been made in the last 30m (by default), which is equivalent to the long time outs we use in recovery requests.
4) When a shard fails on a node, we try to assign it to another node. If no such node is available, the shard will remain unassigned, causing the target node to clean any in memory state for it (files on disk remain). At the moment the shard will remain unassigned until another cluster state change happens, which will re-assigned it to the node in question but if no such change happens the shard will remain stuck at unassigned. The commits adds an extra delayed reroute in such cases to make sure the shard will be reassinged
@bleskes bleskes force-pushed the recovery_network_disconnect branch from b3da055 to ba64f52 Compare January 9, 2015 12:41
@bleskes
Copy link
Contributor Author

bleskes commented Jan 9, 2015

@s1monw I rebased and squashed against the latest master. Would be great if you can give it another round.

@s1monw
Copy link
Contributor

s1monw commented Jan 9, 2015

cool stuff LGTM

@bleskes bleskes closed this in 102e7ad Jan 10, 2015
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jan 12, 2015
This commits adds a test that simulate disconnecting nodes and dropping requests during the various stages of recovery and solves all the issues that were raised by it. In short:

1) On going recoveries will be scheduled for retry upon network disconnect. The default retry period is 5s (cross node connections are checked every 10s by default).
2) Sometimes the disconnect happens after the target engine has started (but the shard is still in recovery). For simplicity, I opted to restart the recovery from scratch (where little to no files will be copied again, because there were just synced).
3) To protected against dropped requests, a Recovery Monitor was added that fails a recovery if no progress has been made in the last 30m (by default), which is equivalent to the long time outs we use in recovery requests.
4) When a shard fails on a node, we try to assign it to another node. If no such node is available, the shard will remain unassigned, causing the target node to clean any in memory state for it (files on disk remain). At the moment the shard will remain unassigned until another cluster state change happens, which will re-assigned it to the node in question but if no such change happens the shard will remain stuck at unassigned. The commits adds an extra delayed reroute in such cases to make sure the shard will be reassinged
5) Moved all recovery related settings to the RecoverySettings.

Closes elastic#8720
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jan 13, 2015
@bleskes bleskes deleted the recovery_network_disconnect branch January 14, 2015 17:52
bleskes added a commit that referenced this pull request Jan 16, 2015
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jan 30, 2015
elastic#8720 introduced a timeout mechanism for ongoing recoveries, based on a last access time variable. In the many iterations on that PR the update of the access time was lost. This adds it back, including a test that should have been there in the first place.
bleskes added a commit that referenced this pull request Jan 30, 2015
#8720 introduced a timeout mechanism for ongoing recoveries, based on a last access time variable. In the many iterations on that PR the update of the access time was lost. This adds it back, including a test that should have been there in the first place.

Closes #9506
bleskes added a commit that referenced this pull request Jan 30, 2015
#8720 introduced a timeout mechanism for ongoing recoveries, based on a last access time variable. In the many iterations on that PR the update of the access time was lost. This adds it back, including a test that should have been there in the first place.

Closes #9506
bleskes added a commit that referenced this pull request Feb 5, 2015
disabling this until further discussion. Recent failures probably relate to #9211 & #8720 (+ friends)
bleskes added a commit that referenced this pull request Feb 5, 2015
disabling this until further discussion. Recent failures probably relate to #9211 & #8720 (+ friends)
@clintongormley clintongormley changed the title Recovery: be more resilient to partial network partitions Be more resilient to partial network partitions Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement resiliency v1.5.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants