Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Overly Aggressive Request DeDuplication #51270

Merged
merged 3 commits into from
Jan 22, 2020

Conversation

original-brownbear
Copy link
Member

On master failover we have to resent all the shard failed messages,
but the transport requests remain the same in the eyes of equals.
If the master failover is registered and the requests to the new master
are sent before all the callbacks have executed and the request to the
old master removed from the deduplicator then the requuests to the new
master will incorrectly fail and the snapshot get stuck.

Closes #51253

On master failover we have to resent all the shard failed messages,
but the transport requests remain the same in the eyes of `equals`.
If the master failover is registered and the requests to the new master
are sent before all the callbacks have executed and the request to the
old master removed from the deduplicator then the requuests to the new
master will incorrectly fail and the snapshot get stuck.

Closes elastic#51253
@original-brownbear original-brownbear added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.6.0 v7.7.0 labels Jan 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear
Copy link
Member Author

This was caused by actually fixing the deduplicator in #50788

@original-brownbear original-brownbear changed the title Fix Overly Optimistic Request DeDuplication Fix Overly Aggressive Request DeDuplication Jan 21, 2020
* Remove all tracked requests from this instance so that the first time {@link #executeOnce} is invoked with any request it triggers
* an actual request execution. Use this e.g. for requests to master that need to be sent again on master failover.
*/
public void clear() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there's just one use case for this now, but just dropping all deduplication seems to be the right approach to master failovers in general if we ever want to reuse this for e.g. put mapping calls or so.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@original-brownbear
Copy link
Member Author

Thanks Nhat!

@original-brownbear original-brownbear merged commit fac1247 into elastic:master Jan 22, 2020
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 22, 2020
On master failover we have to resent all the shard failed messages,
but the transport requests remain the same in the eyes of `equals`.
If the master failover is registered and the requests to the new master
are sent before all the callbacks have executed and the request to the
old master removed from the deduplicator then the requuests to the new
master will incorrectly fail and the snapshot get stuck.

Closes elastic#51253
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 22, 2020
On master failover we have to resent all the shard failed messages,
but the transport requests remain the same in the eyes of `equals`.
If the master failover is registered and the requests to the new master
are sent before all the callbacks have executed and the request to the
old master removed from the deduplicator then the requuests to the new
master will incorrectly fail and the snapshot get stuck.

Closes elastic#51253
@@ -358,6 +358,9 @@ private void syncShardStatsOnNewMaster(ClusterChangedEvent event) {
return;
}

// Clear request deduplicator since we need to send all requests that were potentially not handled by the previous
// master again
remoteFailedRequestDeduplicator.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is needed. UpdateSnapshotStatusAction is a TransportMasterNodeAction and should resend the request in case of master failover.

Copy link
Member Author

@original-brownbear original-brownbear Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently we're only retrying on FailedToCommitClusterStateException and NotMasterException but not on things like the node closed exception. That's why we have the manual retry/re-send of these messages in SnapshotShardsService in the first place don't we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this stack trace. It shows that the node that's sending the request is closed, no? Why would we want to retry sending a request from a node that's shut down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry pasted the wrong trace:

  1> [2020-01-21T11:49:53,151][WARN ][o.e.s.SnapshotShardsService] [node_td3] [test-repo:test-snap/v_2yOVZsQHu6gk_aFsmkKg] [ShardSnapshotStatus[state=SUCCESS, nodeId=8MquPimwRTCtgnobCaxJlw, reason=null, generation=T-25oH4kSg62xXEtINmviQ]] failed to update snapshot state
  1> org.elasticsearch.transport.RemoteTransportException: [node_tm0][127.0.0.1:41909][internal:cluster/snapshot/update_snapshot_status]
  1> Caused by: org.elasticsearch.node.NodeClosedException: node closed {node_tm0}{VGisb-JmTiOkIc_vJP8FhA}{Ucpv57LoTEy1J_cuDT1L8A}{127.0.0.1}{127.0.0.1:41909}{im}
  1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onClusterServiceClose(TransportMasterNodeAction.java:213) ~[main/:?]
  1> 	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:318) ~[main/:?]
  1> 	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:237) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.doStop(ClusterApplierService.java:186) ~[main/:?]
  1> 	at org.elasticsearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:79) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterService.doStop(ClusterService.java:96) ~[main/:?]
  1> 	at org.elasticsearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:79) ~[main/:?]
  1> 	at org.elasticsearch.node.Node.stop(Node.java:805) ~[main/:?]
  1> 	at org.elasticsearch.node.Node.close(Node.java:829) ~[main/:?]
  1> 	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:954) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.test.InternalTestCluster.stopNodesAndClients(InternalTestCluster.java:1573) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.test.InternalTestCluster.stopNodesAndClient(InternalTestCluster.java:1563) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.test.InternalTestCluster.stopCurrentMasterNode(InternalTestCluster.java:1493) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT.testMasterShutdownDuringSnapshot(DedicatedClusterSnapshotRestoreIT.java:849) ~[test/:?]
  1> 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
  1> 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
  1> 	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
  1> 	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
  1> 	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) ~[randomizedtesting-runner-2.7.4.jar:?]
  1> 	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938) ~[randomizedtesting-runner-2.7.4.jar:?]
  1> 	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974) ~[randomizedtesting-runner-2.7.4.jar:?]
  1> 	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988) ~[randomizedtesting-runner-2.7.4.jar:?]

We have the same for the remote end as well. I'm not 100% sure we can get that exception outside of a test (because of the shut down order, but as you can see at least in the test requests are clearly not retried on the master node restart tm_0 and just go to the listener's onFailure). We could improve the master action to retry this case as well, but as long as we have the fail-safe retry on master failover in the snapshot shard's service already, we should make that actually work?

Copy link
Contributor

@ywelsch ywelsch Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug in TransportMasterNodeAction, then. In case where a node is shut down, we first shut down the cluster service, and only then do we shut down the transport service. This means that a node that has sent a MasterNodeRequest to a master node that is in the process of shutting down can receive a TransportException with the cause being a NodeClosedException, for which we don't trigger a retry (we only do that for a ConnectTransportException). Can we fix that instead of this hack here (which also fixes this situation for any other master-level request)? Long term, the logic in syncShardStatsOnNewMaster can go away. I think it's mainly still here because the ShardSnapshotStatus action was not a TransportMasterNodeAction in some older versions.

Copy link
Member Author

@original-brownbear original-brownbear Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch sounds good, I'll look into fixing the TransportMasterNodeAction and removing the retry in the snapshot shards service then shortly :)
Still, so long as we have syncShardStatsOnNewMaster I feel like we should have this hack in place as without it that method doesn't really make sense?

original-brownbear added a commit that referenced this pull request Jan 22, 2020
On master failover we have to resent all the shard failed messages,
but the transport requests remain the same in the eyes of `equals`.
If the master failover is registered and the requests to the new master
are sent before all the callbacks have executed and the request to the
old master removed from the deduplicator then the requuests to the new
master will incorrectly fail and the snapshot get stuck.

Closes #51253
original-brownbear added a commit that referenced this pull request Jan 22, 2020
On master failover we have to resent all the shard failed messages,
but the transport requests remain the same in the eyes of `equals`.
If the master failover is registered and the requests to the new master
are sent before all the callbacks have executed and the request to the
old master removed from the deduplicator then the requuests to the new
master will incorrectly fail and the snapshot get stuck.

Closes #51253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.6.0 v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] DedicatedClusterSnapshotRestoreIT.testMasterShutdownDuringSnapshot failure
5 participants