Add two phased commit to Cluster State publishing #13062

bleskes · 2015-08-23T20:58:07Z

When publishing a new cluster state, the master will send it to all the node of the cluster, noting down how many master nodes responded successfully. The nodes do not yet process the new cluster state, but rather park it in memory. As soon as at least minimum master nodes have ack-ed the cluster state change, it is committed and a commit request is sent to all the node that responded so far (and will respond in the future). Once receiving the commit requests the nodes continue to process the cluster state change as they did before this change.

A few notable comments:

For this change to have effect, min master nodes must be configured.
All basic cluster state validation is done in the first phase of publish and is thus now part of ShardOperationResult
A new COMMIT_TIMEOUT settings is introduced, dictating how long a master should wait for nodes to ack the first phase. Unlike PUBLISH_TIMEOUT, if waiting for a commit times out, the cluster state change will be rejected.
Failing to achieve a min master node of acks, will cause the master to step down as it clearly doesn't have enough active followers.
Previously there was a short window between the moment a master lost it's followers and it stepping down because of node fault detection failures. In this short window, the master could process any change (but fail to publish it). This PR closes this gap to 0.

I still have one no commit and some docs to add but I think we can start the review cycles.

@brwe @imotov and @jasontedor - can you have a careful look when you have time?

brwe · 2015-08-24T09:56:43Z

core/src/main/java/org/elasticsearch/discovery/DiscoverySettings.java

@@ -57,6 +69,7 @@ public DiscoverySettings(Settings settings, NodeSettingsService nodeSettingsServ
        nodeSettingsService.addListener(new ApplySettings());
        this.noMasterBlock = parseNoMasterBlock(settings.get(NO_MASTER_BLOCK, DEFAULT_NO_MASTER_BLOCK));
        this.publishTimeout = settings.getAsTime(PUBLISH_TIMEOUT, publishTimeout);
+        this.commitTimeout = settings.getAsTime(COMMIT_TIMEOUT, publishTimeout);


should this be settings.getAsTime(COMMIT_TIMEOUT, commitTimeout); ?

ykes. Yes… :(

On 24 Aug 2015, at 11:57, Britta Weber notifications@github.com wrote:

In core/src/main/java/org/elasticsearch/discovery/DiscoverySettings.java:

@@ -57,6 +69,7 @@ public DiscoverySettings(Settings settings, NodeSettingsService nodeSettingsServ
nodeSettingsService.addListener(new ApplySettings());
this.noMasterBlock = parseNoMasterBlock(settings.get(NO_MASTER_BLOCK, DEFAULT_NO_MASTER_BLOCK));
this.publishTimeout = settings.getAsTime(PUBLISH_TIMEOUT, publishTimeout);

this.commitTimeout = settings.getAsTime(COMMIT_TIMEOUT, publishTimeout);

should this be settings.getAsTime(COMMIT_TIMEOUT, commitTimeout); ?

—
Reply to this email directly or view it on GitHub.

bleskes · 2015-08-24T15:25:09Z

@brwe @imotov @jasontedor FYI I removed the no commit.

imotov · 2015-08-24T21:10:17Z

core/src/main/java/org/elasticsearch/discovery/zen/publish/PublishClusterStateAction.java

+        String stateUUID;
+
+        public CommitClusterStateRequest() {
+        }


This constructor doesn't seem to be necessary.

it's needed for the request to be created on the receiving node. It's used by reflection.

bleskes · 2015-08-25T13:05:10Z

I update the docs. @clintongormley @jasontedor - I would love a native English speaker to review it - can you take a look?

jasontedor · 2015-08-25T13:35:48Z

@bleskes I left a review of the updated docs but I'm still in progress on a review of the code.

clintongormley · 2015-08-25T13:52:16Z

docs/reference/modules/discovery/zen.asciidoc

-to 30 seconds and can be changed dynamically through the
-<<cluster-update-settings,cluster update settings api>>
+the other nodes in the cluster. Each node receives the publish message, acknowledges
+it but do *not* yet apply it. If the master does not receive acknowledgement from


clintongormley · 2015-08-25T13:56:24Z

left some minor docs suggestions

brwe · 2015-08-26T10:53:09Z

core/src/main/java/org/elasticsearch/cluster/service/InternalClusterService.java

+                    logger.debug("publishing cluster state version [{}]", newClusterState.version());
+                    try {
+                        discoveryService.publish(clusterChangedEvent, ackListener);
+                    } catch (Throwable t) {


I guess we need that to catch the FailedToCommitException? If so, why not only catch that?

That's a good one. I've tightened things up to make sure can rely on FaileToCommitException to indicate something wrong happened before committing and we can safely reject the CS (it will not be committed on any node)

bleskes · 2015-08-26T14:10:19Z

@imotov @brwe pushed another commit addressing comments so far

imotov · 2015-08-26T20:34:40Z

core/src/main/java/org/elasticsearch/discovery/zen/publish/PublishClusterStateAction.java

-                // ignore & restore interrupt
-                Thread.currentThread().interrupt();
+            } catch (IOException e) {
+                throw new ElasticsearchException("failed to serialize cluster_state for publishing to node {}", e, node);
            }
        }
    }

    private void sendFullClusterState(ClusterState clusterState, @Nullable Map<Version, BytesReference> serializedStates,


It looks like with the latest changes serializedStates can no longer be null, so we should probably remove Nullable here.

correct. Removed.

brwe · 2015-08-28T10:08:13Z

Just #13062 (comment) left and I am not too passionate about it. LGTM too otherwise.

…a CS is never committed after publishing is marked out as timed out

…edback

bleskes · 2015-08-28T11:12:42Z

@brwe thx. I update the PR based on your last comment (and rebased to latest master)

The initial implementation of two phase commit based cluster state publishing (elastic#13062) relied on a single in memory "pending" cluster state that is only processed by ZenDiscovery once committed by the master. While this is fine on it's own, it resulted in an issue with acknowledged APIs, such as the open index API, in the extreme case where a node falls behind and receives a commit message after a new cluster state has been published. Specifically: 1) Master receives and acked-API call and publishes cluster state CS1 2) Master waits for a min-master nodes to receives CS1 and commits it. 3) All nodes that have responded to CS1 are sent a commit message, however, node N didn't respond yet 4) Master waits for publish timeout (defaults to 30s) for all nodes to process the commit. Node N fails to do so. 5) Master publishes a cluster state CS2. Node N responds to cluster state CS1's publishing but receives cluster state CS2 before the commit for CS1 arrives. 6) The commit message for cluster CS1 is processed on node N, but fails because CS2 is pending. This caused the acked API in step 1 to return (but CS2 , is not yet processed). In this case, the action indicated by CS1 is not yet executed on node N and therefore the acked API calls return pre-maturely. Note that once CS2 is processed but the change in CS1 takes effect (cluster state operations are safe to batch and we do so all the time). An example failure can be found on: http://build-us-00.elastic.co/job/es_feature_two_phase_pub/314/ This commit extracts the already existing pending cluster state queue (processNewClusterStates) from ZenDiscovery into it's own class, which serves as a temporary container for in-flight cluster states. Once committed the cluster states are transferred to ZenDiscovery as they used to before. This allows "lagging" cluster states to still be successfully committed and processed (and likely to be ignored as a newer cluster state has already been processed). As a side effect, all batching logic is now extracted from ZenDiscovery and is unit tested.

When publishing a new cluster state, the master will send it to all the node of the cluster, noting down how many *master* nodes responded successfully. The nodes do not yet process the new cluster state, but rather park it in memory. As soon as at least minimum master nodes have ack-ed the cluster state change, it is committed and a commit request is sent to all the node that responded so far (and will respond in the future). Once receiving the commit requests the nodes continue to process the cluster state change as they did before this change. A few notable comments: 1. For this change to have effect, min master nodes must be configured. 2. All basic cluster state validation is done in the first phase of publish and is thus now part of `ShardOperationResult` 3. A new `COMMIT_TIMEOUT` settings is introduced, dictating how long a master should wait for nodes to ack the first phase. Unlike `PUBLISH_TIMEOUT`, if waiting for a commit times out, the cluster state change will be rejected. 4. Failing to achieve a min master node of acks, will cause the master to step down as it clearly doesn't have enough active followers. 5. Previously there was a short window between the moment a master lost it's followers and it stepping down because of node fault detection failures. In this short window, the master could process any change (but fail to publish it). This PR closes this gap to 0. 6. A dedicated pending cluster states queue was added to keep pending non-comitted cluster states and manage the logic around processing committed cluster states. See #13303 for details. Closes #13062 , Closes #13303

bleskes added review resiliency :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure labels Aug 23, 2015

bleskes changed the title ~~Add two phased to Cluster State publishing~~ Add two phased commit to Cluster State publishing Aug 23, 2015

clintongormley added the >feature label Aug 24, 2015

brwe reviewed Aug 24, 2015
View reviewed changes

imotov reviewed Aug 24, 2015
View reviewed changes

clintongormley reviewed Aug 25, 2015
View reviewed changes

brwe reviewed Aug 26, 2015
View reviewed changes

imotov reviewed Aug 26, 2015
View reviewed changes

bleskes added 19 commits August 28, 2015 12:31

initial copy over from POC

3815a41

simplified PublishClusterStateActionTests infra

81e07e8

beefed up testing...

b702843

add FailedToCommitException to registration

7390bcf

fix ZenDiscoveryUnitTest.testShouldIgnoreNewClusterState

7d3a36b

added constructor to FailedToCommitException

4d31681

improved timeout handling

234a379

Improved concurrency controls In SendingController to make sure that …

e3e0aa5

…a CS is never committed after publishing is marked out as timed out

force mock transport in testCanNotPublishWithoutMinMastNodes

a56d67d

reject older cluster state from the same master

91dee8b

fix defaults in DiscoverySettings

6208248

commit timeout default should never be larger than publishing timeout

c7c65b6

added docs

f70ed87

doc feedback

d9f6e30

reduce log chatter

98ed133

tighten up FailedToCommitClusterStateException semantics and other fe…

c9ee8db

…edback

more feedback

0668e0d

more feedback

10e8c41

remove committedOrFailed and use committedOrFailedLatch for state

218979d

bleskes force-pushed the discovery_two_phase_pub branch from 251b27b to 218979d Compare August 28, 2015 10:42

bleskes mentioned this pull request Sep 3, 2015

Add a dedicate queue for incoming ClusterStates #13303

Closed

bleskes merged commit 218979d into elastic:master Sep 14, 2015

bleskes added the v5.0.0-alpha1 label Sep 15, 2015

bleskes mentioned this pull request Sep 16, 2015

Expose pending cluster state queue size in node stats #13610

Closed

s1monw mentioned this pull request Aug 26, 2016

Rolling upgrades for major releases with no downtime #20173

Closed

makeyang mentioned this pull request Jan 19, 2017

Add current cluster state version to zen pings and use them in master election #20384

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add two phased commit to Cluster State publishing #13062

Add two phased commit to Cluster State publishing #13062

bleskes commented Aug 23, 2015

brwe Aug 24, 2015

bleskes Aug 24, 2015

bleskes commented Aug 24, 2015

imotov Aug 24, 2015

bleskes Aug 26, 2015

bleskes commented Aug 25, 2015

jasontedor commented Aug 25, 2015

clintongormley Aug 25, 2015

clintongormley commented Aug 25, 2015

brwe Aug 26, 2015

bleskes Aug 26, 2015

bleskes commented Aug 26, 2015

imotov Aug 26, 2015

bleskes Aug 27, 2015

brwe commented Aug 28, 2015

bleskes commented Aug 28, 2015

Add two phased commit to Cluster State publishing #13062

Add two phased commit to Cluster State publishing #13062

Conversation

bleskes commented Aug 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Aug 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Aug 25, 2015

jasontedor commented Aug 25, 2015

Choose a reason for hiding this comment

clintongormley commented Aug 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Aug 26, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brwe commented Aug 28, 2015

bleskes commented Aug 28, 2015