[ZOOKEEPER-3657] Implementing snapshot schedule to avoid high latency issue due to disk contention by lvfangmin · Pull Request #1191 · apache/zookeeper

lvfangmin · 2019-12-20T07:13:01Z

No description provided.

asf-ci · 2019-12-20T07:56:19Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build-maven/1761/

Failed Tests: 1

PreCommit-ZOOKEEPER-github-pr-build-maven/org.apache.zookeeper:zookeeper: 1

org.apache.zookeeper.test.RestoreCommittedLogTest.testRestoreCommittedLogWithSnapSize

lvfangmin · 2019-12-20T20:14:41Z

retest maven build

hanm · 2019-12-21T06:12:46Z

interesting. two quick questions:

zookeeper has some randomness in today's sync processor when snapshotting already, i am curious to know what's the short comings of existing approach that motivates this change.
i don't see observers processing snap ping packet in this pull request so only quorum servers will be scheduled by leader for snap shot generation?

anmolnar · 2020-01-21T17:01:04Z

@lvfangmin I second @hanm 's questions. This patch is quite huge, are you still working on it?

lvfangmin · 2020-01-26T00:15:02Z

@hanm the randomness snapshot could introduce high latency if majority taking snapshot at the same time, when the total DataTree size increasing, it will take longer time to do snapshot, which means it's more likely majority will take snapshot at the same time with longer period. Which will be a problem when running ZK on a single disk driver.

From what we saw in benchmark, the write throughput within SLA for 6GB DataTree size is more than 10X smaller than 100MB DataTree.

That's why we introduced this feature. Observer don't need to handle SNAPPING, since the quorum ack latency is only affected by participants.

@anmolnar this feature is complete, it has been on our prod for more than 6 months.

anmolnar

+1 lgtm.
Just a few nitpicks and also rebase please.

anmolnar · 2020-01-28T13:03:04Z

zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md

    handshakes. Set it to something like 250 is good enough to avoid herd effect.

+* *leader.snapPingIntervalInSeconds*
+    (Jave system property only: **zookeeper.leader.snapPingIntervalInSeconds**)


Fix typo please: 'Jave'

anmolnar · 2020-01-28T13:03:39Z

zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md

+    scheduler if it's enabled, and send SNAPPING to the quorum. If the follower is 
+    running old code, it will ignore that packet. When follower with new code received 
+    SNAPPING packet, it will turn off the periodically snapshot locally, and only 
+    taking safety snapshot if the if the txns since last snapshot is much larger than 


Fix typo: 'if the if the'

anmolnar · 2020-01-28T13:04:05Z

zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md

+    Also there is a JMX setting on leader to turn it on and off in flight.
+
+* *leader.snapTxnsThreshold*
+    (Jave system property only: **zookeeper.leader.snapTxnsThreshold**)


Typo: 'Jave'

anmolnar · 2020-01-28T13:04:22Z

zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md

+    default value is 100,000 which is the suggested value.
+
+* *leader.snapTxnsSizeThresholdKB*
+    (Jave system property only: **zookeeper.leader.snapTxnsSizeThresholdKB**)


Typo: 'Jave' (looks like a copy-paste problem)

anmolnar · 2020-01-28T13:11:12Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/SnapshotGenerator.java

+            });
+            return true;
+        } else {
+            LOG.warn("Too busy to snap, skipping");


Is this log message accurate? Getting here means previous snapshot is still running.

anmolnar · 2020-01-28T13:16:31Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Follower.java

+                    + "compatible with ours {}, will skip", peerSnapPingVersion,
+                    SnapPingManager.SNAP_PING_VERSION);
+            if (fzk.syncProcessor.isOnlySnapWhenSafetyIsThreatened()) {
+                LOG.info("SnapPing version imcompatible, start self snapshot");


Typo: 'imcompatible'

lvfangmin · 2020-05-14T19:22:40Z

Rebased and addressed the nit suggestion from @anmolnar.

eolivelli

Great work

eolivelli · 2020-05-14T21:19:41Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Follower.java

+            return;
+        }
+
+        if (snapCode == SnapPingCode.CANCEL.ordinal()) {


Relying on ordinal() can lead to problems in case that someone refactor the class, adds items, reorders them.
What about having well defined constants?

That's a good point, it seems more natural to use enum for these code, I'll add a comment to the enum to WARN ordering changes.

…issue due to disk contention

hanm

back to review mode, left some comments. Haven't finished review all parts, bear with my snail speed please.

hanm · 2020-05-21T03:58:38Z

zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md

    The default value is false.

+* *leader.snapPingIntervalInSeconds*
+    (Java system property only: **zookeeper.leader.snapPingIntervalInSeconds**)


are these really java system properties only? Put a config foo in zoo.cfg and ZK will parse them and generate a zookeeoer.foo. similar for other "java only system properties" listed here. might need update doc.

hanm · 2020-05-21T04:01:10Z

zookeeper-server/src/main/java/org/apache/zookeeper/ZooDefs.java

    public static final String ZOOKEEPER_NODE_SUBTREE = "/zookeeper/";

+    /**
+     * WARN: please don't retain the order, which is used to check 


did you mean "don't change" or "retain", as opposed to "don't retain"? my understanding is the order must be preserved here for the code to work.

hanm · 2020-05-21T04:02:37Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/SnapshotGenerator.java

+        purgeAfterSnapshot = Boolean.getBoolean(PURGE_AFTER_SNAPSHOT);
+        LOG.info("{} = {}", PURGE_AFTER_SNAPSHOT, purgeAfterSnapshot);
+
+        fsyncSnapshotFromScheduler = Boolean.parseBoolean(


why not use Boolean.getBoolean so it's consistent with previous property parsing code (also less verbose)?

hanm · 2020-05-21T04:04:59Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/SyncRequestProcessor.java

+
+                        if (onlySnapWhenSafetyIsThreatened) {
+                            if (safetySnapThreshold.meet(zkDB.getTxnsSinceLastSnap(), zkDB.getTxnsSizeSinceLastSnap())) {
+                                snapGenerator.takeSnapshot(false);


should we pass snapGenerator.fsyncSnapshotFromScheduler as parameter here instead of hardcoding a false?

also if we want to hard code a value, it seems true is more safe than false here for durability guarantees - but i lose track of what our default options were when taking snapshot (fsync it or not, especially if IIRC we lost the fsync parameter when introducing SnapStream..)

hanm · 2020-05-21T04:05:16Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/SyncRequestProcessor.java

-                                    }
-                                }
-                            }.start();
+                            snapGenerator.takeSnapshot(false);


similar here - should the parameter be hardcoded or taken from fsyncSnapshotFromScheduler

hanm · 2020-05-21T04:22:34Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Follower.java

+        if (snapCode == SnapPingCode.CANCEL.ordinal()) {
+            if (fzk.syncProcessor.isOnlySnapWhenSafetyIsThreatened()) {
+                LOG.info("Snapshot schedule cancelled by leader, start self snapshot");
+                fzk.syncProcessor.setOnlySnapWhenSafetyIsThreatened(false);


did we consider that the CANCEL packet could potentially lost so a server in schedule snap mode will never be snapping again? Is there any built in defense mechanism for that case (didn't read all part of code yet) to make sure a server will not end up in not snapping state?

hanm

finishing reviewing rest of the pull request.

hanm · 2020-05-27T23:29:31Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/SnapPingManager.java

+                    try {
+                        listener.snapPing(SnapPingListener.SNAP_PING_ID_DONT_CARE,
+                                sid == learnerSnapCandidate
+                                ? SnapPingCode.SNAP : SnapPingCode.SKIP);


we can simply skip this learner if its not the candidate. this saves sending a SnapPingCode.SKIP packet, as on learner side that code doesn't do anything. This also raise a question on why SnapPingCode.SKIP exists in first place - if we don't want a learner snap we can just skip sending it a command instead of sending it a no-op command.

hanm · 2020-05-28T01:09:43Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/SnapPingManager.java

+            }
+
+            if (++snapPingId < 0) {
+                snapPingId = 1;


is this to deal with overflow? might worth to add a comment here.

hanm · 2020-05-28T03:04:09Z

zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/SnapPingTest.java

+
+    @Before
+    public void setup() throws Exception {
+        System.setProperty(


I don't see where zookeeper.leader.snapPingIntervalInSeconds is set - isn't it require to explicit set this to enable the snap schedule feature?

anmolnar approved these changes Jan 28, 2020

View reviewed changes

lvfangmin force-pushed the ZOOKEEPER-3657 branch from 31ea561 to 5d04c08 Compare May 14, 2020 19:21

eolivelli reviewed May 14, 2020

View reviewed changes

ZOOKEEPER-3657: Implementing snapshot schedule to avoid high latency …

77c4ea4

…issue due to disk contention

lvfangmin force-pushed the ZOOKEEPER-3657 branch from 5d04c08 to 77c4ea4 Compare May 20, 2020 19:46

hanm requested changes May 21, 2020

View reviewed changes

hanm reviewed May 28, 2020

View reviewed changes

ztzg force-pushed the master branch from 1c60545 to e2070be Compare October 3, 2023 12:57

Conversation

lvfangmin commented Dec 20, 2019

Uh oh!

asf-ci commented Dec 20, 2019

Failed Tests: 1

PreCommit-ZOOKEEPER-github-pr-build-maven/org.apache.zookeeper:zookeeper: 1

Uh oh!

lvfangmin commented Dec 20, 2019

Uh oh!

hanm commented Dec 21, 2019

Uh oh!

anmolnar commented Jan 21, 2020

Uh oh!

lvfangmin commented Jan 26, 2020

Uh oh!

anmolnar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lvfangmin commented May 14, 2020

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants