HDDS-4540. Add a new OM admin operation to submit the OMPrepareRequest. by avijayanhwx · Pull Request #1664 · apache/ozone

avijayanhwx · 2020-12-06T06:18:35Z

What changes were proposed in this pull request?

Introduce a new OM client operation to "prepare" the OM quorum.
As a first pass, the client will just submit the request (HDDS-4480) and print out the response (Txn ID)
In a follow up JIRA, the subsequent steps to probe every individual OM for preparation completeness will be added.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4540

How was this patch tested?

Manually tested.

linyiqun

Minor comments from me:

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocol/OzoneManagerProtocol.java

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/PrepareSubCommand.java

errose28

Thanks for working on this @avijayanhwx. This will be very useful in further testing of the prepare feature. I left some comments inline.

...ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/upgrade/OMPrepareRequest.java

errose28 · 2020-12-07T20:11:37Z

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/PrepareSubCommand.java

+        "all pending transactions, taking a Ratis snapshot at the last txn " +
+        "and purging all logs on each OM instance. The returned txn id " +
+        "corresponds to the last txn in the quorum in which the snapshot is " +
+        "taken.",


nit: Is txn a standard abbreviation for transaction that is used in the docs and the user would be expected to recognize? Seems best not to use non-standard abbreviations in user facing documentation.

I have used the full word in the description, and marked the parameter as "hidden".

errose28 · 2020-12-07T20:21:23Z

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/PrepareSubCommand.java

+
+  @CommandLine.Option(
+      names = {"-ft", "--flush-wait-timeout"},
+      description = "Max time to wait for OM Double Buffer flush in seconds.",


I feel like this flag name and description might be too low level for the UI, since the double buffer is more of an implementation detail. For example, the user would not be expected to know whether the double buffer flush is a mandatory part of prepare (it is), or if this is just a convenience and they can just set this to a low number and if the flush doesn't complete in time expect the OM to still prepare. Maybe just --prepare-timeout for the flag? We can use the flush timeout terminology within the client, request, and protos.

If the user sets this too low, they will get an error message saying the flush did not complete in time. Then they can retry with a higher value. So this is just a thought and might not be necessary.

Refactored this to transaction-apply-wait-timeout, and marked this field as hidden. Since I am expecting a global "prepare-timeout" parameter for the client which includes the wait for each OM to be prepared, I have used a more applicable parameter name here. But, I agree this is an internal detail and hence I have moved it to a hidden field.

errose28 · 2020-12-07T20:23:32Z

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/PrepareSubCommand.java

+  @Override
+  public Void call() throws Exception {
+    OzoneManagerProtocol client = parent.createOmClient(omServiceId);
+    long prepareTxnId = client.prepareOzoneManager(flushWaitTime, 5);


nit: Can we make the 5 a static constant in this class so it is easier to find and update if needed?

Added this as a hidden parameter as well.

errose28 · 2020-12-07T20:34:39Z

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java


+  @Override
+  public long prepareOzoneManager(long flushWaitTimeout,
+                                  long flushCheckInterval) throws IOException {


Can we add some identifying information for these time units in this method? We could use java.time.Duration, or since they are being passed right into a proto, just adding Seconds onto their variable names might be enough.

errose28 · 2020-12-07T21:02:37Z

...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerPrepare.java

  public void testPrepareDownedOM() throws Exception {
    // Index of the OM that will be shut down during this test.
-    final int shutdownOMIndex = 2;
+    final int shutdownOMIndex = new Random().nextInt(3);


nit: Is there any benefit to using a random index over a fixed one? When observing the test run from logs, we now need to search the messages to determine which OM was taken down on this run (also distinguish the deliberate takedown from a crash or JVM pause induced leader change), instead of having that info beforehand.

Reverted this change.

errose28 · 2020-12-07T21:12:44Z

...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerPrepare.java

-        () -> downedOM.getRatisSnapshotIndex() == prepareIndex);
-    checkPrepared(downedOM, prepareIndex);
+    LambdaTestUtils.await(timeoutMillis, 2000,
+        () -> checkPrepared(downedOM, prepareIndex));


Can we replace this with a waitAndCheckPrepared call? Then we know that the logs have also been removed as well.

Given that we have already made sure that there are no logs present in the functional OMs, I believe that it is sufficient to check just the prepare request apply marker in the downed OM.

errose28 · 2020-12-07T21:16:37Z

...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerPrepare.java

  }

-  private void checkPrepared(OzoneManager om, long prepareRequestLogIndex)
+  private boolean checkPrepared(OzoneManager om, long prepareRequestLogIndex)


Do we still need this method? Since we are no longer immediately checking the leader and waiting on all 3 OMs, can we just put these lines in waitAndCheckPrepared? This method on its own is kind of misleading, since it no longer checks that the logs have been removed, and therefore doesn't do a full check for preparedness.

Due to my last comment, this still has one usage.

errose28 · 2020-12-07T21:22:09Z

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java

+  public long prepareOzoneManager(long flushWaitTimeout,
+                                  long flushCheckInterval) throws IOException {
+    Preconditions.checkArgument(flushWaitTimeout > 0,
+        "flushWaitTimeout has to be > zero");


Should we add this check in PrepareSubCommand as well to give the user a specific error message if they pass a bad value? I'm not sure how the client handles these precondition generated exceptions and presents them to the user.

Changed this to match the command's user parameter name.

errose28 · 2020-12-07T21:25:27Z

...n/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java

+    Preconditions.checkArgument(flushCheckInterval > 0 &&
+            flushCheckInterval < flushWaitTimeout / 2,
+        "flushCheckInterval has to be > zero and < half of " +
+            "flushWaitTimeout to make sense.");


If the user passes in a value for flush wait timeout that is less than the hardcoded flush check interval, they will see this message. This is confusing to them since it indicates the problem is flush check interval, which they have no knowledge of. They must also guess as to what the flush check interval value is to make sure their flush wait timeout is two times that when they try again.

Similar to above, a min value check on the user passed flush wait timeout to make sure it is large enough (or automatically set flush check interval based on the passed value) might be good to add in PrepareSubCommand.

Since the timeout command is now hidden this is probably okay.

avijayanhwx

Thanks for the review @errose28. I have taken up most of your suggestions.

avijayanhwx · 2020-12-07T22:52:06Z

...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerPrepare.java

-        () -> downedOM.getRatisSnapshotIndex() == prepareIndex);
-    checkPrepared(downedOM, prepareIndex);
+    LambdaTestUtils.await(timeoutMillis, 2000,
+        () -> checkPrepared(downedOM, prepareIndex));


Given that we have already made sure that there are no logs present in the functional OMs, I believe that it is sufficient to check just the prepare request apply marker in the downed OM.

avijayanhwx · 2020-12-07T22:52:36Z

...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerPrepare.java

  }

-  private void checkPrepared(OzoneManager om, long prepareRequestLogIndex)
+  private boolean checkPrepared(OzoneManager om, long prepareRequestLogIndex)


Due to my last comment, this still has one usage.

errose28 · 2020-12-08T22:42:30Z

Thanks @avijayanhwx LGTM +1

swagle

+1 LGTM

avijayanhwx · 2020-12-09T19:53:37Z

Thanks for the reviews @linyiqun, @errose28 & @swagle. I am merging this to unblock the efforts in HDDS-4569. If there are further review comments, they can be addressed in the following patches.

Aravindan Vijayan added 2 commits December 5, 2020 22:14

HDDS-4540. Add a new OM admin operation to submit the OMPrepareRequest.

c8bc481

Fix checkstyle.

ceb4ef8

linyiqun reviewed Dec 6, 2020

View reviewed changes

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocol/OzoneManagerProtocol.java Outdated Show resolved Hide resolved

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/PrepareSubCommand.java Outdated Show resolved Hide resolved

avijayanhwx requested review from bharatviswa504, fapifta, prashantpogde and swagle December 6, 2020 17:31

Address review comments.

7054801

errose28 reviewed Dec 7, 2020

View reviewed changes

Aravindan Vijayan added 2 commits December 7, 2020 14:41

Address review comments.

558357d

Add wait for log file absence check.

6f8e03d

avijayanhwx force-pushed the HDDS-4540 branch from 311836f to 6f8e03d Compare December 7, 2020 22:47

avijayanhwx commented Dec 7, 2020

View reviewed changes

swagle approved these changes Dec 9, 2020

View reviewed changes

avijayanhwx merged commit 2489968 into apache:HDDS-3698-upgrade Dec 9, 2020

Conversation

avijayanhwx commented Dec 6, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

linyiqun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avijayanhwx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

errose28 commented Dec 8, 2020

Uh oh!

swagle left a comment

Choose a reason for hiding this comment

Uh oh!

avijayanhwx commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants