HDDS-4540. Add a new OM admin operation to submit the OMPrepareRequest.#1664
HDDS-4540. Add a new OM admin operation to submit the OMPrepareRequest.#1664avijayanhwx merged 5 commits intoapache:HDDS-3698-upgradefrom
Conversation
linyiqun
left a comment
There was a problem hiding this comment.
Minor comments from me:
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocol/OzoneManagerProtocol.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/PrepareSubCommand.java
Outdated
Show resolved
Hide resolved
errose28
left a comment
There was a problem hiding this comment.
Thanks for working on this @avijayanhwx. This will be very useful in further testing of the prepare feature. I left some comments inline.
...ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/upgrade/OMPrepareRequest.java
Outdated
Show resolved
Hide resolved
| "all pending transactions, taking a Ratis snapshot at the last txn " + | ||
| "and purging all logs on each OM instance. The returned txn id " + | ||
| "corresponds to the last txn in the quorum in which the snapshot is " + | ||
| "taken.", |
There was a problem hiding this comment.
nit: Is txn a standard abbreviation for transaction that is used in the docs and the user would be expected to recognize? Seems best not to use non-standard abbreviations in user facing documentation.
There was a problem hiding this comment.
I have used the full word in the description, and marked the parameter as "hidden".
|
|
||
| @CommandLine.Option( | ||
| names = {"-ft", "--flush-wait-timeout"}, | ||
| description = "Max time to wait for OM Double Buffer flush in seconds.", |
There was a problem hiding this comment.
I feel like this flag name and description might be too low level for the UI, since the double buffer is more of an implementation detail. For example, the user would not be expected to know whether the double buffer flush is a mandatory part of prepare (it is), or if this is just a convenience and they can just set this to a low number and if the flush doesn't complete in time expect the OM to still prepare. Maybe just --prepare-timeout for the flag? We can use the flush timeout terminology within the client, request, and protos.
If the user sets this too low, they will get an error message saying the flush did not complete in time. Then they can retry with a higher value. So this is just a thought and might not be necessary.
There was a problem hiding this comment.
Refactored this to transaction-apply-wait-timeout, and marked this field as hidden. Since I am expecting a global "prepare-timeout" parameter for the client which includes the wait for each OM to be prepared, I have used a more applicable parameter name here. But, I agree this is an internal detail and hence I have moved it to a hidden field.
| @Override | ||
| public Void call() throws Exception { | ||
| OzoneManagerProtocol client = parent.createOmClient(omServiceId); | ||
| long prepareTxnId = client.prepareOzoneManager(flushWaitTime, 5); |
There was a problem hiding this comment.
nit: Can we make the 5 a static constant in this class so it is easier to find and update if needed?
There was a problem hiding this comment.
Added this as a hidden parameter as well.
|
|
||
| @Override | ||
| public long prepareOzoneManager(long flushWaitTimeout, | ||
| long flushCheckInterval) throws IOException { |
There was a problem hiding this comment.
Can we add some identifying information for these time units in this method? We could use java.time.Duration, or since they are being passed right into a proto, just adding Seconds onto their variable names might be enough.
| public void testPrepareDownedOM() throws Exception { | ||
| // Index of the OM that will be shut down during this test. | ||
| final int shutdownOMIndex = 2; | ||
| final int shutdownOMIndex = new Random().nextInt(3); |
There was a problem hiding this comment.
nit: Is there any benefit to using a random index over a fixed one? When observing the test run from logs, we now need to search the messages to determine which OM was taken down on this run (also distinguish the deliberate takedown from a crash or JVM pause induced leader change), instead of having that info beforehand.
There was a problem hiding this comment.
Reverted this change.
| () -> downedOM.getRatisSnapshotIndex() == prepareIndex); | ||
| checkPrepared(downedOM, prepareIndex); | ||
| LambdaTestUtils.await(timeoutMillis, 2000, | ||
| () -> checkPrepared(downedOM, prepareIndex)); |
There was a problem hiding this comment.
Can we replace this with a waitAndCheckPrepared call? Then we know that the logs have also been removed as well.
There was a problem hiding this comment.
Given that we have already made sure that there are no logs present in the functional OMs, I believe that it is sufficient to check just the prepare request apply marker in the downed OM.
| } | ||
|
|
||
| private void checkPrepared(OzoneManager om, long prepareRequestLogIndex) | ||
| private boolean checkPrepared(OzoneManager om, long prepareRequestLogIndex) |
There was a problem hiding this comment.
Do we still need this method? Since we are no longer immediately checking the leader and waiting on all 3 OMs, can we just put these lines in waitAndCheckPrepared? This method on its own is kind of misleading, since it no longer checks that the logs have been removed, and therefore doesn't do a full check for preparedness.
There was a problem hiding this comment.
Due to my last comment, this still has one usage.
| public long prepareOzoneManager(long flushWaitTimeout, | ||
| long flushCheckInterval) throws IOException { | ||
| Preconditions.checkArgument(flushWaitTimeout > 0, | ||
| "flushWaitTimeout has to be > zero"); |
There was a problem hiding this comment.
Should we add this check in PrepareSubCommand as well to give the user a specific error message if they pass a bad value? I'm not sure how the client handles these precondition generated exceptions and presents them to the user.
There was a problem hiding this comment.
Changed this to match the command's user parameter name.
| Preconditions.checkArgument(flushCheckInterval > 0 && | ||
| flushCheckInterval < flushWaitTimeout / 2, | ||
| "flushCheckInterval has to be > zero and < half of " + | ||
| "flushWaitTimeout to make sense."); |
There was a problem hiding this comment.
If the user passes in a value for flush wait timeout that is less than the hardcoded flush check interval, they will see this message. This is confusing to them since it indicates the problem is flush check interval, which they have no knowledge of. They must also guess as to what the flush check interval value is to make sure their flush wait timeout is two times that when they try again.
Similar to above, a min value check on the user passed flush wait timeout to make sure it is large enough (or automatically set flush check interval based on the passed value) might be good to add in PrepareSubCommand.
There was a problem hiding this comment.
Since the timeout command is now hidden this is probably okay.
311836f to
6f8e03d
Compare
avijayanhwx
left a comment
There was a problem hiding this comment.
Thanks for the review @errose28. I have taken up most of your suggestions.
| () -> downedOM.getRatisSnapshotIndex() == prepareIndex); | ||
| checkPrepared(downedOM, prepareIndex); | ||
| LambdaTestUtils.await(timeoutMillis, 2000, | ||
| () -> checkPrepared(downedOM, prepareIndex)); |
There was a problem hiding this comment.
Given that we have already made sure that there are no logs present in the functional OMs, I believe that it is sufficient to check just the prepare request apply marker in the downed OM.
| } | ||
|
|
||
| private void checkPrepared(OzoneManager om, long prepareRequestLogIndex) | ||
| private boolean checkPrepared(OzoneManager om, long prepareRequestLogIndex) |
There was a problem hiding this comment.
Due to my last comment, this still has one usage.
|
Thanks @avijayanhwx LGTM +1 |
What changes were proposed in this pull request?
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4540
How was this patch tested?
Manually tested.