Skip to content

HDDS-4540. Add a new OM admin operation to submit the OMPrepareRequest.#1664

Merged
avijayanhwx merged 5 commits intoapache:HDDS-3698-upgradefrom
avijayanhwx:HDDS-4540
Dec 9, 2020
Merged

HDDS-4540. Add a new OM admin operation to submit the OMPrepareRequest.#1664
avijayanhwx merged 5 commits intoapache:HDDS-3698-upgradefrom
avijayanhwx:HDDS-4540

Conversation

@avijayanhwx
Copy link
Contributor

What changes were proposed in this pull request?

  • Introduce a new OM client operation to "prepare" the OM quorum.
  • As a first pass, the client will just submit the request (HDDS-4480) and print out the response (Txn ID)
  • In a follow up JIRA, the subsequent steps to probe every individual OM for preparation completeness will be added.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4540

How was this patch tested?

Manually tested.

Copy link
Contributor

@linyiqun linyiqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments from me:

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @avijayanhwx. This will be very useful in further testing of the prepare feature. I left some comments inline.

"all pending transactions, taking a Ratis snapshot at the last txn " +
"and purging all logs on each OM instance. The returned txn id " +
"corresponds to the last txn in the quorum in which the snapshot is " +
"taken.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is txn a standard abbreviation for transaction that is used in the docs and the user would be expected to recognize? Seems best not to use non-standard abbreviations in user facing documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used the full word in the description, and marked the parameter as "hidden".


@CommandLine.Option(
names = {"-ft", "--flush-wait-timeout"},
description = "Max time to wait for OM Double Buffer flush in seconds.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this flag name and description might be too low level for the UI, since the double buffer is more of an implementation detail. For example, the user would not be expected to know whether the double buffer flush is a mandatory part of prepare (it is), or if this is just a convenience and they can just set this to a low number and if the flush doesn't complete in time expect the OM to still prepare. Maybe just --prepare-timeout for the flag? We can use the flush timeout terminology within the client, request, and protos.

If the user sets this too low, they will get an error message saying the flush did not complete in time. Then they can retry with a higher value. So this is just a thought and might not be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this to transaction-apply-wait-timeout, and marked this field as hidden. Since I am expecting a global "prepare-timeout" parameter for the client which includes the wait for each OM to be prepared, I have used a more applicable parameter name here. But, I agree this is an internal detail and hence I have moved it to a hidden field.

@Override
public Void call() throws Exception {
OzoneManagerProtocol client = parent.createOmClient(omServiceId);
long prepareTxnId = client.prepareOzoneManager(flushWaitTime, 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we make the 5 a static constant in this class so it is easier to find and update if needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as a hidden parameter as well.


@Override
public long prepareOzoneManager(long flushWaitTimeout,
long flushCheckInterval) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some identifying information for these time units in this method? We could use java.time.Duration, or since they are being passed right into a proto, just adding Seconds onto their variable names might be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public void testPrepareDownedOM() throws Exception {
// Index of the OM that will be shut down during this test.
final int shutdownOMIndex = 2;
final int shutdownOMIndex = new Random().nextInt(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there any benefit to using a random index over a fixed one? When observing the test run from logs, we now need to search the messages to determine which OM was taken down on this run (also distinguish the deliberate takedown from a crash or JVM pause induced leader change), instead of having that info beforehand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this change.

() -> downedOM.getRatisSnapshotIndex() == prepareIndex);
checkPrepared(downedOM, prepareIndex);
LambdaTestUtils.await(timeoutMillis, 2000,
() -> checkPrepared(downedOM, prepareIndex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this with a waitAndCheckPrepared call? Then we know that the logs have also been removed as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have already made sure that there are no logs present in the functional OMs, I believe that it is sufficient to check just the prepare request apply marker in the downed OM.

}

private void checkPrepared(OzoneManager om, long prepareRequestLogIndex)
private boolean checkPrepared(OzoneManager om, long prepareRequestLogIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this method? Since we are no longer immediately checking the leader and waiting on all 3 OMs, can we just put these lines in waitAndCheckPrepared? This method on its own is kind of misleading, since it no longer checks that the logs have been removed, and therefore doesn't do a full check for preparedness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to my last comment, this still has one usage.

public long prepareOzoneManager(long flushWaitTimeout,
long flushCheckInterval) throws IOException {
Preconditions.checkArgument(flushWaitTimeout > 0,
"flushWaitTimeout has to be > zero");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this check in PrepareSubCommand as well to give the user a specific error message if they pass a bad value? I'm not sure how the client handles these precondition generated exceptions and presents them to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to match the command's user parameter name.

Preconditions.checkArgument(flushCheckInterval > 0 &&
flushCheckInterval < flushWaitTimeout / 2,
"flushCheckInterval has to be > zero and < half of " +
"flushWaitTimeout to make sense.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user passes in a value for flush wait timeout that is less than the hardcoded flush check interval, they will see this message. This is confusing to them since it indicates the problem is flush check interval, which they have no knowledge of. They must also guess as to what the flush check interval value is to make sure their flush wait timeout is two times that when they try again.

Similar to above, a min value check on the user passed flush wait timeout to make sure it is large enough (or automatically set flush check interval based on the passed value) might be good to add in PrepareSubCommand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the timeout command is now hidden this is probably okay.

Copy link
Contributor Author

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @errose28. I have taken up most of your suggestions.

() -> downedOM.getRatisSnapshotIndex() == prepareIndex);
checkPrepared(downedOM, prepareIndex);
LambdaTestUtils.await(timeoutMillis, 2000,
() -> checkPrepared(downedOM, prepareIndex));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have already made sure that there are no logs present in the functional OMs, I believe that it is sufficient to check just the prepare request apply marker in the downed OM.

}

private void checkPrepared(OzoneManager om, long prepareRequestLogIndex)
private boolean checkPrepared(OzoneManager om, long prepareRequestLogIndex)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to my last comment, this still has one usage.

@errose28
Copy link
Contributor

errose28 commented Dec 8, 2020

Thanks @avijayanhwx LGTM +1

Copy link
Contributor

@swagle swagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@avijayanhwx
Copy link
Contributor Author

Thanks for the reviews @linyiqun, @errose28 & @swagle. I am merging this to unblock the efforts in HDDS-4569. If there are further review comments, they can be addressed in the following patches.

@avijayanhwx avijayanhwx merged commit 2489968 into apache:HDDS-3698-upgrade Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants