Skip to content

HDDS-10366. Add new testPrepare() in TestOzoneManagerPrepare#6245

Closed
aryangupta1998 wants to merge 4 commits intoapache:masterfrom
aryangupta1998:HDDS-10366
Closed

HDDS-10366. Add new testPrepare() in TestOzoneManagerPrepare#6245
aryangupta1998 wants to merge 4 commits intoapache:masterfrom
aryangupta1998:HDDS-10366

Conversation

@aryangupta1998
Copy link
Contributor

What changes were proposed in this pull request?

Adding new testPrepare() in TestOzoneManagerPrepare which has snapshot interval set to 1, when upgrade prepare request comes, triggers force snapshot from ratis and waits for complete, once force snapshot is completed, then submit the upgrade prepare for the remaining task to mark the upgrade prepare complete. We then verify that prepare index should always be less than the current transaction index.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10366

How was this patch tested?

Tested Manually.

@adoroszlai
Copy link
Contributor

@aryangupta1998 TestOzoneManagerHA subclasses historically had several intermittent failures. Please run them in flaky-test-check workflow to ensure reduced snapshot threshold does not cause regressions.

Also, please compare test run time before/after the change.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @aryangupta1998. This snapshot race has probably been making prepare slightly flaky for a while.

private static final int OZONE_CLIENT_FAILOVER_MAX_ATTEMPTS = 5;
private static final int IPC_CLIENT_CONNECT_MAX_RETRIES = 4;
private static final long SNAPSHOT_THRESHOLD = 50;
private static final long SNAPSHOT_THRESHOLD = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this globally may mask other problems in prepare or other OM HA tests when snapshots are not taken. We should probably only set in the one new test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the presence of some functions like submitCancelPrepareRequest() in setup(), the transaction index increases due to which we may not be able to produce the current scenario i.e, SNAPSHOT_THRESHOLD = 1.
Do you feel we should write a new test class extending TestOzoneManagerHA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think desired result is that only the new snapshot test in TestOzoneManagerPrepare has a snapshot threshold of 1, and the rest of the tests for TestOzoneManagerPrepare and TestOzoneManagerHA subclasses remain unchanged. Doing that with the current TestOzoneManagerHA setup is difficult because it is using a static method to set the configuration, so some refactoring may be required there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do so, the transaction index will still increase due to the setup() function in TestOzoneManagerPrepare!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the transaction index increases due to which we may not be able to produce the current scenario i.e, SNAPSHOT_THRESHOLD = 1.

If we do so, the transaction index will still increase due to the setup() function in TestOzoneManagerPrepare!

I think there may be some misunderstanding about the snapshot threshold? SNAPSHOT_THRESHOLD=1 means the OM will take a snapshot on every new request, not that it will only take a snapshot on index 1. The transaction index can be any number at any time during the test, and by setting this config to 1 we know that transaction will have a snapshot taken.

Comment on lines -215 to -220
} else if (!ratisStateMachineApplied) {
throw new IOException(String.format("After waiting for %d seconds, " +
"Ratis state machine applied index %d which is less than" +
" the minimum required index %d.",
flushTimeout.getSeconds(), lastRatisCommitIndex,
minRatisStateMachineIndex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove minRatisStateMachineIndex and ratisStateMachineApplied from the other parts of this function as well if we are no longer using them. We also need to make sure that om.getRatisSnapshotIndex() returning an index means that was actually written to RocksDB. Checking the transaction info table may be a better option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants