HDDS-8492. Intermittent timeout in TestStorageContainerManager#testBlockDeletionTransactions#5397
HDDS-8492. Intermittent timeout in TestStorageContainerManager#testBlockDeletionTransactions#5397devmadhuu wants to merge 30 commits intoapache:masterfrom
Conversation
…ockDeletionTransactions.
…ockDeletionTransactions.
…ockDeletionTransactions.
…ockDeletionTransactions.
…ockDeletionTransactions.
…ockDeletionTransactions.
…ockDeletionTransactions.
|
@adoroszlai Pls review |
| @@ -344,7 +346,7 @@ public void testBlockDeletionTransactions() throws Exception { | |||
| } catch (IOException e) { | |||
There was a problem hiding this comment.
@devmadhuu Thanks for working on this.
We are explicitly flushing at L340 cluster.getStorageContainerManager().getScmHAManager().asSCMHADBTransactionBuffer().flush(), I think that should flush the transactions, do we still need to define OZONE_SCM_HA_DBTRANSACTIONBUFFER_FLUSH_INTERVAL ?
There was a problem hiding this comment.
@devmadhuu Thanks for working on this. We are explicitly flushing at L340
cluster.getStorageContainerManager().getScmHAManager().asSCMHADBTransactionBuffer().flush(),I think that should flush the transactions, do we still need to define OZONE_SCM_HA_DBTRANSACTIONBUFFER_FLUSH_INTERVAL ?
Thanks @ashishkumar50 for reviewing the patch. For MiniOzoneCluster, by default SCM HA is not enabled, so above statement for transaction buffer flush will not execute. So I had to set and reduce the time explicitly for transaction buffer flush.
There was a problem hiding this comment.
I think this config may be used for HA case only even config name contains "HA flush", It is used for non-HA case also?
There was a problem hiding this comment.
I think this config may be used for HA case only even config name contains "HA flush", It is used for non-HA case also?
Yes ideally non HA case, SCM should write deleted block transactions directly to DB, so based on further analysis, we have this method org.apache.hadoop.hdds.scm.block.DeletedBlockLogStateManagerImpl#addTransactionsToDB
which adds deleted blocks transactions to DB , and here we are adding to transaction buffer , not directly to DB.
@sumitagrawl do you have an understanding of expected behavior ?
There was a problem hiding this comment.
explicit call of flush is done in this case, so setting property is redundant
There was a problem hiding this comment.
I removed that HA flush flag and still can't reproduce the test case failure even after running 300 iterations.
https://github.com/devmadhuu/ozone/actions/runs/6457355734 - > Using Flaky test case workflow
https://github.com/devmadhuu/ozone/actions/runs/6455820324 -> 2 attempts
…ockDeletionTransactions.
|
@devmadhuu Please do not mix repeated runs with the branch used for creating PR. You can create a separate "repeat" branch where fix commits can be cherry-picked, or use the new Also, please use more descriptive message for follow-up commits. Using the same title for all commits makes it more difficult to see what's going on. Compare: https://github.com/apache/ozone/pull/5397/commits |
…nd revert of ci.yml and junit.sh.
… change of inconsistent assertion.
|
@adoroszlai - As per historical CI build results, this issue came last on 06/23 (June 2023): However It was reproducible using flaky test workflow sometimes. In current state of PR raised , changes been tested in 400 iterations, out of which only 1 split got failed. https://github.com/devmadhuu/ozone/actions/runs/6494918458 Can we not treat this test as non-flaky now because passing percentage is high now. |
We can stop active work on it but keep the issue open. Reduced priority to minor. |
|
Closing the PR till we decide to revisit the code change. |
What changes were proposed in this pull request?
This PR fixes the intermittent timeout in
TestStorageContainerManager#testBlockDeletionTransactionsdue to assertion failure of verifying "NumOfValidTransactions" as zero in deleted block log. Intermittent failure reason was due to holding the deleted transaction in SCM transaction buffer, so reducing of transaction buffer interval in test mini cluster solved the problem.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8492
How was this patch tested?
This patch was tested using multiple iterations of CI run. Here is the green CI link from forked branch.