Skip to content

Comments

HDDS-8492. Intermittent timeout in TestStorageContainerManager#testBlockDeletionTransactions#5397

Closed
devmadhuu wants to merge 30 commits intoapache:masterfrom
devmadhuu:HDDS-8492
Closed

HDDS-8492. Intermittent timeout in TestStorageContainerManager#testBlockDeletionTransactions#5397
devmadhuu wants to merge 30 commits intoapache:masterfrom
devmadhuu:HDDS-8492

Conversation

@devmadhuu
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes the intermittent timeout in TestStorageContainerManager#testBlockDeletionTransactions due to assertion failure of verifying "NumOfValidTransactions" as zero in deleted block log. Intermittent failure reason was due to holding the deleted transaction in SCM transaction buffer, so reducing of transaction buffer interval in test mini cluster solved the problem.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8492

How was this patch tested?

This patch was tested using multiple iterations of CI run. Here is the green CI link from forked branch.

@devmadhuu devmadhuu marked this pull request as ready for review October 6, 2023 08:34
@devmadhuu
Copy link
Contributor Author

@adoroszlai Pls review

@adoroszlai adoroszlai requested a review from sumitagrawl October 6, 2023 10:24
@@ -344,7 +346,7 @@ public void testBlockDeletionTransactions() throws Exception {
} catch (IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu Thanks for working on this.
We are explicitly flushing at L340 cluster.getStorageContainerManager().getScmHAManager().asSCMHADBTransactionBuffer().flush(), I think that should flush the transactions, do we still need to define OZONE_SCM_HA_DBTRANSACTIONBUFFER_FLUSH_INTERVAL ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu Thanks for working on this. We are explicitly flushing at L340 cluster.getStorageContainerManager().getScmHAManager().asSCMHADBTransactionBuffer().flush(), I think that should flush the transactions, do we still need to define OZONE_SCM_HA_DBTRANSACTIONBUFFER_FLUSH_INTERVAL ?

Thanks @ashishkumar50 for reviewing the patch. For MiniOzoneCluster, by default SCM HA is not enabled, so above statement for transaction buffer flush will not execute. So I had to set and reduce the time explicitly for transaction buffer flush.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this config may be used for HA case only even config name contains "HA flush", It is used for non-HA case also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this config may be used for HA case only even config name contains "HA flush", It is used for non-HA case also?

Yes ideally non HA case, SCM should write deleted block transactions directly to DB, so based on further analysis, we have this method org.apache.hadoop.hdds.scm.block.DeletedBlockLogStateManagerImpl#addTransactionsToDB
which adds deleted blocks transactions to DB , and here we are adding to transaction buffer , not directly to DB.
@sumitagrawl do you have an understanding of expected behavior ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit call of flush is done in this case, so setting property is redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed that HA flush flag and still can't reproduce the test case failure even after running 300 iterations.
https://github.com/devmadhuu/ozone/actions/runs/6457355734 - > Using Flaky test case workflow
https://github.com/devmadhuu/ozone/actions/runs/6455820324 -> 2 attempts

@adoroszlai
Copy link
Contributor

@devmadhuu Please do not mix repeated runs with the branch used for creating PR. You can create a separate "repeat" branch where fix commits can be cherry-picked, or use the new flaky-test-check workflow to trigger repeated run on the PR branch.

Also, please use more descriptive message for follow-up commits. Using the same title for all commits makes it more difficult to see what's going on. Compare:

https://github.com/apache/ozone/pull/5397/commits
vs.
https://github.com/apache/ozone/pull/5301/commits

@devmadhuu devmadhuu marked this pull request as draft October 10, 2023 04:42
@devmadhuu devmadhuu marked this pull request as ready for review October 12, 2023 18:25
@devmadhuu
Copy link
Contributor Author

devmadhuu commented Oct 12, 2023

@adoroszlai - As per historical CI build results, this issue came last on 06/23 (June 2023):

https://github.com/adoroszlai/ozone-build-results/blob/998c5c5c03941f4f446bdc04db92e66a325a0e46/2023/06/06/23016/it-ozone/hadoop-ozone/integration-test/org.apache.hadoop.ozone.TestStorageContainerManager.txt#L5

However It was reproducible using flaky test workflow sometimes. In current state of PR raised , changes been tested in 400 iterations, out of which only 1 split got failed. https://github.com/devmadhuu/ozone/actions/runs/6494918458

Can we not treat this test as non-flaky now because passing percentage is high now.

@adoroszlai
Copy link
Contributor

Can we not treat this test as non-flaky now because passing percentage is high now.

We can stop active work on it but keep the issue open. Reduced priority to minor.

@adoroszlai adoroszlai marked this pull request as draft October 15, 2023 17:11
@kerneltime
Copy link
Contributor

Closing the PR till we decide to revisit the code change.

@kerneltime kerneltime closed this Oct 16, 2023
@devmadhuu devmadhuu deleted the HDDS-8492 branch January 5, 2024 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants