Skip to content

HDDS-15004. Stabilize TestReconContainerEndpoint#testContainerEndpointForOBSBucket#10116

Merged
adoroszlai merged 4 commits into
apache:masterfrom
arunsarin85:HDDS-15004
May 7, 2026
Merged

HDDS-15004. Stabilize TestReconContainerEndpoint#testContainerEndpointForOBSBucket#10116
adoroszlai merged 4 commits into
apache:masterfrom
arunsarin85:HDDS-15004

Conversation

@arunsarin85
Copy link
Copy Markdown
Contributor

@arunsarin85 arunsarin85 commented Apr 23, 2026

What changes were proposed in this pull request?

Fix the intermittent failure in TestReconContainerEndpoint#testContainerEndpointForOBSBucket (HDDS-15004).

The test sometimes failed with expected: <1> but was: <0> on KeysResponse#getTotalCount(). That usually meant either Recon had not finished updating its containerdi-key index, or the test queried the wrong container id.

Please describe your PR in detail:

  1. Use the real container id from OM

OBS: the test no longer hard-codes 1L. It uses OzoneManager#lookupKey (via getContainerIdForKey) to get the container id from the key’s block locations.
FSO: the same helper is used for each key path instead of assuming fixed container ids.

  1. Fail fast when the async “buffer empty” wait breaks

After waitForEventBufferEmpty, the test only waited until the CompletableFuture completed. If the async work failed, the future could still be “done” and the test would continue anyway.

The test now calls completableFuture.join() after GenericTestUtils.waitFor(completableFuture::isDone, …) so a failed buffer wait is not ignored.

  1. Wait until Recon actually has the keys (no sleep in TestReconContainerEndpoint)

The OM event buffer can be empty while Recon is still applying that batch (events are removed from the queue before processing finishes). So “buffer empty” is not enough to assert on the container endpoint.

Instead of sleeping, the test waits until ReconContainerMetadataManager#getKeyCountForContainer shows the expected number of keys per container. That logic lives in TestReconOmMetaManagerUtils.waitUntilReconKeyCounts, which polls until counts match or a timeout is hit.

The FSO test builds the expected counts from both written keys (so if two keys share a container, it waits for the right total).

  1. Clean up static state between tests

ContainerKeyMapperHelper uses JVM-wide static flags and maps. If one test method runs before another in the same process, the second test can see stale state.

  1. Safer shutdown in @AfterEach

The client is closed with IOUtils.closeQuietly(client) so a client-close error does not block cluster.shutdown().

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15004

How was this patch tested?

(Please explain how this patch was tested. Ex: unit tests, manual tests, workflow run on the fork git repo.)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this.)

https://github.com/arunsarin85/ozone/actions/runs/24855010484
https://github.com/arunsarin85/ozone/actions/runs/24855051641

Copy link
Copy Markdown
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arunsarin85 for the patch. Kindly find comments.

@arunsarin85
Copy link
Copy Markdown
Contributor Author

@devmadhuu Thanks for the review . I have added a patch for the above changes and triggered the flaky-test-check
https://github.com/arunsarin85/ozone/actions/runs/25127529917
https://github.com/arunsarin85/ozone/actions/runs/25127529917/attempts/1

@arunsarin85 arunsarin85 requested a review from devmadhuu April 29, 2026 19:26
Copy link
Copy Markdown
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arunsarin85 for improving the patch. Largely looks good to me. Can you revisit your PR description. E.g I noticed below point seems obsolete. Check other points also and rephrase them to more cleaner understandable way. Currently the language is too complex to understand.

Short settle time after the buffer wait
The OM event queue can be empty while a batch is still being processed (events are dequeued before task processing finishes). A two-second sleep after join() gives in-flight container-key updates time to land before assertions.

@arunsarin85
Copy link
Copy Markdown
Contributor Author

Thanks @arunsarin85 for improving the patch. Largely looks good to me. Can you revisit your PR description. E.g I noticed below point seems obsolete. Check other points also and rephrase them to more cleaner understandable way. Currently the language is too complex to understand.

Short settle time after the buffer wait
The OM event queue can be empty while a batch is still being processed (events are dequeued before task processing finishes). A two-second sleep after join() gives in-flight container-key updates time to land before assertions.

Hi @devmadhuu , I have updated the PR description

@devmadhuu devmadhuu self-requested a review May 5, 2026 08:42
Copy link
Copy Markdown
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arunsarin85 for improving the patch. LGTM +1

@adoroszlai
Copy link
Copy Markdown
Contributor

Thanks @arunsarin85 for updating the patch. Please note checkstyle failure:

hadoop-ozone/integration-test-recon/src/test/java/org/apache/hadoop/ozone/recon/TestReconContainerEndpoint.java
 32: Wrong lexicographical order for 'org.apache.hadoop.hdds.scm.server.OzoneStorageContainerManager' import. Should be before 'org.apache.hadoop.hdds.utils.IOUtils'.

https://github.com/arunsarin85/ozone/actions/runs/25127388560/job/73644085279

@adoroszlai adoroszlai merged commit bc89991 into apache:master May 7, 2026
32 checks passed
@adoroszlai
Copy link
Copy Markdown
Contributor

Thanks @arunsarin85 for the patch, @ArafatKhan2198, @devmadhuu for the review.

@arunsarin85 arunsarin85 deleted the HDDS-15004 branch May 7, 2026 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants