HDDS-15004. Stabilize TestReconContainerEndpoint#testContainerEndpointForOBSBucket#10116
Conversation
devmadhuu
left a comment
There was a problem hiding this comment.
Thanks @arunsarin85 for the patch. Kindly find comments.
|
@devmadhuu Thanks for the review . I have added a patch for the above changes and triggered the flaky-test-check |
devmadhuu
left a comment
There was a problem hiding this comment.
Thanks @arunsarin85 for improving the patch. Largely looks good to me. Can you revisit your PR description. E.g I noticed below point seems obsolete. Check other points also and rephrase them to more cleaner understandable way. Currently the language is too complex to understand.
Short settle time after the buffer wait
The OM event queue can be empty while a batch is still being processed (events are dequeued before task processing finishes). A two-second sleep after join() gives in-flight container-key updates time to land before assertions.
Hi @devmadhuu , I have updated the PR description |
devmadhuu
left a comment
There was a problem hiding this comment.
Thanks @arunsarin85 for improving the patch. LGTM +1
|
Thanks @arunsarin85 for updating the patch. Please note checkstyle failure: https://github.com/arunsarin85/ozone/actions/runs/25127388560/job/73644085279 |
|
Thanks @arunsarin85 for the patch, @ArafatKhan2198, @devmadhuu for the review. |
What changes were proposed in this pull request?
Fix the intermittent failure in TestReconContainerEndpoint#testContainerEndpointForOBSBucket (HDDS-15004).
The test sometimes failed with expected: <1> but was: <0> on KeysResponse#getTotalCount(). That usually meant either Recon had not finished updating its containerdi-key index, or the test queried the wrong container id.
Please describe your PR in detail:
OBS: the test no longer hard-codes 1L. It uses OzoneManager#lookupKey (via getContainerIdForKey) to get the container id from the key’s block locations.
FSO: the same helper is used for each key path instead of assuming fixed container ids.
After waitForEventBufferEmpty, the test only waited until the CompletableFuture completed. If the async work failed, the future could still be “done” and the test would continue anyway.
The test now calls completableFuture.join() after GenericTestUtils.waitFor(completableFuture::isDone, …) so a failed buffer wait is not ignored.
The OM event buffer can be empty while Recon is still applying that batch (events are removed from the queue before processing finishes). So “buffer empty” is not enough to assert on the container endpoint.
Instead of sleeping, the test waits until ReconContainerMetadataManager#getKeyCountForContainer shows the expected number of keys per container. That logic lives in TestReconOmMetaManagerUtils.waitUntilReconKeyCounts, which polls until counts match or a timeout is hit.
The FSO test builds the expected counts from both written keys (so if two keys share a container, it waits for the right total).
ContainerKeyMapperHelper uses JVM-wide static flags and maps. If one test method runs before another in the same process, the second test can see stale state.
The client is closed with IOUtils.closeQuietly(client) so a client-close error does not block cluster.shutdown().
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15004
How was this patch tested?
(Please explain how this patch was tested. Ex: unit tests, manual tests, workflow run on the fork git repo.)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this.)
https://github.com/arunsarin85/ozone/actions/runs/24855010484
https://github.com/arunsarin85/ozone/actions/runs/24855051641