Skip to content

HDDS-7989. UnhealthyReplicationProcessor retries failure without delay#4285

Merged
sodonnel merged 2 commits intoapache:masterfrom
adoroszlai:HDDS-7989
Feb 21, 2023
Merged

HDDS-7989. UnhealthyReplicationProcessor retries failure without delay#4285
sodonnel merged 2 commits intoapache:masterfrom
adoroszlai:HDDS-7989

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

UnhealthyReplicationProcessor#processAll requeues any failed task. Such tasks are attempted in the same processAll call, before exiting the loop. This can flood SCM logs until the cause of the error is resolved.

This causes Github's environment to run out of disk space in just a few minutes after testing EC reconstruction read (test being added in HDDS-7982).

This PR proposes to collect failed container health results and requeue them only after exiting the loop.

https://issues.apache.org/jira/browse/HDDS-7989

How was this patch tested?

Added unit test.

Also verified together with HDDS-7982 (which uncovered the problem without this fix):
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4207471575/jobs/7302558782

Regular CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4207414175

@sodonnel sodonnel merged commit 47a68f8 into apache:master Feb 21, 2023
@adoroszlai adoroszlai deleted the HDDS-7989 branch February 21, 2023 22:15
@adoroszlai
Copy link
Contributor Author

Thanks @sodonnel for reviewing and committing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants