[DPE-6762] Catch archive command timeout #1328
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
If pgBackRest cannot archive a WAL file in less than 60 seconds, it will fail it's
checkcommand and the charm will be stuck in a blocked status.Solution
In the stanza check step, handle the error code
82(Archive operation timeout) in the same way we already do for error code49(Cannot connect to host) in the stanza initialisation step:postgresql-operator/src/backups.py
Line 643 in 6ffe0b4
juju resolve.Fixes #784
Tested manually by adding
sleep 90 &&before the actual archive command intemplates/patroni.yml.j2and deploying the charm + connecting it to Microceph's RadosGW.pgBackRest error codes extracted from source code: error-codes.csv. I think we shouldn't turn the charm into an error state for every error code, so I'm limiting it to error code
82for now. We may expand the list of error codes in which we do that in the charm in the future.Also, sometimes the error message from pgBackRest is sent to stdout instead of stderr (and we don't log them, as seen in #1280), so it's important to check both. I ported the changes from #1320 to this PR to handle that situation.
Checklist