Improve reliability of source calendar retrieval by enabling retries #403
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was having problems where my entire calendar would get deleted temporarily, only to be recreated an hour later, due to intermittent HTTP errors being reported by the calendar source. It is the same issue discussed in #343.
Related work
PR #392 tries to solve this same problem by silently skipping sync when such errors occur. I'm uncomfortable with that solution, because I would prefer "obviously wrong" (entire calendar is missing) versus "deceptively wrong" (calendar looks okay but is actually telling me wrong information because it has stopped syncing). Skipping sync would be a good solution if we ALSO provided better mechanism for communicating problems, for example email notifications. Ideally we should pursue that.
A different approach
But then I found a superior fix: This PR completely solves the root cause in my case. That is why I have not invested the time to investigate better error notifications. For my own purposes at least, this PR entirely eliminated the need for PR #392.
Fixes #343
What I changed
Track the failures. If an HTTP request ultimately fails (excluding success after retrying), throw a top-level exception so that the Google Apps Script "Executions" dashboard reports the execution as failing. Example shown below:
Reattempt HTTP requests. Improve
callWithBackoff()
so that it retries HTTP requests for all status codes that are known to be intermittent problems.With these changes, I can see that the failures are no longer occurring. The longest reattempt took 8 tries over 14 seconds, as seen in this log:
But since yesterday, every execution ultimately succeeded. 👍