Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSI nodeUnstage retries #4069

Merged
merged 5 commits into from Dec 22, 2023
Merged

Add CSI nodeUnstage retries #4069

merged 5 commits into from Dec 22, 2023

Conversation

fierlion
Copy link
Member

@fierlion fierlion commented Dec 21, 2023

Summary

This adds retries (5) with a generous 5-10 second interval between each to accommodate any issues with the CSI Driver. At best the nodeUnstage will succeed on the first try (ie no retries); at worst nodeUnstage will fail consecutively 5 times with an upper bound of 30 seconds, enforced by the context timeout.

Implementation details

This incorporates the RetryNWithBackoff wrapped in the context.WithTimeout. It also adds a set of retry consts to limit the retries.

Testing

Tested manually

  • built and docker loaded agent using make release-agent && docker load < agent.tar
  • started an EBS-backed task which ran for 2 minutes and then stopped, triggering the nodeUnstage at task cleanup
  • first tested that the task would run and stop and cleanup as expected
  • started the same task again
  • while this task was running, I manually killed the csi-driver container using docker kill. The restart policy is on-failure, so docker will not restart the driver when it's killed manually
  • saw the cleanup fail and the logging run through 4/5 attempts before hitting the 30 second context timeout, then the nodeUnstage failure message

New tests cover the changes: yes (updated unit tests to have min/max 1-5 calls for retries)

Description for the changelog

Add CSI nodeUnstage retries.

Does this PR include breaking model changes? If so, Have you added transformation functions?
No

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@fierlion fierlion requested a review from a team as a code owner December 21, 2023 19:17
@fierlion fierlion merged commit 7e9473f into dev Dec 22, 2023
45 checks passed
@fierlion fierlion deleted the fierlion/nodeUnstageRetries branch December 22, 2023 21:57
@mye956 mye956 mentioned this pull request Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants