Add timeout to SDK k8s client#3070
Merged
zmerlynn merged 2 commits intoagones-dev:mainfrom Apr 5, 2023
Merged
Conversation
f0fc75d to
33aff15
Compare
markmandel
approved these changes
Apr 5, 2023
Collaborator
markmandel
left a comment
There was a problem hiding this comment.
Seems like a good change no matter what 👍🏻
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: markmandel, zmerlynn The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
New changes are detected. LGTM label has been removed. |
Collaborator
|
Build Succeeded 👏 Build Id: 8a612f44-eae4-49b7-ad09-35feb7880ab3 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Kalaiselvi84
pushed a commit
to Kalaiselvi84/agones
that referenced
this pull request
Apr 11, 2023
The SDK client only ever accesses small amounts of data (single object list / event updates), latency more than a couple of seconds is excessive. We need to keep a relatively tight timeout during initialization as well to allow the informer a chance to retry - the SDK won't reply to /healthz checks until the informer has synced once, and our liveness configuration only allows 9s before a positive /healthz.
zmerlynn
added a commit
to zmerlynn/agones
that referenced
this pull request
Apr 17, 2023
The problem addressed by agones-dev#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in agones-dev#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See agones-dev#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes agones-dev#3106
zmerlynn
added a commit
to zmerlynn/agones
that referenced
this pull request
Apr 17, 2023
The problem addressed by agones-dev#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in agones-dev#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See agones-dev#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes agones-dev#3106
zmerlynn
added a commit
to zmerlynn/agones
that referenced
this pull request
Apr 17, 2023
The problem addressed by agones-dev#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in agones-dev#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See agones-dev#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes agones-dev#3106
zmerlynn
added a commit
to zmerlynn/agones
that referenced
this pull request
Apr 18, 2023
The problem addressed by agones-dev#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in agones-dev#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See agones-dev#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes agones-dev#3106
zmerlynn
added a commit
to zmerlynn/agones
that referenced
this pull request
Apr 18, 2023
The problem addressed by agones-dev#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in agones-dev#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See agones-dev#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes agones-dev#3106
zmerlynn
added a commit
to zmerlynn/agones
that referenced
this pull request
Apr 18, 2023
The problem addressed by agones-dev#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in agones-dev#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See agones-dev#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes agones-dev#3106
zmerlynn
added a commit
to zmerlynn/agones
that referenced
this pull request
Apr 18, 2023
The problem addressed by agones-dev#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in agones-dev#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See agones-dev#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes agones-dev#3106
zmerlynn
added a commit
to zmerlynn/agones
that referenced
this pull request
Apr 18, 2023
The problem addressed by agones-dev#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in agones-dev#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See agones-dev#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes agones-dev#3106
zmerlynn
added a commit
to zmerlynn/agones
that referenced
this pull request
Apr 18, 2023
The problem addressed by agones-dev#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in agones-dev#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See agones-dev#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes agones-dev#3106
zmerlynn
added a commit
that referenced
this pull request
Apr 18, 2023
* Revert #3070, wait on networking a different way The problem addressed by #3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in #3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See #3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Along the way: fix nil pointer deref when TestPingHTTP fails Fixes #3106
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This seems to help with (many of?) the flakes we're seeing in CI by forcing the informer to retry lists, rather than the SDK dying after 30s of hanging.