-
Notifications
You must be signed in to change notification settings - Fork 71
Update pod readiness condition #607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
207716c to
cea693a
Compare
|
Can you add some detail design note on how often do controller reconcile on all lattice TG targets? Does it pull lattice health state for all targets/pods? |
|
No - the controller only listens on endpoint event, and pulls lattice health state if any of the endpoint is not ready. This means the controller does not go in opposite direction (making ready pods non-ready) The purpose of readiness is about "disabling traffic while keeping pods alive" which is already handled by VPC Lattice HC, so I believe this should not block any use cases, although we can support this in the future. |
What happens if targets HC are not ready ? Will controller retry later or just busy waiting |
|
To make the PR smaller, Is it possible to have a separate CR just on endpoint slice change? |
|
controller just goes to another reconcile loop, not busy waiting. |
erikfuller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still working through some of the tests but wanted to publish the comments I had in the meantime. Mostly I just have lots of questions.
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v3 | ||
| - run: sed -En 's/^go[[:space:]]+([[:digit:].]+)$/GO_VERSION=\1/p' go.mod >> $GITHUB_ENV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this fit in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the go version to 1.20 which is the current version of the codebase. Looks like running golangci-lint with 1.21 doesn't work anymore. I am not sure if it is a bug or regression, but I think this change is the right change to make anyways.
|
|
||
| func SetPodStatusCondition(conditions *[]corev1.PodCondition, newCondition corev1.PodCondition) { | ||
| if conditions == nil { | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we silently fail updates here somehow? When would we expect nil here?
| if existingCondition.Status != newCondition.Status { | ||
| existingCondition.Status = newCondition.Status | ||
| if !newCondition.LastTransitionTime.IsZero() { | ||
| existingCondition.LastTransitionTime = newCondition.LastTransitionTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it really important to keep this value rather than just overwrite with time.Now()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are copied from meta package, it is a general library for handling conditions but doesn't work for pod condition due to type difference. I didn't want to touch that, I will mention that in the comment.
| identifier = tg.Status.Id | ||
| } | ||
|
|
||
| latticeTargets, err := t.targetsManager.List(ctx, tg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't a pod be a target for multiple target groups? With the current logic, only one target group would need to show the pod as healthy before it is marked ready. This may be fine, but I just wanted to confirm this was intended behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually makes sense. Not sure if I can fit it in this PR, but will give it a try.
| var requeue bool | ||
| for _, target := range modelTargets { | ||
| // Step 0: Check if the endpoint is not ready yet. | ||
| if !target.Ready && target.TargetRef.Name != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than nesting everything in this if block, may be better to check for the condition and continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there occasions where the TargetRef.Name will be empty? Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the target is not a pod, etc. I don't think it happens in a normal scenario, but I think it is better to have a sanity check and ignore those.
08653f9 to
50f33d4
Compare
erikfuller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked through the tests. No major concerns. If we can figure out the multiple target group case we're probably good to go.
| { | ||
| name: "Unused pods keep condition", | ||
| model: target, | ||
| lattice: newLatticeTarget("10.10.1.1", 8675, vpclattice.TargetStatusUnused), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this include a test for Draining just for completeness?
|
erikfuller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together and for all your efforts and leadership in this project. I think this PR gives us a good starting point, though I agree we may have to introduce multiple readiness gates as is done in load balancer controller.
While the logic all looks correct, we should still verify this with a zero-downtime rolling update, as that is the ultimate use case here. I'll see if someone can follow up with this as part of pre-release testing.
What type of PR is this?
feature
Which issue does this PR fix:
#560
What does this PR do / Why do we need it:
Please refer to the issue description of #560 for the background.
If an issue # is not available please add repro steps and logs from aws-gateway-controller showing the issue:
Testing done on this change:
Tested in a simple scenario where I have 2 pods in the deployment, and adding pods. (w/ readiness gate)
Will update after further test.
Automation added to e2e:
e2e test cases coming soon
Will this PR introduce any new dependencies?:
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
Does this PR introduce any user-facing change?:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.