Update pod readiness condition #607

solmonk · 2024-02-28T18:28:22Z

What type of PR is this?

feature

Which issue does this PR fix:

#560

What does this PR do / Why do we need it:

Please refer to the issue description of #560 for the background.

Model build: now ready and non-ready targets are both registered (unless they are terminating)
Target synthesizer: updates pod readiness condition. Only operates when the pod has the readiness gate. It is mostly injected by webhook, but people may add gate by themselves (that's how I tested)
- Might be slightly off placed but couldn't find better place to fit in.
- Status logic is mostly in the synthesizer as well, I tried to comment as much as possible.
- I put status update logic in the PostSynthesize step, as otherwise health check status can be misleading w/ unused status, etc.
Target manager: stale target identification logic slightly improved

If an issue # is not available please add repro steps and logs from aws-gateway-controller showing the issue:

Testing done on this change:

Tested in a simple scenario where I have 2 pods in the deployment, and adding pods. (w/ readiness gate)

Confirmed pods are starting with non-ready state
Confirmed targets are added while pods are non ready
Confirmed pods are becoming ready on the next reconciliation loop

Will update after further test.

Automation added to e2e:
e2e test cases coming soon

Will this PR introduce any new dependencies?:

Will this break upgrades or downgrades. Has updating a running cluster been tested?:

Does this PR introduce any user-facing change?:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

liwenwu-amazon · 2024-02-28T19:17:34Z

Can you add some detail design note on how often do controller reconcile on all lattice TG targets?
e.g.
Is it on the timer-interval?

Does it pull lattice health state for all targets/pods?

solmonk · 2024-02-28T20:18:36Z

No - the controller only listens on endpoint event, and pulls lattice health state if any of the endpoint is not ready. This means the controller does not go in opposite direction (making ready pods non-ready)

The purpose of readiness is about "disabling traffic while keeping pods alive" which is already handled by VPC Lattice HC, so I believe this should not block any use cases, although we can support this in the future.

liwenwu-amazon · 2024-02-28T21:38:44Z

No - the controller only listens on endpoint event, and pulls lattice health state if any of the endpoint is not ready. This means the controller does not go in opposite direction (making ready pods non-ready)

What happens if targets HC are not ready ? Will controller retry later or just busy waiting

liwenwu-amazon · 2024-02-28T21:43:50Z

To make the PR smaller, Is it possible to have a separate CR just on endpoint slice change?

solmonk · 2024-02-28T21:49:34Z

controller just goes to another reconcile loop, not busy waiting.
actually a separated PR was already up if you're interested - #604
If this gets merged first I can rebase this PR to reduce the diff.

erikfuller

I'm still working through some of the tests but wanted to publish the comments I had in the meantime. Mostly I just have lots of questions.

erikfuller · 2024-02-28T21:20:36Z

.github/workflows/presubmit.yaml

    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
+      - run: sed -En 's/^go[[:space:]]+([[:digit:].]+)$/GO_VERSION=\1/p' go.mod >> $GITHUB_ENV


How does this fit in?

This fixes the go version to 1.20 which is the current version of the codebase. Looks like running golangci-lint with 1.21 doesn't work anymore. I am not sure if it is a bug or regression, but I think this change is the right change to make anyways.

pkg/controllers/eventhandlers/mapper.go

pkg/controllers/eventhandlers/service.go

pkg/controllers/route_controller_test.go

pkg/deploy/lattice/targets_manager.go

erikfuller · 2024-02-28T22:54:59Z

pkg/utils/pod_condition.go

+
+func SetPodStatusCondition(conditions *[]corev1.PodCondition, newCondition corev1.PodCondition) {
+	if conditions == nil {
+		return


Could we silently fail updates here somehow? When would we expect nil here?

erikfuller · 2024-02-28T22:56:20Z

pkg/utils/pod_condition.go

+	if existingCondition.Status != newCondition.Status {
+		existingCondition.Status = newCondition.Status
+		if !newCondition.LastTransitionTime.IsZero() {
+			existingCondition.LastTransitionTime = newCondition.LastTransitionTime


is it really important to keep this value rather than just overwrite with time.Now()?

these are copied from meta package, it is a general library for handling conditions but doesn't work for pod condition due to type difference. I didn't want to touch that, I will mention that in the comment.

erikfuller · 2024-02-28T23:00:06Z

pkg/deploy/lattice/targets_synthesizer.go

+			identifier = tg.Status.Id
+		}
+
+		latticeTargets, err := t.targetsManager.List(ctx, tg)


Can't a pod be a target for multiple target groups? With the current logic, only one target group would need to show the pod as healthy before it is marked ready. This may be fine, but I just wanted to confirm this was intended behaviour.

This actually makes sense. Not sure if I can fit it in this PR, but will give it a try.

erikfuller · 2024-02-28T23:07:13Z

pkg/deploy/lattice/targets_synthesizer.go

+	var requeue bool
+	for _, target := range modelTargets {
+		// Step 0: Check if the endpoint is not ready yet.
+		if !target.Ready && target.TargetRef.Name != "" {


Rather than nesting everything in this if block, may be better to check for the condition and continue

Are there occasions where the TargetRef.Name will be empty? Is that expected?

If the target is not a pod, etc. I don't think it happens in a normal scenario, but I think it is better to have a sanity check and ignore those.

pkg/gateway/model_build_targets.go

erikfuller

Looked through the tests. No major concerns. If we can figure out the multiple target group case we're probably good to go.

erikfuller · 2024-02-29T21:03:58Z

pkg/deploy/lattice/targets_synthesizer_test.go

+		{
+			name:           "Unused pods keep condition",
+			model:          target,
+			lattice:        newLatticeTarget("10.10.1.1", 8675, vpclattice.TargetStatusUnused),


Should this include a test for Draining just for completeness?

solmonk · 2024-03-01T15:58:38Z

Now findStaleTargets skips draining targets, as there is no need to deregister them.
Added e2e test.
I wasn't able to make multi-TG use case. It turned out more complex than I expected. Since only one route is processed at a time, we can't calculate the readiness if a pod is shared between multiple TG across different HTTPRoutes. If we want to deal with this I think we should approach differently, such as injecting a separate readiness condition for each route.

erikfuller

Thanks for putting this together and for all your efforts and leadership in this project. I think this PR gives us a good starting point, though I agree we may have to introduce multiple readiness gates as is done in load balancer controller.

While the logic all looks correct, we should still verify this with a zero-downtime rolling update, as that is the ultimate use case here. I'll see if someone can follow up with this as part of pre-release testing.

solmonk requested a review from zijun726911 February 28, 2024 18:30

solmonk force-pushed the pod-readiness branch 2 times, most recently from 207716c to cea693a Compare February 28, 2024 19:05

solmonk requested a review from erikfuller February 28, 2024 20:18

erikfuller reviewed Feb 29, 2024

View reviewed changes

solmonk force-pushed the pod-readiness branch from 08653f9 to 50f33d4 Compare February 29, 2024 19:21

erikfuller reviewed Feb 29, 2024

View reviewed changes

Doyoon Kim added 9 commits March 1, 2024 05:41

Update pod readiness condition

015b897

Do not model terminating endpoints

428efaa

vet

f33f714

Change unused handling

0845f6e

Bugfix

b082e10

Address initial set of PR comments

a5609d6

Quick fix on e2e

b2cef56

Address PR comments and add E2E test

aa3768f

Fix test case

cc16a4f

solmonk force-pushed the pod-readiness branch from 884f488 to aa3768f Compare March 1, 2024 15:36

erikfuller approved these changes Mar 1, 2024

View reviewed changes

zijun726911 approved these changes Mar 1, 2024

View reviewed changes

solmonk merged commit b73f671 into aws:main Mar 1, 2024

zijun726911 mentioned this pull request Mar 26, 2024

Support readiness gates for 0-downtime rolling update #560

Closed

Update pod readiness condition #607

Update pod readiness condition #607

Uh oh!

Conversation

solmonk commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liwenwu-amazon commented Feb 28, 2024

Uh oh!

solmonk commented Feb 28, 2024

Uh oh!

liwenwu-amazon commented Feb 28, 2024

Uh oh!

liwenwu-amazon commented Feb 28, 2024

Uh oh!

solmonk commented Feb 28, 2024

Uh oh!

erikfuller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erikfuller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

solmonk commented Mar 1, 2024

Uh oh!

erikfuller left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

solmonk commented Feb 28, 2024 •

edited

Loading