Skip to content

Conversation

@erikfuller
Copy link
Contributor

@erikfuller erikfuller commented Feb 27, 2024

What type of PR is this?
feature

Which issue does this PR fix:
#596

What does this PR do / Why do we need it:
This PR adds a pod mutating webhook to inject the controller's pod readiness gate.

One of the biggest changes is that webhooks must be network-reachable by the API server. Previously, the controller only required communication from itself to the API server. This means local testing against a remote cluster is very challenging. Most of my testing was done by pushing a custom ECR image for the controller and deploying it to my test cluster.

Webhook invocation is only over TLS, so the controlller must now present a TLS certificate to the API server. This certificate is validated by its DNS name(s) and must also pass trust validation checks. More details are in the pod-readiness-gates.md file.

For the TLS cert we require a standard kubernetes tls secret called webhook-cert. It must be present in the same namespace as the controller itself. The certificate is mounted as read-only by the controller container at /etc/webhook-cert.

By default, the webhook includes a namespace label selector, meaning the webhook will only be called for namespaces which include the label aws-application-networking-k8s/pod-readiness-gate-inject with value enabled.

Otherwise, this work is based off the webhook work in aws-load-balancer-controller and aws-app-mesh-controller-for-k8s. Note that, unlike the load balancer controller, the webhook configuration here is created by default. Note however, the webhook logic is only enabled by the namespace label described above, otherwise the webhook will not execute or inject the controller readiness gate.

What is left to be done?
These will follow in separate PRs:

  1. The actual pod readiness gate implementation
  2. Helm chart update to support the webhook and configure the TLS certificate
  3. Update to build scripts to automatically run webhook e2e tests

Testing done on this change:
The general testing process:

# after build and deploy image to ECR 
# deploy controller (and webhook) to cluster
kubectl create -f deploy.yaml 

# recreate the webhook secret
kubectl delete secret tls $WEBHOOK_SECRET_NAME --namespace $WEBHOOK_NAMESPACE
kubectl create secret tls $WEBHOOK_SECRET_NAME --namespace $WEBHOOK_NAMESPACE --cert=${CERT_FILE} --key=${KEY_FILE}

# patch the webhook with the cert details to enable trust from API server
kubectl patch mutatingwebhookconfigurations.admissionregistration.k8s.io aws-appnet-gwc-mutating-webhook --namespace $WEBHOOK_NAMESPACE --type='json' -p="[{'op': 'replace', 'path': '/webhooks/0/clientConfig/caBundle', 'value': '${CERT_B64}'}]"
mutatingwebhookconfiguration.admissionregistration.k8s.io/aws-appnet-gwc-mutating-webhook patched

# run manual tests OR webhook-e2e-tests
make webhook-e2e-test

Automation added to e2e:
Since webhooks must be invoked by the API server, they do not fit into the standard make e2e-test, at least for local development and testing. Instead, I have created a new make target make webhook-e2e-test. These tests can be run in a local dev environment against a remote cluster once the controller is deployed and the webhook secret updated.

make webhook-e2e-test
=== RUN   TestIntegration
Running Suite: WebhookIntegration - /.../aws-application-networking-k8s/test/suites/webhook
===============================================================================================================
Random Seed: 1709056486

Will run 2 of 2 specs
------------------------------
[SynchronizedBeforeSuite] 
/.../aws-application-networking-k8s/test/suites/webhook/suite_test.go:23
[SynchronizedBeforeSuite] PASSED [0.000 seconds]
------------------------------
Readiness Gate Inject create deployment in untagged namespace, no readiness gate
/.../aws-application-networking-k8s/test/suites/webhook/readiness_gate_inject_test.go:35
{"level":"info","ts":"2024-02-27T09:54:46.310-0800","caller":"test/framework.go:264","msg":"Deleting objects: *v1.Namespace/webhook-e2e-test-no-tag, *v1.Namespace/webhook-e2e-test-tagged"}
{"level":"info","ts":"2024-02-27T09:54:46.409-0800","caller":"test/framework.go:282","msg":"Waiting for NotFound, objects: *v1.Namespace/webhook-e2e-test-no-tag, *v1.Namespace/webhook-e2e-test-tagged"}
{"level":"info","ts":"2024-02-27T09:54:46.425-0800","caller":"test/framework.go:186","msg":"Creating objects: *v1.Namespace/webhook-e2e-test-no-tag, *v1.Namespace/webhook-e2e-test-tagged"}
{"level":"info","ts":"2024-02-27T09:54:46.447-0800","caller":"test/framework.go:186","msg":"Creating objects: *v1.Deployment/sparr-1-inmjcpummz"}
{"level":"info","ts":"2024-02-27T09:54:46.506-0800","caller":"test/pod_manager.go:68","msg":"deployment.Spec.Selector.MatchLabels: map[app:untagged-test-pod]"}
• [0.215 seconds]
------------------------------
Readiness Gate Inject create deployment in tagged namespace, has readiness gate
/.../aws-application-networking-k8s/test/suites/webhook/readiness_gate_inject_test.go:56
{"level":"info","ts":"2024-02-27T09:54:46.526-0800","caller":"test/framework.go:186","msg":"Creating objects: *v1.Deployment/finge-2-hicyysf5br"}
{"level":"info","ts":"2024-02-27T09:54:46.569-0800","caller":"test/pod_manager.go:68","msg":"deployment.Spec.Selector.MatchLabels: map[app:tagged-test-pod]"}
{"level":"info","ts":"2024-02-27T09:54:46.588-0800","caller":"test/framework.go:264","msg":"Deleting objects: *v1.Namespace/webhook-e2e-test-no-tag, *v1.Namespace/webhook-e2e-test-tagged"}
{"level":"info","ts":"2024-02-27T09:54:46.608-0800","caller":"test/framework.go:282","msg":"Waiting for NotFound, objects: *v1.Namespace/webhook-e2e-test-no-tag, *v1.Namespace/webhook-e2e-test-tagged"}
• [11.283 seconds]
------------------------------
[SynchronizedAfterSuite] 
/.../aws-application-networking-k8s/test/suites/webhook/suite_test.go:39
[SynchronizedAfterSuite] PASSED [0.000 seconds]
------------------------------

Ran 2 of 2 Specs in 11.500 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestIntegration (11.50s)
PASS
ok  	github.com/aws/aws-application-networking-k8s/test/suites/webhook	12.869s

Will this PR introduce any new dependencies?:
We now have a dependency on the presence of the webhook secret. Note that, if the webhook is not invoked

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
I tested deploying with the previous build's deploy.yaml, then applying the new deploy.yaml which includes the webhook configuration, without issue.

Does this PR introduce any user-facing change?:
Yes

Support for automatically injecting pod readiness gate. For more details see docs/guides/pod-readiness-gates.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

KeyName: "tls.key",
})
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using make run, DevMode will be non-empty, so locally running the controller will skip this webhook configuration. Otherwise, the controller will only start if the webhook cert files are present (at /etc/webhook-cert/tls.crt and tls.key).

@@ -1,26 +1,10 @@
# Adds namespace to all resources.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of this commented code came from the aws-load-balancer-controller. I've tried to remove what we don't need. One notable difference is that the gateway controller enables the webhook configuration by default, even though you still have to opt into it for any given namespace.

secret:
defaultMode: 420
secretName: webhook-cert
---
Copy link
Contributor Author

@erikfuller erikfuller Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This placeholder certificate is generated during make build-deploy:

When calling a webhook, the API server performs the following validations:

  1. Cert validity window has not passed
  2. SAN matches DNS (e.g. webhook-service.aws-application-networking-system.svc)
  3. Cert is signed by a known authority OR matches the ca field on the client configuration for the webhook (either signer or direct match)

The placeholder cert is valid for 1 day (will pass check 1 for 1 day) but will fail checks 2 and 3 with this message (from kubectl get events):

 Error creating: Internal error occurred: failed calling webhook "mpod.gwc.k8s.aws": failed to call webhook: Post "https://webhook-service.aws-application-networking-system.svc:443/mutate-pod?timeout=10s": tls: failed to verify certificate: x509: certificate is not valid for any names, but wanted to match webhook-service.aws-application-networking-system.svc

Having a placeholder here is important because the controller's webhook server needs this value to start up successfully. The webhook server does not perform strict validation.

Once the secret is patched with a valid cert, Kubernetes will distribute the value and the webhook server will be notified when the files are updated on the volume mount. See controller-runtime/pkg/webhook/server.go for more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the webhook validation don't pass, does the controller still work?(i.e., for only the placeholder cert, can it still enter the (r *routeReconciler) Reconcile() function for a incomming HTTPRoute resource? )

@@ -0,0 +1,144 @@
# Pod readiness gate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good amount of structure/content sourced from aws-load-balancer-controller

@erikfuller erikfuller marked this pull request as ready for review February 27, 2024 23:02
)

const (
PodReadinessGateConditionType = "aws-application-networking-k8s/pod-readiness-gate"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use application-networking.k8s.aws/pod-readiness-gate, using the group name of controller resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a preference, other than it being a self-descriptive value so people can know what it is when they see it on their pods. Happy to change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, I think it is better to change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.gitignore Outdated

# generated during make-deploy
tls.crt
tls.key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to remove them at the end of Makefile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. WIll change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added rm to makefile, removed from .gitignore

When you specify multiple selectors, pods matching all the conditions will get mutated.

## Disabling the readiness gate inject
You can specify the controller flag `--enable-pod-readiness-gate-inject=false` during controller startup to disable the controller from modifying the pod spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this feature implemented? having hard time finding this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It is not. This is something in the load balancer controller. Will scrub through this file again now that we have the other PR open too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

)

const (
PodReadinessGateConditionType = "aws-application-networking-k8s/pod-readiness-gate"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, I think it is better to change

secret:
defaultMode: 420
secretName: webhook-cert
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the webhook validation don't pass, does the controller still work?(i.e., for only the placeholder cert, can it still enter the (r *routeReconciler) Reconcile() function for a incomming HTTPRoute resource? )

CertName: "tls.crt",
KeyName: "tls.key",
})
}
Copy link
Contributor

@zijun726911 zijun726911 Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~~Where this dns name come from webhook-service.{namespace}.svc ? Does k8s defined that or we define that? do we need to pass it when start the webhookServer? ~~

oh, nvm, I saw this part:

kind: Service
metadata:
  name: webhook-service
  namespace: aws-application-networking-system
spec:
  ports:
    - port: 443
      targetPort: 9443
  selector:
    control-plane: gateway-api-controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants