-
Notifications
You must be signed in to change notification settings - Fork 71
Inject pod readiness gate #606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| KeyName: "tls.key", | ||
| }) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using make run, DevMode will be non-empty, so locally running the controller will skip this webhook configuration. Otherwise, the controller will only start if the webhook cert files are present (at /etc/webhook-cert/tls.crt and tls.key).
| @@ -1,26 +1,10 @@ | |||
| # Adds namespace to all resources. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much of this commented code came from the aws-load-balancer-controller. I've tried to remove what we don't need. One notable difference is that the gateway controller enables the webhook configuration by default, even though you still have to opt into it for any given namespace.
| secret: | ||
| defaultMode: 420 | ||
| secretName: webhook-cert | ||
| --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This placeholder certificate is generated during make build-deploy:
When calling a webhook, the API server performs the following validations:
- Cert validity window has not passed
- SAN matches DNS (e.g.
webhook-service.aws-application-networking-system.svc) - Cert is signed by a known authority OR matches the
cafield on the client configuration for the webhook (either signer or direct match)
The placeholder cert is valid for 1 day (will pass check 1 for 1 day) but will fail checks 2 and 3 with this message (from kubectl get events):
Error creating: Internal error occurred: failed calling webhook "mpod.gwc.k8s.aws": failed to call webhook: Post "https://webhook-service.aws-application-networking-system.svc:443/mutate-pod?timeout=10s": tls: failed to verify certificate: x509: certificate is not valid for any names, but wanted to match webhook-service.aws-application-networking-system.svc
Having a placeholder here is important because the controller's webhook server needs this value to start up successfully. The webhook server does not perform strict validation.
Once the secret is patched with a valid cert, Kubernetes will distribute the value and the webhook server will be notified when the files are updated on the volume mount. See controller-runtime/pkg/webhook/server.go for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the webhook validation don't pass, does the controller still work?(i.e., for only the placeholder cert, can it still enter the (r *routeReconciler) Reconcile() function for a incomming HTTPRoute resource? )
| @@ -0,0 +1,144 @@ | |||
| # Pod readiness gate | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good amount of structure/content sourced from aws-load-balancer-controller
| ) | ||
|
|
||
| const ( | ||
| PodReadinessGateConditionType = "aws-application-networking-k8s/pod-readiness-gate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use application-networking.k8s.aws/pod-readiness-gate, using the group name of controller resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have a preference, other than it being a self-descriptive value so people can know what it is when they see it on their pods. Happy to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, I think it is better to change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.gitignore
Outdated
|
|
||
| # generated during make-deploy | ||
| tls.crt | ||
| tls.key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to remove them at the end of Makefile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. WIll change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added rm to makefile, removed from .gitignore
docs/guides/pod-readiness-gates.md
Outdated
| When you specify multiple selectors, pods matching all the conditions will get mutated. | ||
|
|
||
| ## Disabling the readiness gate inject | ||
| You can specify the controller flag `--enable-pod-readiness-gate-inject=false` during controller startup to disable the controller from modifying the pod spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this feature implemented? having hard time finding this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. It is not. This is something in the load balancer controller. Will scrub through this file again now that we have the other PR open too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| ) | ||
|
|
||
| const ( | ||
| PodReadinessGateConditionType = "aws-application-networking-k8s/pod-readiness-gate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, I think it is better to change
| secret: | ||
| defaultMode: 420 | ||
| secretName: webhook-cert | ||
| --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the webhook validation don't pass, does the controller still work?(i.e., for only the placeholder cert, can it still enter the (r *routeReconciler) Reconcile() function for a incomming HTTPRoute resource? )
| CertName: "tls.crt", | ||
| KeyName: "tls.key", | ||
| }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~~Where this dns name come from webhook-service.{namespace}.svc ? Does k8s defined that or we define that? do we need to pass it when start the webhookServer? ~~
oh, nvm, I saw this part:
kind: Service
metadata:
name: webhook-service
namespace: aws-application-networking-system
spec:
ports:
- port: 443
targetPort: 9443
selector:
control-plane: gateway-api-controller
What type of PR is this?
feature
Which issue does this PR fix:
#596
What does this PR do / Why do we need it:
This PR adds a pod mutating webhook to inject the controller's pod readiness gate.
One of the biggest changes is that webhooks must be network-reachable by the API server. Previously, the controller only required communication from itself to the API server. This means local testing against a remote cluster is very challenging. Most of my testing was done by pushing a custom ECR image for the controller and deploying it to my test cluster.
Webhook invocation is only over TLS, so the controlller must now present a TLS certificate to the API server. This certificate is validated by its DNS name(s) and must also pass trust validation checks. More details are in the
pod-readiness-gates.mdfile.For the TLS cert we require a standard kubernetes tls secret called
webhook-cert. It must be present in the same namespace as the controller itself. The certificate is mounted as read-only by the controller container at/etc/webhook-cert.By default, the webhook includes a namespace label selector, meaning the webhook will only be called for namespaces which include the label
aws-application-networking-k8s/pod-readiness-gate-injectwith valueenabled.Otherwise, this work is based off the webhook work in aws-load-balancer-controller and aws-app-mesh-controller-for-k8s. Note that, unlike the load balancer controller, the webhook configuration here is created by default. Note however, the webhook logic is only enabled by the namespace label described above, otherwise the webhook will not execute or inject the controller readiness gate.
What is left to be done?
These will follow in separate PRs:
Testing done on this change:
The general testing process:
Automation added to e2e:
Since webhooks must be invoked by the API server, they do not fit into the standard
make e2e-test, at least for local development and testing. Instead, I have created a new make targetmake webhook-e2e-test. These tests can be run in a local dev environment against a remote cluster once the controller is deployed and the webhook secret updated.Will this PR introduce any new dependencies?:
We now have a dependency on the presence of the webhook secret. Note that, if the webhook is not invoked
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
I tested deploying with the previous build's
deploy.yaml, then applying the newdeploy.yamlwhich includes the webhook configuration, without issue.Does this PR introduce any user-facing change?:
Yes
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.