New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Sidecar injection resilient in case of Kubernetes API interruptions / admissions webhook failures etc #4171
Comments
/area operator |
This issue may also come in handy in case an existing deployment is modified with dapr annotations when it previously didn't have dapr annotations. I'm unclear on the behavior today (haven't further looked into this). |
An existing deployment with added annotations will get the sidecars injected. |
We have identified other scenarios where the sidecar is not injected. We need to create some sort of background process to identify those instances and proactively fix it. |
Will this issue be fixed in 1.8? |
@artursouza to provide info here as triager. |
@yaron2 I just updated all my incorrect understanding in the issue. @ItalyPaleAle is currently investigating. One possible option that may alleviate this problem may be to set the AdmissionWebhook Failure Policy to |
We are seeing this issue in the E2E tests too, randomly, and I've been investigating this for a few days. With the logs we have so far, it looks like the injector service(s) do not receive the webhook call at all. I am waiting for #4719 to be merged so I can access the audit logs and see exactly what happens on the K8s side (our current logs only report what's happening on the injector side). My suspicion is that the webhook invocations fail due to a network timeout and never hit the injector services. There are two things that I think we need to do to address this issue:
As @berndverst just mentioned, this morning we learnt about the In Dapr, that can be set using the Line 165 in 6e3387f
I need to test this, but it sounds like |
I proposed changing the default value to |
I don't think it's a good idea to make I believe it is natural that the sidecar injector periodically checks for active pods that have the dapr-enabled annotation but no dapr sidecar and in those cases, simply deletes the pod. The underlying replicaset would then restart the pod and the sidecar get injected. |
@gunniwho that makes sense. Looks like there's no way to restrict to pods with a certain annotation, so what you're saying is a real risk. |
@ItalyPaleAle we learned it the hard way the other day (albeit with another sidecar injector) 😅 |
A 'watch-dog' like behavior like this would have solved our issue. I still don't know what the root cause was, all I know is around midnight several pods died, 3 of our custom pods, along with the dapr-sidecar-injector. It sounds like the intermittent issue with the injector may not be the only problem. There could simply be a race condition where pods created just before the sidecar injector will never appear to the sidecar injector. (I could be wrong since I don't know the inner workings of this web hook call the injector service relies on |
@MattCosturos that sounds like a possible issue as well, when all injectors die. In the case of E2E tests, the injectors are running, but the webhook dosn't get delivered. Perhaps a watch-dog is what we need. |
We see this happen frequently on our dev env. We update the AMI of our nodes every week (if there is an updated AMI at that point), resulting in the entire node pool getting recycled a few nodes at a time (this does not honor PDBs). This usually results in this exact issue, ie the injector comes up after some of the application pods making us have to manually delete the affected pods. |
This is problematic for running dapr in a production environment. Are there any docs/guides on using dapr in a production environment where downtime would best be avoided? |
I'm going to look into this issue and we'll try to get a fix into 1.8 |
I think the underlying issue for this problem (which surely isn't specific to dapr at all) is the mutating admission webooks design in kubernetes. The idea of calling an API at the time a pod is to be scheduled doesn't really rhyme well with the desired-state-reconciliation-philosophy of kubernetes if you think about it. There should be a reconciliation loop calling the registered webooks periodically to check if the state of the pod is the desired state. This could probably just be the replica set controller. Anyway, I digress and I might be wrong. It's just a thought that occurred to me 🙂 |
This should be fixed in master now and there's a watchdog service (disabled by default and can be enabled via some Helm options: https://github.com/dapr/dapr/tree/master/charts/dapr#dapr-operator-options ). If you want to test it, it can be found in the nightly builds! |
In what area(s)?
/area injector
/area sidecar-injector
Describe the feature
Dapr Sidecar Injector is responsible for handling the creation of Dapr sidecars and injecting the Dapr Trust Bundle secret from the Dapr-System namespace. Internally this approach uses the Kubernetes
MutatingAdmissionWebhook
feature.On occasion the MutatingAdmissionRequest to the webhook which is registered to the Dapr Sidecar injector (see
kubectl get MutatingWebhookConfiguration/dapr-sidecar-injector -oyaml
) may not occur or may fail. At this time Dapr Helm Charts are configured to resume with the deployment in that case. This means new application deployments with Dapr annotations that should have sidecars injected will have missing sidecars.The Dapr control plane should periodically (or upon Kubernetes API failure) determine whether any existing deployments need sidecars injected and whether new components must be loaded. The Dapr control plane also needs to gracefully resume watching of resource deployments when the Kubernetes API is available again.
Release Note
RELEASE NOTE:
The text was updated successfully, but these errors were encountered: