Make Sidecar injection resilient in case of Kubernetes API interruptions / admissions webhook failures etc #4171

berndverst · 2022-01-27T22:33:20Z

In what area(s)?

/area injector
/area sidecar-injector

Describe the feature

Dapr Sidecar Injector is responsible for handling the creation of Dapr sidecars and injecting the Dapr Trust Bundle secret from the Dapr-System namespace. Internally this approach uses the Kubernetes MutatingAdmissionWebhook feature.

On occasion the MutatingAdmissionRequest to the webhook which is registered to the Dapr Sidecar injector (see kubectl get MutatingWebhookConfiguration/dapr-sidecar-injector -oyaml) may not occur or may fail. At this time Dapr Helm Charts are configured to resume with the deployment in that case. This means new application deployments with Dapr annotations that should have sidecars injected will have missing sidecars.

The Dapr control plane should periodically (or upon Kubernetes API failure) determine whether any existing deployments need sidecars injected and whether new components must be loaded. The Dapr control plane also needs to gracefully resume watching of resource deployments when the Kubernetes API is available again.

Release Note

RELEASE NOTE:

The text was updated successfully, but these errors were encountered:

berndverst · 2022-01-27T22:33:52Z

/area operator

berndverst · 2022-01-29T02:55:31Z

This issue may also come in handy in case an existing deployment is modified with dapr annotations when it previously didn't have dapr annotations. I'm unclear on the behavior today (haven't further looked into this).

yaron2 · 2022-01-29T17:09:51Z

This issue may also come in handy in case an existing deployment is modified with dapr annotations when it previously didn't have dapr annotations. I'm unclear on the behavior today (haven't further looked into this).

An existing deployment with added annotations will get the sidecars injected.

artursouza · 2022-03-07T07:21:35Z

This issue may also come in handy in case an existing deployment is modified with dapr annotations when it previously didn't have dapr annotations. I'm unclear on the behavior today (haven't further looked into this).

An existing deployment with added annotations will get the sidecars injected.

We have identified other scenarios where the sidecar is not injected. We need to create some sort of background process to identify those instances and proactively fix it.

MattCosturos · 2022-06-07T15:17:26Z

Will this issue be fixed in 1.8?
We had an outage, and our system did not recover due to this issue.

yaron2 · 2022-06-07T15:19:20Z

Will this issue be fixed in 1.8? We had an outage, and our system did not recover due to this issue.

@artursouza to provide info here as triager.

berndverst · 2022-06-07T18:03:57Z

@yaron2 I just updated all my incorrect understanding in the issue. @ItalyPaleAle is currently investigating.

One possible option that may alleviate this problem may be to set the AdmissionWebhook Failure Policy to Fail. We'll be evaluating that approach.

ItalyPaleAle · 2022-06-07T18:08:50Z

We are seeing this issue in the E2E tests too, randomly, and I've been investigating this for a few days.

With the logs we have so far, it looks like the injector service(s) do not receive the webhook call at all. I am waiting for #4719 to be merged so I can access the audit logs and see exactly what happens on the K8s side (our current logs only report what's happening on the injector side). My suspicion is that the webhook invocations fail due to a network timeout and never hit the injector services.

There are two things that I think we need to do to address this issue:

Confirm why the webhook isn't receiving requests (or if that's even true), and whether this is something we can fix in Dapr.
Consider additional mitigations such as what this issue is proposing, although I'm not a big fan of polling as a principle.

As @berndverst just mentioned, this morning we learnt about the failurePolicy configuration for the AdminissionWebhook.

In Dapr, that can be set using the dapr_sidecar_injector.webhookFailurePolicy property for our Helm charts. The default value is Ignore:

dapr/charts/dapr/README.md

Line 165 in 6e3387f

    
           | `dapr_sidecar_injector.webhookFailurePolicy` | Failure policy for the sidecar injector                              | `Ignore`                |

I need to test this, but it sounds like Ignore causes K8s to deploy the app without a sidecar if the webhook fails. Perhaps changing the value to Fail (which I presume makes the deployment fail) could be a better approach.

ItalyPaleAle · 2022-06-07T18:52:54Z

I proposed changing the default value to Fail in #4721. This can be done in parallel to the investigation into figuring out why the injector isn't getting messages.

gunniwho · 2022-06-07T19:29:40Z

I proposed changing the default value to Fail in #4721. This can be done in parallel to the investigation into figuring out why the injector isn't getting messages.

I don't think it's a good idea to make Fail the default option as it can lead to a very bad situation: if you lose all instances of the sidecar injector then you won't be able to schedule any pod, including the sidecar injector itself.

I believe it is natural that the sidecar injector periodically checks for active pods that have the dapr-enabled annotation but no dapr sidecar and in those cases, simply deletes the pod. The underlying replicaset would then restart the pod and the sidecar get injected.

ItalyPaleAle · 2022-06-07T19:43:21Z

@gunniwho that makes sense. Looks like there's no way to restrict to pods with a certain annotation, so what you're saying is a real risk.

gunniwho · 2022-06-07T19:53:56Z

@gunniwho that makes sense. Looks like there's no way to restrict to pods with a certain annotation, so what you're saying is a real risk.

@ItalyPaleAle we learned it the hard way the other day (albeit with another sidecar injector) 😅

MattCosturos · 2022-06-07T20:35:04Z

I believe it is natural that the sidecar injector periodically checks for active pods that have the dapr-enabled annotation but no dapr sidecar and in those cases, simply deletes the pod. The underlying replicaset would then restart the pod and the sidecar get injected.

A 'watch-dog' like behavior like this would have solved our issue. I still don't know what the root cause was, all I know is around midnight several pods died, 3 of our custom pods, along with the dapr-sidecar-injector.
The pods were restarted, but my 3 pods restarted BEFORE the dapr-sidecar-injector. The injector then starts up and connects to and listens for new pods, but it didn't see any new ones since my pods were already created.
I had to manually delete my pods, and the recreated ones then got the sidecar injected correctly.

It sounds like the intermittent issue with the injector may not be the only problem. There could simply be a race condition where pods created just before the sidecar injector will never appear to the sidecar injector. (I could be wrong since I don't know the inner workings of this web hook call the injector service relies on

ItalyPaleAle · 2022-06-07T20:38:19Z

@MattCosturos that sounds like a possible issue as well, when all injectors die.

In the case of E2E tests, the injectors are running, but the webhook dosn't get delivered.

Perhaps a watch-dog is what we need.

gunniwho · 2022-06-07T20:49:52Z

We see this happen frequently on our dev env. We update the AMI of our nodes every week (if there is an updated AMI at that point), resulting in the entire node pool getting recycled a few nodes at a time (this does not honor PDBs). This usually results in this exact issue, ie the injector comes up after some of the application pods making us have to manually delete the affected pods.

MattCosturos · 2022-06-08T12:45:21Z

We see this happen frequently on our dev env. We update the AMI of our nodes every week (if there is an updated AMI at that point), resulting in the entire node pool getting recycled a few nodes at a time (this does not honor PDBs). This usually results in this exact issue, ie the injector comes up after some of the application pods making us have to manually delete the affected pods.

This is problematic for running dapr in a production environment. Are there any docs/guides on using dapr in a production environment where downtime would best be avoided?
Obviously I can get alerts when my application dies, but in the middle of the night, no one is around to manually delete the pods to resolve this side-car injector issue

ItalyPaleAle · 2022-06-08T16:19:12Z

I'm going to look into this issue and we'll try to get a fix into 1.8

gunniwho · 2022-06-08T17:01:36Z

I think the underlying issue for this problem (which surely isn't specific to dapr at all) is the mutating admission webooks design in kubernetes. The idea of calling an API at the time a pod is to be scheduled doesn't really rhyme well with the desired-state-reconciliation-philosophy of kubernetes if you think about it. There should be a reconciliation loop calling the registered webooks periodically to check if the state of the pod is the desired state. This could probably just be the replica set controller.

Anyway, I digress and I might be wrong. It's just a thought that occurred to me 🙂

ItalyPaleAle · 2022-06-18T18:36:49Z

This should be fixed in master now and there's a watchdog service (disabled by default and can be enabled via some Helm options: https://github.com/dapr/dapr/tree/master/charts/dapr#dapr-operator-options ). If you want to test it, it can be found in the nightly builds!

berndverst added the kind/enhancement label Jan 27, 2022

artursouza added the area/k8s-operator label Jan 27, 2022

artursouza added this to the v1.7 milestone Jan 27, 2022

artursouza added P1 size/S 1 week of work labels Jan 28, 2022

artursouza added this to Backlog in Dapr Roadmap via automation Jan 28, 2022

artursouza added the triaged/resolved Indicates that this issue has been triaged label Jan 28, 2022

artursouza modified the milestones: v1.7, v1.8 Mar 7, 2022

artursouza mentioned this issue Mar 14, 2022

Sidecar injection "sentinel" #4369

Closed

berndverst changed the title ~~Make Sidecar injection resilient in case of Kubernetes API interruptions~~ Make Sidecar injection resilient in case of Kubernetes API interruptions / admissions webhook failures etc Jun 7, 2022

ItalyPaleAle mentioned this issue Jun 7, 2022

Changed default webhookFailurePolicy to Fail #4721

Closed

dapr-bot assigned ItalyPaleAle Jun 8, 2022

ItalyPaleAle mentioned this issue Jun 14, 2022

Sidecar watchdog in operator service #4764

Merged

7 tasks

artursouza closed this as completed in #4764 Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Sidecar injection resilient in case of Kubernetes API interruptions / admissions webhook failures etc #4171

Make Sidecar injection resilient in case of Kubernetes API interruptions / admissions webhook failures etc #4171

berndverst commented Jan 27, 2022 •

edited

berndverst commented Jan 27, 2022

berndverst commented Jan 29, 2022

yaron2 commented Jan 29, 2022

artursouza commented Mar 7, 2022

MattCosturos commented Jun 7, 2022

yaron2 commented Jun 7, 2022

berndverst commented Jun 7, 2022 •

edited

ItalyPaleAle commented Jun 7, 2022

ItalyPaleAle commented Jun 7, 2022

gunniwho commented Jun 7, 2022

ItalyPaleAle commented Jun 7, 2022

gunniwho commented Jun 7, 2022

MattCosturos commented Jun 7, 2022

ItalyPaleAle commented Jun 7, 2022

gunniwho commented Jun 7, 2022

MattCosturos commented Jun 8, 2022

ItalyPaleAle commented Jun 8, 2022

gunniwho commented Jun 8, 2022

ItalyPaleAle commented Jun 18, 2022

Make Sidecar injection resilient in case of Kubernetes API interruptions / admissions webhook failures etc #4171

Make Sidecar injection resilient in case of Kubernetes API interruptions / admissions webhook failures etc #4171

Comments

berndverst commented Jan 27, 2022 • edited

In what area(s)?

Describe the feature

Release Note

berndverst commented Jan 27, 2022

berndverst commented Jan 29, 2022

yaron2 commented Jan 29, 2022

artursouza commented Mar 7, 2022

MattCosturos commented Jun 7, 2022

yaron2 commented Jun 7, 2022

berndverst commented Jun 7, 2022 • edited

ItalyPaleAle commented Jun 7, 2022

ItalyPaleAle commented Jun 7, 2022

gunniwho commented Jun 7, 2022

ItalyPaleAle commented Jun 7, 2022

gunniwho commented Jun 7, 2022

MattCosturos commented Jun 7, 2022

ItalyPaleAle commented Jun 7, 2022

gunniwho commented Jun 7, 2022

MattCosturos commented Jun 8, 2022

ItalyPaleAle commented Jun 8, 2022

gunniwho commented Jun 8, 2022

ItalyPaleAle commented Jun 18, 2022

berndverst commented Jan 27, 2022 •

edited

berndverst commented Jun 7, 2022 •

edited