Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Sidecar injection resilient in case of Kubernetes API interruptions / admissions webhook failures etc #4171

Closed
berndverst opened this issue Jan 27, 2022 · 19 comments · Fixed by #4764
Assignees
Labels
area/k8s-operator kind/enhancement P1 size/S 1 week of work triaged/resolved Indicates that this issue has been triaged
Milestone

Comments

@berndverst
Copy link
Member

berndverst commented Jan 27, 2022

In what area(s)?

/area injector
/area sidecar-injector

Describe the feature

Dapr Sidecar Injector is responsible for handling the creation of Dapr sidecars and injecting the Dapr Trust Bundle secret from the Dapr-System namespace. Internally this approach uses the Kubernetes MutatingAdmissionWebhook feature.

On occasion the MutatingAdmissionRequest to the webhook which is registered to the Dapr Sidecar injector (see kubectl get MutatingWebhookConfiguration/dapr-sidecar-injector -oyaml) may not occur or may fail. At this time Dapr Helm Charts are configured to resume with the deployment in that case. This means new application deployments with Dapr annotations that should have sidecars injected will have missing sidecars.

The Dapr control plane should periodically (or upon Kubernetes API failure) determine whether any existing deployments need sidecars injected and whether new components must be loaded. The Dapr control plane also needs to gracefully resume watching of resource deployments when the Kubernetes API is available again.

Release Note

RELEASE NOTE:

@berndverst
Copy link
Member Author

/area operator

@artursouza artursouza added this to the v1.7 milestone Jan 27, 2022
@artursouza artursouza added P1 size/S 1 week of work labels Jan 28, 2022
@artursouza artursouza added this to Backlog in Dapr Roadmap via automation Jan 28, 2022
@artursouza artursouza added the triaged/resolved Indicates that this issue has been triaged label Jan 28, 2022
@berndverst
Copy link
Member Author

This issue may also come in handy in case an existing deployment is modified with dapr annotations when it previously didn't have dapr annotations. I'm unclear on the behavior today (haven't further looked into this).

@yaron2
Copy link
Member

yaron2 commented Jan 29, 2022

This issue may also come in handy in case an existing deployment is modified with dapr annotations when it previously didn't have dapr annotations. I'm unclear on the behavior today (haven't further looked into this).

An existing deployment with added annotations will get the sidecars injected.

@artursouza
Copy link
Member

This issue may also come in handy in case an existing deployment is modified with dapr annotations when it previously didn't have dapr annotations. I'm unclear on the behavior today (haven't further looked into this).

An existing deployment with added annotations will get the sidecars injected.

We have identified other scenarios where the sidecar is not injected. We need to create some sort of background process to identify those instances and proactively fix it.

@MattCosturos
Copy link

Will this issue be fixed in 1.8?
We had an outage, and our system did not recover due to this issue.

@yaron2
Copy link
Member

yaron2 commented Jun 7, 2022

Will this issue be fixed in 1.8? We had an outage, and our system did not recover due to this issue.

@artursouza to provide info here as triager.

@berndverst
Copy link
Member Author

berndverst commented Jun 7, 2022

@yaron2 I just updated all my incorrect understanding in the issue. @ItalyPaleAle is currently investigating.

One possible option that may alleviate this problem may be to set the AdmissionWebhook Failure Policy to Fail. We'll be evaluating that approach.

@berndverst berndverst changed the title Make Sidecar injection resilient in case of Kubernetes API interruptions Make Sidecar injection resilient in case of Kubernetes API interruptions / admissions webhook failures etc Jun 7, 2022
@ItalyPaleAle
Copy link
Contributor

We are seeing this issue in the E2E tests too, randomly, and I've been investigating this for a few days.

With the logs we have so far, it looks like the injector service(s) do not receive the webhook call at all. I am waiting for #4719 to be merged so I can access the audit logs and see exactly what happens on the K8s side (our current logs only report what's happening on the injector side). My suspicion is that the webhook invocations fail due to a network timeout and never hit the injector services.

There are two things that I think we need to do to address this issue:

  1. Confirm why the webhook isn't receiving requests (or if that's even true), and whether this is something we can fix in Dapr.
  2. Consider additional mitigations such as what this issue is proposing, although I'm not a big fan of polling as a principle.

As @berndverst just mentioned, this morning we learnt about the failurePolicy configuration for the AdminissionWebhook.

In Dapr, that can be set using the dapr_sidecar_injector.webhookFailurePolicy property for our Helm charts. The default value is Ignore:

| `dapr_sidecar_injector.webhookFailurePolicy` | Failure policy for the sidecar injector | `Ignore` |

I need to test this, but it sounds like Ignore causes K8s to deploy the app without a sidecar if the webhook fails. Perhaps changing the value to Fail (which I presume makes the deployment fail) could be a better approach.

@ItalyPaleAle
Copy link
Contributor

I proposed changing the default value to Fail in #4721. This can be done in parallel to the investigation into figuring out why the injector isn't getting messages.

@gunniwho
Copy link
Contributor

gunniwho commented Jun 7, 2022

I proposed changing the default value to Fail in #4721. This can be done in parallel to the investigation into figuring out why the injector isn't getting messages.

I don't think it's a good idea to make Fail the default option as it can lead to a very bad situation: if you lose all instances of the sidecar injector then you won't be able to schedule any pod, including the sidecar injector itself.

I believe it is natural that the sidecar injector periodically checks for active pods that have the dapr-enabled annotation but no dapr sidecar and in those cases, simply deletes the pod. The underlying replicaset would then restart the pod and the sidecar get injected.

@ItalyPaleAle
Copy link
Contributor

@gunniwho that makes sense. Looks like there's no way to restrict to pods with a certain annotation, so what you're saying is a real risk.

@gunniwho
Copy link
Contributor

gunniwho commented Jun 7, 2022

@gunniwho that makes sense. Looks like there's no way to restrict to pods with a certain annotation, so what you're saying is a real risk.

@ItalyPaleAle we learned it the hard way the other day (albeit with another sidecar injector) 😅

@MattCosturos
Copy link

I believe it is natural that the sidecar injector periodically checks for active pods that have the dapr-enabled annotation but no dapr sidecar and in those cases, simply deletes the pod. The underlying replicaset would then restart the pod and the sidecar get injected.

A 'watch-dog' like behavior like this would have solved our issue. I still don't know what the root cause was, all I know is around midnight several pods died, 3 of our custom pods, along with the dapr-sidecar-injector.
The pods were restarted, but my 3 pods restarted BEFORE the dapr-sidecar-injector. The injector then starts up and connects to and listens for new pods, but it didn't see any new ones since my pods were already created.
I had to manually delete my pods, and the recreated ones then got the sidecar injected correctly.

It sounds like the intermittent issue with the injector may not be the only problem. There could simply be a race condition where pods created just before the sidecar injector will never appear to the sidecar injector. (I could be wrong since I don't know the inner workings of this web hook call the injector service relies on

@ItalyPaleAle
Copy link
Contributor

@MattCosturos that sounds like a possible issue as well, when all injectors die.

In the case of E2E tests, the injectors are running, but the webhook dosn't get delivered.

Perhaps a watch-dog is what we need.

@gunniwho
Copy link
Contributor

gunniwho commented Jun 7, 2022

We see this happen frequently on our dev env. We update the AMI of our nodes every week (if there is an updated AMI at that point), resulting in the entire node pool getting recycled a few nodes at a time (this does not honor PDBs). This usually results in this exact issue, ie the injector comes up after some of the application pods making us have to manually delete the affected pods.

@MattCosturos
Copy link

We see this happen frequently on our dev env. We update the AMI of our nodes every week (if there is an updated AMI at that point), resulting in the entire node pool getting recycled a few nodes at a time (this does not honor PDBs). This usually results in this exact issue, ie the injector comes up after some of the application pods making us have to manually delete the affected pods.

This is problematic for running dapr in a production environment. Are there any docs/guides on using dapr in a production environment where downtime would best be avoided?
Obviously I can get alerts when my application dies, but in the middle of the night, no one is around to manually delete the pods to resolve this side-car injector issue

@ItalyPaleAle
Copy link
Contributor

I'm going to look into this issue and we'll try to get a fix into 1.8

@gunniwho
Copy link
Contributor

gunniwho commented Jun 8, 2022

I think the underlying issue for this problem (which surely isn't specific to dapr at all) is the mutating admission webooks design in kubernetes. The idea of calling an API at the time a pod is to be scheduled doesn't really rhyme well with the desired-state-reconciliation-philosophy of kubernetes if you think about it. There should be a reconciliation loop calling the registered webooks periodically to check if the state of the pod is the desired state. This could probably just be the replica set controller.

Anyway, I digress and I might be wrong. It's just a thought that occurred to me 🙂

@ItalyPaleAle
Copy link
Contributor

This should be fixed in master now and there's a watchdog service (disabled by default and can be enabled via some Helm options: https://github.com/dapr/dapr/tree/master/charts/dapr#dapr-operator-options ). If you want to test it, it can be found in the nightly builds!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/k8s-operator kind/enhancement P1 size/S 1 week of work triaged/resolved Indicates that this issue has been triaged
Projects
Development

Successfully merging a pull request may close this issue.

6 participants