Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] flytescheduler not starting with istio authorization policy #2562

Closed
2 tasks done
bstadlbauer opened this issue May 30, 2022 · 8 comments · Fixed by #2576
Closed
2 tasks done

[BUG] flytescheduler not starting with istio authorization policy #2562

bstadlbauer opened this issue May 30, 2022 · 8 comments · Fixed by #2576
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers

Comments

@bstadlbauer
Copy link
Member

Describe the bug

When using istio's Authorization Policiy in conjunction with the flyte-core helm chart, the flytescheduler pod won't start, as the flytescheduler-check init container tries to access flyteadmin before the istio proxy gets injected, thus getting a RBAC: access denied error.

Looking at options, I've found this conversation, as well as istio/istio#11130, from which I gathered that there might be no workaround ATM.

A possible solution they are suggesting would be to move all init-logic into a real (non-init) container, as the istio proxy should be ready by then. Do you have any thoughts on this or another possible workaround? We are currently not blocked by this as we're not using the scheduler right now.

Expected behavior

The flytescheduler pod coming up.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@bstadlbauer bstadlbauer added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels May 30, 2022
@welcome
Copy link

welcome bot commented May 30, 2022

Thank you for opening your first issue here! 🛠

@pmahindrakar-oss
Copy link
Contributor

Hi @bstadlbauer , doesn't the k8 backoff policy kick in even in this case and eventually bringup the scheduler container.

@kumare3
Copy link
Contributor

kumare3 commented Jun 1, 2022

@pmahindrakar-oss / @bstadlbauer what if we just remove the init container? I think scheduler will
Keep trying right

@bstadlbauer
Copy link
Member Author

@pmahindrakar-oss Not a K8s expert, but from what I see ATM, the flytescheduler has not come up yet (and the pod is around for ~50h now).

@kumare3 That would be nice! What is the purpose of flytescheduler precheck? Can it be safely removed? If so I am happy to open up a PR here :-)

@pmahindrakar-oss
Copy link
Contributor

pmahindrakar-oss commented Jun 2, 2022

@bstadlbauer @kumare3 we are using precheck as kind of readiness check where it doesn't initialize the main pod until the init-container precheck completes. It validates that flyteadmin is in SERVING status.

This indirectly also validates that DB is up which is used by the scheduler to read the schedules .
Additionally send the activated schedules to admin.
If we remove this precheck

  • The container will keep crashing until the DB is up and it can read the schedules. Kubernetes restart policy would kick in and restart using backoff on this failure.
  • The schedules would fail to fire until the admin is up. We do have retries here in the application logic to take care of this but the precheck acts as cushion to avoid this .

We can probably add a flag to skip the precheck and you can pass this from your values file. Will that work. I want to avoid removing this check as it might start sending alerts on schedule failures using FailedExecutionCounter metric.

For quick verification you can try removing the init container and check if the scheduler pod eventually comes up

@agates4
Copy link

agates4 commented Jun 2, 2022

Init containers are ran in the order they appear in the pod spec
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#detailed-behavior

Hashicorp Vault gets around this by modifying the yaml so that the Vault init container runs before all other init containers:
https://github.com/hashicorp/vault-k8s/blob/35b63e4720e29ed701f70a901ecd421f5c5bc26f/agent-inject/agent/agent.go#L571-L589

There could be an annotation within Flyte to determine ordering in the same way
https://www.vaultproject.io/docs/platform/k8s/injector/annotations#vault-hashicorp-com-agent-init-first

@agates4
Copy link

agates4 commented Jun 2, 2022

Within Istio's mesh config, there is a holdApplicationUntilProxyStarts value that would force the istio sidecar to be injected and initialized first
https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/ (command+f holdApplicationUntilProxyStarts)

@bstadlbauer We can test this by modifying the labs service mesh with this value

@bstadlbauer
Copy link
Member Author

@agates4 I've tested setting the istio config, but seems like this won't do anything (I think this might be because it won't affect init containers, similar to what's discussed here).

I've opened #2576, which adds an option to disable the precheck init container

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants