Skip to content

Helm chart: update to traefik v2.5.x#431

Merged
jcrist merged 3 commits intodask:mainfrom
consideRatio:pr/update-traefik
Sep 14, 2021
Merged

Helm chart: update to traefik v2.5.x#431
jcrist merged 3 commits intodask:mainfrom
consideRatio:pr/update-traefik

Conversation

@consideRatio
Copy link
Copy Markdown
Collaborator

@consideRatio consideRatio commented Aug 26, 2021

I bumped Traefik to the latest version and unpinned the patch version of the image to v2.5 which I find reasonable at least until we have more regular releases.

To make traefik start and function properly, I had to also install a few CRDs and grant Traefik permission to inspect those CRDs. Specifically, I've added the following CRDs and associciated read permissions (get, list, watch).

  • IngressRouteUDP
  • MiddlewaresTCP
  • ServersTransports
  • TLSStores

@consideRatio consideRatio force-pushed the pr/update-traefik branch 2 times, most recently from 1ea7987 to e1746d4 Compare August 26, 2021 21:22
@consideRatio
Copy link
Copy Markdown
Collaborator Author

Notes from debugging version bump failure

I figured I'd do some general maintenance and bump traefik. It turns out that between v2.1.9 and v2.2.0 there is a breaking change related to k8s RBAC permissions. We still mount the same kind of service account with its associated permissions, but traefik reacts differently and concludes it now lack permissions.

I think what's happening is that Traefik starts to access previously not known k8s resources and/or tries to access them with across all namespaces instead of just the local or similar. Perhaps because of a change in a flag or config etc.

Here is the changelog: https://github.com/traefik/traefik/blob/master/CHANGELOG.md#v220-2020-03-25

This can be relevant for example:

[k8s,k8s/crd] Add TLSStores to Kubernetes CRD (#6270 by dtomcej)
[k8s,k8s/crd] Add namespace attribute on IngressRouteTCP service (#6085 by jbdoumenjou)

These are the errors as reported by the traefik pod.

Failed to list *v1alpha1.IngressRouteUDP: ingressrouteudps.traefik.containo.us is forbidden: User "system:serviceaccount:default:traefik-test-dask-gateway" cannot list resource "ingressrouteudps" in API group "traefik.containo.us" at the cluster scope
Failed to list *v1alpha1.TLSStore: tlsstores.traefik.containo.us is forbidden: User "system:serviceaccount:default:traefik-test-dask-gateway" cannot list resource "tlsstores" in API group "traefik.containo.us" at the cluster scope

Fixing this, we got the following similar errors to be fixed in the same way - by adding permissions to read such resources in our ClusterRole that is coupled to a ServiceAccount via a ClusterRoleBinding.

Failed to watch *v1alpha1.ServersTransport: failed to list *v1alpha1.ServersTransport: serverstransports.traefik.containo.us is forbidden: User "system:serviceaccount:default:traefik-test-dask-gateway" cannot list resource "serverstransports" in API group "traefik.containo.us" at the cluster scope
Failed to watch *v1alpha1.MiddlewareTCP: failed to list *v1alpha1.MiddlewareTCP: middlewaretcps.traefik.containo.us is forbidden: User "system:serviceaccount:default:traefik-test-dask-gateway" cannot list resource "middlewaretcps" in API group "traefik.containo.us" at the cluster scope

Fixing that, we got the errors about those resources not being known, so we had to install the associated CRDs as well.

@consideRatio consideRatio mentioned this pull request Sep 1, 2021
@TomAugspurger
Copy link
Copy Markdown
Member

I can't really comment on the changes here, but I'm happy to test things out (either on PRs if necessary, or prior to a release)

@consideRatio
Copy link
Copy Markdown
Collaborator Author

@jcrist I just force pushed this with a rebase on master, fixing a trivial merge conflict in values.yaml

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Sep 14, 2021

Have you tested this locally at all? It generally looks good, but our k8s tests may not be sufficient to test everything.

@consideRatio
Copy link
Copy Markdown
Collaborator Author

Not more than the traefik instance and so actually starts running properly with these changes and version 2.5. I've not created dask-gateway clusters via k8s etc, so if that is missing from tests then it could be relevant to check manually as well which I haven't.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Sep 14, 2021

No, we do create clusters as part of the test. I'm mainly wondering if the traefik logs are filling up with errors or warnings or anything - sometimes traefik will complain about issues but still keep working. I'll need to do a full QA before release anyway to clean up some things, so I'll check this then. Fine to merge as is. Thanks!

@jcrist jcrist merged commit b86e306 into dask:main Sep 14, 2021
@consideRatio
Copy link
Copy Markdown
Collaborator Author

I'm mainly wondering if the traefik logs are filling up with errors or warnings or anything - sometimes traefik will complain about issues but still keep working.

Ah, the logs didn't fill up with errors at least - I had also monitored the logs as I had to debug lots of errors before getting things functional to a running state.


Thanks for review and merge!!

@chrisroat chrisroat mentioned this pull request Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants