Kubernetes backend rewrite #206

jcrist · 2020-02-18T19:03:48Z

This implements (or will) the design proposed in #198. The kubernetes backend is broken into 2 components:

A Backend subclass, responsible for creating and querying DaskCluster objects
A controller (written in go) that processes and manages these DaskCluster objects

The controller is responsible for the following:

When a new DaskCluster object is created, it creates a backing secret containing temporary credentials
After the secret is created, it creates a scheduler pod, following the pod template in the DaskCluster object. The secret is mounted as a volume named dask-credentials in this pod.
When the scheduler pod is created, additional ingress routes should be added (not specified in the CRD yet).
Upon scale-up events, instances of the worker pod are created based on the worker pod template. Like the scheduler pod, the worker pods must mount the secret as a volume named dask-credentials.
The controller should restart any pod that exits with a non-zero exit code. Additionally, it should ensure that the number of worker pods is never less than cluster.spec.worker.replicas.
On a stop-cluster event (triggered by a patch of the CRD, this field isn't implemented yet), all pods should be stopped, and the ingress routes and secret should be deleted.
On a deletion event, all child objects (secret, ingress routes, pods) should also be deleted. This could be done with ownerReferences, so nothing for the controller to do.

Fixes #198.

(originally opened as #205, closed and reopened from a branch hosted on the dask upstream repo to make it easier for other contributors. Github doesn't let you change the source branch on an open PR)

dask-gateway-k8s-operator/pkg/apis/gateway/v1alpha1/daskcluster_types.go

droctothorpe · 2020-02-18T19:39:30Z

@jcrist What flags did you pass when you ran operator-sdk add api?

jcrist · 2020-02-18T19:45:56Z

@jcrist What flags did you pass when you ran operator-sdk add api?

operator-sdk --api-version=gateway.dask.org/v1alpha1 --kind=DaskCluster

droctothorpe · 2020-02-18T19:46:07Z

There are random things missing that should be there when you run operator-sdk new, like the build directory and the Dockerfile in it. The lack of that file, in particular, throws off downstream commands like controller generation and operator-sdk build. There's also a weird bug with the generated crd where it winds up being thousands of lines long and therefore cannot be deployed.

Did you update any files other than daskcluster_types.go? If not, it might make sense to regenerate the scaffolding and pull that data in.

jcrist · 2020-02-18T19:57:12Z

Ah! Good catch. The .gitignore file in the repository root was ignoring the build directory for the kubernetes operator, fixed now.

There's also a weird bug with the generated crd where it winds up being thousands of lines long and therefore cannot be deployed.

Yeah, that's what the build_tools/reformat_crds.py script is for. This reformats the crd files to make them shorter (better for git diffs) and smaller in bytes (by removing the description fields) so that kubectl apply works. There's a limit of 256 KiB to files that are deployed with kubectl apply, since in that case the contents are also written as an annotation to provide create-or-update semantics, and annotations have a limit of 256 KiB. This limit is not present for kubectl create. Either way, using the reformat script I wrote fixes things.

droctothorpe · 2020-02-18T22:02:10Z

@jcrist Looks like the verbose CRDs were a result of passing corev1.PodTemplate as the type for Scheduler and Worker in the CRD struct.

When I switched it to a type of string, the resulting CRD was only 71 lines long.

The controller will deserialize the strings using json.Unmarshal from the standard library.

Do you want one, large, fully-functional PR / POC or would you prefer smaller PRs along the way?

jcrist · 2020-02-18T22:37:53Z

Looks like the verbose CRDs were a result of passing corev1.PodTemplate as the type for Scheduler and Worker in the CRD struct.
When I switched it to a type of string, the resulting CRD was only 71 lines long.
The controller will deserialize the strings using json.Unmarshal from the standard library.

Yes, this is correct. But the cleaned up CRD files work fine, and shouldn't cause any problems. Why would you want to serialize the pod specs as a string instead of the full object? This way things still deserialize properly, and we get schema verification on object creation rather than having to handle that ourselves in the controller.

Further, things other than the controller have to work with these objects - the gateway api server watches them too. Having the templates directly as objects rather than strings that need further work to interpret makes this nicer.

Why would you define them as strings over the typed PodTemplate objects?

Do you want one, large, fully-functional PR / POC or would you prefer smaller PRs along the way?

I would prefer if you formed a WIP branch and made a PR against this branch. This allows us to discuss code as you work on it, and makes it easier to incrementally merge things upstream when we find good merge points.

droctothorpe · 2020-02-18T23:21:24Z

Copy that. It's really helpful to hear the rationale behind the decision. Thanks for the clarification!

droctothorpe · 2020-02-19T02:06:39Z

Any particular reason that we're using a pod template as opposed to a corev1.pod?

jcrist · 2020-02-19T13:59:52Z

Pod templates are commonly used for other objects that create many instances of a pod from a template (jobs, deployments, statefulsets, etc...). A pod is an instantiation of a pod template.

We also add a script to reformat the generated CRD yaml. Without this reformatting, the yaml file is ~12,000 lines!!

Doesn't do much for now, but basic class outline is here.

Add kubernetes requirements

Just testing the functional python code for now.

The gitignore in the root directory was masking this.

- Reflect k8s state for DaskCluster and Secret objects, updating internal cluster cache to follow the state of these objects - Implement `start_cluster`, `get_cluster` and `list_clusters`

jcrist · 2020-02-21T00:05:11Z

I have run into a problem with two solutions, and I'm not sure which is better.

To connect to a cluster, a user needs the temporary credentials we generate for it (TLS key and cert). These need to be stored in kubernetes, and also accessed by the gateway api server. Because we want the gateway to be able to manage clusters in multiple namespaces, this presents a bit of a problem.

For a pod to access the secret, it must be in its own namespace. Thus, if we only store the TLS credentials in a secret, then the api server needs a cluster role to read (list, get, watch) secrets for all namespaces. This is a huge level of access, and not something that a security-minded admin might like.

One alternative option is to store the credentials inside the DaskCluster custom resource object. This means that any user that has permissions to read DaskCluster objects in the namespace the gateway is deployed in can access anyone's cluster. However, if you have that access you probably are already an admin anyway. Since they're not stored in standard objects (pods, configmaps, ...) it seems unlikely that other users would be given access to read these objects. This allows us to give the api server only read/modify credentials for DaskCluster objects in its own namespace, which simplifies our security access across the cluster as well as simplifies our code (only need to watch one resource in one namespace).

Is this a bad idea? Keeping a secret in something common like a configmap is a bad idea since lots of users need access to configmaps. But storing a secret in a custom resource object that is only needed for the application (and other users are unlikely to have access) seems less flawed. AFAICT secrets are just like any other object in kubernetes, except that it's a convention to store secret info in them.

cc @yuvipanda, @jacobtomlinson for thoughts.

jcrist · 2020-02-21T14:26:39Z

A third option would be to create the secrets in the same namespace as the DaskCluster objects, and have the controller mirror them to namespace the cluster eventually runs in. This is more complicated than storing them in the DaskCluster objects, but keeps secret things stored in secrets. This also allows for admins who can query/modify DaskCluster objects but can't connect to them, since the permissions are separated (could be given daskclusters permissions but not secrets).

I guess I'm looking for feedback from someone who knows more about what kubernetes admins might expect. Storing everything in the DaskCluster object makes our code the simplest, so if that's fine then I'd prefer that.

droctothorpe · 2020-02-21T15:27:50Z

I lean towards storing some credentials in the CR. I don't think that pattern is unheard of and I'm not sure there's a need to create separate service accounts for access to the CR vs the TLS key and cert. Take my input with a grain of salt though. I need to do some more digging. I'll also sync up with our SRE team to see if they have any objections.

jcrist · 2020-02-21T15:28:43Z

I'll also sync up with our SRE team to see if they have any objections.

That would be great, those are the kind of people I want to hear from.

droctothorpe · 2020-02-21T16:05:12Z

Fourth option: create another CRD (when you're a hammer, everything looks like a nail) that shares the same namespace as the DaskCluster CRD and simply stores the TLS key and cert. The DaskCluster CRD would contain a reference to the secondary CRD. That way, in theory, we would have more fine-grained access control. That being said, DG would need a a service account that grants access to both CRDs anyway, and this adds substantial complexity for a small return that we may not even need. I'm a big fan of first make it work, then make it better. The path of least resistance, for now, is to just store it in the CRD.

droctothorpe · 2020-02-21T18:41:35Z

For a pod to access the secret, it must be in its own namespace. Thus, if we only store the TLS credentials in a secret, then the api server needs a cluster role to read (list, get, watch) secrets for all namespaces. This is a huge level of access, and not something that a security-minded admin might like.

@jcrist which pods need access to the TLS secrets? If it's only DG, DG will be deployed to a predetermined namespace, not a dynamically provisioned one, in which case, we can store all of the secrets in that one namespace.

jcrist · 2020-02-21T18:52:45Z

If it's only DG, DG will be deployed to a predetermined namespace, not a dynamically provisioned one, in which case, we can store all of the secrets in that one namespace.

No, the scheduler and worker pods will also need to mount their corresponding secret as a volume, and they'll be potentially running in other namespaces.

droctothorpe · 2020-02-21T18:57:39Z

Thanks for clarifying!

A third option would be to create the secrets in the same namespace as the DaskCluster objects, and have the controller mirror them to namespace the cluster eventually runs in. This is more complicated than storing them in the DaskCluster objects, but keeps secret things stored in secrets.

In that case, I think that this approach makes a lot of sense. There would be no need to watch the secrets in the dynamic namespaces, so no overhead, other than creation. We can make the scheduler the parent of the duplicated secret. That way, as soon as it's deleted, the duplicated secret will be deleted as well.

droctothorpe · 2020-02-21T19:06:51Z

Scratch that, cross namespace owner references are prohibited.

Note: Cross-namespace owner references are disallowed by design. This means: 1) Namespace-scoped dependents can only specify owners in the same namespace, and owners that are cluster-scoped. 2) Cluster-scoped dependents can only specify cluster-scoped owners, but not namespace-scoped owners.
https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents

This complicates things. I have to dig deeper.

droctothorpe · 2020-02-21T19:15:28Z

If we make the CRD cluster-scoped, namespace-scoped operands can be owned by the CRs.

https://github.com/operator-framework/operator-sdk/blob/master/doc/operator-scope.md#crd-scope

droctothorpe · 2020-03-09T23:27:51Z

I'm OOO tomorrow but really look forward to deploying and testing on Wednesday.

- Batch creation of pods to reduce load on api server, and catch expected failures (e.g. due to resource-quotas) earlier. - Requeue with backoff in the presence of pod operation failure - Add a client-side rate limiter for k8s pod operations. This makes things work nicely in the presence of resource-quotas. If a cluster fails to start a requested pod, it will requeue itself after a backoff period. These errors are caught early to minimize failed api calls.

jcrist · 2020-03-10T23:49:28Z

Todo list:

Tests
Cleanup logging of operations, we're not logging everything that's important, and the logging levels and verbosity isn't consistent everywhere.
Add stop_time timestamp to model (we're already setting start_time).
Delete stopped/failed clusters after a timeout. Currently these accumulate and must be deleted manually or with a cronjob.
Expose backoff and rate limiting knobs in the configuration options. These are currently hardcoded.

Used to fill in `stop_time` in the cluster models returned by the api server.

Cluster records (the DaskCluster objects in k8s) are persisted for a while even after the cluster is stopped. We now add configuration for deleting these objects periodically after a set period of time (defaults to 24 hours).

gforsyth · 2020-03-11T17:20:06Z

Hey @jcrist -- this is looking great! We just deployed this to try it out. We needed to tweak a few small things in the helm chart for enterprise-y stuff that we can push up later, but I don't think that's a blocker for public consumption.

We can hit the gateway, log in, list_clusters just fine. create_cluster shows a pod being created (and it is running) but the call to create_cluster on the client side never returns and the controller is showing this error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 437, in reconciler_loop
    requeue = await self.reconcile_cluster(name)
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 457, in reconcile_cluster
    status_update, requeue = await self.handle_cluster(cluster)
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 487, in handle_cluster
    return await self.handle_pending_cluster(cluster)
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 506, in handle_pending_cluster
    cluster
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 790, in create_scheduler_pod_if_not_exists
    return pod["metadata"]["name"], pod
TypeError: 'NoneType' object is not subscriptable

jcrist · 2020-03-11T17:24:02Z

shows a pod being created (and it is running) but the call to create_cluster on the client side never returns and the controller is showing this error:

Ah, I can see the logic bug here, but I'm curious why it's occurring on your side and not on mine. Can you post the full controller logs?

jcrist · 2020-03-11T17:29:25Z

Should be fixed now.

Simplifies shutdown of queue consumers.

- Add rate limiting to all k8s calls. - Expose rate limiting parameters in the configuration - Expose backoff parameters in the configuration

Move more things to `INFO` level, more uniform/informative messages.

gforsyth · 2020-03-11T19:43:32Z

Ah, I can see the logic bug here, but I'm curious why it's occurring on your side and not on mine. Can you post the full controller logs?

Sorry, changes were pushed up to come CRDs and I think our issue was that some CRDs were missing permissions (a lot of this, I believe, is due to security restrictions on our end about how verbs are listed, etc.)

Bug is fixed -- I can try to deploy the previous version a little later and try to re-trigger the traceback

On shutdown there are many events from child objects as cascading deletes progress, which leads to many unnecessary reconcilation calls. We now check if the cluster is shutting down before enqueueing the cluster for reconcilation, which reduces unnecessary reconcilation calls on shutdown.

jcrist · 2020-03-11T19:58:47Z

Sorry, changes were pushed up to come CRDs and I think our issue was that some CRDs were missing permissions (a lot of this, I believe, is due to security restrictions on our end about how verbs are listed, etc.)

Did you need to change the service account permissions for the controller/api server? The permissions created by the helm chart should be state the permissions required explicitly - if these aren't sufficient then this is a bug in our helm chart. I'd be interested in seeing what you needed to do to get things working.

droctothorpe · 2020-03-11T20:07:44Z

One notable change was this in the controller RBAC:

rules:
--
  | - apiGroups: ["gateway.dask.org"]
  | resources: ["daskclusters", "daskclusters/status"]
  | verbs: ["*"]

daskclusters and daskclusters/status have discrete permissions.

jcrist · 2020-03-11T21:10:30Z

Hmmm, didn't know that */status things were separate resources. That should have caused things to fail on any system with rbac enabled, but it looks like docker-desktop automatically promotes all service accounts to admin by default (see docker/for-mac#3694 (comment)) - deleting their clusterrolebinding causes things to fail properly locally as well.

Most users will probably want to serve over a single port (both HTTP and TCP traffic) rather than splitting the scheduler traffic out to a separate port. We now enable that by default. The client is also updated to infer the proper port from the scheme if the port is not provided (e.g. use port 443 if given ``https://foo.com``). Also fixes rbac permissioning for the controller.

jcrist · 2020-03-11T21:51:47Z

I've fixed the RBAC issue. I also updated the helm chart to serve over a single port by default, as this will likely be the more common configuration.

jcrist · 2020-03-11T21:58:40Z

This is getting pretty close to ready to merge. I'm going to try and expand unit tests tomorrow, and then hit the merge button. I think getting larger integration tests setup will be a subsequent PR - I've at least verified things work locally and this PR is pretty big already.

droctothorpe · 2020-03-12T14:14:01Z

You may find OSDKs documentation on E2E tests to be a helpful reference: https://github.com/operator-framework/operator-sdk/blob/master/doc/test-framework/writing-e2e-tests.md

droctothorpe · 2020-03-12T17:09:49Z

Do other backends rely on the Golang proxy that you wrote?

jcrist · 2020-03-12T17:12:08Z

Yes. All other backends use our bundled proxy, the kubernetes backend is the only one using traefik.

jcrist · 2020-03-12T18:17:17Z

Ok, I'm going to merge this.

Since the design has changed a few times during this PR (and some of the above comments are out-of-date), here is an up-to-date summary of what was implemented here:

A DaskCluster CRD, storing a spec and status for each dask cluster
A Backend implementation for kubernetes, which stores all state in DaskCluster objects
A controller, implemented in Python, which watches for dask cluster objects and processes them accordingly.
A helm chart for deploying the gateway api server, controller, and traefik proxy together, configured appropriately
Updated docker images that build from source rather than pypi for easy deployment from source
A skaffold setup that can optionally run the controller out-of-cluster (by specifying -p local-controller) for easier development

There are currently some unit tests, mostly that k8s objects created have the appropriate fields and that configuration is forwarded appropriately. I've poked at things locally and everything seems ok. Integration tests will be added in a subsequent PR.

jcrist mentioned this pull request Feb 18, 2020

WIP: Kubernetes backend rewrite #205

Closed

droctothorpe reviewed Feb 18, 2020

View reviewed changes

dask-gateway-k8s-operator/pkg/apis/gateway/v1alpha1/daskcluster_types.go Outdated Show resolved Hide resolved

jcrist mentioned this pull request Feb 18, 2020

Trivial change to test CI #208

Closed

jcrist and others added 8 commits February 20, 2020 15:18

Initial operator structure generated by operator-sdk

47723d1

Add initial DaskCluster CRD

3ef4345

We also add a script to reformat the generated CRD yaml. Without this reformatting, the yaml file is ~12,000 lines!!

Add initial kubernetes backend

bc2e3d6

Doesn't do much for now, but basic class outline is here.

Update dask-gateway-server setup.py

c89522c

Add kubernetes requirements

Add initial kubernetes tests

5c2904d

Just testing the functional python code for now.

Fix gitignore files to not ignore operator build directory

26adc67

The gitignore in the root directory was masking this.

Expand kubernetes backend

49e74c8

- Reflect k8s state for DaskCluster and Secret objects, updating internal cluster cache to follow the state of these objects - Implement `start_cluster`, `get_cluster` and `list_clusters`

Update CR to use PodTemplateSpec and add compatible CR (#209)

45b1f2b

jcrist added 2 commits March 11, 2020 11:05

Add completionTime to cluster status

a20fff9

Used to fill in `stop_time` in the cluster models returned by the api server.

Delete completed cluster records periodically

5dcc442

Cluster records (the DaskCluster objects in k8s) are persisted for a while even after the cluster is stopped. We now add configuration for deleting these objects periodically after a set period of time (defaults to 24 hours).

Fix bug in create_scheduler_pod_if_not_exists

3bf9bde

jcrist added 3 commits March 11, 2020 13:05

Manage reconciler shutdown in the work queue.

2298c90

Simplifies shutdown of queue consumers.

Apply rate limiting everywhere in controller.

b65ff37

- Add rate limiting to all k8s calls. - Expose rate limiting parameters in the configuration - Expose backoff parameters in the configuration

Improve logging in backend and controller

517222e

Move more things to `INFO` level, more uniform/informative messages.

jcrist changed the title ~~WIP: Kubernetes backend rewrite~~ Kubernetes backend rewrite Mar 11, 2020

jcrist mentioned this pull request Mar 11, 2020

[WIP] Preliminary controller implementation #211

Closed

Add tests for RateLimiter

743acf6

A few more unit tests

640fdb3

jcrist merged commit b656eaf into master Mar 12, 2020

jcrist deleted the kube-support branch March 12, 2020 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes backend rewrite #206

Kubernetes backend rewrite #206

jcrist commented Feb 18, 2020

droctothorpe commented Feb 18, 2020

jcrist commented Feb 18, 2020

droctothorpe commented Feb 18, 2020

jcrist commented Feb 18, 2020

droctothorpe commented Feb 18, 2020

jcrist commented Feb 18, 2020 •

edited

Loading

droctothorpe commented Feb 18, 2020

droctothorpe commented Feb 19, 2020

jcrist commented Feb 19, 2020

jcrist commented Feb 21, 2020 •

edited

Loading

jcrist commented Feb 21, 2020

droctothorpe commented Feb 21, 2020

jcrist commented Feb 21, 2020

droctothorpe commented Feb 21, 2020

droctothorpe commented Feb 21, 2020

jcrist commented Feb 21, 2020 •

edited

Loading

droctothorpe commented Feb 21, 2020

droctothorpe commented Feb 21, 2020 •

edited

Loading

droctothorpe commented Feb 21, 2020

droctothorpe commented Mar 9, 2020

jcrist commented Mar 10, 2020 •

edited

Loading

gforsyth commented Mar 11, 2020

jcrist commented Mar 11, 2020 •

edited

Loading

jcrist commented Mar 11, 2020

gforsyth commented Mar 11, 2020 •

edited

Loading

jcrist commented Mar 11, 2020

droctothorpe commented Mar 11, 2020 •

edited

Loading

jcrist commented Mar 11, 2020

jcrist commented Mar 11, 2020

jcrist commented Mar 11, 2020

droctothorpe commented Mar 12, 2020

droctothorpe commented Mar 12, 2020

jcrist commented Mar 12, 2020

jcrist commented Mar 12, 2020

Kubernetes backend rewrite #206

Kubernetes backend rewrite #206

Conversation

jcrist commented Feb 18, 2020

droctothorpe commented Feb 18, 2020

jcrist commented Feb 18, 2020

droctothorpe commented Feb 18, 2020

jcrist commented Feb 18, 2020

droctothorpe commented Feb 18, 2020

jcrist commented Feb 18, 2020 • edited Loading

droctothorpe commented Feb 18, 2020

droctothorpe commented Feb 19, 2020

jcrist commented Feb 19, 2020

jcrist commented Feb 21, 2020 • edited Loading

jcrist commented Feb 21, 2020

droctothorpe commented Feb 21, 2020

jcrist commented Feb 21, 2020

droctothorpe commented Feb 21, 2020

droctothorpe commented Feb 21, 2020

jcrist commented Feb 21, 2020 • edited Loading

droctothorpe commented Feb 21, 2020

droctothorpe commented Feb 21, 2020 • edited Loading

droctothorpe commented Feb 21, 2020

droctothorpe commented Mar 9, 2020

jcrist commented Mar 10, 2020 • edited Loading

gforsyth commented Mar 11, 2020

jcrist commented Mar 11, 2020 • edited Loading

jcrist commented Mar 11, 2020

gforsyth commented Mar 11, 2020 • edited Loading

jcrist commented Mar 11, 2020

droctothorpe commented Mar 11, 2020 • edited Loading

jcrist commented Mar 11, 2020

jcrist commented Mar 11, 2020

jcrist commented Mar 11, 2020

droctothorpe commented Mar 12, 2020

droctothorpe commented Mar 12, 2020

jcrist commented Mar 12, 2020

jcrist commented Mar 12, 2020

jcrist commented Feb 18, 2020 •

edited

Loading

jcrist commented Feb 21, 2020 •

edited

Loading

jcrist commented Feb 21, 2020 •

edited

Loading

droctothorpe commented Feb 21, 2020 •

edited

Loading

jcrist commented Mar 10, 2020 •

edited

Loading

jcrist commented Mar 11, 2020 •

edited

Loading

gforsyth commented Mar 11, 2020 •

edited

Loading

droctothorpe commented Mar 11, 2020 •

edited

Loading