Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes backend rewrite #206

Merged
merged 42 commits into from
Mar 12, 2020
Merged

Kubernetes backend rewrite #206

merged 42 commits into from
Mar 12, 2020

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Feb 18, 2020

This implements (or will) the design proposed in #198. The kubernetes backend is broken into 2 components:

  • A Backend subclass, responsible for creating and querying DaskCluster objects
  • A controller (written in go) that processes and manages these DaskCluster objects

The controller is responsible for the following:

  • When a new DaskCluster object is created, it creates a backing secret containing temporary credentials
  • After the secret is created, it creates a scheduler pod, following the pod template in the DaskCluster object. The secret is mounted as a volume named dask-credentials in this pod.
  • When the scheduler pod is created, additional ingress routes should be added (not specified in the CRD yet).
  • Upon scale-up events, instances of the worker pod are created based on the worker pod template. Like the scheduler pod, the worker pods must mount the secret as a volume named dask-credentials.
  • The controller should restart any pod that exits with a non-zero exit code. Additionally, it should ensure that the number of worker pods is never less than cluster.spec.worker.replicas.
  • On a stop-cluster event (triggered by a patch of the CRD, this field isn't implemented yet), all pods should be stopped, and the ingress routes and secret should be deleted.
  • On a deletion event, all child objects (secret, ingress routes, pods) should also be deleted. This could be done with ownerReferences, so nothing for the controller to do.

Fixes #198.

(originally opened as #205, closed and reopened from a branch hosted on the dask upstream repo to make it easier for other contributors. Github doesn't let you change the source branch on an open PR)

@droctothorpe
Copy link
Contributor

@jcrist What flags did you pass when you ran operator-sdk add api?

@jcrist
Copy link
Member Author

jcrist commented Feb 18, 2020

@jcrist What flags did you pass when you ran operator-sdk add api?

operator-sdk --api-version=gateway.dask.org/v1alpha1 --kind=DaskCluster

@droctothorpe
Copy link
Contributor

There are random things missing that should be there when you run operator-sdk new, like the build directory and the Dockerfile in it. The lack of that file, in particular, throws off downstream commands like controller generation and operator-sdk build. There's also a weird bug with the generated crd where it winds up being thousands of lines long and therefore cannot be deployed.

Did you update any files other than daskcluster_types.go? If not, it might make sense to regenerate the scaffolding and pull that data in.

@jcrist
Copy link
Member Author

jcrist commented Feb 18, 2020

Ah! Good catch. The .gitignore file in the repository root was ignoring the build directory for the kubernetes operator, fixed now.

There's also a weird bug with the generated crd where it winds up being thousands of lines long and therefore cannot be deployed.

Yeah, that's what the build_tools/reformat_crds.py script is for. This reformats the crd files to make them shorter (better for git diffs) and smaller in bytes (by removing the description fields) so that kubectl apply works. There's a limit of 256 KiB to files that are deployed with kubectl apply, since in that case the contents are also written as an annotation to provide create-or-update semantics, and annotations have a limit of 256 KiB. This limit is not present for kubectl create. Either way, using the reformat script I wrote fixes things.

@droctothorpe
Copy link
Contributor

@jcrist Looks like the verbose CRDs were a result of passing corev1.PodTemplate as the type for Scheduler and Worker in the CRD struct.

When I switched it to a type of string, the resulting CRD was only 71 lines long.

The controller will deserialize the strings using json.Unmarshal from the standard library.

Do you want one, large, fully-functional PR / POC or would you prefer smaller PRs along the way?

@jcrist
Copy link
Member Author

jcrist commented Feb 18, 2020

Looks like the verbose CRDs were a result of passing corev1.PodTemplate as the type for Scheduler and Worker in the CRD struct.
When I switched it to a type of string, the resulting CRD was only 71 lines long.
The controller will deserialize the strings using json.Unmarshal from the standard library.

Yes, this is correct. But the cleaned up CRD files work fine, and shouldn't cause any problems. Why would you want to serialize the pod specs as a string instead of the full object? This way things still deserialize properly, and we get schema verification on object creation rather than having to handle that ourselves in the controller.

Further, things other than the controller have to work with these objects - the gateway api server watches them too. Having the templates directly as objects rather than strings that need further work to interpret makes this nicer.

Why would you define them as strings over the typed PodTemplate objects?

Do you want one, large, fully-functional PR / POC or would you prefer smaller PRs along the way?

I would prefer if you formed a WIP branch and made a PR against this branch. This allows us to discuss code as you work on it, and makes it easier to incrementally merge things upstream when we find good merge points.

@jcrist jcrist mentioned this pull request Feb 18, 2020
@droctothorpe
Copy link
Contributor

Copy that. It's really helpful to hear the rationale behind the decision. Thanks for the clarification!

@droctothorpe
Copy link
Contributor

Any particular reason that we're using a pod template as opposed to a corev1.pod?

@jcrist
Copy link
Member Author

jcrist commented Feb 19, 2020

Pod templates are commonly used for other objects that create many instances of a pod from a template (jobs, deployments, statefulsets, etc...). A pod is an instantiation of a pod template.

jcrist and others added 8 commits February 20, 2020 15:18
We also add a script to reformat the generated CRD yaml. Without this
reformatting, the yaml file is ~12,000 lines!!
Doesn't do much for now, but basic class outline is here.
Add kubernetes requirements
Just testing the functional python code for now.
The gitignore in the root directory was masking this.
- Reflect k8s state for DaskCluster and Secret objects, updating
internal cluster cache to follow the state of these objects
- Implement `start_cluster`, `get_cluster` and `list_clusters`
@jcrist
Copy link
Member Author

jcrist commented Feb 21, 2020

I have run into a problem with two solutions, and I'm not sure which is better.

To connect to a cluster, a user needs the temporary credentials we generate for it (TLS key and cert). These need to be stored in kubernetes, and also accessed by the gateway api server. Because we want the gateway to be able to manage clusters in multiple namespaces, this presents a bit of a problem.

For a pod to access the secret, it must be in its own namespace. Thus, if we only store the TLS credentials in a secret, then the api server needs a cluster role to read (list, get, watch) secrets for all namespaces. This is a huge level of access, and not something that a security-minded admin might like.

One alternative option is to store the credentials inside the DaskCluster custom resource object. This means that any user that has permissions to read DaskCluster objects in the namespace the gateway is deployed in can access anyone's cluster. However, if you have that access you probably are already an admin anyway. Since they're not stored in standard objects (pods, configmaps, ...) it seems unlikely that other users would be given access to read these objects. This allows us to give the api server only read/modify credentials for DaskCluster objects in its own namespace, which simplifies our security access across the cluster as well as simplifies our code (only need to watch one resource in one namespace).

Is this a bad idea? Keeping a secret in something common like a configmap is a bad idea since lots of users need access to configmaps. But storing a secret in a custom resource object that is only needed for the application (and other users are unlikely to have access) seems less flawed. AFAICT secrets are just like any other object in kubernetes, except that it's a convention to store secret info in them.

cc @yuvipanda, @jacobtomlinson for thoughts.

@jcrist
Copy link
Member Author

jcrist commented Feb 21, 2020

A third option would be to create the secrets in the same namespace as the DaskCluster objects, and have the controller mirror them to namespace the cluster eventually runs in. This is more complicated than storing them in the DaskCluster objects, but keeps secret things stored in secrets. This also allows for admins who can query/modify DaskCluster objects but can't connect to them, since the permissions are separated (could be given daskclusters permissions but not secrets).

I guess I'm looking for feedback from someone who knows more about what kubernetes admins might expect. Storing everything in the DaskCluster object makes our code the simplest, so if that's fine then I'd prefer that.

@droctothorpe
Copy link
Contributor

I lean towards storing some credentials in the CR. I don't think that pattern is unheard of and I'm not sure there's a need to create separate service accounts for access to the CR vs the TLS key and cert. Take my input with a grain of salt though. I need to do some more digging. I'll also sync up with our SRE team to see if they have any objections.

@jcrist
Copy link
Member Author

jcrist commented Feb 21, 2020

I'll also sync up with our SRE team to see if they have any objections.

That would be great, those are the kind of people I want to hear from.

@droctothorpe
Copy link
Contributor

Fourth option: create another CRD (when you're a hammer, everything looks like a nail) that shares the same namespace as the DaskCluster CRD and simply stores the TLS key and cert. The DaskCluster CRD would contain a reference to the secondary CRD. That way, in theory, we would have more fine-grained access control. That being said, DG would need a a service account that grants access to both CRDs anyway, and this adds substantial complexity for a small return that we may not even need. I'm a big fan of first make it work, then make it better. The path of least resistance, for now, is to just store it in the CRD.

@droctothorpe
Copy link
Contributor

For a pod to access the secret, it must be in its own namespace. Thus, if we only store the TLS credentials in a secret, then the api server needs a cluster role to read (list, get, watch) secrets for all namespaces. This is a huge level of access, and not something that a security-minded admin might like.

@jcrist which pods need access to the TLS secrets? If it's only DG, DG will be deployed to a predetermined namespace, not a dynamically provisioned one, in which case, we can store all of the secrets in that one namespace.

@jcrist
Copy link
Member Author

jcrist commented Feb 21, 2020

If it's only DG, DG will be deployed to a predetermined namespace, not a dynamically provisioned one, in which case, we can store all of the secrets in that one namespace.

No, the scheduler and worker pods will also need to mount their corresponding secret as a volume, and they'll be potentially running in other namespaces.

@droctothorpe
Copy link
Contributor

Thanks for clarifying!

A third option would be to create the secrets in the same namespace as the DaskCluster objects, and have the controller mirror them to namespace the cluster eventually runs in. This is more complicated than storing them in the DaskCluster objects, but keeps secret things stored in secrets.

In that case, I think that this approach makes a lot of sense. There would be no need to watch the secrets in the dynamic namespaces, so no overhead, other than creation. We can make the scheduler the parent of the duplicated secret. That way, as soon as it's deleted, the duplicated secret will be deleted as well.

@droctothorpe
Copy link
Contributor

droctothorpe commented Feb 21, 2020

Scratch that, cross namespace owner references are prohibited.

Note: Cross-namespace owner references are disallowed by design. This means: 1) Namespace-scoped dependents can only specify owners in the same namespace, and owners that are cluster-scoped. 2) Cluster-scoped dependents can only specify cluster-scoped owners, but not namespace-scoped owners.
https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents

This complicates things. I have to dig deeper.

@droctothorpe
Copy link
Contributor

If we make the CRD cluster-scoped, namespace-scoped operands can be owned by the CRs.

https://github.com/operator-framework/operator-sdk/blob/master/doc/operator-scope.md#crd-scope

@droctothorpe
Copy link
Contributor

I'm OOO tomorrow but really look forward to deploying and testing on Wednesday.

- Batch creation of pods to reduce load on api server, and catch
expected failures (e.g. due to resource-quotas) earlier.
- Requeue with backoff in the presence of pod operation failure
- Add a client-side rate limiter for k8s pod operations.

This makes things work nicely in the presence of resource-quotas. If a
cluster fails to start a requested pod, it will requeue itself after a
backoff period. These errors are caught early to minimize failed api
calls.
@jcrist
Copy link
Member Author

jcrist commented Mar 10, 2020

Todo list:

  • Tests
  • Cleanup logging of operations, we're not logging everything that's important, and the logging levels and verbosity isn't consistent everywhere.
  • Add stop_time timestamp to model (we're already setting start_time).
  • Delete stopped/failed clusters after a timeout. Currently these accumulate and must be deleted manually or with a cronjob.
  • Expose backoff and rate limiting knobs in the configuration options. These are currently hardcoded.

Used to fill in `stop_time` in the cluster models returned by the api
server.
Cluster records (the DaskCluster objects in k8s) are persisted for a
while even after the cluster is stopped. We now add configuration for
deleting these objects periodically after a set period of time (defaults
to 24 hours).
@gforsyth
Copy link
Contributor

Hey @jcrist -- this is looking great! We just deployed this to try it out. We needed to tweak a few small things in the helm chart for enterprise-y stuff that we can push up later, but I don't think that's a blocker for public consumption.

We can hit the gateway, log in, list_clusters just fine. create_cluster shows a pod being created (and it is running) but the call to create_cluster on the client side never returns and the controller is showing this error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 437, in reconciler_loop
    requeue = await self.reconcile_cluster(name)
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 457, in reconcile_cluster
    status_update, requeue = await self.handle_cluster(cluster)
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 487, in handle_cluster
    return await self.handle_pending_cluster(cluster)
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 506, in handle_pending_cluster
    cluster
  File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/controller.py", line 790, in create_scheduler_pod_if_not_exists
    return pod["metadata"]["name"], pod
TypeError: 'NoneType' object is not subscriptable

@jcrist
Copy link
Member Author

jcrist commented Mar 11, 2020

shows a pod being created (and it is running) but the call to create_cluster on the client side never returns and the controller is showing this error:

Ah, I can see the logic bug here, but I'm curious why it's occurring on your side and not on mine. Can you post the full controller logs?

@jcrist
Copy link
Member Author

jcrist commented Mar 11, 2020

Should be fixed now.

Simplifies shutdown of queue consumers.
- Add rate limiting to all k8s calls.
- Expose rate limiting parameters in the configuration
- Expose backoff parameters in the configuration
Move more things to `INFO` level, more uniform/informative messages.
@gforsyth
Copy link
Contributor

gforsyth commented Mar 11, 2020

Ah, I can see the logic bug here, but I'm curious why it's occurring on your side and not on mine. Can you post the full controller logs?

Sorry, changes were pushed up to come CRDs and I think our issue was that some CRDs were missing permissions (a lot of this, I believe, is due to security restrictions on our end about how verbs are listed, etc.)

Bug is fixed -- I can try to deploy the previous version a little later and try to re-trigger the traceback

On shutdown there are many events from child objects as cascading
deletes progress, which leads to many unnecessary reconcilation calls.
We now check if the cluster is shutting down before enqueueing the
cluster for reconcilation, which reduces unnecessary reconcilation calls
on shutdown.
@jcrist
Copy link
Member Author

jcrist commented Mar 11, 2020

Sorry, changes were pushed up to come CRDs and I think our issue was that some CRDs were missing permissions (a lot of this, I believe, is due to security restrictions on our end about how verbs are listed, etc.)

Did you need to change the service account permissions for the controller/api server? The permissions created by the helm chart should be state the permissions required explicitly - if these aren't sufficient then this is a bug in our helm chart. I'd be interested in seeing what you needed to do to get things working.

@droctothorpe
Copy link
Contributor

droctothorpe commented Mar 11, 2020

One notable change was this in the controller RBAC:

rules:
--
  | - apiGroups: ["gateway.dask.org"]
  | resources: ["daskclusters", "daskclusters/status"]
  | verbs: ["*"]

daskclusters and daskclusters/status have discrete permissions.

@jcrist
Copy link
Member Author

jcrist commented Mar 11, 2020

Hmmm, didn't know that */status things were separate resources. That should have caused things to fail on any system with rbac enabled, but it looks like docker-desktop automatically promotes all service accounts to admin by default (see docker/for-mac#3694 (comment)) - deleting their clusterrolebinding causes things to fail properly locally as well.

Most users will probably want to serve over a single port (both HTTP and
TCP traffic) rather than splitting the scheduler traffic out to a
separate port. We now enable that by default.

The client is also updated to infer the proper port from the scheme if
the port is not provided (e.g. use port 443 if given
``https://foo.com``).

Also fixes rbac permissioning for the controller.
@jcrist
Copy link
Member Author

jcrist commented Mar 11, 2020

I've fixed the RBAC issue. I also updated the helm chart to serve over a single port by default, as this will likely be the more common configuration.

@jcrist
Copy link
Member Author

jcrist commented Mar 11, 2020

This is getting pretty close to ready to merge. I'm going to try and expand unit tests tomorrow, and then hit the merge button. I think getting larger integration tests setup will be a subsequent PR - I've at least verified things work locally and this PR is pretty big already.

@jcrist jcrist changed the title WIP: Kubernetes backend rewrite Kubernetes backend rewrite Mar 11, 2020
@droctothorpe
Copy link
Contributor

You may find OSDKs documentation on E2E tests to be a helpful reference: https://github.com/operator-framework/operator-sdk/blob/master/doc/test-framework/writing-e2e-tests.md

@droctothorpe
Copy link
Contributor

Do other backends rely on the Golang proxy that you wrote?

@jcrist
Copy link
Member Author

jcrist commented Mar 12, 2020

Yes. All other backends use our bundled proxy, the kubernetes backend is the only one using traefik.

@jcrist
Copy link
Member Author

jcrist commented Mar 12, 2020

Ok, I'm going to merge this.

Since the design has changed a few times during this PR (and some of the above comments are out-of-date), here is an up-to-date summary of what was implemented here:

  • A DaskCluster CRD, storing a spec and status for each dask cluster
  • A Backend implementation for kubernetes, which stores all state in DaskCluster objects
  • A controller, implemented in Python, which watches for dask cluster objects and processes them accordingly.
  • A helm chart for deploying the gateway api server, controller, and traefik proxy together, configured appropriately
  • Updated docker images that build from source rather than pypi for easy deployment from source
  • A skaffold setup that can optionally run the controller out-of-cluster (by specifying -p local-controller) for easier development

There are currently some unit tests, mostly that k8s objects created have the appropriate fields and that configuration is forwarded appropriately. I've poked at things locally and everything seems ok. Integration tests will be added in a subsequent PR.

@jcrist jcrist merged commit b656eaf into master Mar 12, 2020
@jcrist jcrist deleted the kube-support branch March 12, 2020 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rearchitecture: Kubernetes Backend
4 participants