New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Kubernetes Runtime #2

Open
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@topherbullock
Member

topherbullock commented Apr 17, 2018

Proposal

Status: draft

Still more details and thoughts to collect; specifically around how the new GC changes impact how a K8s runtime should clean up resources. I'm sure more questions will come up around using K8s effectively.

Please submit comments and feedback on individual lines of the proposal markdown, so that we can track open conversations.

topherbullock added some commits Apr 11, 2018

@edude03

Persistent Volume Claims is likely the right abstraction for this, however one thing I don't understand about concourse is how it manages volumes. Right now, if you run Concourse on Kubernetes and give the worker a Persistent Volume to put the btrfs image on, you often run into an issue where the state ATC expects the worker to be in is not the same as the actual state.

It seems to me that this approach should work fine - when the worker restarts the PV has the same volume image and thus inputs, resource etc should still be there.

Does baggaeclaim do something unusual with how it references its images that causes it to fail if the volume is unmounted and remounted?

provides and interface the ATC can leverage to manage volumes similar to how
Baggageclaim Volumes are managed. Could a replica set place a worker on each K8s
Node and use `HostPath` Volumes or create Persistent Volumes to store Resource
Caches, Task Caches, and Image Resource Caches, etc.?

This comment has been minimized.

@edude03

edude03 Jun 28, 2018

Persistent Volume Claims is likely the right abstraction for this, however one thing I don't understand about concourse is how it manages volumes. Right now, if you run Concourse on Kubernetes and give the worker a Persistent Volume to put the btrfs image on, you often run into an issue where the state ATC expects the worker to be in is not the same as the actual state.

It seems to me that this approach should work fine - when the worker restarts the PV has the same volume image and thus inputs, resource etc should still be there.

Does baggaeclaim do something unusual with how it references its images that causes it to fail if the volume is unmounted and remounted?

This comment has been minimized.

@vito

vito Jul 4, 2018

Member

@edude03 I don't think I quite understand persistent volume claims enough yet to know what you mean by the state desync, but I can theorize. The gist of things is that the set of volumes a worker has is associated by name to the worker's registration in the ATC's database. If a worker comes back with a new name but is somehow pointing to a disk with a bunch of volumes already present, the ATC won't know they exist. Similarly, if a worker comes back under the same name but all of its volumes are gone, the ATC will get a bit confused and still try to use them.

As far as BaggageClaim is concerned, its API is just a direct reflection of the filesystem (a GET request to list volumes corresponds to a directory listing for example). It has no idea when its volumes are mounted/unmounted to/from containers, it's just preparing the paths on the host that we then bind-mount. Its sole purpose is to manage the volumes through direct API calls from the orchestrator. There's no state in its head and no automatic garbage collection (beyond volume TTLs, which aren't used by Concourse anymore but are still implemented in the API) or outbound interaction with the ATC to "re-sync".

As for how BaggageClaim fits in to the K8s model, I wonder if it would be a good fit as a CSI? Maybe this way we could abstract over whether the volume has to be transferred over the network or a local copy-on-write can be constructed. (Disclaimer: I've only briefly looked at this...)

This comment has been minimized.

@edude03

edude03 Jul 5, 2018

@vito - so I'm going from memory so the situation might have changed however using the Kubernetes helm chart problem the first problem is that the workers crash fairly frequently.

This leads to / exasberates the second problem which is that I get an error about not being able to find the volume causing the worker to get stuck - it's running but none of the jobs assigned to it actually run.

I've never understood this problem because the helm chart is implemented as a StatefulSet which means that the worker has a stable name (it will always be called concourse-worker-$n) and it uses a PersistentVolumeClaim so it will restart with the same "Hard drive" (I'm avoiding calling it a volume as that term is overloaded in this context) as the previous run.

With this setup - it should have the same guarrentees as if run on a physical machine (host scheduling not with standing of course) and yet workers break fairly often.

As for the statefulness of ATC/BaggageClaim - it seems to me that one solution to this would be to make volume requests an upsert operation IE create the volume if it doesn't exist. OR to remove the cleverness of trying to schedule jobs on machines were the volume already exists. Of course figuring out why volumes can't be found on Kubernetes is probably an even better solution.

As for a CSI - It's either a great idea or not the right thing at all though I'm not sure I understand baggageclaim enough to say.

a CSI can run as a sidecar in a pod, listening to the kubernetes API and respoding to requests for storage (Add/Update/Delete operations).

This sounds like what baggageclaim is for although I imagine a Pod would need a PersistentVolumeClaim mounted so that baggageclaim has somewhere to put its own volumes?

In the K8S model, would baggageclaim map one "Volume" to one "hard drive" or would it get one "hard drive" and create a btrfs image and put all the volumes in there?

create a container, and K8s will cache these images. In order to support this
as the preferred way to define container images, we will need to find a viable
solution which saves the exported contents from `image_resource` to the K8s
registry.

This comment has been minimized.

@edude03

edude03 Jun 28, 2018

Two options that come to mind - hosting an object storage server on the cluster such as Minimo and using the s3 api, or using a shared persistent volume. The latter is likely difficult in non cloud environments however

Could a CustomResourceDefinition (CRD) be used to represent the Containers
created for Concourse Tasks and Resources? This would allow a user or operator
to easily recognize and differentiate Concourse Containers and their
corresponding workloads from other workloads on the K8s Cluster.

This comment has been minimized.

@edude03

edude03 Jun 28, 2018

A CRD is a great way to manage workers on the cluster IMO 👍

Tagging of Concourse Jobs for specific workers might need to change to
accommodate K8s Pod `nodeSelector`s which allow users to select specific K8s
Nodes schedule the necessary Workloads on. Affinity and anti-affinity also

This comment has been minimized.

@edude03

edude03 Jun 28, 2018

IMO, neither affinity or anti affinity should be exposed to the user.

This comment has been minimized.

@ornous

ornous Jun 29, 2018

Hey edude03. Why do you believe that?

This comment has been minimized.

@edude03

edude03 Jul 5, 2018

I think that it makes the mental model of workload scheduling more confusing. At the very least it shouldn't be part of an MVP

This comment has been minimized.

@topherbullock

topherbullock Jul 11, 2018

Member

@edude03 @ornous
tags already allow Concourse users to define some aspects of how steps are scheduled across the set of workers, but those are all exposed by workers themselves

It would be confusing for users to need to reason about what nodes their resources are scheduled on, but maybe the CRD for workers could tie the thread from worker tags -> node.

@pivotal-nader-ziada

This comment has been minimized.

pivotal-nader-ziada commented Jul 9, 2018

One approach I investigated for Volume management is the Local Persistent Volumes feature. Each worker node would have a local persistent volume, and a persistent volume claim, that be be then mounted in the Pod executing a given task. PersistentVolume nodeAffinity enables the Kubernetes scheduler to schedule Pods using local volumes to the correct node.
On this local persistent volume, folders represent each required volume and the subPath is mounted in the Pod that needs access to the volume. Here is a note about the subpath details and fixes to a recent vulnerability.
https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/

An example for creating local persistent volume in the following yaml files
https://gist.github.com/pivotal-nader-ziada/15d430bb16397c672e337f2e9275164c

This solution requires significant changes to how concourse is currently managing volumes, and at the same time is not using a kubernetes feature as is.

topherbullock referenced this pull request in pivotal-jwinters/rfcs Nov 20, 2018

introduce user-artfact API handler
Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
@hairyhenderson

This comment has been minimized.

hairyhenderson commented Dec 8, 2018

Just wanted to drop this here: https://blog.drone.io/drone-goes-kubernetes-native/ - discusses how Drone (a similar CI/CD tool) has leveraged Kubernetes. Some of the implementation details are potentially relevant (or not!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment