Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Kubernetes Runtime #5682

Open
matthewpereira opened this issue May 28, 2020 · 3 comments
Open

Native Kubernetes Runtime #5682

matthewpereira opened this issue May 28, 2020 · 3 comments
Labels
containerd product epic These meta-issues link multiple engineering issues together and contextualize initiatives. sred stack/kubernetes

Comments

@matthewpereira
Copy link

matthewpereira commented May 28, 2020

NOTE:

This issue is currently a stand-in for the larger track of work around implementing a k8s runtime option on Concourse. In the coming months, this issue will be replaced by the smaller epics that make up the track as a whole, starting with #6591 - Utilizing Baggage Claim as a CSI Driver. Although this issue is currently placed in Q3, expect the work on the k8s runtime to continue throughout 2021.

TLDR

  • Add a k8s runtime option to Concourse that coincides with garden and containerd

Problem

  • Concourse’s current Guardian runtime makes it difficult for operators to know when to scale their Concourse deployments
  • Causes Concourse to reach states where containers are maxed out and Concourse is stalled
  • Trickle down effect to application teams using Concourse for CI
  • Customers often ask us what the limits are to Concourse from a Teams/Pipelines/Jobs perspective in order to try and create a potential magic metric that will indicate when it is necessary to scale

Proposed Solution

  • Add native a native kubernetes runtime to leverage k8s’ built-in container management and orchestration abilities

Outcomes

  • The ideal solution would be to get Concourse to a place where Operators do not have to actively think about scaling their Concourse deployments. Our hypothesis is that utilizing k8s as a runtime will allow Concourse to leverage k8s built-in container orchestration and scheduling abilities and take the guesswork out of scaling for customers leading to greater stability and overall product satisfaction.

Current Status

References

RFC: k8s runtime
RFC: k8s storage
#5986 - SPIKE Review the k8s worker POC
#6036 - SPIKE Storage on K8s

@matthewpereira matthewpereira created this issue from a note in 2020 Quarterly Roadmap (Q4 - October to December) May 28, 2020
@matthewpereira matthewpereira added product epic These meta-issues link multiple engineering issues together and contextualize initiatives. and removed epic labels Jun 5, 2020
@scottietremendous scottietremendous moved this from Q4 - October to December to Q3 - July to September in 2020 Quarterly Roadmap Jul 14, 2020
@scottietremendous scottietremendous moved this from Q3 - July to September to Q4 - October to December in 2020 Quarterly Roadmap Sep 2, 2020
@scottietremendous scottietremendous moved this from Q4 - October to December to Q1 - January to March in 2020 Quarterly Roadmap Oct 13, 2020
@scottietremendous scottietremendous added this to Q2 - April to June in 2021 Quarterly Roadmap Feb 23, 2021
@scottietremendous scottietremendous moved this from Q2 - April to June to Q3 - July to September in 2021 Quarterly Roadmap Feb 24, 2021
@ringods
Copy link

ringods commented Oct 12, 2021

We are struggling with scaling our workers dynamically. Having containers scheduled on an autoscaling k8s cluster would be a great simplification to us.

Can someone provide an update of the status for this feature?

@taylorsilva
Copy link
Member

Here's an update!

  • We currently lack the bandwidth to take on this work. We're focusing on the v10 roadmap features with our current bandwidth. If this is going to get done it's not going to be by the core team any time soon.
  • There are some RFC's open that someone could follow to start implementing the k8s runtime
    • RFC: k8s storage rfcs#77 - Current proposal was to create a CSI driver based on baggageclaim. We actually did a POC of this, link in the RFC, and it worked! Still a lot of open questions around how to stream volumes between "workers" on a k8s cluster and even more importantly to workers outside the cluster (should we even support that?? probably!)
    • k8s runtime rfcs#81 - This one is still very draft-y, but there are many options for someone that wants to implement the k8s runtime. The easiest might be to add it to the worker package, so someone would configure a single worker on a k8s cluster and it would act as a CRD-ish thing (I think?) but still call out to a web node deployed somewhere. The other idea was to have the web node deployed on a k8s cluster do all the runtime stuff instead of calling out to a worker.

A lot of that is in the OP, so I guess my first point is the biggest (and not the greatest) update 😓

@multimac
Copy link
Contributor

Hey, I've actually been working on a series of patches to allow Concourse to schedule work in a k8s cluster. Recently I've been able to have successful runs of basic pipelines, including...

  • Using images fetched via the registry-image resource
  • Streaming volumes between k8s and non-k8s workers
  • Running both resource and task containers, and collecting the output from resource containers as required for them to work with Concourse

The high level is that a worker Pod is deployed to each node in the cluster, with containers for the CSI driver, baggage claim and garden (required by the beacon component) APIs, and the beacon component. This worker is only responsible for handling the volumes to be mounted to any step Pods scheduled on the corresponding k8s node, and also notifying the web nodes of the step Pods currently running on its corresponding k8s node (via the beacon component, to allow cleanup of old Pods).

I've also implemented a new Kubernetes "runtime" on the web nodes (under atc/worker/kubernetesruntime). This runtime handles the creation of Pods to run the different step containers.

I'm currently testing this in our Concourse deployment at work and hoping to get it stable enough to become our primary runtime over the next few weeks. Hopefully I'll have the code polished up enough in the next week or two to begin PR-ing some of the smaller changes.

Happy for feedback if you want to poke around the current state of the work.

Current work can be found in this branch...
https://github.com/multimac/concourse/tree/concourse-k8s

With the custom k8s worker in this repository...
https://github.com/multimac/concourse-k8s-worker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
containerd product epic These meta-issues link multiple engineering issues together and contextualize initiatives. sred stack/kubernetes
Projects
2020 Quarterly Roadmap
Q1 - January to March
2021 Quarterly Roadmap
Q3 - July to September
Development

No branches or pull requests

5 participants