Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes operator for Celery #24

Open
jmdacruz opened this issue Sep 13, 2019 · 25 comments
Open

Kubernetes operator for Celery #24

jmdacruz opened this issue Sep 13, 2019 · 25 comments
Assignees
Milestone

Comments

@jmdacruz
Copy link

This proposal is about having a Kubernetes operator (see here and here). The scope of the operator would be the following:

  • Defining a CRD for a CeleryApplication. This resource would contain the configuration for the cluster (e.g., container resource requests/limits, number of replicas), Celery configuration (e.g., broker and result backend configuration), Docker image with the code and launch parameters (e.g., location of the code inside the container, virtualenv)
    • As an alternative, the CRD could include a URL for downloading a Python package with the code of the application, instead of the Docker image.
  • The operator itself would manage the control loop of the CeleryApplication CRD, and would spawn a Kubernetes Deployment for the cluster, and also a Deployment for running flower. It would also create a Kubernetes Service so that we can access the flower UI/API
  • Broker and result backend configuration would be out of scope of the operator, these have to be created before hand. A set of CeleryApplication resources should be able to use a shared broker and result backend, but they can pick their own broker configuration too.

This idea is inspired by the Flink Kubernetes Operator developed by Lyft: https://github.com/lyft/flinkk8soperator

@thedrow
Copy link
Member

thedrow commented Sep 22, 2019

Yes, this should happen.
It means that we need to create a Controller implementation (See #19) for k8s.

@jmdacruz
Copy link
Author

jmdacruz commented Dec 6, 2019

Initial attempt at implementing this (still very much a proof-of-concept): https://github.com/jmdacruz/celery-k8s-operator

@thedrow
Copy link
Member

thedrow commented Dec 13, 2019

How do you imagine the integration with Celery 5 be like?

@jmdacruz
Copy link
Author

jmdacruz commented Dec 13, 2019

I honestly haven't been following the development of 5.X that close, where can I get a glimpse on the biggest changes? Worst case scenario, if there are breaking changes, the operator supports versioning: The CeleryApplication resource includes an attribute for the celeryVersion (I'm currently ignoring it, but should be used for this) which can be used to change the shape of the Kubernetes deployment according to the Celery version.

@jmdacruz
Copy link
Author

where can I get a glimpse on the biggest changes?

Actually, scratch that... you had already mentioned this above :-). I'll take a look at that.

@jmdacruz
Copy link
Author

Trying to stick to the KISS principle, I think a good option would be to keep a single CRD (the CeleryApplication CRD) and a single Docker image for all the different tools and new "roles" (thanks to the celery CLI, the celery-k8s-operator uses a single image to run the workers, flower, and the liveness probes using celery inspect, something similar could be done to launch the Controller, Router, and Publisher?). The operator would create the deployment for the core components (e.g, the Celery Controller), ensuring that the proper configuration is injected (using Kubernetes ConfigMap) so that the Celery Controller can then do its job. Since the Celery Controller is now taking some of the responsibility the operator has for a 4.X deployment, then the operator needs to ensure the controller deployment/pods have the proper Kubernetes permissions to create the required resources.

Another approach would be to treat the Celery Controller as the operator itself, in which case its deployment should be straight forward (the expectation is that an operator is a standalone deployment), and then use the CeleryApplication CRD to create objects that will be impacted into the Kubernetes cluster by the Celery Controller operator.

In general, an Kubernetes operator needs to be very simple to deploy and maintain, since it's by definition a critical piece of the infrastructure (it becomes part of the Kubernetes cluster by extending its functionality, making sure other pieces work as expected)

Take a look at what the folks at Lyft do for the Apache Flink operator: https://github.com/lyft/flinkk8soperator

@gautamp8
Copy link
Collaborator

gautamp8 commented Jul 26, 2020

Is this thread still active? Inspired by the need for this at my own work, I also tried building a POC/MVP of Celery operator to learn the whole thing - https://github.com/brainbreaker/Celery-Kubernetes-Operator

I also presented it in EuroPython 2020 last Friday as part of my talk around automating the management of Kubernetes Infra while staying in the Python ecosystem(Slides - https://bit.ly/europython20-ppt). I'm willing to commit a certain number of hours every week to build a production-ready version of the Celery operator.

@mik-laj
Copy link

mik-laj commented Jul 26, 2020

In the Apache Airflow project, we use KEDA to provide autoscaling from 0 to n.
https://www.astronomer.io/blog/the-keda-autoscaler/

@auvipy
Copy link
Member

auvipy commented Jul 27, 2020

In the Apache Airflow project, we use KEDA to provide autoscaling from 0 to n.
https://www.astronomer.io/blog/the-keda-autoscaler/

I quite liked the approach KEDA take when I first saw that.

@auvipy auvipy added this to the Celery 5.0 milestone Jul 27, 2020
@thedrow
Copy link
Member

thedrow commented Jul 27, 2020

I'm aware of KEDA and I plan to incorporate it in our solution.

@gautamp8
Copy link
Collaborator

Yes, KEDA is probably the best way to go for scaling use-case. It keeps us close to native solutions like HPA and only introduces the metrics server and controller.

For my application, I was personally more focused around the learning experience so chose to implement a really basic scaling algorithm without using anything external.

@thedrow
Copy link
Member

thedrow commented Jul 27, 2020

What other tasks should the operator perform other than scaling?

@gautamp8
Copy link
Collaborator

Let me give my inputs from what I know about running Celery in production. I'm yet to read and understand the proposal and architecture for 5.x you've shared. I'll come back with more inputs.

I'm focusing on the problem of all the manual work/configuration that needs to be done while setting up Celery on K8s -

  1. Setup of worker deployments(and have a separate deployment for periodic tasks using celery beat)
  2. Setup flower deployment for observability, expose a service to make it accessible outside the cluster
  3. Worker scaling setup(using KEDA operator maybe as we discussed)
  4. Make sure things are recoverable like if any of the children(like worker deployment) goes rogue, the operator recovers it automatically. We might have to discuss different causes here.
  5. Should the operator also care about alerting when something abnormal happens(Failed tasks beyond a threshold, workers unhealthy etc)? I'm not sure there.
  6. We could include the basic broker setup too by default for use-cases where people just want to quickly start? But that might go beyond the Celery operator because brokers have their own notion of cluster.

There might be more things that come up while managing the lifecycle of a Celery application(I'm not a Celery expert rn but willing to explore/learn). I guess solving these manual steps would be a good starting point. What do you suggest?

@auvipy
Copy link
Member

auvipy commented Jul 28, 2020

Let me give my inputs from what I know about running Celery in production. I'm yet to read and understand the proposal and architecture for 5.x you've shared. I'll come back with more inputs.

I'm focusing on the problem of all the manual work/configuration that needs to be done while setting up Celery on K8s -

1. Setup of worker deployments(and have a separate deployment for periodic tasks using celery beat)

2. Setup flower deployment for observability, expose a service to make it accessible outside the cluster

3. Worker scaling setup(using KEDA operator maybe as we discussed)

4. Make sure things are recoverable like if any of the children(like worker deployment) goes rogue, the operator recovers it automatically. We might have to discuss different causes here.

5. Should the operator also care about alerting when something abnormal happens(Failed tasks beyond a threshold, workers unhealthy etc)? I'm not sure there.

6. We could include the basic broker setup too by default for use-cases where people just want to quickly start? But that might go beyond the Celery operator because brokers have their own notion of cluster.

There might be more things that come up while managing the lifecycle of a Celery application(I'm not a Celery expert rn but willing to explore/learn). I guess solving these manual steps would be a good starting point. What do you suggest?

you can start with celery 4.4.x as well.

@thedrow
Copy link
Member

thedrow commented Jul 28, 2020

Let me give my inputs from what I know about running Celery in production. I'm yet to read and understand the proposal and architecture for 5.x you've shared. I'll come back with more inputs.
I'm focusing on the problem of all the manual work/configuration that needs to be done while setting up Celery on K8s -

1. Setup of worker deployments(and have a separate deployment for periodic tasks using celery beat)

2. Setup flower deployment for observability, expose a service to make it accessible outside the cluster

3. Worker scaling setup(using KEDA operator maybe as we discussed)

4. Make sure things are recoverable like if any of the children(like worker deployment) goes rogue, the operator recovers it automatically. We might have to discuss different causes here.

5. Should the operator also care about alerting when something abnormal happens(Failed tasks beyond a threshold, workers unhealthy etc)? I'm not sure there.

6. We could include the basic broker setup too by default for use-cases where people just want to quickly start? But that might go beyond the Celery operator because brokers have their own notion of cluster.

There might be more things that come up while managing the lifecycle of a Celery application(I'm not a Celery expert rn but willing to explore/learn). I guess solving these manual steps would be a good starting point. What do you suggest?

you can start with celery 4.4.x as well.

Yes but if he will, he'll have to redesign it later on.

@thedrow
Copy link
Member

thedrow commented Jul 28, 2020

Let me give my inputs from what I know about running Celery in production. I'm yet to read and understand the proposal and architecture for 5.x you've shared. I'll come back with more inputs.

I'm focusing on the problem of all the manual work/configuration that needs to be done while setting up Celery on K8s -

1. Setup of worker deployments(and have a separate deployment for periodic tasks using celery beat)

2. Setup flower deployment for observability, expose a service to make it accessible outside the cluster

3. Worker scaling setup(using KEDA operator maybe as we discussed)

4. Make sure things are recoverable like if any of the children(like worker deployment) goes rogue, the operator recovers it automatically. We might have to discuss different causes here.

5. Should the operator also care about alerting when something abnormal happens(Failed tasks beyond a threshold, workers unhealthy etc)? I'm not sure there.

6. We could include the basic broker setup too by default for use-cases where people just want to quickly start? But that might go beyond the Celery operator because brokers have their own notion of cluster.

There might be more things that come up while managing the lifecycle of a Celery application(I'm not a Celery expert rn but willing to explore/learn). I guess solving these manual steps would be a good starting point. What do you suggest?

I think alerting is a good idea.
We're going to use OpenTelemetry so I'm not sure how much flower will be useful until they migrate as well.

Do read the draft, please 😄. I'd love to hear your comments.

@gautamp8
Copy link
Collaborator

gautamp8 commented Jul 30, 2020

Let's continue the operator conversation from celery issues#4213 here itself(at one place).

Do read the draft, please 😄. I'd love to hear your comments.

I'm still going through the arch doc and thinking what all operator/controller will need to do. I definitely see some major changes from the way 4.X needed to be deployed on a K8s cluster in this CEP.

I'll come back with some questions/comments by this weekend. Sorry for the delay because of my limited availability.

@gautamp8
Copy link
Collaborator

gautamp8 commented Aug 1, 2020

Okay so I reviewed the architecture for 5, it looks really promising. I've some comments which I'll add to the PR#27.

For the operator, we could support both 4.X and 5. I feel we should start with 4.4.X as per the suggestion of @auvipy. We can introduce versioning in the operator as we go along.

Correct me if I'm wrong - For 5 to go on for a stable version and be adopted by the community as a breaking release - will take time. I'm guessing more than a year. Till then and even beyond that, people would still be using 4.4.X if it's too much effort to migrate and they don't wish to use the new use-cases 5 is going to support.

4.4.X Operator will be somewhat simpler to implement and good way to start because it has less moving parts/components.

For controller implementation of 5 - we need to have a detailed discussion around the lifecycle of components(Data Sources, Sinks, router, execution platform, and so on). We also need to discuss what all would lie in the scope of the controller and what won't - For example - managing different message brokers, data sources and sinks might go beyond the work of the Celery operator.

Ideally, I'd want to try running Celery 5 in production to see the pain points and manual things to be done before writing an operator to fix those. I think there's still some way to go for that. What do you suggest @thedrow?

If you guys agree to go ahead with 4.4.X as a start, then I'll go ahead and chalk out a design document for the operator and share it with you guys as soon as I can.

@auvipy
Copy link
Member

auvipy commented Aug 1, 2020

your observation seems practically logical to me. I would suggest starting with 4.x first. One Goose step at a time O:)

@RyanSiu1995
Copy link

I have recently started to create the celery operator with operator framework.
https://github.com/RyanSiu1995/celery-operator
With the operator framework, we may do something more native, like metrics exporting.
I am going to continue the development.
Hopefully, it will have a test version by the end of this month.

@auvipy auvipy self-assigned this Aug 17, 2020
@gautamp8
Copy link
Collaborator

gautamp8 commented Aug 18, 2020

@auvipy @thedrow @jmdacruz
Wrote a high-level architecture document for the operator - https://brainbreaker.github.io/Celery-Kubernetes-Operator/architecture

Would like to have inputs/suggestions from you guys.

@auvipy
Copy link
Member

auvipy commented Aug 19, 2020

will look into this next week.

@thedrow
Copy link
Member

thedrow commented Aug 24, 2020

@Brainbreaker I'm going to read this today.
If you want this to be the official way to deploy Celery to k8s, you'll need to submit a CEP.
Unfortunately, the Operator CEP depends on the Controller CEP, which I haven't started yet.
We can work on that together as well to ensure we have gathered all the requirements.

You can use our template to do so.

I'm willing to shepherd this effort.

@gautamp8
Copy link
Collaborator

gautamp8 commented Aug 25, 2020

@Brainbreaker I'm going to read this today.

Awesome, thanks.

If you want this to be the official way to deploy Celery to k8s, you'll need to submit a CEP.

Yes, for sure. I'd be happy to submit a CEP.

Unfortunately, the Operator CEP depends on the Controller CEP, which I haven't started yet.
We can work on that together as well to ensure we have gathered all the requirements.

Sounds good. Although, I have written the document by keeping in mind Celery 4.4.X right now, not 5. But yeah, we should think around making it scalable to handle 5 as well.

I'm willing to shepherd this effort.

Great. I'm looking forward to your inputs.

@gautamp8
Copy link
Collaborator

Opened #29. It'll probably be better for you guys to review it as a CEP.

I couldn't review the rendered output for RST, however, I've tried my best to avoid any random formatting issues using online tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants