Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Volcano as CNCF Sandbox. #318

Merged
merged 1 commit into from
Apr 10, 2020
Merged

Conversation

k82cn
Copy link
Contributor

@k82cn k82cn commented Nov 6, 2019

This is the proposal to add Volcano to the CNCF.

Name of Project: Volcano
Description:
Volcano is a batch system built on Kubernetes for the above requirements. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workload including: machine learning/deep learning, bioinformatics/genomics and other "big data" applications. These types of applications typically run on generalized domain frameworks like TensorFlow, Spark, PyTorch, MPI, etc, which Volcano integrates with.

@wpeng102
Copy link

wpeng102 commented Nov 6, 2019

+1

1 similar comment
@tizhou86
Copy link

tizhou86 commented Nov 6, 2019

+1

@lucperkins
Copy link
Contributor

Everyone, please no more “+1”s here. They do not provide useful information and do not make a difference in the TOC’s decisions.

@quinton-hoole
Copy link
Contributor

I am familiar with this project and believe it would be a good addition to the CNCF sandbox to promote industry collaboration towards shared implementations of common extensions to Kubernetes to better support batch AI/ML, big data and other similar batch workloads.

@caniszczyk
Copy link
Contributor

@quinton-hoole have they considered LF AI? There's a scope question here potentially and there may be other organizations that are a better fit: https://landscape.lfai.foundation

@k82cn
Copy link
Contributor Author

k82cn commented Nov 13, 2019

@quinton-hoole have they considered LF AI? There's a scope question here potentially and there may be other organizations that are a better fit: https://landscape.lfai.foundation

The Volcano is a common extensions to Kubernetes for batch workload (A Kubernetes Native Batch System), and its target is to "help batch workload/application to be cloud native"; the batch workload includes not only AI, BigData, but also HPC, e.g. Gene, and others. So, CNCF seems matching our target better :)

@quinton-hoole
Copy link
Contributor

quinton-hoole commented Nov 13, 2019 via email

@k82cn
Copy link
Contributor Author

k82cn commented Dec 2, 2019

hi team, is there anything I can help for next step? :)

@caniszczyk
Copy link
Contributor

@k82cn my advice would be to schedule a presentation to one of the CNCF SIGs, I think in this case, App Delivery may be best (e.g., https://github.com/cncf/sig-app-delivery) as I can't think of another SIG that would be better, @quinton-hoole?

@k82cn
Copy link
Contributor Author

k82cn commented Dec 3, 2019

schedule a presentation to one of the CNCF SIGs

Is that the new process? :)

I think in this case, App Delivery may be best (e.g., https://github.com/cncf/sig-app-delivery) as I can't think of another SIG that would be better

hm..., IMO, SIG Runtime seems better for Volcano according to the chart of them; is SIG Runtime ready to do that?

@caniszczyk
Copy link
Contributor

yes we are requiring all projects to present to a SIG first

SIG Runtime could be great but it's not formed yet, I believe we are almost there cc: @amye

@quinton-hoole
Copy link
Contributor

@caniszczyk Yes, I discussed with @amye today, and she's busy tying up some loose ends to bring SIG Runtime to a TOC vote.

Signed-off-by: Klaus Ma <klaus1982.cn@gmail.com>
@k82cn
Copy link
Contributor Author

k82cn commented Feb 20, 2020

We reviewed it and our recommendation is to accept to Sandbox. The next step is getting sponsors in the TOC. Here are the recommendation and presentation record :)

/cc @raravena80 @amye

@lizrice
Copy link
Contributor

lizrice commented Mar 6, 2020

I've just been reviewing the presentation and recommendation. It looks good but I have a couple of questions:

  • Are there any similar projects that you're aware of?
  • In the presentation there was a good list of companies evaluating their use of Volcano, have any more of them reached a conclusion and decided whether to adopt Volcano since this presentation happened?

@raravena80
Copy link
Contributor

I think maybe @k82cn can better answer these questions but I'd like to pitch in for

Are there any similar projects that you're aware of?

There is Cyclone, K8s native but it seems to focus more on AI.
There is also Flyte, K8s native but it focuses on Machine Learning.

There are also a few similar projects but they are not K8s native. For example, Airflow, Luigi, and Oozie.

Then Peloton is a mixed batch, stateless and stateful jobs scheduler but it mainly works under Mesos.

An interesting feature of Volcano is that it allows for DRF allocation in K8s, which I think no other project implements. Something that the Big Data folks running something like Spark or Flink would remember coming from the Mesos world being the default.

@k82cn
Copy link
Contributor Author

k82cn commented Mar 7, 2020

Are there any similar projects that you're aware of?

I think @raravena80 cover most of the similar projects. Beside DRF allocation, Volcano also provides other features, e.g. Gang, Queue, to run kubeflow, spark and so on better on k8s.

In the presentation there was a good list of companies evaluating their use of Volcano, have any more of them reached a conclusion and decided whether to adopt Volcano since this presentation happened?

Some of them decided to use Volcano + Kubeflow or Spark for their ML/BigData platform according to offline discussion; but it's up to the adopters to decide when update the status to avoid any media communication concerns.

@haosdent
Copy link

+1024

@Jeffwan
Copy link

Jeffwan commented Mar 10, 2020

I would love to vote for Volcano to join CNCF Sandbox projects.

In ug-machine-learning, volcano is widely adopted by most of the distributed training operator like tf-operator, pytorch-operator, mpi-operator, etc. As Kubernetes default scheduler is not well designed for batch jobs, a lots of features user need like gang-scheduling, binpack are missing. Users from traditional platform plan to move to Kubernetes have to leverage secondary scheduler solutions. Volcano is the most popular one and it solve most of the common problems user have and becomes the essential component to run big data and machine learning workloads on Kubernetes.

@lizrice
Copy link
Contributor

lizrice commented Mar 10, 2020

it's up to the adopters to decide when update the status to avoid any media communication concerns

@k82cn is there an Adopters file that they are updating status in? (We don't have a particular gate on usage for Sandbox acceptance, but evidence of adoption could be a persuasive argument to help attract TOC Sponsors.)

@k82cn
Copy link
Contributor Author

k82cn commented Mar 12, 2020

@lizrice , discuss with some adopters about latest status; for now, 3 adopters updated status, e.g. iQiyi (Staging, pre-production), Xiaohongshu.com (Production), HuaweiCloud (Production). It'll take time to get more adopters to update the status; do you think that's enough for a sandbox application?

@k82cn
Copy link
Contributor Author

k82cn commented Mar 17, 2020

@lizrice
Copy link
Contributor

lizrice commented Mar 19, 2020

I am happy to sponsor Volcano for Sandbox. (Could have sworn I already added a comment to that effect yesterday - maybe I failed to hit the green button!)

Thank you SIG Runtime for your review, and @k82cn for the additional answers

@sheng-liang
Copy link

I am happy to sponsor as well.

@k82cn
Copy link
Contributor Author

k82cn commented Mar 21, 2020

@lizrice , @sheng-liang thanks very much !

@happy2048
Copy link

I find sig-scheduling may be doing something similar with the new scheduler framework. Is there any relationship or difference ?
framework:https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20180409-scheduling-framework.md
binpack:https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20190311-resource_bin_packing_priority_function.md
coscheduling:kubernetes/enhancements#1463

@k82cn
Copy link
Contributor Author

k82cn commented Mar 25, 2020

I find sig-scheduling may be doing something similar with the new scheduler framework. Is there any relationship or difference ?

  1. Volcano is a batch system instead of a scheduler; it includes job controller, admission controllers, scheduler and device plugins for batch workload on k8s. That's why k8s steering committees also suggested to donate to CNCF.
  2. When Bobby and I built scheduling framework as sig-scheduling co-chair, we found there're still some gaps to support job level scheduling, e.g. fair-scheduling (cross time), queue, reservation/backfill and so on. So we prefer to ask Volcano to unblock batch workload on boarding.
  3. In additional, the purpose of scheduling framework is to help user to build customized scheduler; so we may leverage it with batch algorithms as scheduler, and work with job controller, admission and device plugins for batch workload on k8s.

@alena1108
Copy link
Contributor

I am happy to sponsor Volcano for Sandbox. Support for running batch workloads on Kubernetes is in high demand by Big Data/Machine Learning projects. And having Volcano in the Sandbox can foster community collaboration and knowledge sharing in the Cloud Native space.

@k82cn thank you for providing all the extra information! Also great to see GPU topology/share support on 2020 roadmap.

@k82cn
Copy link
Contributor Author

k82cn commented Apr 2, 2020

@alena1108 , thanks very much for your sponsoring :)

@raravena80
Copy link
Contributor

@amye @caniszczyk Volcano now has 3 TOC sponsors for Sandbox. Any next steps for the project team? Thanks.

@amye
Copy link
Contributor

amye commented Apr 2, 2020

We'll work with them on transferring assets over, after that's complete I will merge this in.

@caniszczyk
Copy link
Contributor

caniszczyk commented Apr 2, 2020

Thanks @amye can you start on the onboarding checklist here:

  • Create maintainer list + added to aggregated list
  • Website: logo updates
  • Domain: transfer domain to CNCF and under LF ITx
  • Trademarks: transfer any assets over to the LF if they exist
  • Devstats: add to devstats
  • Marketing: cncf/artwork repo, slide decks
  • Documentation: maintainer docs, project tracker
  • Update CNCF Landscape
  • Events updates: CFP + Registration + CFP Area
  • ServiceDesk: confirm maintainers have read https://www.cncf.io/services-for-projects/
  • Welcome Email Sent

@caniszczyk
Copy link
Contributor

hey @k82cn can you confirm you and the maintainers have read https://www.cncf.io/services-for-projects/ :)

@k82cn
Copy link
Contributor Author

k82cn commented Apr 9, 2020

hey @k82cn can you confirm you and the maintainers have read https://www.cncf.io/services-for-projects/ :)

Confirmed. I and the maintainers already read https://www.cncf.io/services-for-projects/ .

caniszczyk added a commit that referenced this pull request Apr 10, 2020
#318

Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
@caniszczyk
Copy link
Contributor

Thanks @k82cn, @amye will send out the project welcome email and schedule the monthly project sync with CNCF staff, thanks!

@caniszczyk caniszczyk merged commit 76b1c92 into cncf:master Apr 10, 2020
@k82cn k82cn deleted the volcano_sandbox branch April 11, 2020 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.