-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backend][Plugin]Support for Dask clustered tasks in Flyte #427
Comments
Dask can be deployed to Kubernetes, the template is shown here. Allowing this would help the users a lot and enable writing really short tasks. This coupled with cluster re-use (coming later) or cluster gateways (daskhub) and support for a coiled task in the future would enable users to use dask more effectively and make Flyte + Dask work together. @task(config=Dask(
workers=4,
worker_resources=....,
[worker-pod-template=...] # Also the command should probably be hard coded client side?
), resources=Resources(....) # Driver resource
)
def my_dask_program():
pass |
This was discussed a little on slack: https://app.slack.com/client/TN89P6GGK/CNMKCU6FR/thread/CNMKCU6FR-1648660418.322249 Currently we do not use dask, nor do we use Coiled's hosted platform for Dask. However, both are really interesting to us in terms of migrating away from our current, home rolled, workflow orchestration solution and having someone else run our work loads for us. The primary interest in Dask is its drop in nature w.r.t. dataframes and numpy arrays. We currently employ a streaming solution (using mmap) to allow us to do this work on one node; being able to scale up to multiple nodes without needing code changes in dependent areas of our project would be an instant win for us. From an integration point of view I found the Dask+Prefect video informative: In this video a couple of things stand out to me as desirable from any integration:
Unfortunately I don't have much more to add other than opinions currently; however as we re-work our stack to work with flyte there's a high chance we find ourselves going down this path - in which case we'll keep you apprised. |
This is a great summary, let me chalk out the effort and see when we can accommodate this |
Also we can always start with a flytekit plugin, |
@kumare3 Quick update on this, we are working on a I've also looked into creating a backend plugin and have a working prototype, capable of managing the cluster lifecycle. Currently, this is waiting on dask/dask-kubernetes#483 (Basically |
@bstadlbauer this is awesome. Please let us know how we can help. There is some momentum now in adding Flyte+Ray support. We will also be working on reusing Ray cluster across multiple tasks in a Flyte workflow. Once you have your dask plugin, we will start modifying things towards this common way of reusing clusters |
@kumare3 Great, thank you! Resuing clusters would be super helpful! I've looked at the |
@bstadlbauer tye backend plugin is flexible. Spark is peculiar because it starts the cluster and runs the app. We actually prefer that you can run a separate driver as that can speed up Flyte even more and give fantastic control- learnt through many issues in spark. Flyte can run the user code as a separate pod and then monitor it. This also helps on reuse |
@kumare3 Oh that's nice! Is there a plugin that does this already? |
Not today, but we are working on ray plugin. Let me add you to a slack thread |
Quick update:
|
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
…teorg#427) This reverts commit d0bda09. Signed-off-by: Katrina Rogan <katroganGH@gmail.com>
Quick update from my end. I had some time this weekend to finish things. Sorry for this taking so long, the last weeks have been quite busy. Overall, this would be the order in which the PRs need to go in: |
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
All PRs are in, closing this task. Thanks again for this awesome contribution @bstadlbauer ! 🚀 |
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
…teorg#427) This reverts commit c1489d8. Signed-off-by: Katrina Rogan <katroganGH@gmail.com>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
* add field Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Pass task execution metadata from agent (#422) * Pass task execution metadata from agent Signed-off-by: Hongxin Liang <honnix@users.noreply.github.com> * Add doc Signed-off-by: Hongxin Liang <honnix@users.noreply.github.com> * Update protos/flyteidl/admin/agent.proto Co-authored-by: Kevin Su <pingsutw@gmail.com> Signed-off-by: Honnix <honnix@users.noreply.github.com> * Regenerate --------- Signed-off-by: Hongxin Liang <honnix@users.noreply.github.com> Signed-off-by: Honnix <honnix@users.noreply.github.com> Co-authored-by: Kevin Su <pingsutw@gmail.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Add tags to execution spec (#414) * add tags to execution spec Signed-off-by: Kevin Su <pingsutw@apache.org> * add tags to execution spec Signed-off-by: Kevin Su <pingsutw@apache.org> * add comment Signed-off-by: Kevin Su <pingsutw@apache.org> --------- Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Correct comment for array job max parallelism (#431) Signed-off-by: Katrina Rogan <katroganGH@gmail.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Add the scalar to the operand (#427) Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * add selector Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * move selectors from container to task metadata Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * drop only_preferred Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Updating boilerplate to lock golangci-lint version (#435) Signed-off-by: Daniel Rammer <daniel@union.ai> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * add unpartitioned selector Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * refactor Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * refactor Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * fix oneof names Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * add build.os for read the docs Signed-off-by: Jeev B <jeevb@users.noreply.github.com> --------- Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> Signed-off-by: Hongxin Liang <honnix@users.noreply.github.com> Signed-off-by: Honnix <honnix@users.noreply.github.com> Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Katrina Rogan <katroganGH@gmail.com> Signed-off-by: Daniel Rammer <daniel@union.ai> Co-authored-by: Honnix <honnix@users.noreply.github.com> Co-authored-by: Kevin Su <pingsutw@gmail.com> Co-authored-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Katrina Rogan <katroganGH@gmail.com> Co-authored-by: Jeev B <jeevb@users.noreply.github.com> Co-authored-by: Dan Rammer <daniel@union.ai>
Signed-off-by: Kevin Su <pingsutw@apache.org>
* add field Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Pass task execution metadata from agent (#422) * Pass task execution metadata from agent Signed-off-by: Hongxin Liang <honnix@users.noreply.github.com> * Add doc Signed-off-by: Hongxin Liang <honnix@users.noreply.github.com> * Update protos/flyteidl/admin/agent.proto Co-authored-by: Kevin Su <pingsutw@gmail.com> Signed-off-by: Honnix <honnix@users.noreply.github.com> * Regenerate --------- Signed-off-by: Hongxin Liang <honnix@users.noreply.github.com> Signed-off-by: Honnix <honnix@users.noreply.github.com> Co-authored-by: Kevin Su <pingsutw@gmail.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Add tags to execution spec (#414) * add tags to execution spec Signed-off-by: Kevin Su <pingsutw@apache.org> * add tags to execution spec Signed-off-by: Kevin Su <pingsutw@apache.org> * add comment Signed-off-by: Kevin Su <pingsutw@apache.org> --------- Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Correct comment for array job max parallelism (#431) Signed-off-by: Katrina Rogan <katroganGH@gmail.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Add the scalar to the operand (#427) Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * add selector Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * move selectors from container to task metadata Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * drop only_preferred Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * Updating boilerplate to lock golangci-lint version (#435) Signed-off-by: Daniel Rammer <daniel@union.ai> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * add unpartitioned selector Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * refactor Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * refactor Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * fix oneof names Signed-off-by: Jeev B <jeevb@users.noreply.github.com> * add build.os for read the docs Signed-off-by: Jeev B <jeevb@users.noreply.github.com> --------- Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Jeev B <jeevb@users.noreply.github.com> Signed-off-by: Hongxin Liang <honnix@users.noreply.github.com> Signed-off-by: Honnix <honnix@users.noreply.github.com> Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Katrina Rogan <katroganGH@gmail.com> Signed-off-by: Daniel Rammer <daniel@union.ai> Co-authored-by: Honnix <honnix@users.noreply.github.com> Co-authored-by: Kevin Su <pingsutw@gmail.com> Co-authored-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Katrina Rogan <katroganGH@gmail.com> Co-authored-by: Jeev B <jeevb@users.noreply.github.com> Co-authored-by: Dan Rammer <daniel@union.ai>
Signed-off-by: Kamal Eybov <54046807+kamaleybov@users.noreply.github.com>
Signed-off-by: Kamal Eybov <54046807+kamaleybov@users.noreply.github.com>
Signed-off-by: Kamal Eybov <54046807+kamaleybov@users.noreply.github.com>
Signed-off-by: Kamal Eybov <54046807+kamaleybov@users.noreply.github.com>
Why would this plugin be helpful to the Flyte community
Users could write very short running distributed array jobs using DASK. This makes it possible to have very small runtime jobs multi-plexed onto same set of nodes.
Type of Plugin
Can you help us with the implementation?
Additional context
This would really help express some ideas that are not Spark, or heavyweight like Flyte batch jobs.
The text was updated successfully, but these errors were encountered: