Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for a Dask runner #18962

Closed
kennknowles opened this issue Jun 3, 2022 · 11 comments · Fixed by #22421
Closed

Support for a Dask runner #18962

kennknowles opened this issue Jun 3, 2022 · 11 comments · Fixed by #22421

Comments

@kennknowles
Copy link
Member

Adding support for a Dask runner is currently under consideration?

Imported from Jira BEAM-5336. Original Jira may contain additional context.
Reported by: georvic.

@TheNeuralBit
Copy link
Member

Related issue in pangeo-forge-recipes: pangeo-forge/pangeo-forge-recipes#256

@rabernat
Copy link

rabernat commented Jun 7, 2022

In order to kickstart the discussion of implementing a Dask Beam runner, I propose we meet during the week of June 13-17. I have created a When2Meet Poll here - https://www.when2meet.com/?15861604-jLnA4 . If you are interested in attending, please give your availability. Hope to see many people there! 🚀

@rabernat
Copy link

rabernat commented Jun 8, 2022

It would be great if we could a Beam expert to attend this meeting in order to provide guidance on the Beam side. @kennknowles, @TheNeuralBit, @richardliaw - any interest? If the times don't work for you, we can revise the options.

@rabernat
Copy link

Thanks to all who replied! We have scheduled the call for Wed June 15, 1:30 pm ET. @TheNeuralBit - I did not have your email so couldn't add you to the invite, but the zoom link is https://columbiauniversity.zoom.us/j/94977204481?pwd=TVFVTGVXY25sVnMxTHROeTloWlZ5dz09

Looking forward to the discussion!

@TheNeuralBit
Copy link
Member

@kennknowles could you make that time?

My apache email is bhulette@, if apache.org emails don't work, it's the same username at google.com

@kennknowles
Copy link
Member Author

Yes I can make it. (our emails are in the dev@ list thread)

@TomAugspurger
Copy link

@rabernat do I have the wrong time, or did this call already wrap up?

@rabernat
Copy link

rabernat commented Jun 15, 2022 via email

@francesco086
Copy link

Any update? I find the idea very interesting

@rabernat
Copy link

rabernat commented Jul 8, 2022

@alxmrs seems to be leading the charge on this for now. We have some notes here. https://docs.google.com/document/d/1Awj_eNmH-WRSte3bKcCcUlQDiZ5mMKmCO_xV-mHWAak/edit#heading=h.y0pwg4polebc

We are hoping that the folks at Google may spearhead the implementation of the Beam runner. Another option would be to make and RFP and seek an independent contractor to work on it.

@alxmrs
Copy link
Contributor

alxmrs commented Jul 8, 2022

Great timing! I have an initial working prototype here: alxmrs#1

Edit: I just made an announcement to the apache mailing list, and the update is more or less in the notes Ryan's linked to. To reiterate –

  • The overall approach involves translation Beam graphs to a graph of Dask Bag operations. Bags and PCollections refer to the same concept.
  • I've implemented enough Beam/Dask operations to write tests with TestPipelines. These generally span a decent amount of Beam use cases (Create, ParDo, Flatten, and GroupByKey).
  • There are basic options for pointing to a Dask cluster. By default, the pipeline will run on a local cluster.
  • Windowing and streaming operations in general are not implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants