Question: Could this be used to get a dask cluster running on AWS EMR? #28

ian-whitestone · 2018-10-24T12:55:53Z

Potentially naive question, as I just learned what YARN was at a meetup last night. I think Amazon's EMR service is built around it. With that in mind, could you use this package, or parts of it, to get a dask cluster up and running on EMR?

I know the recommended deployment is using kubernetes, but my company blocked AWS' kubernetes service (EKS) 🤦‍♂️.

Any tips/advice would be greatly appreciated.

jcrist · 2018-10-24T15:11:00Z

Definitely. I haven't tried it, but it's likely that things will just work. Steps I'd try:

Start up an EMR cluster
SSH in
Create a python environment with dask-yarn in it, and package the environment (see http://yarn.dask.org/en/latest/#distributing-python-environments)
Create a YarnCluster object with your desired parameters to start up dask.

martindurant · 2018-10-24T15:16:26Z

That workflow would make a very nice (v/b)log.

ian-whitestone · 2018-10-24T16:09:31Z

thanks @jcrist

One other question/thought that popped to mind based on this code:

from dask_yarn import YarnCluster
from dask.distributed import Client

# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
                      worker_vcores=2,
                      worker_memory="8GB")
# Scale out to ten such workers
cluster.scale(10)

# Connect to the cluster
client = Client(cluster)

How do subsequent connections to the cluster work? I imagine you wouldn't re-execute this code each time as (i think) it would create the new cluster each time.

Would a common work flow involve printing out the master scheduler endpoint, then each time you SSH into the master node, you connect with client('<local-ip-address>:<port>') ? Or does dask-yarn create a public scheduler endpoint that can be accessed from local machine?

jcrist · 2018-10-24T16:35:48Z

I imagine you wouldn't re-execute this code each time as (i think) it would create the new cluster each time.

Yes, that is correct. The intent is that you create a cluster, do your work, then shut the cluster down. There's nothing baked into dask-yarn currently for spinning up a persistent cluster (although this would be fairly trivial to write up if it'd be useful to you).

Since clusters are fairly quick to spin up, and keeping a persistent one that's idle would hog resources from others, this hasn't been a huge problem so far. If you're wanting to exit the terminal while something is running in the background, you can use more general solutions for this (e.g. tmux, screen, nohup, etc...).

ian-whitestone · 2018-10-24T16:37:12Z

I see, makes sense. thanks for the quick response.

jcrist · 2018-11-05T23:00:48Z

I'm currently working on writing up docs on getting started on AWS EMR, but the immediate tl;dr is:

Just following the quickstart works, no problems encountered: http://yarn.dask.org/en/latest/quickstart.html
If you want to access HDFS through pyarrow, you'll need to pass worker_env={'ARROW_LIBHDFS_DIR': '/usr/lib/hadoop/lib/native/'} to the YarnCluster constructor, as well as set ARROW_LIBHDFS_DIR='/usr/lib/hadoop/lib/native/' locally. This is a common issue on other clusters, and I haven't found a general solution to it besides documentation yet (other non-python tools also have issues finding the hadoop native libraries).
A simple EMR bootstrap script can get you up and running quickly (not sure if I want to provide a general one, but people should be able to work off of the example one I've prepared)
Job submission (http://yarn.dask.org/en/latest/submit.html) works fine, and also works through the GUI interface provided by AWS. This means you can have automated EMR jobs that run Dask.

All in all this was a fairly painless process. I hope to get the docs up in the next couple of days, will ping here again once there's a PR.

jcrist · 2018-11-07T23:27:58Z

Work is happening in #41. I have a nice bootstrap action written up, and just need to write the docs now. If you're feeling antsy, you can try the bootstrap action out already (just upload to s3 and configure in your EMR setup, see https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html). This sets up a working jupyter notebook server and dask environment, complete with dashboard access. I hope to finish the docs up sometime tomorrow.

ian-whitestone · 2018-11-08T02:12:22Z

Thanks for the updates & all the work. None of this is too urgent from my end...just want to be able to leverage dask + EMR down the road. I am planning to do some more in depth experimentation with all this two weekends from now, so hopefully will be able to provide some useful feedback then.

jcrist · 2018-11-08T17:41:39Z

Glad to hear it. The initial documentation on EMR is now in, when you get a chance, it would be useful for you to try things out and report back what was confusing/undocumented/could-be-better :).

mnapolitano89 · 2019-01-15T18:55:27Z

Wanted to comment that I used the EMR documentation and it was amazingly clear and helpful - definitely excellent.

manugarri · 2019-04-11T16:48:35Z

Thanks for this, superhelpful. The only thing I found missing is how to connect to the dask dashboard if you are using the ssh tunnel to connect to jupyter, since the url provided by the notebook ( /proxy/42727/status) gives a 500 Error

manugarri · 2019-04-11T16:48:59Z

Thanks for this, superhelpful. The only thing I found missing is how to connect to the dask dashboard if you are using the ssh tunnel to connect to jupyter, since the url provided by the notebook ( /proxy/42727/status) gives a 500 Error

AlJohri · 2020-03-09T20:55:06Z

@manugarri I ran into a similar issue- turns out installing bokeh fixed it for me: #112

jcrist mentioned this issue Nov 7, 2018

[WIP] Docs and resources for deploying on AWS EMR #41

Merged

3 tasks

jcrist closed this as completed in #41 Nov 8, 2018

ian-whitestone mentioned this issue Nov 14, 2018

Testing plan/notes ian-whitestone/pyspark-vs-dask#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Could this be used to get a dask cluster running on AWS EMR? #28

Question: Could this be used to get a dask cluster running on AWS EMR? #28

ian-whitestone commented Oct 24, 2018 •

edited

Loading

jcrist commented Oct 24, 2018

martindurant commented Oct 24, 2018

ian-whitestone commented Oct 24, 2018

jcrist commented Oct 24, 2018 •

edited

Loading

ian-whitestone commented Oct 24, 2018 •

edited

Loading

jcrist commented Nov 5, 2018 •

edited

Loading

jcrist commented Nov 7, 2018

ian-whitestone commented Nov 8, 2018

jcrist commented Nov 8, 2018

mnapolitano89 commented Jan 15, 2019

manugarri commented Apr 11, 2019

manugarri commented Apr 11, 2019

AlJohri commented Mar 9, 2020

Question: Could this be used to get a dask cluster running on AWS EMR? #28

Question: Could this be used to get a dask cluster running on AWS EMR? #28

Comments

ian-whitestone commented Oct 24, 2018 • edited Loading

jcrist commented Oct 24, 2018

martindurant commented Oct 24, 2018

ian-whitestone commented Oct 24, 2018

jcrist commented Oct 24, 2018 • edited Loading

ian-whitestone commented Oct 24, 2018 • edited Loading

jcrist commented Nov 5, 2018 • edited Loading

jcrist commented Nov 7, 2018

ian-whitestone commented Nov 8, 2018

jcrist commented Nov 8, 2018

mnapolitano89 commented Jan 15, 2019

manugarri commented Apr 11, 2019

manugarri commented Apr 11, 2019

AlJohri commented Mar 9, 2020

ian-whitestone commented Oct 24, 2018 •

edited

Loading

jcrist commented Oct 24, 2018 •

edited

Loading

ian-whitestone commented Oct 24, 2018 •

edited

Loading

jcrist commented Nov 5, 2018 •

edited

Loading