Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Could this be used to get a dask cluster running on AWS EMR? #28

Closed
ian-whitestone opened this issue Oct 24, 2018 · 13 comments
Closed

Comments

@ian-whitestone
Copy link

ian-whitestone commented Oct 24, 2018

Potentially naive question, as I just learned what YARN was at a meetup last night. I think Amazon's EMR service is built around it. With that in mind, could you use this package, or parts of it, to get a dask cluster up and running on EMR?

I know the recommended deployment is using kubernetes, but my company blocked AWS' kubernetes service (EKS) 🤦‍♂️.

Any tips/advice would be greatly appreciated.

@jcrist
Copy link
Member

jcrist commented Oct 24, 2018

Definitely. I haven't tried it, but it's likely that things will just work. Steps I'd try:

@martindurant
Copy link
Member

That workflow would make a very nice (v/b)log.

@ian-whitestone
Copy link
Author

thanks @jcrist

One other question/thought that popped to mind based on this code:

from dask_yarn import YarnCluster
from dask.distributed import Client

# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
                      worker_vcores=2,
                      worker_memory="8GB")
# Scale out to ten such workers
cluster.scale(10)

# Connect to the cluster
client = Client(cluster)

How do subsequent connections to the cluster work? I imagine you wouldn't re-execute this code each time as (i think) it would create the new cluster each time.

Would a common work flow involve printing out the master scheduler endpoint, then each time you SSH into the master node, you connect with client('<local-ip-address>:<port>') ? Or does dask-yarn create a public scheduler endpoint that can be accessed from local machine?

@jcrist
Copy link
Member

jcrist commented Oct 24, 2018

I imagine you wouldn't re-execute this code each time as (i think) it would create the new cluster each time.

Yes, that is correct. The intent is that you create a cluster, do your work, then shut the cluster down. There's nothing baked into dask-yarn currently for spinning up a persistent cluster (although this would be fairly trivial to write up if it'd be useful to you).

Since clusters are fairly quick to spin up, and keeping a persistent one that's idle would hog resources from others, this hasn't been a huge problem so far. If you're wanting to exit the terminal while something is running in the background, you can use more general solutions for this (e.g. tmux, screen, nohup, etc...).

@ian-whitestone
Copy link
Author

ian-whitestone commented Oct 24, 2018

I see, makes sense. thanks for the quick response.

@jcrist
Copy link
Member

jcrist commented Nov 5, 2018

I'm currently working on writing up docs on getting started on AWS EMR, but the immediate tl;dr is:

  • Just following the quickstart works, no problems encountered: http://yarn.dask.org/en/latest/quickstart.html
  • If you want to access HDFS through pyarrow, you'll need to pass worker_env={'ARROW_LIBHDFS_DIR': '/usr/lib/hadoop/lib/native/'} to the YarnCluster constructor, as well as set ARROW_LIBHDFS_DIR='/usr/lib/hadoop/lib/native/' locally. This is a common issue on other clusters, and I haven't found a general solution to it besides documentation yet (other non-python tools also have issues finding the hadoop native libraries).
  • A simple EMR bootstrap script can get you up and running quickly (not sure if I want to provide a general one, but people should be able to work off of the example one I've prepared)
  • Job submission (http://yarn.dask.org/en/latest/submit.html) works fine, and also works through the GUI interface provided by AWS. This means you can have automated EMR jobs that run Dask.

All in all this was a fairly painless process. I hope to get the docs up in the next couple of days, will ping here again once there's a PR.

@jcrist
Copy link
Member

jcrist commented Nov 7, 2018

Work is happening in #41. I have a nice bootstrap action written up, and just need to write the docs now. If you're feeling antsy, you can try the bootstrap action out already (just upload to s3 and configure in your EMR setup, see https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html). This sets up a working jupyter notebook server and dask environment, complete with dashboard access. I hope to finish the docs up sometime tomorrow.

@ian-whitestone
Copy link
Author

Thanks for the updates & all the work. None of this is too urgent from my end...just want to be able to leverage dask + EMR down the road. I am planning to do some more in depth experimentation with all this two weekends from now, so hopefully will be able to provide some useful feedback then.

@jcrist jcrist closed this as completed in #41 Nov 8, 2018
@jcrist
Copy link
Member

jcrist commented Nov 8, 2018

Glad to hear it. The initial documentation on EMR is now in, when you get a chance, it would be useful for you to try things out and report back what was confusing/undocumented/could-be-better :).

@mnapolitano89
Copy link

Wanted to comment that I used the EMR documentation and it was amazingly clear and helpful - definitely excellent.

@manugarri
Copy link

Thanks for this, superhelpful. The only thing I found missing is how to connect to the dask dashboard if you are using the ssh tunnel to connect to jupyter, since the url provided by the notebook ( /proxy/42727/status) gives a 500 Error

1 similar comment
@manugarri
Copy link

Thanks for this, superhelpful. The only thing I found missing is how to connect to the dask dashboard if you are using the ssh tunnel to connect to jupyter, since the url provided by the notebook ( /proxy/42727/status) gives a 500 Error

@AlJohri
Copy link
Contributor

AlJohri commented Mar 9, 2020

@manugarri I ran into a similar issue- turns out installing bokeh fixed it for me: #112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants