Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add [GP]GPU support #1

Closed
windreamer opened this issue Apr 21, 2016 · 15 comments
Closed

Add [GP]GPU support #1

windreamer opened this issue Apr 21, 2016 · 15 comments

Comments

@windreamer
Copy link
Contributor

cf: tensorflow/tensorflow#1996 (comment)

@bhack :

I think that the docker image could be based on
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/README.md.
Expecially if we want to have GPU support
http://www.nvidia.com/object/apache-mesos.html and https://mesosphere.com/blog/2015/11/10/mesos-nvidia-gpus/
See also NVIDIA/nvidia-docker#60.

Also, tfmesos need to allocate and isolate [GP]GPU resources.

@bhack
Copy link

bhack commented Apr 21, 2016

See also some "nvidia" option at https://github.com/apache/mesos/blob/master/docs/configuration.md

@bhack
Copy link

bhack commented Apr 27, 2016

We need to think how we want to handle the auto op device placement in TF. This could be overrided but we need to find some good default for data and model parallel cases because users could frequently have few GPU resource in the cluster and many CPU

@windreamer
Copy link
Contributor Author

yeah, auto device placement is a bit annoying in TF. TF offers a context named tf.train.replica_device_setter which place variables to ps devices in round-robin manner. but considering GPU resources this is still not ideal.

no idea how TF team is going to solve this problem.

@bhack
Copy link

bhack commented Apr 27, 2016

Can you formulate a comment on this on the original TF ticket? So we can collect a comment from mrry of the TF team.

@windreamer
Copy link
Contributor Author

OK,

this is definitely a huge pain cosidering my poor English :(
anyway the issue is on the way :)

@bhack
Copy link

bhack commented Apr 27, 2016

Don't worry.. Seems good to me. It is just a technical discussion 😄

@windreamer
Copy link
Contributor Author

@bhack
Copy link

bhack commented Apr 27, 2016

For GPU docker images we need to run preferably nvidia-docker command instead of docker on mesos slaves. How can be handled this?

@windreamer
Copy link
Contributor Author

I am still thinking of how to implement GPU support. and we do not have a GPU cluster for test right now. maybe I can submit a PR based on my guessing , and you can do a PR-to-PR or submit a new working RP for this?

@bhack
Copy link

bhack commented Apr 27, 2016

Yes could be useful. We have a slave node with GPU resources so we can test and continue the discussion. /cc @lenlen @mtamburrano

@windreamer windreamer mentioned this issue Apr 27, 2016
@vitan
Copy link

vitan commented Jun 13, 2016

@bhack @windreamer , guys, I am using a quick-win solution bypass the nvidia-docker command. What the nvidia-docker actually doing is to create a docker volume and map it to the cuda container then. So I tell mesos/docker map the wanted volume directly.

BTW, I have 5 GPU servers for testing. I'd like to share sth. with you guys.

@bhack
Copy link

bhack commented Jun 13, 2016

@vitan Thank you for the feedback. Can you try the last version with nvidia-docker handling assigned GPU resources with multiple tasks? I hope that @windreamer can contribute this upstream to TF soon to attract more users but we need the help of people with multiple GPU like you to test some uses cases.

@vitan
Copy link

vitan commented Jun 14, 2016

@bhack I am needing more input from you, given I have no more info about you guys. So "the latest version" is the latest TF, isn't it? and I am appreciating you if anyone give me some multiple tasks sample.

@bhack
Copy link

bhack commented Jun 14, 2016

@lenlen Do you have any protocol of an experiment to run on 5 GPU?

@bhack
Copy link

bhack commented Jun 14, 2016

@vitan For "last version" I mean the PR at #3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants