Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Forward device requests #7124

Closed
wants to merge 4 commits into from
Closed

WIP: Forward device requests #7124

wants to merge 4 commits into from

Conversation

yoanisgil
Copy link

Add support for Nvidia GPUs (solves #6691)

Signed-off-by: Yoanis Gil <gil.yoanis@gmail.com>
@yoanisgil
Copy link
Author

This PR is pretty much WIP as I have tested only one use case and also depends on docker/docker-py#2471

Signed-off-by: Yoanis Gil <gil.yoanis@gmail.com>
@yoanisgil
Copy link
Author

@rumpl @ndeloof @ulyssessouza I'm happy to make this PR more compliant (tests, documentation, etc) as long as you think it is heading int he right direction.

Signed-off-by: Yoanis Gil <gil.yoanis@gmail.com>
Signed-off-by: Yoanis Gil <gil.yoanis@gmail.com>
@docker docker deleted a comment from GordonTheTurtle Jan 7, 2020
@@ -597,6 +599,21 @@
}
}
}
},
"device_requests": {
Copy link
Contributor

@ndeloof ndeloof Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leading schema definition is https://github.com/docker/cli/tree/master/cli/compose/schema/data
see #7047
I'm currently working on getting this schema duplication issue resolved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually raises a very good point. Maybe the GPU allocation request should be based on generic_resources?. Just sayin because of this: https://github.com/docker/cli/blob/master/cli/compose/loader/full-example.yml

But then that will require a translation to a DeviceRequest

@@ -151,6 +151,8 @@

"external_links": {"type": "array", "items": {"type": "string"}, "uniqueItems": true},
"extra_hosts": {"$ref": "#/definitions/list_or_dict"},
"device_requests": {"$ref": "#/definitions/device_requests"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within schema v3, runtime constraints have been moved into deployment.resources. Sounds a better place for such device allocation request

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move but seems to me like it's better to wait until you figure out how to reconcile schema definitions between docker and compose before making any other change ...

Copy link
Author

@yoanisgil yoanisgil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback

sileht added a commit to sileht/compose that referenced this pull request Jan 30, 2020
sileht added a commit to sileht/compose that referenced this pull request Jan 30, 2020
sileht added a commit to sileht/compose that referenced this pull request Jan 30, 2020
sileht added a commit to sileht/compose that referenced this pull request Jan 30, 2020
@lshamis
Copy link

lshamis commented Mar 5, 2020

I thought a docker-compose file tries to match the docker command line.

If I want to set set ipc, pid, privileged, ... the docker cli would look like:

docker run --ipc=host --pid=host --privileged=true ...

and the docker-compose looks like:

services:
    myservice:
         ipc: "host",
         pid: "host",
         privileged: true,
         ...

With newer versions of docker, the command line to use the nvidia runtime:

docker run --gpus=all --rm nvidia/cuda nvidia-smi

so I would assume that the docker-compose would look like:

services:
    myservice:
         gpus: "all",
         ...

but with this PR (according to the comments #6691), gpu usage would look like:

services:
    gpu:
        image: 'nvidia/cuda:9.0-base'
        command: 'nvidia-smi'
        device_requests:
            - capabilities:
               - "gpu"

which seems much more complicated to me.

@hadim
Copy link

hadim commented Jul 1, 2020

Is there anyone working on this PR still?

@xkortex
Copy link

xkortex commented Jul 6, 2020

@lshamis I believe the motivation behind the device_requests syntax is the long-term ongoing goal of ensuring swarm mode is as generic as possible. docker run --gpus=all makes sense in the scope of docker run but not so much in docker swarm. However I agree, the requests notation is more confusing. As awesome as the aspirations for swarm are, there is still a user base reliant on compose for single-machine orchestration.

network_mode: host is a compose v3 field which is incompatible with swarm. Is it plausible to use this syntax for local-machine-only, and throw an error if used in swarm mode, in a similar manner to network_mode?

services:
    myservice:
         gpus: "all",
         ...

I think it's more violating of law-of-least-surprisal to lack this syntax when docker run supports it.

Nearly my entire department is still using 2.X syntax for runtime: nvidia because we use gpu enabled containers day-in and day-out and cannot always set daemon.json with a default runtime.

To give you an idea, here's the sidequest I currently have to embark on to test my ML api, vs what used to simply be docker-compose up :

  • docker-compose up my application stack. But the ML container dies, because no GPU, thus the worker times out
  • docker run --gpus=all --name=ml_server --network=myapp_default -v all_the_args_normally_in_compose my_ml_image:latest the CUDA container
  • restart the worker container, which fortunately still accepts the container name for dns lookup, but I have to remember to set the --name flag when I start the CUDA container, or else it can't talk to it

@yoanisgil
Copy link
Author

Is there anyone working on this PR still?

I'm not (mostly because of the lack of feedback and/or interest).

@hadim
Copy link

hadim commented Aug 7, 2020

FYI: docker/docker-py#2471 has been merged.

@visheratin
Copy link

Can someone tell what stops the PR from being merged?

@edurenye
Copy link

edurenye commented Sep 2, 2020

There are conflicting files, it needs to be rebased I guess.

@PhotoTeeborChoka
Copy link

PhotoTeeborChoka commented Sep 7, 2020

Guys, (@rumpl @ndeloof @ulyssessouza), would it be possible to give the necessary feedback to @yoanisgil so that he can finalize this PR or so that somebody can take over and finalize it so that it can be merged? ML applications, as well as GPU based ones cannot be used very well with docker-compose unless the compatibility with docker 19.03 is complete regarding the --gpus flag.

The largest hindrance in the form of lacking docker-py support is now gone, so let's make this work!

@ysyyork
Copy link

ysyyork commented Sep 21, 2020

Any progress on merging this into master?

@alexandrecharpy
Copy link

Agree with PhotoTeerborChoka. If this is just a "rebase"/conflict issue, would it speed up things if I rebase a fork from @yoanisgil (with his consent) ?

@yoanisgil
Copy link
Author

@alexandrecharpy please go ahead. At the time I submitted the PR I had the time and will to make this happen but unfortunately that's no longer the case.

@glours
Copy link
Contributor

glours commented Jul 27, 2022

Thanks for taking the time to create this issue/pull request!

Unfortunately, Docker Compose V1 has reached end-of-life and we are not accepting any more changes (except for security issues). Please try and reproduce your issue with Compose V2 or rewrite your pull request to be based on the v2 branch and create a new issue or PR with the relevant Compose V2 information.

@glours glours closed this Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.