Unified Containerization support for Apache Mesos 1.0 #12

windreamer · 2016-09-01T04:09:19Z

cf: tensorflow/tensorflow#1996 (comment)
quote from @klueska

Regarding problems figuring out how to enable GPU support -- I can help with that. We basically mimic the functionality of nvidia-docker so that anything that runs in nvidia-docker should now be able to run in mesos as well. Consider the following example:
$ mesos-master \
      --ip=127.0.0.1 \
      --work_dir=/var/lib/mesos
$ mesos-agent \
      --master=127.0.0.1:5050 \
      --work_dir=/var/lib/mesos \
      --image_providers=docker \
      --executor_environment_variables="{}" \
      --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia"
$ mesos-execute \
      --master=127.0.0.1:5050 \
      --name=gpu-test \
      --docker_image=nvidia/cuda \
      --command="nvidia-smi" \
      --framework_capabilities="GPU_RESOURCES" \
      --resources="gpus:1"
The flags of note here are:
mesos-agent: 
      --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" 
mesos-execute: 
      --resources="gpus:1" 
      --framework_capabilities="GPU_RESOURCES" 
When launching an agent, both the cgroups/devices and the gpu/nvidia isolation flags are required for Nvidia GPU support in Mesos. Likewise, the docker/runtime and filesystem/linux flags are needed to enable running docker images with the unified containerizer.

The cgroups/devices flag tells the agent to restrict access to a specific set of devices when launching a task (i.e. a subset of the devices listed in /dev). The gpu/nvidia isolation flag allows the agent to grant / revoke access to GPUs on a per-task basis. It also handles automatic injection of the Nvidia libraries / volumes into the container if the label com.nvidia.volumes.needed = nvidia_driver is present in the docker image. The docker/runtime flag allows the agent to parse docker image files and containerize them. The filesystem/linux flag says to use linux specific functionality when creating / entering the new mount namespace for the container filesystem.

In addition to these agent isolation flags, Mesos requires frameworks that want to consume GPU resources to have the GPU_RESOURCES framework capability set. Without this, the master will not send an offer to a framework if it contains GPUs. The choice to make frameworks explicitly opt-in to this GPU_RESOURCES capability was to keep legacy frameworks from accidentally consuming a bunch of non-GPU resources on any GPU-capable machines in a cluster (and thus blocking your GPU jobs from running). It's not that big a deal if all of your nodes have GPUs, but in a mixed-node environment, it can be a big problem.

Finally, the --resources="gpus:1" flag tells the framework to only accept offers that contain at least 1 GPU. This is just an example of consuming a single GPU, you can (and probably should) build your framework to do something more interesting.

Hopefully you can extrapolate things from there. Let me know if you have any questions.

We can try to enable unified containerization for Mesos >= 1.0
~~- [ ] Add com.nvidia.volumes.needed = nvidia_driver label to tfmesos/tfmesos image~~

Get current Mesos master version MasterInfo.version
If Mesos version < 1.0, fallback with Mesos + Docker + Nvidia-docker combination
Else instead of using nvidia-docker volume parameters, set GPU_RESOURCES Capability

The text was updated successfully, but these errors were encountered:

klueska · 2016-09-01T04:41:29Z

Note that:

* Add com.nvidia.volumes.needed = nvidia_driver label to tfmesos/tfmesos image

is only necessary if you don't build your docker image off of one of the standard nvidia images. nvidia-docker uses this same label to decide if it should inject the volume as well.

windreamer · 2016-09-01T05:15:25Z

Okey, tfmesos/tfmesos is built from nvidia image. I will remove this check item.

windreamer · 2016-09-19T08:31:43Z

cf: douban/pymesos#18
pymesos will use Mesos HTTP API from 0.2.0

windreamer · 2016-09-22T09:06:32Z

done in #19

windreamer mentioned this issue Sep 1, 2016

Distributed tensorflow on Mesos tensorflow/tensorflow#1996

Closed

windreamer mentioned this issue Sep 7, 2016

command can't run when surportting gpu #18

Closed

windreamer mentioned this issue Sep 22, 2016

Mesos 1.0 and Python3 support #19

Merged

windreamer closed this as completed Sep 22, 2016

bhack mentioned this issue Dec 1, 2016

Marathon/Mesos and docker tensorflow/ecosystem#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified Containerization support for Apache Mesos 1.0 #12

Unified Containerization support for Apache Mesos 1.0 #12

windreamer commented Sep 1, 2016 •

edited

Loading

klueska commented Sep 1, 2016

windreamer commented Sep 1, 2016

windreamer commented Sep 19, 2016

windreamer commented Sep 22, 2016

Unified Containerization support for Apache Mesos 1.0 #12

Unified Containerization support for Apache Mesos 1.0 #12

Comments

windreamer commented Sep 1, 2016 • edited Loading

klueska commented Sep 1, 2016

windreamer commented Sep 1, 2016

windreamer commented Sep 19, 2016

windreamer commented Sep 22, 2016

windreamer commented Sep 1, 2016 •

edited

Loading