New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Device Support #2682

Open
dperny opened this Issue Jul 2, 2018 · 13 comments

Comments

Projects
None yet
5 participants
@dperny
Member

dperny commented Jul 2, 2018

This is a rough overview of a proposed design for device support in Swarm. This is a possible implementation of #1244. The objective is to implement, in a way sensible to the cluster, support for devices. Please note that this is not yet on the road map; this is an early-stage proposal.

For community members, even if you don't or haven't contributed directly to swarmkit:

Does this meet or exceed the community's needs for device support? Is the UI flexible, ergonomic, and easy to use? Feel free to leave a comment explaining what is good and bad about this proposal.

Overview

Devices will be added as a first-class feature of swarm. The user will be able to define device classes, to which devices belong to. The user will be able to register devices on specific nodes, indicating to what class the device belongs to and what path the device is located at. The user can then specify device classes that a task needs to execute, and the swarmkit scheduler will assign a device to the task and place the task on the node with that device.

Goals

The goal of this proposal is to implement the most basic device-aware scheduling system, to swarmkit to fully support devices in a clustered environment.

Non-Goals

Non-goals of this proposal are to support things like security profiles or permissions. Additionally, though the device management workflow presented in this PR is a bit onerous and requires manual registration of devices, implementing automatic device detection and registration is out of scope.

Detailed Design

Data Model

The basic data model of devices is as follows:

  1. Device Classes represent a set of interchangeable and equivalent devices equally suited for scheduling. All devices belong to exactly one device class.
  2. Individual devices will be registered belonging to a class on a per-node basis. Once registered, a task may be assigned to use them.
  3. Task Specs will be updated to include the desired device classes and attachment options.

Devices are host-specific resources, but different devices on the same or different hosts may possibly be treated as interchangeable or equivalent. For example, many nodes in the cluster could possibly be attached to some GPU. Though the actual GPU on different nodes may be different, and there may even be more than one GPU per node, their functionality is equivalent, and any of these nodes is an equally suitable candidate for scheduling. Further, some devices should only be used by one task in the cluster, whereas others can be shared between as many tasks as needed.

Device classes are the object that represents the top-level concept of a device. Tasks can only specify devices in terms of device classes they desire. The specific device chosen is the prerogative of the swarmkit scheduler.

The individual devices available are a property of the node. A node may have as many devices specified as necessary. In keeping with the security pattern of not trusting workers, devices are always registered through the swarmkit manager, never self-reported or self-discovered.

Task Specs will include a list of device classes and options desired, including where in the task’s file system to place the device. Tasks must be prepared to accept any device in the class as equivalent. When a task is created, it will have the full run-time device parameter included in the object.

User Interface

Adding devices will introduce a new command and subcommands for the management of devices. The first command, and the biggest change, will be to add new subcommands to manage device classes:

Usage: docker swarm device COMMAND

Manage Swarm devices

Commands:
  add      Add a new device class to the swarm
  ls       List device classes on this swarm
  inspect  Show information about a device class and its devices
  rm       Remove a device class from the swarm

The add command adds a new device to the the swarm:

Usage: docker swarm device add [OPTIONS] CLASS

Add a new device class to the swarm

Options:
     --shared      Allow this class to be shared between tasks
     --label list  Set metadata on this device class

The ls command will allow listing all available device classes

Usage: docker swarm device ls [OPTIONS]

List device classes on this swarm

Options:
  -q, --quiet   Only display IDs
  -f, --filter  Filter output based on conditions provided

The inspect command will allow showing full information about device classes, as well as allowing the user to include all devices currently registered belonging to a device class.

Usage: docker swarm device inspect [OPTIONS] CLASS [CLASS]

Display detailed information one one or more device classes

Options:
  -f, --format string   Format the output
      --pretty          Print the information human friendly
      --devices         Include devices belonging to this class

The remove command is similar to all other rm commands, and its usage is obvious, with the caveat that removal of a device class will be disallowed if a device is in use by task. There is no update command, as device classes will not be treated as updateable.

To manage particular devices on nodes, the existing node update command will receive new flags:

--device-add device  Register a device on a node with the swarm
--device-rm device   Deregister a device with the swarm

Similar to other options like ports and volumes, devices will accept both short- and long-form versions.

The short form will take the format target:class, where path is the path of device on the host, and class is the device class to register with. as such

--device-add /dev/nvidia0:gpu

The long form of the command allows specifying these options independently, and allows future expansion of options for devices (such as host-specific cgroup options):

--device-add target=”/dev/nvidia0”,class=”gpu”

The device rm option for node update acts as expected, but will disallow removing a device that is in use.

Services would also support new flags. Service create will have a new option, --device, with both a long form and a short form. The short form will be reciprocal of the the --device flag on the node, taking the form class:path. It will also optionally support a third rwm field, mirroring the --device flag on docker run. The long form will take discrete arguments, and allow the user to specify cgroup options as supported in th

The short form, for mounting a GPU:

--device gpu:/dev/nvidia0

Services would also support a long form of the command:

--device class=”gpu”,path=”/dev/nvidia0”

Note: the long form of the command could possibly support further cgroup options, as allowed in the docker REST API for container creation.

Service update would include --device-add and --device-rm flags. --device-add syntax will be equivalent to the --device flag of create. Because a task may have more than one device of a class mounted into its running container, --device-rm would require both the class and path of the device to disambiguate the specific device that is to be removed.

--device-rm class=”gpu”,path=”/dev/nvidia0”

REST API

The Docker engine REST API would require a new set of endpoint to accommodate the concept of device classes. These endpoints would return the JSON representation of the objects described in the example Protocol Buffers. These endpoints would be as follows:

GET    /devices             List device classes
POST   /devices/create      Create a new device class
GET    /devices/{id}        Inspect a device class
POST   /devices/{id}/update Update a device class
DELETE /devices/{id}        Delete a device class

Protocol Buffers

In swarm, protocol buffers define the internal API and object structure.

The DeviceClass proto will form a new top-level type, like a Network or a Service. It will have an ID and a name.

// DeviceClass is a specification for a particular device, zero or more of
// which may be available on the cluster. It refers to the general class of
// devices that the user wishes to be assumed as interchangeably usable. For
// example, a cluster may have many possible block devices on many nodes, but
// any of them are valid. The specific implementation of a specific device on a
// node is provided by the node. A particular device may only belong to one
// device class.
message DeviceClass {
  string id = 1;

  Meta meta = 2 [(gogoproto.nullable = false];

  // Shared represents whether this device can be shared between many tasks, or
  // whether it should be uniquely mapped to a particular task. Shared devices
  // may have any number of tasks assigned to them.
  //
  // Note that Shared has strong security risks; shared devices may be used by
  // tasks to communicate with one another.
  bool shared = 3;
}

The Device proto is included as a repeated field on Node specs. It defines a particular available device belonging to a class.

// Device represents a particular available device on a node. It is one
// particular instance of a DeviceClass, and is interchangeable with other
// devices in the DeviceClass
message Device {
  // DeviceClassID is the ID of the device class that this device belongs to.
  string device_class_id = 1;

  // PathOnHost is the path in the host's filesystem that this device should be
  // mounted from.  For example, a block device may have this value as
  // "/dev/sda". A particular device may belong to only 1 device class;
  // assigning a device to more than one class may cause it to be conflictingly
  // scheduled.
  string path_on_host = 2;
}

The DeviceAttachmentSpec is a repeated field found in the TaskSpec proto, and defines the devices that a task should be attached to.

// DeviceAttachmentSpec represents the spec for a device attachment
message DeviceAttachmentSpec {
  // DeviceClass is the ID or name of the device class that is to be used for
  // this spec. The actual device may be any device of this class on any node.
  string device_class = 1;

  // Path represents the path in the task's filesystem that this device should
  // be mounted at.
  string path = 2;
 
  // DeviceCgroupRules represents the cgroup rules that should be applied to
  // this device.
  repeated string device_cgroup_rules = 16;
}

The DeviceAttachment is a repeated field on Tasks which defines specifically the run-time parameters of a device attachment for a particular task.

// DeviceAttachment represents the run-time configuration of a device in use.
// It includes both the path on the host and the path in the Task of the
// device, because a Task may have many devices of the same class reserved, and
// those reservations would be otherwise indistinguishable.
message DeviceAttachment {
  // DeviceClassID is the ID of the device class used for this device
  string device_class_id = 1;

  // PathOnHost is the path on the host's filesystem of the device to be used
  // by the task.
  string path_on_host = 2;

  // PathInTask is the path in the task's filesystem that the device will be
  // mounted at.
  string path_in_task = 3;

  // DeviceCgroupRules represents the Cgroup rules that should be applied to
  // this device.
  repeated string device_cgroup_rules = 16;
}

Swarmkit Implementation

The device allocator will be implemented as a sub-component of the Scheduler. It will be created when a scheduler is created, and keep track of the available devices in the cluster. Scheduling for available devices forms part of the constraint-solving portion of the scheduler.

Task updates present a difficulty for devices. If devices in the class can be shared between tasks (marked --shared), then there is not problem. However, the start-first update strategy would fail if there were not at least one device in a class available, such that the new task could start with a fresh device, allowing the old task to shut down and free its in-use device. There is no easy solution for this, I think. We should instead document thoroughly that using start-first with devices may cause trouble.

Error Handling

Because of the nature of distrusting the workers, it is difficult or impossible for swarm to “prove” that a given device exists on a node, or performs as the user expects. Swarm will therefore make no attempts to verify the correctness of provided user data. If a device is mistakenly assigned to the wrong class, or if it does not exist at all, the task is expected to fail to start. It should enter a terminal state of FAILED and should include an error message explaining that the errant device is at fault.

Notably, in this proposal, there will be no attempt to “downweight” or otherwise attempt to avoid a node with a failing device. This functionality may come later, but not as part of this proposal.

Security

It must be understood that once on the host, swarmkit has no control over how a task uses devices. If improperly used, devices can be an extreme security hole for swarm tasks. For example, mounting block devices may allow read or write access to all of their contents. If the host’s primary block device were mounted into a task, that task could have full access to the host filesystem.

About Generic Resources

Swarmkit currently includes a feature called “Generic Resources”, which serves to allow scheduling based on kinds of resources. The design doc for Generic Resources [2] outlines their use, which overlaps with the use case of this proposal. Specifically, Generic Resource already keeps track of resources which are available and in use on a cluster.

However, GenericResource has a notable deficiency: it lacks context about the runtime usage of a particular reserved resource. Essentially, a task is only informed of a resource at runtime, and the swarmkit worker has no way to know how to make use of a particular resource, which makes the feature quite useless.

The obvious solution would be to include in the TaskSpec instructions for how to make use of a resource. However, this puts the information about how to use a resource separate from the information about what resource is required. A TaskSpec might, for example, request in its ResourceReservations 3 GPUs, but in its ContainerSpec in a hypothetic Devices field, only use 2 of them, leaving 1 wasted. Or, alternatively, a TaskSpec might include instructions for mounting an audio device, but not include a reservation for one. This means that run time checks would be needed to make sure that the requested resources match the runtime instructions for using resources. Instead, this proposal uses the type system to make this kind of mismatch impossible to express.

Additionally, we cannot simply annotate or augment the GenericResource type in the task resource reservations, because the same type is shared between the TaskSpec (requested resources), the Task itself (assigned resources), and the Node (available resources). The same type is used to express which resources are available, which resources are assigned, and which resources are requested. However, these types all serve different purposes. Available resources don’t need to be aware of how they should be used by a task and requested resources can’t be aware of what resource will be assigned. This means that fields on the GenericResource would either mean different things in different places, or there would only be a subset of fields in use on any given object.

[2] https://github.com/docker/swarmkit/blob/de950a7/design/generic_resources.md

@cnrmck

This comment has been minimized.

cnrmck commented Jul 2, 2018

While I cannot claim to know anything about the implementation or protocols, I can say that this is a desperately needed feature for any sort of IoT development for which current solutions (however clever) are insufficient. +1 due to that.

The user interface that's proposed also seems fairly intuitive. My question is, would this then support docker-compose files?

@dperny

This comment has been minimized.

Member

dperny commented Jul 2, 2018

I don't have a design for compose support, but I imagine it would be straightforward. You would just include devices in a service definition, like you do networks or ports. Something like this (very rough, not part of the proposal):

version: '3'
services:
  iot:
    ports:
     - "5000:5000"
    volumes:
     - .:/datastore
    devices:
    - target: sensor
      path: /dev/sensor

The only open question is whether a compose file should also be able to define device classes and devices per node. That's a better question for the compose team, after we've passed this phase of design.

@cnrmck

This comment has been minimized.

cnrmck commented Jul 5, 2018

@dperny I like the plan, would be great to see this!

@apollo13

This comment has been minimized.

apollo13 commented Jul 7, 2018

@dperny This would cover our needs for using hardware security modules in containers. I cannot find anything wrong in the proposal.

@dperny

This comment has been minimized.

Member

dperny commented Jul 9, 2018

I'm... kind of a doofus? And totally forgot that swarm supports Generic Resource constraints, design doc here: https://github.com/docker/swarmkit/blob/master/design/generic_resources.md

This work, which everyone seems to have forgotten even happened, handles the difficulty of managing which resources are in use on which nodes and by which tasks, which is the more complicated part of this proposal.

However, there is a big problem with the generic resources: the resource availability is decoupled at the data model from the way the resource is used. Essentially, you can keep track of which and how many resources a node has, but not how to actually make use of those resources. This is an explicit non-goal of the Generic Resource design. Quote,

As swarmkit is not responsible for exposing the resources to the container (or acquiring them),
it needs a way to communicate how many generic resources were assigned (in the case of
discrete resources) or / and what resources were selected (in the case of sets).

The reference implementation of the executor exposes the resource value to
software running in containers through environment variables.
The exposed environment variable is prefixed with DOCKER_RESOURCE_ and it's key
uppercased.

This implies that tasks should be responsible for requisitioning their own resources at run time. However, this is impossible for devices. A task, from within a container, cannot attach devices after it has started. So the task has an awareness of what resources are available to it, but no actual way to make use of them. This basically explains why nobody uses this feature; the only way to do so would be to create tasks mounting the docker socket that spawn new containers.

The executor will have to be aware of how devices are accessed for devices to work. The responsibility for putting those devices into the task will have to live entirely within the agent.

I'll need to rewrite this proposal to accommodate this existing GenericResource feature, so we don't have two overlapping features with different but similar purposes.

@dperny

This comment has been minimized.

Member

dperny commented Jul 9, 2018

I'm poking at how to leverage the existing GenericResource code, and it's honestly not that sensible. The use case is too different. The amount of mogrification to the GenericResource concept that one would have to do is untenable.

Honestly... GenericResource isn't a super sensible implementation anyway. It totally decouples a task's resource demands from the actual use of resources, which is a serious problem. If a Task reserves a resource, but does not have any way to use it, the resource is wasted. However, if a Task specifies how to use a resource, but no such reservation was made, then the Task will fail in strange ways.

I think, despite the slight duplication of efforts, the use case for actually using devices is sufficiently different to warrant a separate design.

@dperny

This comment has been minimized.

Member

dperny commented Jul 9, 2018

Updated the design document to include section on GenericResource

@mbonato

This comment has been minimized.

mbonato commented Jul 12, 2018

@dperny I would love to see this implemented! This would allow us to proper use hardware security modules (HSM) which are required by our application in swarm mode.

@cnrmck

This comment has been minimized.

cnrmck commented Aug 22, 2018

@dperny Any update on progress for those of us who are eagerly waiting?

@dperny

This comment has been minimized.

Member

dperny commented Aug 22, 2018

Yes, I'm gonna do it, I just keep getting pulled away on other things internally. But it's gonna happen. Soon™.

@cnrmck

This comment has been minimized.

cnrmck commented Aug 22, 2018

dperny added a commit to dperny/swarmkit-1 that referenced this issue Aug 24, 2018

Add Devices protos
Adds the protocol buffers needed to support allocation of devices in
swarmkit. Part of the proposal in docker#2682.

Signed-off-by: Drew Erny <drew.erny@docker.com>

dperny added a commit to dperny/swarmkit-1 that referenced this issue Aug 27, 2018

Add Devices protos
Adds the protocol buffers needed to support allocation of devices in
swarmkit. Part of the proposal in docker#2682.

Signed-off-by: Drew Erny <drew.erny@docker.com>

dperny added a commit to dperny/swarmkit-1 that referenced this issue Aug 27, 2018

Add Devices protos
Adds the protocol buffers needed to support allocation of devices in
swarmkit. Part of the proposal in docker#2682.

Signed-off-by: Drew Erny <drew.erny@docker.com>
@cnrmck

This comment has been minimized.

cnrmck commented Sep 20, 2018

@dperny Any updates or timeline? Thanks!

@swift1911

This comment has been minimized.

swift1911 commented Nov 17, 2018

is any progress about this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment