Proposal: Network Drivers #8952

Closed
dave-tucker opened this Issue Nov 4, 2014 · 30 comments

Projects

None yet
@dave-tucker
Member

Authors: @dave-tucker, @mavenugo and @nerdalert.

Problem Statement

We believe that networking in Docker should be driver based with multiple backends to cater for the various different styles of networking. This would provide a great means for supporting alternative vSwitches like Open vSwitch, Snabb Switch or even a twist on the existing Linux Bridge solution.

This is a companion proposal to #8951 as it will be based on the Open vSwitch backend provided here.

Solution

The current bridged networking in docker relies on Linux Bridge with iptables programming.
Linux Bridge is only one of many vSwitch implementation available for Linux. Our proposal is to introduce a driver framework alongside a backends for the most popular vSwitch solutions today - Open vSwitch and Linux Bridge

Driver API

Today, a lot of the networking configuration is handled within libcontainer.
In order to be compatible with a driver-based model, we propose moving all of the code that handles networking inside of Docker.
This allows us to create a configuration pipeline, with hooks, allowing the drivers complete control over the network setup.

This could look as follows:

  • ParseNetworkConfiguration
  • InitNamespace
  • InitBridge
  • PrePortCreation
  • CreatePort
  • PostPortCreation
  • PreAddressing
  • Addressing
  • PostAddressing

This means that when a container is created:

  • Docker parses the networking configuration, initializes the bridge and creates a net namespace
  • We then allow the driver to set up the namespace, port on the bridge, iptables rules etc..

Having the driver as a part of Docker also allows us access to contextual information about a given container.
This would enable us to write metadata to a port (in cases where the vSwitch allows it) which is very useful for troubleshooting and debugging network issues.

The driver API allows the concrete driver to be purely implemented in the vSwitch or to be a combination of a vSwitch and other processes (e.g Linux Bridge + iptables).

From a user perspective, the network driver would be chosen by specifying a flag when running the docker daemon. Similarly, a sensible default will be picked. E.g Open vSwitch, if OVS is installed, otherwise fall back to Linux Bridge + iptables.

For the reasoning behind using OVS by default in place of Linux Bridge, please see this document

Open vSwitch Backend

Today libcontainer has a number of network strategies for connecting containers to bridges.
veth is used for bridged networks and while this solution is widely compatible it is not the most performant. See this blog post for a performance comparison of Linux Bridge and OVS.

As such, the OVS driver will support:

  • Creating a Veth pair, place one end in the container namespace, attach the other to Open vSwitch
  • Creating an OVS internal port and place this in the container namespace

This can be added as a a new network strategy to allow code-sharing between drivers, or can be hard-coded in to the Open vSwitch driver itself.

OVS configuration is done using the OVSDB protocol specified in RFC7047.
As such, we have written an open source (Apache 2.0) OVSDB library for Go that can be consumed by Docker for this purpose.

Use of OVSDB is preferred over Open vSwitch CLI ovs-vsctl commands as it gives:

  • A well-defined management interface with a proper specification (RFC 7047)
  • An enhanced command set
  • The ability to monitor the database for updates
  • Support for multi-threaded operations (lock, unlock, steal)
  • Rich error handling

Linux Bridge Backend

To maintain compatibility with teh existing networking stack, we will also write a Linux Bridge driver.
This will operate in the same fashion as it does today, creating a veth pair unless an alternative network strategy is selected.

Summary

By implementing a Bridged Network Driver framework in Docker we allow for many different implementations of vSwitches to easily integrate with Docker. This gives Docker users choices for performance, reliability and scale in production environments.

The work here should address the following issues:

  • #7857 - OVS will bring better performance figures (see here)
  • #7455, #8216 - Will allow the network driver to handle configuration of bridge, namespace etc...
@billsmith

Broken link at end of Driver API section should be https://github.com/openvswitch/ovs/blob/master/WHY-OVS.md

@dave-tucker
Member

Updated. Thanks @billsmith

@Lukasa
Lukasa commented Nov 5, 2014

This is a great idea!

One warning: I think specifying the API in terms of the method of creating the topology is a bad idea. The initial strawman for the API was:

This could look as follows:

ParseNetworkConfiguration
InitNamespace
InitBridge
PrePortCreation
CreatePort
PostPortCreation
PreAddressing
Addressing
PostAddressing

A better approach would be for the interface to be in terms of the desired connectivity to be achieved, and to allow the plugin to sort it out. For example, rather than saying "I want a bridge", say "I want this container to talk to these containers". For OVS, this turns in to "I want a bridge", while for L3 approaches (say Project Calico), this turns into "I want appropriate ACLs".

Basically, I think this approach for network drivers should be broader than just switches. Switches are great and I want to support them, but let's not rule out alternative approaches to building the network topologies.

@dpw
dpw commented Nov 5, 2014

Does this proposal mean that all drivers will continue to live in the docker codebase? That seems to be suggested by the sentence "From a user perspective, the network driver would be chosen by specifying a flag when running the docker daemon". In practice, only a small set of network drivers will be blessed by inclusion in docker, and it will be hard for users to try out alternative network driver implementations.

If instead the network driver API is exposed through a true plug-in system, then an ecosystem of driver implementations can thrive. For the same reason, the details of the API should not assume a particular style of driver.

Furthermore, the driver should be selected on a per-container basis. For example, a user might want some containers on a single host to use an Open vSwitch-based driver (for high-performance virtualized netoworks), others to use a weave-based driver (in order to have an encrypted overlay network crossing firewalls), and others using the traditional Linux bridge-based docker networking (because they don't need anything more sophisticated).

@dave-tucker
Member

@lukasa great point. this api is just a starter for 10 - i'd really like the api to be defined by interested parties :)

@dpw i think we'll be basing this atop #8968 so drivers don't have to live in the docker code base. To follow docker's batteries included ethos it would make sense to have a sensible default in tree. I only found out about the plugin proposal today though ;)

As for driver on a per-container basis, I'll defer to others to see if this is something we should consider. I'm wary of pushing something like that in a docker run, especially if the backend is abstracted through swarmd, as a container won't run if that backend isn't installed or it's dependencies aren't met.

@Lukasa
Lukasa commented Nov 5, 2014

@dave-tucker Awesome, I'd love to be a voice in this discussion. I'm tentatively open to being a guinea pig for proposed APIs as well if we decide we need to workshop this.

Per container drivers might be a little trickier, but I'm interested in seeing if a straw man can be proposed.

I'll have a think about a straw man ideal API from where I'm standing, and propose it back.

@gaberger
gaberger commented Nov 5, 2014

I think there is room to be opinionated here on the design in the early days as to avoid the pathological effects of injecting all kinds of state distribution issues into system. Others have mentioned that these efforts should be in line with both the plugin architecture #8968 https://github.com/docker/docker/pull/89688968 #8968 design but also the work going on with clustering #8859  #8859.

There should be nothing preventing arbitrary composition of services but we can imagine that there are patterns of composition (spatial and temporal) which are necessary to meet the emerging data-flow patterns and micro-service design.

As mentioned we should not have to worry about the underlying implementation and configuration mechanisms to assemble the graph of services whether applying a model such as one-app per container, full stack (not a good idea) or the assembly of containers into a higher collection (i.e. POD).

From a high level I would love to see a generic "topology” abstraction, purposely not using network as topology infers some scope of spatial isomorphism. The topology class can be sub-classed into hints related to the composition of the graph of services i.e. one-to-one, one-to-many, many-to-one, many-to-many. This would allow the underlying configuration mechanisms and higher-level orchestration functions more intelligent about the patterns of communication which can be used to add constraints to the scheduling decisions.

Quite possible this can be a very simple interface addition to Docker/Libswarm Verb such as:

docker create topology -type one2one -name spark
docker run —topology spark ...

All of the identifier assignments (MAC, IP, VID, A record, etc..) are available through a service discovery API (think consul, etcd, etc..)

The decisions about the intra-host, inter-host IPC capabilities need to be formalized under some constraints imposed by use-case analysis. For instance if mobility is a first-class function in Docker, then managing IP bindings becomes critical of which the options of L2 bridging (scaling challenges) or IPinIP encapsulation such as LISP(state distribution challenges) need to be considered.

If we are going down the path of a broker/proxy based technology implementation like Weave, we might as well look at a more robust cluster based IPC mechanisms such as libchan over sockets which might be more valuable to application designers then stitching together tunnel endpoints.

-g

On Nov 5, 2014, at 10:44 AM, Dave Tucker notifications@github.com wrote:

@lukasa https://github.com/lukasa great point. this api is just a starter for 10 - i'd really like the api to be defined by interested parties :)

@dpw https://github.com/dpw i think we'll be basing this atop #8968 #8968 so drivers don't have to live in the docker code base. To follow docker's batteries included ethos it would make sense to have a sensible default in tree. I only found out about the plugin proposal today though ;)

As for driver on a per-container basis, I'll defer to others to see if this is something we should consider. I'm wary of pushing something like that in a docker run, especially if the backend is abstracted through swarmd, as a container won't run if that backend isn't installed or it's dependencies aren't met.


Reply to this email directly or view it on GitHub #8952 (comment).

@lexlapax
lexlapax commented Nov 5, 2014

some thoughts..
As has been mentioned here and in #8951 , there's two higher level paths multi-host networking in docker can go take,
1st is to provide a docker native / docker knowledgable solution as being discussed here and in #8968
2nd is to provide an escape hatch for outside docker managed networking ala #8216
It makes sense on the one hand to take a stance on where docker should be going in terms of networking and provide some soft of a path, maybe via a plugin architecture, with some implementations like ovs, linuxbridge and defaulting to one of them. Newer ways of networking or newer plugins can always be developed using the plugin architecture -- this will take time.
On the other hand, the escape hatch that's proposed in #8216 is a generic enough solution that will work for any kind of networking until docker catches up..

@mapuri
Contributor
mapuri commented Nov 5, 2014

@ All

This thread and #8951 does a good job of laying out the approaches to integrate advance networking models with docker. And as @lexlapax summarizes, these seem to be boiling down two broad approaches viz.

  • a network-driver(plugin) approach (referred as [1] below) #8952, which involves docker (through the driver) to provision networking for a container and;
  • passing a pre-provisioned network namespace approach(referred as [2] below) #8216, which allows docker to remain a glueing infra for container's networking.

Since we still seem to be weighing the proposals, I want to add that while these two approaches can be thought to co-exist there are potentially different implications wrt design and implementation of docker as a scalable and lightweight container management/launching infrastructure, so it might not be desirable to support both but just do one right!

IMO what can be achieved by one of the proposed approaches can potentially be achieved by the other approach as well. So instead of listing the differences I would like to list the similarity in capabilities of the two approaches in an attempt to make it easier to compare:

  • Both approaches assume some form of external network orchestrator's (*) presence: The network driver approach requires network configuration being passed to the driver (through docker) by the orchestrator while the network-namespace requires the network orchestrator to provision the namespace which is then passed to docker later.

  • Both approaches are equally pluggable: While [1] plugs (driver/orchestrator specific) network plumbing from southbound through driver APIs, [2] plugs (driver/orchestrator specific) network plumbing from northbound with one generic network strategy.

  • Both approaches provide a consistent interface to the external orchestrator for network management: While [1] offers the consistent interface in terms of driver APIs, [2] offers a consistent interface in terms of the existing generic network-namespace strategy.

  • Both approaches are capable of integrating complex networking scenarios with docker like multi-host networks, mutli-tenancy, IPAM etc*: Since servicing these networking needs assume presence of some kind of external orchestrator and both approaches work well with such orchestrators, hence both approaches address these requirements.

    (*)By 'external network orchestrator' above I imply some sort of orchestration/controller platforms (openstack neutron, open daylight or a simple script etc to name a few) that manages the network (including host networking) in a dev/prod fabric.

Conclusion
The two approaches seem equally capable for addressing the existing gaps in docker wrt L2/L3 network integration.

I personally favor [2] mainly because it keeps docker implementation pretty simple and completely offloads the complexity of networking management to the external network orchestrator, which anyways needs to be involved in either approaches. However, [1] might be preferable if a driver/plugin based approach as being proposed by #8968 is adopted by docker in general.

I would love to hear back if there are other considerations where one approach might be favored over the other. Or if I am missing something obvious in my understanding.

This is a relevant discussion wrt shaping up network integration for docker and I would love to be involved.

@dave-tucker
Member

@mapuri this proposal, or #8951 has no requirement on an external orchestrator or controller of any description and is totally happy without one

@mrunalp
Contributor
mrunalp commented Nov 5, 2014

@dave-tucker Do you have a link to the Go based OVSDB library that you mention?

@dave-tucker
Member

@mrunalp https://github.com/socketplane/libovsdb it's still alpha quality, but we're working on that :)

@mrunalp
Contributor
mrunalp commented Nov 6, 2014

@dave-tucker No worries :)

@mapuri
Contributor
mapuri commented Nov 6, 2014

@dave-tucker, ah I might be missing something but I thought this proposal requires a network configuration to be passed down to docker which let's it provision the container network (ovs; or linux bridge + iptables etc) by calling into the driver's API (like ParseNetworkConfiguration etc).

My understanding of the proposal is that in order to be able to address complex network scenarios like making use of underlays (vlans) or overlays or just L3 and ACLs based on user's network case, we want to abstract it under the driver layer. But won't this require such the network configuration to be prepared by an orchestrator and then transparently passed to the driver (that understands that config) through docker run or similar? If so, it does seems to assume presence of an external orchestrator which prepares driver specific configuration, right?

BTW I might not have been clear but depending on complexity of the network, orchestrator might just be a fancy term for a bash script or a human, say managing a single host network.

@lexlapax
lexlapax commented Nov 6, 2014

@mapuri concur on the notion of the orchestrator.. it could be anything, including scripts, chef/puppet/ansible, large scale container management frameworks, policy management frameworks like congress or opflex etc..

@gaberger
gaberger commented Nov 6, 2014

Sorry missing something here.. How do you do “multi-host” without coordination? if the call site for Docker functions only operates on a single host either a single container or a collection of containers then there needs to be some other call site that crosses different hosts.. I was under the impression that libswarm would take on this role, the Docker backend services would have concrete interfaces which would take advantage of the driver/plugin model proposals but there should be a higher level abstraction for multi-host.

Like I said in my earlier post; if the composition of services requires knowledge of multiple hosts i.e. Container B:Port 4001:Host 2 -> depends -> on Container A:Port 4000:Host 1, Host 2 should have knowledge of this during this possibly at initialization but definitely at runtime.. It would be great to be able to do late bindings and just discover services from a central registry when needed, but now we are back to the fact that you need some shared state. As far as I understand OVSDB only has a local view of the host it operates on and NSX pulls that along with other DBs to create a global view. What is the expectation here, are we relying on the network community to provide tools for this?

On Nov 5, 2014, at 6:55 PM, Dave Tucker notifications@github.com wrote:

@mapuri https://github.com/mapuri this proposal, or #8951 #8951 has no requirement on an external orchestrator or controller of any description and is totally happy without one


Reply to this email directly or view it on GitHub #8952 (comment).

@dave-tucker
Member

@mapuri if we are including humans as orchestrators your point is valid.
The ParseNetworkConfiguration will see what's exposed by docker run, which imo should not be more than exists today, and anything which comes in either via a hook in to the docker engine, a read from a K/V store or an RPC to the orchestrator (the latter two could be implemented as part of a driver). This way operational state doesn't get passed in to docker run, rather, it's either injected to docker, or docker is smart enough to know where to look for it.

@gaberger multi-host is not this proposal. That discussion is happening in #8951.
You'll see we are proposing a distributed control plane for exchange of network level reachability between hosts in a cluster. Host-level discovery should happen via Docker Clustering, or another solution if this is too going to be pluggable. This shouldn't be confused with service discovery (which could be via DNS, K/V or other, there are numerous proposals out there) which is where service composition becomes relevant.

@lexlapax
lexlapax commented Nov 6, 2014

I have to say, discovery and connectivity although different disciplines go hand in hand. which sort of leads me to believe an approach that takes into consideration the fact that whatever is connecting the "plumbing" together would probably benefit from the knowledge of seeding discovery data.. and as such, it's either something like libswarm et al or something else altogether (possibly multiple somethings) but it's not the docker binary that should be doing this.. and hence, again, my preference for #8216 . As the community decides to get into the "orchestration and discovery" side of things and how best to do it from a docker perspective, the rest of the community still benefits from external tools that can do the connectivity and the discovery data seeding.. including using things like libovsdb outside of docker proper.

@gaberger
gaberger commented Nov 6, 2014

@dave-tucker https://github.com/dave-tucker my bad Dave, crossed streams.. Still you should be careful here when trying to carve up these namespaces.. Services are overloaded onto interface addresses which is one of the core issues here, but will save that discussion for #8591

@mapuri https://github.com/mapuri I haven’t looked into the plugin layer but it seems to me this would require some dispatcher to live within docker daemon in order to broker calls either to the existing engine or a third-pary library like libovsdb

On Nov 5, 2014, at 9:46 PM, Dave Tucker notifications@github.com wrote:

@mapuri https://github.com/mapuri if we are including humans as orchestrators your point is valid.
The ParseNetworkConfiguration will see what's exposed by docker run, which imo should not be more than exists today, and anything which comes in either via a hook in to the docker engine, a read from a K/V store or an RPC to the orchestrator (the latter two could be implemented as part of a driver). This way operational state doesn't get passed in to docker run, rather, it's either injected to docker, or docker is smart enough to know where to look for it.

@gaberger https://github.com/gaberger multi-host is not this proposal. That discussion is happening in #8951 #8951.
You'll see we are proposing a distributed control plane for exchange of network level reachability between hosts in a cluster. Host-level discovery should happen via Docker Clustering, or another solution if this is too going to be pluggable. This shouldn't be confused with service discovery (which could be via DNS, K/V or other, there are numerous proposals out there) which is where service composition becomes relevant.


Reply to this email directly or view it on GitHub #8952 (comment).

@mapuri
Contributor
mapuri commented Nov 6, 2014

which imo should not be more than exists today, and anything which comes in either via a hook in to the docker engine, a read from a K/V store or an RPC to the orchestrator (the latter two could be implemented as part of a driver).

@dave-tucker, thanks for clarifying. That was my understanding as well while I was comparing #8952 with #8216 in my previous post. The state availed to the driver/plugin (either K/V or RPC calls) needs to be in a form that it understands i.e. it can't be something generic and it will mostly be dictated by the orchestrator/controller that publishes it in first place. For instance, a policy based framework might push policies to allow/prevent communication between two containers through ACL like rules while a simple vlan based ovs driver implementation might push ovsdb configuration to associate container interfaces in same/different vlans.

This brings me back to the conclusion in my original post, that if #8216 and #8952 compare equally in capabilities, do we see any other specific benefits of one over the other that let's us choose one approach over other. I definitely see simplicity of #8216 as a potential plus in it's favor.

@gaberger, yes agreed. With plugin layer based approach I can see that docker daemon will need to broker network namespace provisioning calls to a third-party/orchestrator specific plugin, if one is registered.

@thockin
Contributor
thockin commented Nov 6, 2014

+1 to not making plugins be built-in to docker. Running a separate plugin as a distinct daemon or set of exec() calls or set of http hooks means we don't need to hack on docker to experiment with ideas.

+1 to less concrete API - we should support a bridgeless mode (think SRIOV).

@joeswaminathan

I personally favor #8216 over #8952 for the following reasons

  1. Docker is still confined to deploying containers within single host, and we still need a PAAS/IAAS API framework to orchestrate containers across a large clusters. We need to have a network API at PAAS/IAAS level anyways. Hence why create two levels of API. There are going to be N number of PAAS/IAAS framework each having a different network model and trying to satisfy all with a single API inteface at Docker will not be practical
  2. Based on Openstack experience, it is not easy to create a network API that makes everyone happy, as network is feature rich. If we are creating an API it can't stop at just providing an IP address to container, we need to be able to configure network services, service chain, etc. Having to push any extensions through two different communities is going to be a pain. Particularly Docker will be used across multiple IAAS/PAAS framework, it will not be easy to standardize or extend Docker level APIs (It is already hard within a single framework - Openstack Neutron is a good example).
  3. I see docker as a Openstack Nova agent equivalent. It would be better to confine Docker to this role only rather than trying to make it represent network, storage, etc. Dockers strength today is only in the compute aspects. Confining to Docker to this aspect will enable wider adoption.

Having said that, there might be a use case to have a simple network interface that allows the containers to directly connect to a linux bridge or an OVS for quick experimentation and Proof Of Concept type of works. Hence a lightweight #8951 based on linux bridge / OVS bridge might be useful. But #8216 should be the prime model in my opinion.

@nhorman
nhorman commented Nov 11, 2014

I agree with @joeswaminathan . Theres already lots of work being done to manage the other aspects of this problem (kubernetes handles container scheduling/deployment/monitoring, rudder/flannel handles network address allocation and peer container connections via various methods, like direct routing/ovs/vxlan tunneling). Theres no need to attempt to pull all that into a monlitic docker setup. All docker really needs is a way to specify network interfaces and what they are expected to attach to locally (i.e. internal network/external network), in some common nomenclature that external tools can use to properly manage the virtual cabling for that container. Docker itslef doesn't need to become aware of the off host infrastructure that its living within

@jainvipin

It seems like many would like to see #8216 (from #8951, #8952, and #8216)
While we debate and code up the eventual thing, assuming it is going to take some time (few months I suppose), would it make sense to pull in #8216 (or equivalent) - something that's available now?

@lexlapax

@jainvipin should add #8997 to that list

@fleitner

@joeswaminathan described very well my concerns. Although the idea of having multiple backends to support different networking needs in Docker is for sure compelling, there are too many possibilities to fullfil. It seems better if Docker could provide the simplest networking plumbing by default and be more friendly to external tools that can do more complex networking plumbing.

@FlorianOtel

+1

@yeasy
yeasy commented Jan 9, 2015

Agree, and this should be put with higher priority.
In datacenter environments, the OpenvSwitch is actually the de-facto driver.

@unclejack
Contributor

There's an official proposal for networking drivers which can be found at #9983.
The architecture presented in the proposal would also enable multi-host networking for multiple Docker daemons.

This new proposal implements an architecture which has been discussed quite a bit. Implementing a proof of concept of the network drivers was also part of this effort.
We're not suggesting that the previous proposals had a lower quality or that they've required less effort. However, the design also had to be accepted by everyone and validated with a proof of concept, in addition to being good.

Should you discover something is confusing or missing from the new proposal, please feel free to comment.
If you'd like to continue the discussion, please comment on #9983. Please make sure to stay on topic and try to avoid writing long comments (or too many). This would help make it easier for everyone who's following the discussion.

Questions and lengthy discussions are more adequate for the #docker-network channel on freenode. Should you just want to talk about this, that is a better place to have the conversation.

We'd like to thank everyone who's provided input, especially those who've sent proposals. I will close this proposal now.

@unclejack unclejack closed this Jan 9, 2015
@kamaljeetrathi

how-to-configure-dhcp-server-in-docker so that we can acces the application running inside it with that ip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment