Proposal: Native Docker Multi-Host Networking #8951

Closed
nerdalert opened this Issue Nov 4, 2014 · 145 comments

Projects

None yet
@nerdalert
Contributor

Native Docker Multi-Host Networking

TL;DR Practical SDN for Docker

Authors: @dave-tucker, @mavenugo and @nerdalert.

Background

Application virtualization will have a significant impact on the future of data center networks. Compute virtualization has driven the edge of the network into the server and more specifically the virtual switch. The compute workload efficiencies derived from Docker containers will dramatically increase the density of network requirements in the server. Scaling this density will require reliable network fundamentals, while also ensuring the developer has as much or little interaction with the network as is desired.

A tightly coupled and native integration to Docker will ensure there is a base functionality that capable of integrating into the vast majority of data center network architectures today and help reduce the barriers to Docker adoption for the user. Just as important for the diverse user base, is making Docker networking dead simple for the to integrate, provision and troubleshoot.

The first step is a Native Docker Networking solution today that can handle Multi-Host environment which scales to production requirements and that works well with the existing network deployments / operations.

Problem Statement

Though there are a few existing multi-host networking solutions, they are currently designed more as over-the-top solutions on top of Docker that either:

  1. Address a specific use case
  2. Address a specific orchestration system deployment
  3. Do not scale to the production requirements
  4. Do not work well with existing production network and operations.

The core of this proposal is to bring multi-host networking as a native part of Docker that handles most of the use-cases, scales and works well with the existing production network and operations. With this provided as a native Docker solution, every orchestration system can enjoy the benefits alike.

There are three ways to approach multi-host networking in docker:

  1. NAT-based : Just hide the containers behind the docker host IP address. Job Done.
  2. IP-based Each container should have it’s own unique IP address
  3. Hybrid. A mix of the above

NAT-based

The first option (NAT-based) works by hiding the the containers behind a Docker Host IP address. The TCP port exposed by a given Docker container is mapped to an unique port on the Host machine.

Since the mapped host port has to be unique, containers using well-known port numbers are therefore forced to use ephemeral ports. This adds complexity in network operations, network visibility, troubleshooting and deployment.

For example, the configuration of a front-end load-balancer for a DNS service hosted in a Docker cluster.

Service Address:

  • 1.2.3.4:53

Servers:

  • 10.1.10.1:65321
  • 10.36.45.2:64123
  • 10.44.3.1:54219

If you have firewalls or IDS/IPS devices behind the load-balancer, these also need to know that the DNS service is being hosted on these devices and port numbers.

IP-based

The second option (IP-based) works by assigning unique IP-Addresses to the containers and thus avoiding the need to do Port-mapping, and solving issues with downstream load-balancers and firewalls by using well-known ports in pre-determined subnets.
However, this exposes different sets of issues.

  • _Reachability_: Which containers are on which host?*
  • GCE uses a /24 per host for this reason, but solutions outside of GCE will require an overlay network like Flannel
  • Even a GCE style architecture will make firewall management difficult
  • Flexible Addressing / IP Address Management (IPAM)*
    • Who assigns IP addresses to containers
      • Static? A flag in docker run?
      • DHCP/IPAM? A proper DHCP server or IPAM solution?
      • Docker? A local DHCP solution using Docker?
      • Orchestration System? via docker run or another API?
  • Deployability and migration concerns
    • Some clouds do not play well with routers (like EC2)

Proposal

We are proposing a Native Multi-Host networking solution to Docker that handles various production-grade deployment scenarios and use cases.

The power of Docker is its simplicity, yet it scales to the demands of hyper-scale deployments. The same cannot be said today for the native networking solution in Docker. This proposal aims to bridge that gap. The intent is to implement a production-ready reliable multi-host networking solutions that is native to Docker while remaining laser focused on the user friendly needs of the developers environment that is at the heart of the Docker transformation.

The new edge of the network is the vSwitch. The virtual port density that application virtualization will drive is an even larger multiplier then the explosion of virtual ports created by OS virtualization. This will create port density far beyond anything to date. In order to scale, the network cannot be seen as merely the existing physical spine/leaf 2-tier physical network architecture but also incorporate the virtual edge. Having Docker natively incorporate clear scalable architectures will avoid the all too common problem of the network blocking innovation.

Solution Components

1. Programmable vSwitch

To implement this solution we require a programmable vSwitch.
This will allow us to configure the necessary bridges, ports and tunnels to support a wide range of networking use cases.

Our initial focus will be to develop an API to implement the primitives required of the vSwitch for multi-host networking with a focus on delivering an implementation for Open vSwitch first.

This link, WHY-OVS covers the rational for choosing OVS and why it is important to the Docker ecosystem and virtual networking as a whole. Open vSwitch has a mature Kernel Data-Plane (upstream since 3.7) with a rich set of features that addresses the requirements of mult-host. In addition to the data-plane performance and functionality, Open vSwitch also has an integrated management-plane called OVSDB that abstracts the Switch as a Database for the applications to make use of.

With this proposal the native implementation in Docker will:

  • Provide an API for implementing Multi-Host Networking
  • Provide an implementation for an Open vSwitch datapath
  • Implement native control plane to address the scenarios mentioned in this proposal.

2. Network Integration

The various scenarios that we will deal with in this proposal range between existing Port-Mapping solution to VXLAN based Overlays to Native underlay network-integration. There are real deployment scenarios for each of these use-cases / scenarios.

Facilitate the common application HA scenario of a service needing a 1:1 NAT mapping between the container’s back-end ip-address and a front-end IP address from a routable address pool. Alternatively, the containers can also be reachable globally depending on the users IP addressing strategy.

3. Flexible Addressing / IP Address Management (IPAM)

In a multi-host environment, IP Addressing Strategy becomes crucial. Some of the Use-cases, as we will see, will also require reasonable IPAM in place. This discussion will also lead to the production-grade scale requirements of Layer2 vs Layer3 networks.

4. Host Discovery
Though it is obvious, it is important to mention the Host Discovery requirements that is inherent for any Multi-host solution. We believe that such Host/Service Discovery mechanism is a generic requirement and is not specific to the Multi-Host networking needs and as such we are backing the Docker Clustering proposal for this purpose.

5. Multi-Tenancy
Another important consideration is to provide the architectural white-space for Multi-Tenancy solutions that may either be introduced in Docker Natively or by external orchestration systems.

Single Host Network Deployment Scenarios

  • Parity with existing Docker Single-Host solution

This is the native Single-Host Docker Networking model as of today. This is the most basic scenario that the solution that we are proposing must address seamlessly. This scenario brings in the basic Open vSwitch integration into Docker which we can build on top of for the Multi-Host scenarios that follows.

Figure - 1

Figure - 1

  • Addition of Flexible Addressing

This scenario adds a Flexible Addressing scheme to the basic single-host use-case where we can provide IP addressing from one of many different sources

Figure - 2

Figure - 2

Multi Host Network Deployment Scenarios

This following scenarios enables backend Docker containers to communicate with one another across multiple hosts. This fulfills the need for high availability applications to survive beyond a single node failure.

  • Overlay Tunnels (VXLAN, GRE, Geneve, etc.)

For environments which need to abstract the physical network, overlay networks need to create a virtual datapath using supported tunneling encapsulations (VXLAN, GRE, etc). It is just as important for these networks to be as reliable and consistent as the underlying network. Our experience leads us towards using similar consistency protocol such as a tenant aware BGP in order to achieve the worry free environment developers and operators desire. This also presents an evolvable architecture if a tighter coupling into the native network is of value in the future.

The overlay datapath is provisioned between tunnel endpoints residing in the Docker host which gives the appearance of all hosts within a given provider segment being directly connected to one another as depicted in the following Diagram 3.

Figure - 3

Figure -  3

As a new container comes online, the prefix is updated in the routing protocol announcing its location via a tunnel endpoint. As the other Docker hosts receive the updates the forwarding is installed into OVS for which tunnel endpoint the host resides. When the host is deprovisioned, the similar process occurs and tunnel endpoint Docker hosts remove the forwarding entry for the deprovisioned container.
Underlay Network Integration

  • Underlay Network integration

The backend can also simply be bridged into a networks broadcast domain and rely on upstream networking to provide reachability. Traditional L2 bridging has significant scaling issues but it is still very common in many data centers with flat VLAN architectures to facilitate live workload migrations of their VMs.

This model is fairly critical for DC architectures that require a tight coupling of network and compute as opposed to a ships in the night design of overlays abstracting the physical network.

The underlay network integration can be designed with some specific network architecture in mind and hence we see models like Google Compute where every host is assigned a dedicated Subnet & each pod gets an ip-address from that subnet.

Figure - 4 - Dedicated one Static Subnet per Host*

Figure -  4

The entire backend container space can be advertised into the underlying network for IP reachability. IPv6 is becoming attractive for many in this scenario due to v4 constraints.

By extending L3 to the true edge of the network in the vSwitch it enables a proven network scale while still retaining the ability to perform disaggregated network services on the edge. Extending gateway protocols to the host will play a significant role in scaling a tight coupling to the network architecture.

Alternatively, Underlay integration can also provide Flexible addressing combined with /32 host-updates to the network in order to provide the subnet flexibility.

Figure - 5

Figure -  5

Summary

Implementing the above solution provides a flexible, scalable, multi-host networking as a native part of Docker. This implementation adds a strong networking foundation that is intent on providing an evolvable network architecture for the future.

@thockin
Contributor
thockin commented Nov 4, 2014

This sounds good. What I am not seeing is the API and performance. How does one go about setting this up? How much does it hurt performance?

One of the things we are trying to do in GCE is drive container network perf -> native. veth is awful from a perf perspective. We're working on networking (what you call underlay) without veth and a vbridge at all.

@shykes
Contributor
shykes commented Nov 4, 2014

I like the idea of underlay networking in Docker. The first question is: how much can be bundled by default? Does an ovs+vxlan solution make sense as a default, in replacement of veth + regular bridge? Or should they be reserved for opt-in plugins?

@thockin do you have opinions on the best system mechanism to use?

@thockin
Contributor
thockin commented Nov 4, 2014

What exactly do you mean by "system mechanism" ?

@shykes
Contributor
shykes commented Nov 4, 2014

vxlan vs pcap/userland encapsulation vs nat with netfilter vs veth/bridge vs macvlan... use ovs by default vs. keep it out of the core.. Things like that.

@thockin
Contributor
thockin commented Nov 4, 2014

Ah. My experience is somewhat limited.

Google has made good use of OVS internally.

veth pair performance is awful and unlikely to get better.

I have not plain with macvlan, but I understand it is ~wire speed, but a bit awkward to use.

We have a patch cooking that fills the need for macvlan-like perf without actually being VLAN (more like old-skool eth0:0 aliases).

If we're going to pick a default, I don't think OVS is the worst choice - it can't be worse perf than veth. But it's maybe more dependency heavy? Not sure.

@mavenugo
Contributor
mavenugo commented Nov 5, 2014

@thockin @shykes Thanks for the comments.
Agreed on the veth performance issues. our proposal is to use OVS ports.
The companion proposal : #8952 covers details on how we are planning to use OVS.
(Please refer to the Open vSwitch Backend section of #8952 which covers performance details of veth vs OVS port).

OVS provides the flexibility of using VXLAN for overlay deployments or native network integration for underlay deployments without sacrificing performance or scale.

I haven't done much work with macvlan to give an answer on how it stacks up to an overall solution that includes functionality, manageability, performance, scale and network operations.

We believe that Native Docker networking solution should be flexible enough to accommodate L2, L3 and Overlay network architectures.

@jainvipin

Hi Madhu, Dave and Team:

Definitely a wholesome view of the problem. Thanks for putting it out there. Few questions and comments (on both proposals [0] and [1], as they tie into each other quite a bit):

Comments and Questions on proposal on Native-Docker Multi-Host Networking:

[a] OVS Integration: The proposal is to natively instantiate ovs from docker is good.

  • Versioning and dependency between networking component and compute part of docker: Assuming that the driver APIs (proposed in [1]) will change and refine itself as we go. An obvious implication of such implementation inside Docker is that the docker version that implements those APIs would be tied to the user of the APIs (aka orchestrator) and all must be compatible and upgraded together.
  • Providing native data-path integration: If native integration of OVSDB API calls are made via docker, wouldn’t it be inefficient (extra-hop) to make these API calls via docker.
  • Datapath OF integration: OVS also provides a complete OF datapath using a controller (ODL, for example). Are you proposing that for a use case that requires OF API calls, the API calls are also made through docker (native integration)? Assuming not, if the datapath programming to the switch is done from outside docker, then why keep part of the OVS bridge manipulation inside docker (via the driver) and a part outside? It would seem that doing the network operations completely outside in an orchestration entity would be a good choice, provided a simple basic mechanism like [2] exists to allow the outside systems to attach network namespaces during container creation.
  • Provide API for implementation for Multi-Host Networking:
    Question: Can you please clarify if the APIs proposed here are eventually consumed by the driver calls defined in [1]? Assuming yes, to keep docker-interface transparent to plugin-specific content of these APIs, what is the proposed method? Say, a plugin-specific parsable-network-configuration for each of the proposed API calls in [1].
  • Provide native control plane:
    Question: Can you please elaborate the intention of this integration. Is this to allow inserting a control plane entity (aka router or routing layer, as illustrated in Figure 4 forming routing adjacency)? If so, does the entity sit inside or outside docker? The confusion comes from the bullet in section 1 “o Implement native control plane to address the scenarios mentioned in this proposal.”

[b]
+1 on the flexibility being talked about is good (single host, vs. overlays to native underlay integration). I am wondering if there is anything specific being proposed here or something that naturally comes from the OVS integration?

[c]
+1 on the flexibility on IPAM (use of perhaps DHCP for certain containers vs. auto-configured for the rest, mostly useful in multi-tenant scenarios). I am wondering if there is anything specific being proposed here or something that naturally comes from the OVS integration?

[e]
Multi-tenancy is an important consideration indeed; Associating a profile as in [1], specifies arbitrary parsed network configuration, seem to suffice providing a tenant context.

[f]
Regarding dns/ddns update (exposing) for the host, assuming it is done outside (orchestrator) then where part of the networking is done outside docker and part inside (rest of the native docker integration proposed here).

Comments and Questions on proposal on ‘Network Drivers’:

[g] Multiple-vNICs inside a container: Are the APIs proposed here (CreatePort) handle creation of multiple vNICs inside a container?

[h] Update to Network configuration: Say a bridge is added with a VXLAN-VNID or a VLAN, would your suggestion be to call ‘InitBridge’ or be done during PortCreate() if the VLAN/tunnel/other-parameters-needed-for-port-create does not exist.

[j] Driver API performance/scale requirements: It would be good to state an upfront design target for scale/performance.

As always, will be happy to collaborate on this with you and other developers.

Cheers,
--Vipin

[0] #8951
[1] #8952
[2] #8216

@dave-tucker
Member

@thockin on the macvlan performance, are there any published figures?
@shykes @mavenugo i've done a very rough & ready comparisons and so far OVS seems to be leading the pack in my scenario, which is iperf between two netns on the same host.
See code and environment here
screenshot 2014-11-05 02 07 08

from an underlay integration standpoint, I'd imagine that having a bridge would be much easier to manage as you could trunk all vlans to the vswitch and place the container port in the appropriate vlan.... otherwise with a load of mac addresses loose on your underlay you'd need to configure your underlay edge switches to apply a vlan based on a mac address (which won't be known in advance).

I feel like i'm missing something though so please feel free to correct me if i haven't quite grokked the macvlan use case

@dave-tucker
Member

@jainvipin thanks for the mega feedback. I think the answer to a lot of your questions lies in these simple statements. I firmly believe that all network configuration should be done natively, as a part of Docker. I also believe that docker run shouldn't be polluted with operational semantics, especially if this impacts the ability of docker run to be used with libswarm (e.g making assumptions on the environment) or adds complexity for devs using docker.

Orchestration systems populating netns and/or bridge details on the host, then asking Docker to plumb this in to the container doesn't seem right to me. I'd much rather see orchestration systems converge on, or create a driver in this framework (or one like it) that does the necessary configuration in Docker itself.

For multi-host, the Network Driver API will be extended to support the required primitives for programming the dataplane. This could take the form of OF datapath programming in the case of OVS, but it could also be adding plain old ip routes in the kernel. This is really up to the driver.

To that end, all of the improvements we're suggesting here for multi-host designed to be agnostic to the backend used to deliver them.

@thockin
Contributor
thockin commented Nov 5, 2014

The caveat here is that Docker can not be everything to everyone, and the
more we try to make it do everything, the more likely it is to blow up in
our faces.

Having networking be externalized with a clean plugin interface (i.e. exec)
is powerful. Network setup isn't exactly fast-path, so popping out to an
external tool would probably be fine.

On Tue, Nov 4, 2014 at 6:44 PM, Dave Tucker notifications@github.com
wrote:

@jainvipin https://github.com/jainvipin thanks for the mega feedback. I
think the answer to a lot of your questions lies in these simple
statements. I firmly believe that all network configuration should be done
natively, as a part of Docker. I also believe that docker run shouldn't
be polluted with operational semantics, especially if this impacts the
ability of docker run to be used with libswarm (e.g making assumptions on
the environment) or adds complexity for devs using docker.

Orchestration systems populating netns and/or bridge details on the host,
then asking Docker to plumb this in to the container doesn't seem right to
me. I'd much rather see orchestration systems converge on, or create a
driver in this framework (or one like it) that does the necessary
configuration in Docker itself.

For multi-host, the Network Driver API will be extended to support the
required primitives for programming the dataplane. This could take the form
of OF datapath programming in the case of OVS, but it could also be adding
plain old ip routes in the kernel. This is really up to the driver.

To that end, all of the improvements we're suggesting here for multi-host
designed to be agnostic to the backend used to deliver them.

Reply to this email directly or view it on GitHub
#8951 (comment).

@jainvipin

@dave-tucker There are trade-offs of pulling everything (management, data-plane, and control-plane) in docker. While you highlighted the advantages (and I agree with some as indicated in my comment), I was noting a few disadvantages (versioning/compatibility, inefficiency, docker performance, etc.) so we can weigh it better. This is based on my understanding of things reading the proposal (no experimentation yet).

In contrast, if we can incorporate a small change (#8216) in docker, it can perhaps give scheduler/orchestrator/controller a good way to spawn the containers while allowing them to do networking related things themselves, and not have to move all networking natively inside docker – IMHO a good balance for what the pain point is and yet not make docker very heavy.

'docker run' has about 20-25 options now, some of them further provides more options (e.g. ‘-a’, or ‘—security-opt’). I don’t think it will remain 25 in near/short term, and likely grow rapidly to make it a flat unstructured set. The growth would come from valid use-cases (networking or non-networking), but must we consider solving that problem here in this proposal?

I think libswarm can work with either of the two models, where an orchestrator has to play a role of spawning ‘swarmd’ with appropriate network glue points.

@nkratzke
nkratzke commented Nov 5, 2014

What is about weave (https://github.com/zettio/weave)? Weave provides a very convenient SDN solution for Docker from my point of view. And it provides encryption out of the box, which is a true plus. And it is the only solution with out-of-the-box encryption so far, we have found on the open source market.

Nevertheless weaves impact to network performance in HTTP based and REST-like protocols is substantial. About 30% performance loss for small message sizes (< 1000 byte) and up to 70% performance loss for big message sizes (> 200.000 bytes). Performance losses were measured for the indicators time per request, transfer rate and requests per second using apachebench against a simple ping-pong system exchanging data using a HTTP based REST-like protocol.

We are writing a paper for the next CLOSER conference to present our performance results. There are some options to optimize weave performance (e.g. not containerizing the weave router should bring 10% to 15% performance plus according to our data).

@shykes
Contributor
shykes commented Nov 5, 2014

@thockin absolutely we will need to couple this with a plugin architecture. See #8968 for first steps in that direction :)

At the same time, Docker will always have a default. Ideally that default should be enough for 80% of use cases , with plugins as a solution for the rest. When I ask about ovs as a viable default, it's in the context of this "batteries included but removable" model.

@shykes
Contributor
shykes commented Nov 5, 2014

Ping @erikh

@Lukasa
Lukasa commented Nov 5, 2014

@dave-tucker, @mavenugo and @nerdalert (and indeed @ everyone else):

It's really exciting to see this proposal for Docker! The lack of multi-host networking has been a glaring gap in Docker's solution for a while now.

I just want to quickly propose an alternative, lighter-weight model that my colleagues and I have been working on. The OVS approach proposed here is great if it's necessary to put containers in layer 2 broadcast domains, but it's not immediately clear to me that this will be necessary for the majority of containerized workloads.

An alternative approach is pursue network virtualization at Layer 3. A good reference example is Project Calico. This approach uses BGP and ACLs to route traffic between endpoints (in this case containers). This is a much lighter-weight approach, so long as you can accept certain limitations: IP only, and no IP address overlap. Both of these feel like extremely reasonable limitations for a default Docker case.

We've prototyped Calico's approach with Docker, and it works perfectly, so the approach is simple to implement for Docker.

Docker is in a unique position to take advantage of lighter-weight approaches to virtual networking because it doesn't have the legacy weight of hypervisor approaches. It would be a shame to simply follow the path laid by hypervisors without evaluating alternative approaches.

(NB: I spotted #8952 and will comment there as well, I'd like the Calico approach to be viable for integration with Docker regardless of whether it's the default.)

@erikh
Contributor
erikh commented Nov 5, 2014

I have some simple opinions here but they may be misguided, so please feel free to correct my assumptions. Sorry if this seems overly simplistic but plenty of this is very new to me, so I’ll focus on how I think this should fit into docker instead. I’m not entirely sure what you wanted me to weigh in on @shykes, so I’m trying to cover everything from a design angle.

I’ll weigh in on the nitty-gritty of the architecture after some more experimentation with openvswitch (you know, when I have a clue :).

After some consideration, I think weave, or something like it, should be the default networking system in docker. While this may ruffle some feathers, we absolutely have to support the simple use case. I think it’s safe to say developers don’t care about openvswitch, they care that they can start postgres and rails and they just work together. Weave brings this capability without a lot of dependencies at the cost of performance, and it’s very possible to embed directly into docker, with some collaborative work between us and the zettio team.

That said, openvswitch should definitely be available and first-class for production use (weave does not appear at a glance to be made especially demanding workloads) and ops professionals will appreciate the necessary complexity with the bonus flexibility. The socketplane guys seem extremely skilled and knowledgable with openvswitch and we should fully leverage that, standing on the shoulders of giants.

In general, I am all for anything that gets rid of this iptables/veth mess we have now. The code is very brittle and racy, with tons of problems, and basically makes life for ops a lot harder than it needs to be even in trivial deployments. At the end of the day, if ops teams can’t scale docker because of a poor network implementation it simply won’t get adopted in a lot of institutions.

The downside to all of this is if we execute on the above, that we have two first-class network solutions, both of which have to be meticulously maintained regularly, and devs and ops may have an impedance mismatch between dev and prod. I think that’s an acceptable trade for “it just works” on the dev side, as painful as it might end up being for docker maintainers. Ops can always create a staging environment (As they should) if they need to test network capabilities between alternatives, or help devs configure openvswitch if that’s absolutely necessary.

I would like to take plugin discussion to the relevant pull requests instead of here, I think it’s distracting from the discussion. Additionally, I don’t think the people behind the work in the plugin system are not specifically focused on networking, but instead a wider goal, so the best place to have that discussion is there.

I hope this was useful. :)

-Erik=

@mavenugo
Contributor
mavenugo commented Nov 5, 2014

@thockin @jainvipin @shykes I just want to bring your attention to the point that this proposal tries to bring in solid foundation for network plumbing and is in no way precludes higher order orchestrators to add more value on top. I think adding more details on the API and integration will help clarify some of these concerns.

From the past, we have some deep scars in approaches that lets non-native solutions dictate the basic plumbing model, leading to a crippled default behavior and it fractures the community.
This proposal is to make sure we have considered all the defaults that must be native to Docker and not dependent on external orchestrators to define the basic network plumbing. Docker being the common platform, everyone should be able to contribute to the Default feature-set and benefit out of it.

@mavenugo
Contributor
mavenugo commented Nov 5, 2014

@lukasa Please refer to a couple of important points in this proposal that exactly addresses yours :

"Our experience leads us towards using similar consistency protocol such as a tenant aware BGP in order to achieve the worry free environment developers and operators desire. This also presents an evolvable architecture if a tighter coupling into the native network is of value in the future."

"By extending L3 to the true edge of the network in the vSwitch it enables a proven network scale while still retaining the ability to perform disaggregated network services on the edge. Extending gateway protocols to the host will play a significant role in scaling a tight coupling to the network architecture."

Please refer to #8952 which provides the details on how a driver / plugin can help in choosing appropriate networking backend. I believe that is the right place to bring the discussion on including an alternative choice of another backend that will fit best in a certain scenarios.

This proposal is to explore all the multi-host networking options and exploring the Native Docker integration of those features.

@mavenugo
Contributor
mavenugo commented Nov 5, 2014

@erikh Thanks for weighing in. Is there anything specific in the proposal that leads you to believe that it will make life of the application developer more complex ? We wanted to provide a wholesome view of the Network operations & choices in a multi-host production deployment and hence the proposal description became network operations heavy. I just wanted to assure you that It will by no way expose any complexity to the application developers.

One of the primary goals of Docker is to provide seamless and consistent mechanism from dev to production. Any impedance mismatch between dev and production should be discouraged.

+1 to "I think it’s safe to say developers don’t care about openvswitch, they care that they can start postgres and rails and they just work together."
The discussion on OVS vs Linux Bridge + IPTables is purely a infra level discussion and shouldn't impact the application developers in any way. Also that discussion should be kept under #8952.

This proposal is to bring multi-host networking Native to Docker, Transparent to Developers and Friendly to Operations.

@rade
rade commented Nov 5, 2014

@shykes

absolutely we will need to couple this with a plugin architecture

+1

I reckon that architecturally there are three layers here...

  1. generic docker plug-in system
  2. networking plug-in API, sitting on top of 1)
  3. specific implementation of 2), e.g. based on OVS, user-space, docker's existing bridge approach, our own (weave), etc.

Crucially, 2) must make as few assumptions as possible about what docker networking looks like, such as to not artificially constrain/exclude different approaches.

As a strawman for 2), how about wiring a ConfigureContainerNetworking(<container>) plug-in invocation into docker's container startup workflow just after the docker container process (and hence network namespace) has been created?

@dave-tucker Is this broadly compatible with your thinking on #8952?

@MalteJ
Contributor
MalteJ commented Nov 5, 2014

I would like to see a simple but secure standard network solution (e.g. preventing arp spoofing. The current default config is vulnerable to this.). It should be easy to replace by something more comprehensive. And there should be an API that you can connect to your network management solution.
I don't want to put everything into docker - sounds like a big monolithic monstrosity.
I am OK with a simple default OpenVSwitch setup.
With OVS the user will find lots of documentation and has lots of configuration possibilities - if he likes to dig in.

@titanous
Contributor
titanous commented Nov 5, 2014

I'd like to see this as a composable external tool that works well when wrapped up as a Docker plugin, but doesn't assume anything about the containers it is working with. There's no reason why this needs to be specific to Docker. This also will require service discovery and cluster communication to work effectively, which should be a pluggable layer.

@dave-tucker
Member

@erikh "developers don't care about openvswitch" - I agree.

Our solution is designed to be totally transparent to developers such that they can deploy their rails or postgres containers safe in the knowledge that the plumbing will be taken care of.

The other point of note here is that the backend doesn't have to be Open vSwitch - it could be whatever so long as it honours the API. You could theoretically have multi-host networking using this control plane, but linux bridge, iptables and whatever in the backend.

We prefer OVS, the only downside being that we require "openvswitch" to be installed on the host, but we've wrapped up all the userland elements in a docker container - the kernel module is available in 3.7+

@dave-tucker
Member

@rade yep - philosophy is exactly the same. lets head on over to #8952 to discuss

@nerdalert
Contributor

Hi @MalteJ, Thanks for the feedback.
"And there should be an API that you can connect to your network management solution."

  • A loosely coupled management plane is definitely something that probably shouldn't affect the potential race conditions, performance or scale of deployments other then some policy float.
  • The basic building blocks proposed are to ensure a container can have networking provisioned with as little latency as possible which is ultimately local to the node. Once provisioned, the instance is eventually consistent with updates to its peers.
  • The potential network density in a host is a virtual port density multiplier beyond anything to date in a server and typically solved in networking today with purpose built network ASICs for packet forwarding. This is why we are very passionate about Docker having the fundamental the capabilities of an L3 switch, complete with a fastpath in kernel or OVS actuated in HW (e.g. Intel) along with L4 flow services in OVS for performance/manageability attempts to reduce as much risk as possible. The reasonable simplicity of a well known network consistency model coupled feels very right to those of us who have ever been measured in service uptime. Implementing natively to Docker captures a handful of the dominate network architectures out of the box which reflects a Docker community core value of being easy to deploy, develop against and operate.
@maceip
maceip commented Nov 5, 2014

Wanted to drop in and mention an alternative to VxLAN: GUE -> an in-kernel, L3 encap solution recently (soon to be?) merged into Linux: torvalds/linux@6106253

@c4milo
c4milo commented Nov 5, 2014

@maceip agreed with you. It seems to me that an efficient and minimal approach to networking in Docker would be using VXLAN + DOVE extensions or, even better, GUE. I'm inclined to think that OVS is too much for containers but I might be just biased.

@maceip
maceip commented Nov 5, 2014

Given my limited experience, I don't see a compelling reason to do anything in L2 (ovs/vxlan). Is there an argument explaining why people want this? Generic UDP Encapsulation (GUE) seems to provide a simple, performant solution to this network overlay problem, and scales across various environments/providers.

@shykes
Contributor
shykes commented Nov 5, 2014

@maceip @c4milo isn't GUE super new and poorly supported in the wild? Regarding vxlan+dove, I believe OVS can be used to manage it. Do you think we would be better off hitting the kernel directly? I can see the benefits of not carrying the entire footprint of OVS if we only use a small part of it - but that should be weighed against the difficulty of writing and maintaining new code. We faced a similar tradeoff between continuing to wrap lxc, or carrying our own implementation with libcontainer. Definitely not a no-brainer either way.

@maceip
maceip commented Nov 5, 2014

@shykes correct, GUE would only be available in kernels >= 3.18. I realize this limits its applicability but wanted to make sure it was on your radar nonetheless. OVS is a nightmare; it's like they reimplemented libc...

@c4milo
c4milo commented Nov 5, 2014

@shykes what do you mean by poorly supported? it just landed in the mainline kernel about 1 month ago and it is being worked on by Google.

Regarding VXLAN+DOVE, it certainly can be managed by OVS and I believe work to integrate it into OVS already started as well as into OpenDaylight.

I guess the decision comes down to the sort of networking Docker wants to provide. You can get as crazy as you want with things like Opendaylight, OpenContrail, OVS and the like, or use something simpler/lighter like VXLAN+DOVE or GUE which wouldn't have a fancy control plane or monitoring but that gets the job done too.

@shykes
Contributor
shykes commented Nov 5, 2014

By "poorly supported" I simply mean very few machines with Docker installed
currently support it.

On Wednesday, November 5, 2014, Camilo Aguilar notifications@github.com
wrote:

@shykes https://github.com/shykes what do you mean by poorly supported?
it just landed in the mainline kernel about 1 month ago and it is being
worked on by Google.

Regarding VXLAN+DOVE, it certainly can be managed by OVS and I believe
work to integrate it into OVS already started as well as into OpenDaylight.

I guess the decision comes down to the sort of networking Docker wants to
provide. You can get as crazy as you want with things like Opendaylight,
OpenContrail and the like, or use something simpler like VXLAN+DOVE or GUE
which wouldn't have a fancy control plane or monitoring but that gets the
job done too.


Reply to this email directly or view it on GitHub
#8951 (comment).

@rmustacc
rmustacc commented Nov 5, 2014

As someone else who's in the coalface of building overlay networks based
on vxlan and thinking through some of the abstractions that might make
sense, I think that this is a useful first step. As a result, it's
raised a bunch of questions for me that I'd like to discuss, if this
should instead be directed elsewhere, let me know, but it feels pretty
central to the issue of Multi-Host Networking.

I'd like to approach this issue from a slightly different perspective
and focus less on how this is implemented in terms of the data plane of
the networking stack, but rather start from the perspective of a user
and what they'd actually like to build based on what our users are
building today at Joyent.

As folks are trying to migrate their existing applications into the
world of docker containers, there are a bunch of things that they do
from a networking perspective that aren't quite captured in multi-host
network deployment scenarios. The first is representing the following
classic networking topology that involves every instance existing on two
networks, with a distinct lb vlan, web vlan, and db vlan:

                   +----------+
                   | Internet |
                   +----------+
                  /            \
         +------------+    +------------+
         |            |    |            |
         |    Load    |    |    Load    |
         |  Balancer  |    |  Balancer  |
         |            |    |            |
         +------------+    +------------+
               |                  |
               |                  |
 +--------------------------------------------------------+
( )  VLAN                                                 |
 +--------------------------------------------------------+
      |           |            |            |           |
      |           |            |            |           |
  +------+    +------+     +------+     +------+    +------+
  | Web  |    | Web  |     | Web  |     | Web  |    | Web  |
  | Head |    | Head |     | Head |     | Head |    | Head |
  +------+    +------+     +------+     +------+    +------+
      |           |            |            |           |
      |           |            |            |           |
 +--------------------------------------------------------+
( )  VLAN                                                 |
 +--------------------------------------------------------+
            |               |              |
      +----------+    +----------+    +----------+
      | Database |    | Database |    | Database |
      +----------+    +----------+    +----------+

I believe that this use case is actually highly prevalent for a lot of
applications and represents a very common deployment model. As we move
to the world of Multi-Host Networking, these actually become important
and I think it's worth us taking a critical look at that before we bake
the backend implementation, as it may foreclose us on actually being
able to enable these cases.

From our observations, there are a bunch of open questions in the world
of Multi-Host Networking:

  • How do we specify multiple interfaces to a container to allow it to be
    on multiple networks?
  • How do we assign IP addresses or leave them to the IP Address
    Management System talked about in the proposal?

One of the abstractions that other orchestration and cloud providers
have is the notion of a logical network, which consists of a some IPv4
or IPv6 subnet, a set of IPs that are usable inside of that subnet for
Virtual Machines and containers, and optional information that applies
to that network, such as gateways, additional routes, resolvers, etc.
Whether docker wants to have an abstraction like this that can be
integrated or just work in terms of the raw pieces like it does today,
seems like an open question.

From what we've done and what we've had customers ask us for, they often
want to be able to logically create those networks, but not always
manage it. Most are pretty happy with the IP address management system
assigning IPs, but some also want to be select the IP address directly.
So before we go too much further into discussion about which technology
we should use in the backend, let's spend some time thinking about how
we want to actually use this from a CLI perspective when we're in the
world of Multi-Host Networking and our overlay networks allow us to have
multiple independent virtual L2 and L3 domains on the same host.

So in conclusion, while the discussion about how all this can be
implemented and the different overlay technologies we have available is
rather useful, we need to really step back and ask ourselves, what is it
we want our users to be able to do with this functionality first.

@jainvipin

@mavenugo I am convinced that the proposal doesn't preclude higher order orchestration to add more value, in contrast may be this proposal requires an orchestrator to do that (which I am okay with).
The point I was bringing up is if we need to bring entire data/control/management plane inside docker natively, then technical trade-offs be discussed. May be what I am concerned about are not technical/architectural issues at all, if you or someone can address the concerns. So far I am hearing from Dave, Brent and you mentioning that you 'believe' in native integration and I trust that your assessment is based on good technical merits, it is just that I want to know and be convinced about the reasons too. First three specifics to discuss can be:

  • Increasing the code-footprint of docker: Almost all of the benefits you talked about in this proposal can be done with OVS without natively integrating data/control/management plane. This assumes that the work be done outside in an orchestrator. Can you point out some that otherwise are not possible?
  • Compatibility/versioning; do we require docker version to be compatible with orchestator using the version of APIs?
  • Inefficiency due to extra hop to docker: if I have to manage an OVS via OVSDB/OF-CTL interface, then why take an extra hop via docker especially if docker doesn't need to parse/understand the netwrok configuration.

@thockin @jainvipin @shykes I just want to bring your attention to the point that this proposal tries to bring in solid foundation for network plumbing and is in no way precludes higher order orchestrators to add more value on top. I think adding more details on the API and integration will help clarify some of these concerns.
From the past, we have some deep scars in approaches that lets non-native solutions dictate the basic plumbing model, leading to a crippled default behavior and it fractures the community.
This proposal is to make sure we have considered all the defaults that must be native to Docker and not dependent on external orchestrators to define the basic network plumbing. Docker being the common platform, everyone should be able to contribute to the Default feature-set and benefit out of it.

@jainvipin

@nerdalert:
+1 on the problem (potential virtual port density, etc.) and use of OVS for performance/manageability for a feature rich production-grade system, and possible HW leverage..

Will all that benefits not come if OVS control/data/mgmt plane is not natively integrated into docker but is completely orchestrated from outside to provide with the network intent. Given that the solution requires some network orchestrator/controller to talk to it, the simplicity comes from that entity/integration and pehraps not native docker integration. Having OVS as a default docker bridge is good, but that may still not require all native integration.

The potential network density in a host is a virtual port density multiplier beyond anything to date in a server and typically solved in networking today with purpose built network ASICs for packet forwarding. This is why we are very passionate about Docker having the fundamental the capabilities of an L3 switch, complete with a fastpath in kernel or OVS actuated in HW (e.g. Intel) along with L4 flow services in OVS for performance/manageability attempts to reduce as much risk as possible. The reasonable simplicity of a well known network consistency model coupled feels very right to those of us who have ever been measured in service uptime. Implementing natively to Docker captures a handful of the dominate network architectures out of the box which reflects a Docker community core value of being easy to deploy, develop against and operate.

@MalteJ
Contributor
MalteJ commented Nov 5, 2014

@jainvipin agree
Also I think part of docker's success is that you can use it as a tool: use it for different things and in different ways - just as you like.
Docker shouldn't get too big. If you want to add functionality add APIs and build an ecosystem (and maybe earn some money with that).

@mavenugo
Contributor
mavenugo commented Nov 5, 2014

@jainvipin @MalteJ I can see the disconnect with your understanding of the proposed solution. I will update the proposal with these details.

When we say native, we mean native control of the network backend (linux bridge / IP-Tables or OVS or other backend) from the plugin layer (#8952).
The control and mgmt mechanisms such as netlink, ovsdb, OF are all APIs exposed by the backends and will be used by the plugin/driver in order to manage that backend. We don’t need an external orchestrator to manage them.

This will keep the footprint small and free of external dependencies in order to get network plumbing taken care.

@mavenugo
Contributor
mavenugo commented Nov 5, 2014

@jainvipin

"Given that the solution requires some network orchestrator/controller to talk to it,"

Is there anything specific in the proposal that made you believe that the solution is based on a controller ?

@mavenugo
Contributor
mavenugo commented Nov 5, 2014

@c4milo

"I guess the decision comes down to the sort of networking Docker wants to provide. You can get as crazy as you want with things like Opendaylight, OpenContrail and the like, or use something simpler like VXLAN+DOVE or GUE which wouldn't have a fancy control plane or monitoring but that gets the job done too."

The proposal is trying to find that simplicity for Multi-Host Docker Networking without need for external controllers to manage network plumbing & at the same time not sacrificing on functionality and performance. Please refer to @dave-tucker comment about on the performance comparisons. (We have more data to share on these comparisons shortly).

Also I would recommend jumping to #8952 to discuss on the actual back-end choices via plugin model and we can hash out the API details together.

@jainvipin

@mavenugo:

OK - orchestrator/controller are generic terms. Let me be specific. There is an entity that launches the containers (i.e. calls docker-run). It could be a small home-grown python script or a fancy multi-layered software. Let's call it container orchestrator. It is this entity that knows the application's deployment intent, its inter-connectivity to other applications, its network profile (parameter including ip address, policies, etc.). Assuming the user input to launch applications goes to an orchestrator first, network intent can't specified independently to docker (even with native integration mentioned here), Therefore, a need for an external entity that is outside docker to establishes the communication and prime the network with policy rules consistently across multiple hosts is needed. I am calling that entity orchestrator/controller.

So while the need for a controller s not specific to this proposal, the need for such an entity is not eliminated by this proposal either. To achieve the overall ease of deployment that is discussed as one of the desired goal in this thread, I am assuming that you'll need that entity (even with all that is proposed here). If we need such an entity, rest of my comments ties in...

   "Given that the solution requires some network orchestrator/controller to talk to it,"

Is there anything specific in the proposal that made you believe that the solution is based on a controller ?

@jainvipin

@rmustacc:
+1 on the need for supporting multiple interfaces inside a container, also mentioned in my first comment [g] earlier.

[g] Multiple-vNICs inside a container: Are the APIs proposed here (CreatePort) handle creation of multiple vNICs inside a container?

@mavenugo
Contributor
mavenugo commented Nov 5, 2014

@jainvipin
Simplicity is our #1 design objective here. As you can understand, we are NOT trying to design the orchestration entity that handles the higher order bits. But the proposal is intended to work with any such entity on the top with well designed APIs with the help of the community.
An example of a requirement that we are dependent on (as mentioned in the proposal) is Host Discovery. We are not going to come up with yet another implementation to discover Hosts in the cluster (or) trying to fit in external entities to address them as well. We believe that the Docker Cluster (#8859) proposal will fit in nicely for this particular use-case and we will make use of such a native Docker solution. Same goes with modularizing the interface with the Backend systems (via Docker Plugin proposal #8968).

OK - orchestrator/controller are generic terms. Let me be specific. There is an entity that launches the containers (i.e. calls docker-run). It could be a small home-grown python script or a fancy multi-layered software. Let's call it container orchestrator. It is this entity that knows the application's deployment intent, its inter-connectivity to other applications, its network profile (parameter including ip address, policies, etc.). Assuming the user input to launch applications goes to an orchestrator first, network intent can't specified independently to docker (even with native integration mentioned here), Therefore, a need for an external entity that is outside docker to establishes the communication and prime the network with policy rules consistently across multiple hosts is needed. I am calling that entity orchestrator/controller.

@niclashoyer

Please make sure that there will be a reasonable API to implement alternative multi host solutions. One alternative to overlay networks would be to just limit inter-container networking to IPv6 as partly discussed in #2974. While this may not be the ideal solution for most people it is still a very simple one. Just assign every container a public routable IPv6 address and use netfilter (ip6tables) to limit/allow access between containers. I don't know about the performance, though.

@nerdalert
Contributor

@niclashoyer thanks for the feedback, a backend connectivity via the the physical network is definitely a core option for performant packet forwarding.

  • The plan implementation should support RFC1918s, MAC addrs and v6 all supported in current datapaths both sw and hw, its hard not to Docker as the "killer app" that will drive a significant v6 for those not doing so already in order to avoid a more granular IPAM policy along with metadata fields inherent to v6 that many are looking to exploit.
  • As you pointed out, direct backend connectivity is a simplistic, well understood and likely the easiest scenario for troubleshooting due to no NAT or Tun encaps for inter-node traffic. A pluggable IPAM or however the community defines IP provisioning would presumably lend itself to this non-NAT/PAT scenario.
  • Backends can be bridged via plumbing VIDs into the physical fabric or advertised via L3 depending on the provider architecture and application requirements. Either way, path isolation between nodes can be accomplished by mapping the application isolation into L2 or L3 constructs in the phy fabric, or into those more interested in overlays in the form of VNIs. Its as reliable as the connectivity between nodes/pods/racks which is true for any scenario there is one less process involving state (NAT) that can potentially go south. That said, if v4 is used on the frontend or some n-tier in the middle stateful mappings are there, just not at the same volume to manage a fronted pool only.
@ibuildthecloud
Contributor

While this is an nice write up I feel we are jumping ahead here and having a discussion that really doesn't need to happen. If you have network drivers as was proposed #8952 (I just proposed an alternate solution, but still similar idea #8997) and plugins (#8968), then this networking mode can be implemented completely outside of Docker as a plugin. There would be no need to get the communities approval on it.

@rade
rade commented Nov 6, 2014

@ibuildthecloud

I feel we are jumping ahead here and having a discussion that really doesn't need to happen.

+1

this networking mode can be implemented completely outside of Docker as a plugin. [...] no need to get the communities approval

AFAICT the proposal here is to replace docker's existing default networking mode with one that is based on OVS. That certainly does merit discussion, but ultimately no such decision can be made w/o having working, battle-tested code. The plug-in mechanism and network drivers enable such development and testing to be carried out.

So... make it work, make it good, and then decide whether it should be docker's default. Not the other way round.

@dave-tucker
Member

@ibuildthecloud @rade

I feel we are jumping ahead here and having a discussion that really doesn't need to happen.

+1

I disagree. I think this discussion does need to happen as I think it's important for Docker to have a "batteries included" approach to solving networking in a docker cluster

AFAICT the proposal here is to replace docker's existing default networking mode with one that is based on OVS.

That is not the proposal. The proposal is to provide a means of exchange of network reachability information between hosts in a Docker Cluster. The API to the Network Driver would enable any backend implementation whether OVS or other (I won't deny our preference lies with OVS).

So... make it work, make it good, and then decide whether it should be docker's default. Not the other way round.

And produce yet another wrapper for docker run? I'd rather have a discussion with the community about how best to solve the problem without bypassing Docker.

@titanous
Contributor
titanous commented Nov 6, 2014

And produce yet another wrapper for docker run? I'd rather have a discussion with the community about how best to solve the problem without bypassing Docker.

That's not bypassing Docker, that is using Docker. Except instead of being a monolith, it becomes a composable tool.

@dave-tucker
Member

That's not bypassing Docker, that is using Docker. Except instead of being a monolith, it becomes a composable tool.

I guess it's up to the community to decide whether this functionality should be core to docker or not.
@shykes, @erikh thoughts?

@rade
rade commented Nov 6, 2014

And produce yet another wrapper for docker run?

I was under the impression that the combination of the plug-ins model and network driver would avoid that. I'd certainly consider that a highly desirable design objective for both.

@mavenugo
Contributor
mavenugo commented Nov 6, 2014

@ibuildthecloud @rade

While this is an nice write up I feel we are jumping ahead here and having a discussion that really doesn't need to happen.
There would be no need to get the communities approval on it.

I disagree on both of these points.
IMHO, having a very simple, solid, functional, manageable, performing networking infrastructure is extremely important for Docker to scale natively. This proposal provides the basics of native network plumbing that is needed to have multi-host Docker deployment with all of these properties.

Externalizing such basic ingredient of a cloud infra will lead to crippled defaults and fractured solutions that we see happening in other communities (especially around networking).

@MalteJ
Contributor
MalteJ commented Nov 6, 2014

well, I think this proposal was communicated a bit poorly. But if you understand it as a plugin for the system described in #8952 I begin to really like it.
+1 from my side.

@bboreham
Contributor
bboreham commented Nov 6, 2014

The proposal is to provide a means of exchange of network reachability information between hosts in a Docker Cluster

Can you expand on that a bit? I see reachability listed as an issue, and as something that upstream networking can provide. I did not get from reading the words on this page that the proposal is about exchange of network reachability information.

@dave-tucker
Member

@bboreham

With this proposal the native implementation in Docker will:

  • Provide an API for implementing Multi-Host Networking
  • Provide an implementation for an Open vSwitch datapath
  • Implement native control plane to address the scenarios mentioned in this proposal.

The control plane mentioned above is where reachability exchange takes place. The actual forwarding piece is handled by the Network Driver - we're offering up an OVS implementation, but we'd love to see others.

@bboreham
Contributor
bboreham commented Nov 6, 2014

OK, so you're saying that the API and the control plane would go into the core of Docker, and the datapath would be a pluggable component, with an OVS implementation shipped as a default?

If so, would it be better to split the proposal into the implementation-neutral parts and the implementation-specific parts?

@monadic
monadic commented Nov 6, 2014

I think these proposals would benefit from declarations of interest.

So:

  • @dave-tucker @nerdalert and @mavenugo all work for socketplane (correct me if I am wrong)
  • @bboreham @rade and myself all work for weave (zettio)
  • some other folks work for docker, google, flynn.. I think we all know who is who there ;-)
@ibuildthecloud
Contributor

@dave-tucker @mavenugo @rade Let it be clear that I am not talking to the merit of this proposal. Even if this is to replace the default networking model of Docker we are still putting the cart before the horse. Docker has zero plugability in terms of networking as it stands today. You can in fact do a lot with Docker and networking today but it is all in a "wrapped" approach. If we want to expand the native functionality of Docker we need to do this in more of a staged approach. This proposal has far too many components and details to facilitate a useful discussion.

In #8997 I propose what I believe is a very simple high level API. Once such a thing is in place we can start innovating in libnetwork and add all of functionality described in this proposal. I'd venture to guess most everyone in the discussion wants basically the same functionality. The problem is there is no one way that will make everyone happy. Lets build a simple framework in which we can build various implementations and let the community decide what is upstreamed, what is default, etc.

@Lukasa
Lukasa commented Nov 6, 2014

I'm with @ibuildthecloud here. I think building the plugin model first provides the best possible approach for Docker, allowing for the community to investigate different networking approaches before settling on one to 'standardise'.

@andreaturli
Contributor

If I get it correctly, the proposal is to replace docker's existing default networking model with one that is based on OpenVSwitch. I think it is a big shift which certainly deserves a lot of attention and eventually a large agreement in the community. Also, these kind of proposals should be supported by a working, shared battle-tested code that can be easily evaluated, I think.

+1 to the integration with the plug-in mechanism: IMHO better to have multiple implementations of the same network abstraction rather than imposing a default solution based on OVS that may be an overkill in most of the cases.

@mavenugo
Contributor
mavenugo commented Nov 6, 2014

@andreaturli Yes. This proposal provides a broad perspective of all the components required to have a solid base for a native multi-host networking solution.

And the companion proposal #8952, explains the plugin / driver mechanism for the actual back-end : OVS vs Linux Bridge vs User-land Encap etc...

IMHO, having a very simple, solid, functional, manageable, performing networking infrastructure is extremely important for Docker to scale natively.

From our experience and battle-ground testing, OVS provides that solid base and we recommend that.

But it is implemented via #8952 via the proposed plugin / driver mechanism.

@squaremo
Contributor
squaremo commented Nov 6, 2014

And the companion proposal #8952, explains the plugin / driver mechanism for the actual back-end

So is this proposal proposing any docker code changes over and above #8952?

@adamierymenko

Greetings all,

I'm the author of ZeroTier One and have been steered over to this discussion by a user who has done some independent work on integrating ZeroTier with Docker. ZeroTier One is open source.

I think the applicability of ZT1 to Docker is somewhat in line with this discussion but also somewhat orthogonal to it. Right now it's possible to use ZT1 inside a Docker container, giving that container its own portable virtual network port. Integration with the network stack is accomplished via tun/tap, which while possibly more performant than pcap (though I'm not 100% sure if there's a big difference) requires the container to be launched with "--device=/dev/net/tun --cap-add=NET_ADMIN". That might not be desirable for security or other logistical reasons in some deployments, especially if we eventually get to the point that it's otherwise safe to mix tenancy.

Nevertheless, it works. The principal focus of ZeroTier is to provide a network virtualization layer that captures a wide variety of use cases across many platforms (Docker, conventional VMs, VPN use cases, embedded devices, etc.) while placing significant emphasis on user experience. One of the core goals of the project from inception has been something that is "zero configuration" and where the software handles as elegantly as possible the details of plumbing, crypto, etc.

In the longer term I'm exploring options to improve performance. Possibilities include: integrating with kernel OpenVSwitch, allowing certain network paths to be designated as "trusted" to dispense with crypto, using mmap'd I/O, etc.

So this addresses some of the issues brought up in this thread, like communication across WANs and other "plumbing," but does not address the core issues of networking within the Docker core architecture itself.

What I'd like to do in addition to throwing my hat into the ring is to agree with those who have expressed a need for Docker's containers to support a variety of network augmentation approaches. I seriously doubt that something that provides good ease of use and performance over WANs and for typical work loads is going to perform well enough for someone who wants to run Docker on a data-intensive supercomputing cluster, nor do I think the solutions that would work there would be as easy to use and transparent as what most ordinary users would want.

Part of the performance problem that approaches like ZeroTier have is the problem of passing packets through the kernel's networking stack twice. This is a problem whether the approach is tun/tap or pcap. I'd like to advocate for some way -- perhaps OpenVSwitch kernel exposure or maybe even something more low-level and simple -- for networking augmentations within Docker containers to safely skip the second pass. Perhaps containers could be allowed some API to write a packet (with some restrictions like MAC or UDP enforcement) to the underlying network hardware, allowing any networking virtualization daemon within a container to send UDP without traversing the kernel's networking stacks a second time...?

Another thing that would help is if there were some way of doing IPC between containers on the same hardware node while skipping networking entirely. Is that possible? If so then a network virtualization layer could detect if another node were on the same HN and simply memcpy() the packet across the boundary. It would have to be done securely, and I'm not sure if Linux already has an API for anything like that. But it would take local-to-local communication completely off the kernel's native networking stack's back.

@jainvipin

agree with @ibuildthecloud and others on building flexible APIs as in #8997. So, +1 on plugin architecture (like libnetwork or equivalent)

In contrast, #8952 seems to suggest that the APIs (pre-create/post-create type of construct) are more to handle the data-path hooks and not control plane hooks.

@erikh
Contributor
erikh commented Nov 6, 2014

I do think it’s up to the community to a degree. We still have to maintain this regardless of who “owns” it. We’re also responsible for its quality.

This libnetwork stuff is kind of sidetracking this ticket, which is focused on a specific plugin for it (or something like libnetwork). Let’s strive to discuss the notion of integrating openvswitch here, and the problems that are specific to it, and perhaps focus on abstractions in the new libnetwork repository (once it’s created, at least).

JFYI I am working with @shykes to get the scaffolding in place for libnetwork. We have a #libnetwork on freenode if you want to continue the discussion on this topic there.

On Nov 6, 2014, at 6:51 AM, Dave Tucker notifications@github.com wrote:

That's not bypassing Docker, that is using Docker. Except instead of being a monolith, it becomes a composable tool.

I guess it's up to the community to decide whether this functionality should be core to docker or not.
@shykes https://github.com/shykes, @erikh https://github.com/erikh thoughts?


Reply to this email directly or view it on GitHub #8951 (comment).

@grkvlt
grkvlt commented Nov 6, 2014

I think I prefer #8997 where @ibuildthecloud is proposing what seems like a well thought out plugin model for networking.

I'm the founder of the Clocker project, which uses Apache Brooklyn to build a multi-host Docker cloud, and obviously networking plays a huge part for it to be useful. So far, I have been using Weave to link the hosts, because I need a simple solution that works on any public cloud environment and delivers at the very least a flat, private, shared LAN across all containers. So Weave is what I would describe as the Minimum Viable Network component for multi-host Docker. Now, large and complex applications have different needs, and for those sorts of use cases, i.e. the typical enterprise usage, I would love to be able to flip a switch and install a different network router on each host, and not have to change anything else. The way I use Docker would be unchanged, the way I attach containers to the network might be slightly different, but as long as there's an API for it, I can use things like the jclouds driver from @andreaturli to talk to it and set it up.

What I don't want is a one-size-fits-all solution, which seems to be what we are in danger of ending up with here. Docker is in a very powerful position when it comes to dictating what gets used, and an OVS solution as the default seems like overkill in most cases. Generally, Docker should be in the business of providing containers, which they do incredibly well, and other people like Weave or OVS can do what they do best, and add networking capabilities, and so on. We want to promote a healthy ecosystem with different providers filling appropriate niches.

@adamierymenko

@grkvit Looks like I agree about #8997 -- now reading through it and it looks very interesting. Looks like a much richer version of the network extensions API that Apple introduced for iOS 8.

@shykes
Contributor
shykes commented Nov 6, 2014

@grkvlt

Absolutely we will need to couple this with a plugin architecture. See #8968 for first steps in that direction.

At the same time, Docker will always have a default. Ideally that default should be enough for 80% of use cases , with plugins as a solution for the rest. When I ask about ovs as a viable default, it's in the context of this "batteries included but removable" model.

@shykes
Contributor
shykes commented Nov 6, 2014

@titanous

Absolutely we will need to couple this with a plugin architecture. See #8968 for first steps in that direction.

At the same time, Docker will always have a default. Ideally that default should be enough for 80% of use cases , with plugins as a solution for the rest. When I ask about ovs as a viable default, it's in the context of this "batteries included but removable" model.

@shykes
Contributor
shykes commented Nov 6, 2014

I would like to request that we continue this conversation with the understanding that Docker does not plan to build a monolithic "one size fits-all" tool, and composition is and always will be a dominant design principle for Docker. As a result of that, we prefer hooks and backend plugins over wrappers, because hooks and plugins are more composable than wrappers. How do you compose 4 different wrappers together? Answer: you can't. Eventually you need swappable backends somewhere. The question is: in which tool? If you're developing a wrapper and want to expand its scope, you want your wrapper to have pluggable backends, as opposed to just having them in Docker. But now we're no longer talking about Docker's modularity. We're talking about scope creep of your wrapper.

I'm happy to continue this discussion elsewhere (for example in #8968), but as far as this PR is concerned, I'm operating under the assumption that everyone understands this.

@RmsMit
RmsMit commented Nov 6, 2014

I agree that this needs fixing but I think a lot of the issues go away or at least get more simple once using IPv6.

The container could automatically get a link-local IPv6 address that was FE80::D0C1:. Notice the container ID is 16 bits short of the full local half of the IPv6 address so I have padded it with D0C1 which could be used to prefix for docker conatiner IPs. think of it as short for "docker netgroup 1". For those of you who don't know IPv6 a link local address is only accessible on the same link segment (so ethernet port or hub) it will not pass through a IP router. In essence it could be treated as the current docker host only 172.16.1.x subnet.

If the physical host has an IPv6 address for its external network (lets say 2001:0:2010:0:1::0010) then the container could also get an rotatable IPv6 address on the same network/subnet which would be 2001:0:2010:0:1:DOC1:.

This way knowing the containerID would be all you need to access any container in the same network. All the host would need to do is have IP forwarding enabled. no special routeing software needed. Of-course a virtual router or switch could be used if you wanted something more complex but I don't thin it would not be necessary in most cases.

Even if the application in the container only knows IPv4 it is likely to be listening to its port on both IPv4 and IPv6 as that is the behaviour of modern Linux kernels. If it is an issue a port mapping would fix it. ie map exposed ports for the container from :port to IPv4:port. this way a container that only works with IPv4 will still be accessible via its network wide IPv6 address.

@Lukasa
Lukasa commented Nov 7, 2014

@RmsMit Agreed that this is easier with IPv6. However, note that it limits you to having only IP connectivity between containers. I think this isn't a problem (full disclosure, I work on an open source project that does just this for hypervisors), but it doesn't remove the need for a pluggable network layer because users may want to extend a layer 2 broadcast domain between containers.

@monadic
monadic commented Nov 7, 2014

@shykes thanks for that. It definitely helps to steer folks towards #8968

I'd urge everyone to be very cautious about "putting the cart before the horse" with networking. Even with the very best intentions, it is quite easy to put a lot of work into making a total mess. See this cautionary tale of Neutron for example: http://www.theregister.co.uk/2014/05/13/openstack_neutron_explainer/

@mavenugo
Contributor
mavenugo commented Nov 7, 2014

@shykes Thanks for the clarification. This proposal and #8952 are opened with the plugin/driver architecture in mind as defined in #8968.

@monadic One of the biggest reasons for this proposal is to avoid the issues seen in

See this cautionary tale of Neutron for example: http://www.theregister.co.uk/2014/05/13/openstack_neutron_explainer/

The reason we propose a solid native networking infra is to avoid crippling defaults and to avoid such dependencies on external entities / wrappers / vendor specific solutions.

I fully support @shykes approach here. We all must work together and strive towards this :

At the same time, Docker will always have a default. Ideally that default should be enough for 80% of use cases , with plugins as a solution for the rest. When I ask about ovs as a viable default, it's in the context of this "batteries included but removable" model.

@mavenugo
Contributor
mavenugo commented Nov 7, 2014

@Lukasa Can you please expand on the limitations that you see in @RmsMit's suggestion on using IPv6 ? In the long run I certainly see a value with it & a well designed plugin architecture should help us experiment with different approaches including IPv6 based solutions.

@Lukasa
Lukasa commented Nov 7, 2014

@mavenugo Certainly. If connectivity is provided at the IP layer, any container that expects a layer 2 segment between it and another container cannot function. For example, you cannot ARP between two containers. You cannot run a non-IP protocol (e.g. VRRP).

I think these limitations are totally reasonable for 99% of use cases (again, I work on Project Calico, which is built on this approach). I'd be happy to go that direction for Docker's default behaviour, I think it would give Docker a great simple networking story. My only warning is that the well-designed plugin architecture is mandatory, regardless of what approach we take, because all of them are limited.

TL;DR: Getting the plug-in architecture right matters more than getting the default right.

@monadic
monadic commented Nov 7, 2014

@Lukasa +1

@mavenugo I'm a bit confused about what you guys are proposing now. But, I am sure it will become clearer after some progress on #8968

@nerdalert
Contributor

@Lukasa thanks for the thoughts,

  • Base components should allow for nearly any predominate configuration. ARP definitely important and having local responders is essential for east/west traffic, and being distributed will be more important then ever when we start dropping thousands of IPs into a single Docker host.
  • Whether a backend is extended via L2 v4/v6, L3 v4/v6, overlay/underlay datapath etc, enabling that is pretty straightforward with some basic network constructs. Thats a reason we found it important to start the conversation around having some basic components natively is to encourage people to expect more then just a port from Docker infra. As @shykes termed it perfectly, having optional batteries included natively will reduce some painful race conditions and give us more time to focus on higher order value to devs. Its not cutting out vendor opportunities from our perspective in any way because there are so many facets that can be improved in networking and how the dev interacts with it that it will ultimately avoid some adoption barriers due to missing/fragile foundation. Great convos, appreciate it!
@eyakubovich
Contributor

Having that 80% batteries included solution for multi-host networking is great goal. However, in practice though, any technique trying to achieve this goal is going to make serious assumptions about the underlying infrastructure (e.g. availability of bcast/mcast, tolerance/intolerance of a SPF, integration with 3rd party orchestrators). For that reason, I would rather concentrate on defining the least opinionated plugin interface but stop short of shipping a default solution with Docker.

@adamierymenko

Side note to add to the discussion re: bcast/mcast:

Multicast support -- or at least a plugin architecture rich enough to allow people to use network virtualization solutions with multicast support -- is more important than most people think it is.

When I first wrote ZeroTier One I included a multicast algorithm optimized for slow, seldom-used multicast use cases like mDNS/Bonjour and Netbios on the assumption that service announcement is the only thing people use multicast for. Wrong. Recent development has focused on overhauling multicast for reduced latency, since I found that there were a ton of users who wanted low-latency reliable multicast for things like databases, cluster compute applications, etc. I was surprised by how much interest their was in strong multicast support.

@Lukasa
Lukasa commented Nov 7, 2014

@nerdalert Agreed that there's no inherent problem with having a 'batteries included' approach built around OVS. My caution is built out of some experience with OpenStack's Neutron, the poster child for natural follow-on mistake, which is to build a plugin API around network topologies rather than the connectivity graph.

I'd be perfectly happy for Docker to use OVS for multi-host networking by default, my primary interest is in ensuring that Docker's plugin API is scoped appropriately for other approaches to work. The rest of the community involved in these discussions appears to have their eyes open to this risk, so I'm hopeful we can come up with something great.

@adamierymenko Massive +1. Multicast support is definitely important, and it's a key pillar of Project Calico's networking approach, so I for one will be keeping an eye on enabling it.

@mavenugo
Contributor
mavenugo commented Nov 7, 2014

@adamierymenko @eyakubovich @Lukasa Excellent points. +1 from me as well. These are some of the infrastructure components that we need to hash out as part of the Network plugin APIs and as suggested keep it least opinionated. Even Docker discovery mechanisms for Clustering can make massive help from Multicast support in the infrastructure. Its refreshing to see the interest for Multicast is growing outside of the financial & media applications.

The rest of the community involved in these discussions appears to have their eyes open to this risk, so I'm hopeful we can come up with something great.

Exactly. That is the main intent for having these proposals discussed and hashed out in a very constructive fashion.

From the Docker Advisory board meeting (https://blog.docker.com/2014/11/guest-post-notes-on-the-first-docker-advisory-board-meeting/) :

Contributors are encouraged to start a major piece of work by submitting a “documentation” pull request so that a conversation can happen around design before the implementation gets too far without review.

@erikh
Contributor
erikh commented Nov 8, 2014

Hi folks, we’re trying something new here and hopefully it is useful.

https://docs.google.com/a/docker.com/forms/d/1EK6j5pEE14dHrxB2DAkwjiMg0KzDpMN__o-QkIX9OcQ/viewform?c=0&w=1 https://docs.google.com/a/docker.com/forms/d/1EK6j5pEE14dHrxB2DAkwjiMg0KzDpMN__o-QkIX9OcQ/viewform?c=0&w=1

Is a survey we put together to assist us with your concerns, and if you have strong opinions I would encourage you to fill it out. However, the discussion is vital here. If you’d prefer to say something only once, I’d rather have it in here (where we can discuss and aggregate the results of the discussion) than in the survey.

We will be using these comments — and the survey results — as we figure out what docker’s networking model should look like, probably in the upcoming week when we attempt to drill down on some of this stuff. Additional surveys will be issued as we have more consensus and smaller topics need to be tackled instead. For anyone wondering, no decisions will be made without a proposal being filed and reviewed, per our usual process.

If you don’t like this approach, feel free to tell us that too; we’re trying something new here and are totally open to alternative approaches. Discussion here is extremely useful, and we’re just trying to get a focused response from each stakeholder independently.

@titanous
Contributor
titanous commented Nov 8, 2014

@erikh Would you also post a read-only link to the responses?

@erikh erikh self-assigned this Nov 8, 2014
@jdef jdef referenced this issue in mesosphere/kubernetes-mesos Nov 8, 2014
Open

Networking TBD. #5

@shykes
Contributor
shykes commented Nov 9, 2014

Hi everyone. After reading all the comments in this thread, and looking up
the occupation of the participants, it occurred to me that we should start
requiring disclaimers.

If you express the strong opinion that Docker should NOT ship a default
multi-host networking solution, and it turns out your employer is selling a
custom multi-host networking solution for Docker... I believe readers ought
to be informed of that.

Any objections to starting a policy of strongly encouraging disclaimers
across all discussion threads?

On Fri, Nov 7, 2014 at 4:19 PM, Erik Hollensbe notifications@github.com
wrote:

Here you go:
https://docs.google.com/spreadsheets/d/1fNrTR25N6t9TEdEs5fHWGc3XQeydNCIvtwWBGMNwvvw/edit?usp=sharing
<
https://docs.google.com/spreadsheets/d/1fNrTR25N6t9TEdEs5fHWGc3XQeydNCIvtwWBGMNwvvw/edit?usp=sharing>

The responses are kind of hard to consume in this form, I’m aware :)

-Erik=


Reply to this email directly or view it on GitHub
#8951 (comment).

@monadic
monadic commented Nov 9, 2014

@shykes of course people should provide disclaimers - we already did (and to my first list, I should add @squaremo from our weave team). It is unlikely that anyone on this thread 'has no agenda'. At the same time, being in favour of defaults, or not, plugins, or not, OVS, or not... is not automatically suspect just because of an agenda.

In general, GH issues are not ideal as a discussion forum for things like this. Also, nobody knows what the status and process of these proposals is. Or whether they are important at all, or just a chance for all of us to make some noise.

@shykes
Contributor
shykes commented Nov 9, 2014

I agree github threads are not ideal.

I think it's totally fine to "have an agenda" - it should just be made
clear to the reader so it can inform their analysis. If you're already
offering disclaimers, that's great - more people should follow your lead.

When you say "nobody knows" I assume you mean "I don't know". The status of
this particular proposal is "under review" since it has not been closed or
labeled to indicate otherwise. The importance of these proposals is high -
it is part of an effort by the project maintainers to make it easier for
more people to participate in the design of important features earlier.
None of this is easy, so we are constantly looking for ways to improve -
for example we have been experimenting with "proposals as docs patches"
instead of "proposals as github issues".

All of this could be communicated more efficiently. As I'm sure you know
communication is hard, especially for a project at this scale. If you have
time to invest in improving the communication tools and processes of the
project, the door is open!

On Sun, Nov 9, 2014 at 1:16 PM, alexis richardson notifications@github.com
wrote:

@shykes https://github.com/shykes of course people should provide
disclaimers - we already did (and to my first list, I should add @squaremo
https://github.com/squaremo from our weave team). It is unlikely that
anyone on this thread 'has no agenda'.

In general, GH issues are not ideal as a discussion forum for things like
this. Also, nobody knows what the status and process of these proposals is.
Or whether they are important at all, or just a chance for all of us to
make some noise.


Reply to this email directly or view it on GitHub
#8951 (comment).

@monadic
monadic commented Nov 9, 2014

@shykes thanks.

The process for Proposals is a really important issue. Eg. if the issue is not handled well, then 'ownership' of proposals will be seen as proxy for ownership of features. This will lead to 'land grab' behaviour (if it hasn't already..).

Please can you confirm that Docker sees itself as responsible for clear communication in this area. I don't see how the community can substitute at this time - though perhaps that is something to aim for?

In terms of suggestions - I'm sure we all have plenty. Last week in London some of us met up informally and captured a few ideas on this. I'll email you offline to follow up. At the meeting it was very clear that nobody outside Docker yet knows how the system is meant to work (so, it's not only me that doesn't know ;)

Some suggestions included (1) three proposal stages: RFC, Incubated, Final; (2) appointing two maintainers to every incubated proposal, and maybe more to 'final'; (3) requiring running code wherever possible, and two implementations if interop is proposed; (4) appointing a team to clean up and delete dead proposals.

HTH

alexis

@Lukasa
Lukasa commented Nov 9, 2014

Agreed that it's worth knowing commenters' backgrounds. I believe I have already communicated most of this, but: I'm employed by Metaswitch Networks to build an open-source layer 3 virtual networking stack, which we hope to plug into (amongst other things) Docker. Note that I am not pushing for the Calico approach to be the Docker default (though I think it would be awesome if it was), simply for making it possible.

@jainvipin

As apparent from my official email (in my profile and in docker-dev), I am employed by Cisco/Insieme-Networks to build hardware/software/middle-ware for networking and server products lines specifically targeting data-center.

Coming from different employers can also bring perspectives for different use-cases, that otherwise is not easy to come together if it were not done in public. I am definitely looking forward to progressively technical discussions, arguments that are presented based on data/experiments, and contributing to a working code that would ultimately weigh in big for docker-ecosystem.

@nerdalert
Contributor

@dave-tucker @nerdalert @mavenugo are here as part of SocketPlane.io. We are here to work with the community in developing and maintaining a multi-host networking solution that is native to Docker in a manner that:

  • Stays true to the Docker philosophy of liberating the developer from network infrastructure constraints and doing so in a dead simple fashion, that is empathetic to the Docker first-timer, developer and ops.
  • Work with the Docker community to deliver, scalable, production ready deployments out of the box.
  • Do our part helping make Docker kick so much ass that adoption isn't even a question.
@shykes
Contributor
shykes commented Nov 12, 2014

Thanks everyone for playing ball, I do think this kind of disclosure will
help the community trust what they read. Transparency never hurts in my
experience.

On Mon, Nov 10, 2014 at 10:07 AM, Brent Salisbury notifications@github.com
wrote:

@dave_tucker @nerdalert https://github.com/nerdalert @mavenugo
https://github.com/mavenugo are here as part of SocketPlane.io
http://socketplane.io/press/sdn-experts-unite-to-bring-devops-defined-networking-to-docker-users/.
We are here to work with the community in developing and maintaining a
multi-host networking solution that is native to Docker in a manner that:

  • Stays true to the Docker philosophy of liberating the developer from
    network infrastructure constraints and doing so in a dead simple fashion,
    that is empathetic to the Docker first-timer, developer and ops.
  • Work with the Docker community to deliver, scalable, production
    ready deployments out of the box.
  • Do our part helping make Docker kick so much ass that adoption isn't
    even a question.


Reply to this email directly or view it on GitHub
#8951 (comment).

@shykes
Contributor
shykes commented Nov 12, 2014

@monadic yes, the maintainers of the Docker project (which include but are not limited to Docker inc. employees) are obviously responsible for communicating the rules of the project, and helping contributors be more successful.

We definitely appreciate all suggestions for improving the project.

That said @monadic: so you had lunch with people who don't know how the project runs. How is this evidence that "nobody" knows how the project runs? I will be the first to admit that project communication could be vastly improved in many ways - some scale issues is to be expected for a project that's merged 10,000+ pull requests from 600+ people. However, I know for a fact that you had no experience getting involved with the project before joining this thread. I'm going to guess that the people you had lunch with didn't, either. Otherwise they could have explained to you the basics. So why not spend some time on irc and the mailing list, send a few patches, make a few proposals, and then tell me what is and is not clear? Your feedback will carry a lot more weight that way.

@monadic
monadic commented Nov 12, 2014

@shykes Thanks for confirming that the maintainers are responsible for communicating the rules of the project.

I am not commenting on how the project overall 'runs'. The specific area on which I believe there is confusion, is the "Proposals" process, which I believe to be somewhat new.

What does it take for a Proposal to become part of Docker? What does it take for a Proposal to be deemed 'dead'. etc etc. This is the area where I think some clarity would help. I'm trying to put that question to everyone here. Maybe someone can point me to the answer. We are all here to learn and help as best we can :-)

You say that "you had no experience getting involved with the project before joining this thread. I'm going to guess that the people you had lunch with didn't, either. Otherwise they could have explained to you the basics". Well - thanks for that. The people I spoke to did include staff of Docker Inc who were super helpful in lots of ways. But, it was stated that the Proposals process is new, not yet fully formed, and feedback was solicited.

I am simply moving that Q&A on Proposals into the public domain - which I believe is good for everyone. With the greatest respect, I feel this topic merits attention, regardless of how long anyone has been working on Docker or anything relating to it.

@erikh
Contributor
erikh commented Nov 12, 2014

Can we put this into another issue? I’d rather not have this clog up this great proposal.

Additionally: https://github.com/docker/docker/blob/master/CONTRIBUTING.md#design-and-cleanup-proposals https://github.com/docker/docker/blob/master/CONTRIBUTING.md#design-and-cleanup-proposals can be cited in this new issue.

On Nov 11, 2014, at 11:07 PM, alexis richardson notifications@github.com wrote:

@shykes https://github.com/shykes Thanks for confirming that the maintainers are responsible for communicating the rules of the project.

The specific area on which I believe there is confusion, is the "Proposals" process, which I believe to be somewhat new. What does it take for a Proposal to become part of Docker? What does it take for a Proposal to be deemed 'dead'. etc etc. This is the area where I think some clarity would help. We are all here to learn and help.

You say that "you had no experience getting involved with the project before joining this thread. I'm going to guess that the people you had lunch with didn't, either. Otherwise they could have explained to you the basics". Well - thanks for that. The people I spoke to did include staff of Docker Inc who were super helpful in lots of ways. But, it was stated that the Proposals process is new, not yet fully formed, and feedback was solicited.

I am simply moving that Q&A on Proposals into the public domain - which I believe is good for everyone. With the greatest respect, I feel this topic merits attention, regardless of how long anyone has been working on Docker or anything relating to it.


Reply to this email directly or view it on GitHub #8951 (comment).

@monadic
monadic commented Nov 12, 2014

filed #9114

thanks @erikh :-)

@mindscratch

+1

@fleitner

Open vSwitch has a momentum and certainly allow us to build interesting networking plumbing. The same happened with Linux Bridge + iptables, etc. I wouldn't be surprised if in the next years some other killer technology shows up. My point is that having the complex networking plumbing tighly integrated with Docker doesn't seem like a good idea at all. There are too many different requirements to fulfill and that doesn't seem to be the goal of Docker. Perhaps it would be better if Docker could provide stable APIs that allow external tools to do complex networking plumbing.

@mavenugo
Contributor

@fleitner Thanks. The Choice of back-end is being addressed with #8952 and similar proposals that are dependent on #8968.
When it comes to the question of multi-host network plumbing, we believe that having a native solution as described in this proposal will certainly help the Docker community in general. As we stated in the previous comments, this by no means restrict any external tools to add value. Infact it will be to the contrary. This will encourage the external tools to strive and provide more value on top of Docker and not to worry about providing a basic infrastructure service like network plumbing. Also, the plugin approach provides greater flexibility to extend beyond what is proposed here if we have a richer set of stable plugin APIs as you mentioned.

And for the sake of disclaimers and for everyone's benefit, I would like to point to @shykes request

Hi everyone. After reading all the comments in this thread, and looking up the occupation of the participants, it occurred to me that we should start requiring disclaimers.
If you express the strong opinion that Docker should NOT ship a default multi-host networking solution, and it turns out your employer is selling a custom multi-host networking solution for Docker... I believe readers ought to be informed of that. Any objections to starting a policy of strongly encouraging disclaimers across all discussion threads?

@mavenugo
Contributor

I realized we just reached 100 comments on this topic. I would l like to thank everyone for the level of interest and awesome participation.

Though each of us approach these problems from different points of views, the level of participation shows that all of us care deeply about the success of this community. Special Kudos to the Maintainers for fostering such an open discussion.

@fleitner

Red Hat employee here.

@mavenugo If it by no means restrict any external tool to add value, why this specific one needs to be tightly integrated? As I said, making Docker more friendly to external tools brings simplicity and flexibility to project so that you can implement the plumbing you are proposing or any other solution outside. It doesn't necessarily require to add the actual tool to the Docker.

@monadic
monadic commented Nov 13, 2014

I have published a blog post that sets out what we at the Weave team think is the best way forward for networking in Docker. It is here - http://weaveblog.com/2014/11/13/life-and-docker-networking/

@lloydde
Contributor
lloydde commented Nov 13, 2014

@monadic your position somewhat surprises me as you acknowledge here the OpenStack Neutron (previously quantum) mess. I'm not a networking person, I'm in QA and I was full time involved in OpenStack starting in 2011 for 1.5 yrs during the vendor fest. Speaking with friends it is still a mess.

It's essential that Docker be production ready "out of the box" and not have a compromised architecture. Being pluggable is important, but secondary. I'm anxiously optimistic to see everyone stay focused on this.

@Lukasa
Lukasa commented Nov 13, 2014

@lloydde I don't think that's the lesson to learn from Neutron. Neutron's error is that it requires users to understand how networks work to configure it, and thus forces networking plugins to bend to its specific view of the world.

Neutron failed to understand the point being made elsewhere in this thread and the related ones, which is that different users have very different networking use cases. It is inadvisable to have a default that attempts to handle them all, lest it die under its own weight. IMO, Docker should be focused on making connectivity possible, and should leave advanced network topology configuration to plugins. That is not an excuse for not making those plugins possible. I'm anxious to see Docker treat plugins as a first-class citizen, not a second-class one behind its own 'blessed' approach.

@monadic
monadic commented Nov 13, 2014

@lloydde thank-you for reviewing my post and commenting. I think @Lukasa basically nailed it in his response but let me add a few remarks here.

Purely and completely hypothetically: Let's say that Company X writes their own clustering, container group management, storage and networking implementations, tests them against Docker extensively, writes a statement on their web site that these are in their view the best plugins to use with Docker, and offers a downloadable package to install Docker with those plugins. Then, all that stuff should just work OOTB, batteries included, lifetime warranty, t-shirt and free lifetime passes to all relevant magazines conferences included.

I believe that the above example would be a win for any customer that was comfortable doing business with Company X. But it would not be the 'default' and it wouldn't claim to be the 80% case, 60% case, or any other % case. Indeed, a competing Company Z could offer a different set of plugins, do testing, and offer up a similar package and warranty.

In other words... we can all do this. We can all do what we think is best for customers, without getting stuck in a set of use cases invented in the last months of 2014.

I hope this makes sense.

@lloydde
Contributor
lloydde commented Nov 13, 2014

OpenStack lost its default networking as development chased SDN. Neutron has no view of the world. That is the reason it is complex and difficult. Led by proprietary vendors the focus on being pluggable was so complete that multiple major releases later essential functionality available in nova-network was still missing. It stung more that the new goal of SDN wasn't possible without one of those expensive proprietary solutions.

I'm confident that by focusing on a solution in Docker, a solid native networking infrastructure, that the proponents for other approaches will ensure they are not blocked out. Pluggable first is the road no where. And I don't see that as being at risk here. The expertise is here and y'all makers
continual focus on "consuming infrastructure - not define it" will see us through. Each of the related github issues seem to focus on practical designs and in many areas look like different parts of compatible solutions as people approach from system, network and API composition. As these proposals continue to be fleshed out, it might be useful to weigh conflicts and incompatibilities.

@Lukasa OpenStack is always in the context of the incredibly fragmented networking hardware space. This is why different operators/users are familiar with different networking particulars and dialects of describing them, but I don't think that is the same as them having very different use cases. Still, I imagine you didn't mean to suggest that anyone is proposing a default that attempts to handle all use cases. Thankfully, Docker gets to stay focused on the virtualized.

I do agree with my colleague @rmustacc that discussions could more explicitly be in the perspective of users, but that isn't why I jumped in. It was the invocation of OpenStack. I do also disagree with the argument of network topologies vs connectivity graph as the network topologies here align for secure connectivity not hardware configurations. Are there example in context where this leads to legacy solutions?

Anyway, I'll get back to learning Go so I can help further test coverage in the existing functionality as I continue to get up to speed on Docker.

@mavenugo
Contributor

@lloydde 👍 You articulated the fragmentation issue very well. Thats one of the important lessons to be learnt and as a community, we never repeat again.

@jainvipin

@mavenugo, @lloydde, @Lukasa, @monadic: good to see people trying to iron out perspectives.

Are we mixing/confusing 'default' and 'native'? To me 'default' is what gets shipped/packaged with, were as 'native' suggests it is architecturally tied together. If the community wants to build a default plugin, that gets packaged with docker, that's well and good. However, if we build something 'native' that doesn't leave room for others to innovate, because of implicit architectural assumptions. I am sure there is an objection to that (or, at least I do have one).

Further, let's also not mix 'ease of use' with either 'native' or 'default'. i.e.

  • 'native' solution can be easy to use, or difficult to use
  • 'default' plugin can be easy to use, or difficult to use

The right choices of APIs, future-proofing of the interface, interoperability with older versions, etc. will make it easy to write better plugins and thus together it can be 'easy to use' in any of the above two cases.

Perhaps the lessons from neutron are telling us that we create a 'default' plugin, that consumes the same interface as any 'third party' plugin would, and not create a 'native' tightly integrated solution.

@NetCubist

TL;DR

Disclaimer: I am working on a networking solution for Docker as well :-)

I think there are 2 different issues that we are talking about here:

  1. Does the network setup belong in Docker (native)?
  2. Should Docker ship a default network solution?
  3. As a lot of folks have mentioned, conflating Docker with networking functionality is not a good idea.
    We have to consider the scale implications here. Containers are not the same as VMs. There is atleast an order of magnitude difference in scale and add to it the fact that they can be instantiated and taken down rapidly. If Docker were to assume the responsibility of creating a port, adding it to OVS, provisioning it etc, at this scale, it would be swamped. A better option is for Docker to fork this off to a network agent. We have to look at it from a higher level architectural perspective and design it for scale from the get go.
  4. Does Docker need to ship with a default networking solution? I would say yes. Does it have to be OVS, i am not so sure. OVS is heavy. True that it supports all of the neat use cases, but when we talk about 80% of the users, do they require all of this functionality?
    It would be good to get a perspective on what we think are the use cases that satisfy 80% of the users.
    If as a user, I need to understand OVSDB, vswitchd, let alone BGP to debug any problems that I might face, that is a red flag.
    Besides I think it is a bit premature to tip the hat to OVS. Sure OVS performs better than most in throughput tests, but what we should consider is the end to end provisioning at scale, at flux coupled with throughput.

Linux natively supports vxlan. Linuxbridge coupled with vxlan and service discovery to discover tunnel end points would be a simple and good enough starting point.
For scale and more advanced feature, users can fall back to more sophisticated solutions.

My 2 cents.

@Lukasa
Lukasa commented Nov 14, 2014

Perhaps the lessons from neutron are telling us that we create a 'default' plugin, that consumes the same interface as any 'third party' plugin would, and not create a 'native' tightly integrated solution.

@jainvipin has hit my concern on the head. I'll elaborate more on what I want below.

Neutron has no view of the world.

I disagree strongly. Neutron's view of the world is that virtual networking is built up of subnets, networks, and routers. Essentially, that virtual networks are necessarily layer 2 constructs. The problem is that that view of the world should not be the top-level API.

What I'm worried about is ending up with a plugin architecture that is informed by the default plugin, rather than the other way around, because the default plugin will inevitably be layer 2 based. These unquestioned, quiet assumptions pervade Neutron, and run the risk of pervading Docker's plugin approach unless we squash it now.

So what do I want? I want us to design a plugin API that has the correct level of abstraction. I don't then care if Docker ships with a 'default' plugin that consumes that API, that's totally fine by me. What I don't want is a plugin that wants to be told about subnets and networks and bridges and VLANs and all that other stuff, because then those of us who think that building massive layer 2 broadcast domains is a bad idea have to lie to the plugin infrastructure.

My preferred approach would be to design the plugin API first. We should also design plugins alongside it: @dave-tucker, @mavenugo and @nerdalert can continue designing this proposal, I'll work on a Calico-based plugin, the Weave team can do a Weave plugin. What we'd be trying to do is determine what a good API looks like. I suspect the answer is that it expresses connectivity, not topology ('this container may speak to this other one', not 'this container is in network A, and so is this other one'), but I want to consider that API the primary goal.

With a good API, Docker should then feel free to bless a plugin as 'default', but users who want something else can pick up another plugin that will be guaranteed to be as capable as any other.

NB: The trap that is easy to fall in to is to assume that there is one networking solution that works equally well in all cases. There is not. If you need layer 2 then a Calico network will not work. However, if you need network transparency and simplicity then VXLAN is a nightmare.

Network engineers have a tendency to fall in love with their preferred approach (as all engineers do) and then fail to see the world in any other light. This is what I want Docker to avoid. I want Docker to think like a user, and let plugin architects worry about how that translates into networks. This is why I'm 'plugin first'.

@dpw
dpw commented Nov 14, 2014

@lukasa, great points about the network plugin model.

I want us to design a plugin API that has the correct level of abstraction.

We at weave are keenly interested in this as well. I've just put a comment at #8997 (comment) describing an approach to plugins that would minimize constraints on network plugins, and we'd love to hear what others think.

What we'd be trying to do is determine what a good API looks like. I suspect the answer is that it expresses connectivity, not topology ('this container may speak to this other one', not 'this container is in network A, and so is this other one'), but I want to consider that API the primary goal.

Even these two options represent particular opinions of what container networking should look like, with each one favoring some network technologies over others. For example, with weave network isolation is currently achieved using IP subnets. That is necessarily a model based on network membership, not the kind of container-to-container connectivity that I think you are suggesting. So if a network plugin model puts that kind of connectivity front and center, it could be awkward for weave.

One interesting question is, what is the minimum required for portable containers, i.e. to allow multiple-container applications to be developed without being tied to a particular network technology?

@NetCubist

A more systematic approach is needed to define the API because it has to be generic enough to satisfy the use cases and not necessarily the implementations.
At a very high level, we need to be able to specify a notion of:

  1. Tenancy & membership
  2. Broadcast domain & membership
  3. Opaque Policy provisioning

Once again while I agree with the plugin model, I think it would be a mistake to have docker be the orchestrator, simply because it could easily get bogged down orchestrating the network. This should be done by a dedicated network agent that is part of "native" Docker

@liljenstolpe

In the interests of full disclosure, I also am also with Project Calico with @Lukasa, but have also worked on the operational side of both large-scale networks, and scale-out deployments.

The API should represent an intent and policy model, rather than model any specific network-ism. One of the issues in OpenStack, as mentioned earlier, is that it asks the user to architect a network, rather than describe what connectivity they want.

If, instead of asking for networking constructs, we asked "which endpoints need to communicate, using which protocols" we can then express that in real networking constructs, depending on the network model is used. That rendering may be as VXLAN segments and L3 router services, or it may be a set of Calico route announcements and ACLs.

Let's get the data model/API right first, and the rest can follow.

@RmsMit - I also agree that IPv6 is a compelling solution to the networking substrate. However, I'm not sure I agree with using link local addressing. If we use that, then we have a problem the minute you cross the local link boundary. IPv6 NAT is a BAD idea, and there really isn't support for NATing link local anyway. The correct solution would be to use global v6 addresses, if possible. If not, then a ULA address block.

@RmsMit
RmsMit commented Nov 19, 2014

I agree that link local is not a solution when connecting containers
together but it could be the default for when containers are not connected.

If you start a container without the port argument its ports will not be
mapped to be externally available. This is the use case for link local. For
isolated containers you only give it a link local address. If you want the
container ports accessible on external network then you also give it a
global address with ip forwarding on the host and maybe some firewall rules
to limit access.

I think there is also a network scope address which could also be used to
limit connections to addresses within your subnet.

It is normal in ipv6 to have a link local address even when you also have a
global address on that same network interface.
On Nov 19, 2014 12:38 PM, "Christopher LIJLENSTOLPE" <
notifications@github.com> wrote:

In the interests of full disclosure, I also am also with Project Calico
with @Lukasa https://github.com/Lukasa, but have also worked on the
operational side of both large-scale networks, and scale-out deployments.

The API should represent an intent and policy model, rather than model any
specific network-ism. One of the issues in OpenStack, as mentioned earlier,
is that it asks the user to architect a network, rather than describe what
connectivity they want.

If, instead of asking for networking constructs, we asked "which endpoints
need to communicate, using which protocols" we can then express that in
real networking constructs, depending on the network model is used. That
rendering may be as VXLAN segments and L3 router services, or it may be a
set of Calico route announcements and ACLs.

Let's get the data model/API right first, and the rest can follow.

@RmsMit https://github.com/RmsMit - I also agree that IPv6 is a
compelling solution to the networking substrate. However, I'm not sure I
agree with using link local addressing. If we use that, then we have a
problem the minute you cross the local link boundary. IPv6 NAT is a BAD
idea, and there really isn't support for NATing link local anyway. The
correct solution would be to use global v6 addresses, if possible. If not,
then a ULA https://tools.ietf.org/html/rfc4193 address block.


Reply to this email directly or view it on GitHub
#8951 (comment).

@liljenstolpe

Agreed @RmsMit. In fact, the model you suggest for "connected" containers is the Calico model.

@MalteJ
Contributor
MalteJ commented Nov 19, 2014

If you are interested in an IPv6 implementation have a look at my PR #8947
I'd love to get some feedback :-)

@danehans

Disclaimer: I am a software engineer with Cisco.

I agree with the comments from @NetCubist:

"Linux natively supports vxlan. Linuxbridge coupled with vxlan and service discovery to discover tunnel end points would be a simple and good enough starting point."

I believe this hits the 80% mark @shykes and others are trying to achieve without the additional overhead of OVS. Plugins can be utilized to achieve more complex multi-host scenarios.

@danehans

Adding to the following comments from @dave-tucker:

"We prefer OVS, the only downside being that we require "openvswitch" to be installed on the host, but we've wrapped up all the userland elements in a docker container - the kernel module is available in 3.7+"

According to OVS docs, running OVS in userspace comes at a performance cost, is considered experimental, and has not been thoroughly tested.

With these caveats in place, I don't believe userspace OVS is a viable option.

@dave-tucker
Member

You misunderstand. OVS is not running in userspace mode when the kernel module is present in the host system. If you deploy the container in a host running a kernel that is >= 3.7, OVS will use the kernel datapath

Sent from my iPhone

On 20 Nov 2014, at 18:23, Daneyon Hansen notifications@github.com wrote:

Adding to the following comments from @dave-tucker:

"We prefer OVS, the only downside being that we require "openvswitch" to be installed on the host, but we've wrapped up all the userland elements in a docker container - the kernel module is available in 3.7+"

According to OVS docs, running OVS in userspace comes at a performance cost, is considered experimental, and has not been thoroughly tested.

With these caveats in place, I don't believe userspace OVS is a viable option.


Reply to this email directly or view it on GitHub.

@danehans

@dave-tucker thanks for the clarification. Are the performance numbers the same as what you shared earlier in the thread?

@erikh
Contributor
erikh commented Nov 21, 2014

I’d also like to chime in and say with the exception of older RHEL releases where we’ve coordinated with them to ensure 2.6 has the kernel support we need, we support at minimum 3.8.

That said, we’re currently sorting out some panics with some kernels on the vxlan side of things which might adjust this requirement. Either way, OVS should be able to work in first-class form on any docker host.

-Erik

On Nov 20, 2014, at 7:06 PM, Dave Tucker notifications@github.com wrote:

You misunderstand. OVS is not running in userspace mode when the kernel module is present in the host system. If you deploy the container in a host running a kernel that is >= 3.7, OVS will use the kernel datapath

Sent from my iPhone

On 20 Nov 2014, at 18:23, Daneyon Hansen notifications@github.com wrote:

Adding to the following comments from @dave-tucker:

"We prefer OVS, the only downside being that we require "openvswitch" to be installed on the host, but we've wrapped up all the userland elements in a docker container - the kernel module is available in 3.7+"

According to OVS docs, running OVS in userspace comes at a performance cost, is considered experimental, and has not been thoroughly tested.

With these caveats in place, I don't believe userspace OVS is a viable option.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub #8951 (comment).

@mavenugo
Contributor

@danehans Yes. You can use https://github.com/socketplane/docker-ovs to try the containerized version.
@erikh Thanks for the confirmation. Can you please share more details on the kernel panic that you observed (no necessarily in this issue/proposal, maybe an email thread).

@liljenstolpe

@danehans and @dave-tucker, I don't think that either vxlan or OVS are the only (or even only default) model that we should be considering. Both (especially vxlan) assume an L2 model, which is not necessarily the only (or best) model. I have no problem with having options for using them, but having them as first-class or first-among-equals is a bit of a concern.

@erikh
Contributor
erikh commented Dec 3, 2014

It’s related to smartos, I need to talk to the joyent folks.

-Erik

On Nov 21, 2014, at 10:57 AM, Madhu Venugopal notifications@github.com wrote:

@danehans https://github.com/danehans Yes.
@erikh https://github.com/erikh Thanks for the confirmation. Can you please share more details on the kernel panic that you observed (no necessarily in this issue/proposal, maybe an email thread).


Reply to this email directly or view it on GitHub #8951 (comment).

@thewmf
thewmf commented Dec 3, 2014

@liljenstolpe You can do L3-only over VXLAN if you want; choosing a different encapsulation format will disable hardware offload. Likewise OVS with learning disabled can be used as an L3-only vRouter.

Semantics and implementation are orthogonal in many ways, so maybe we should have a more focused discussion on desired semantics for the "batteries included" plugin first and then worry about the implementation. Obvious semantic questions are:

  • L2 vs. L3
  • multicast enabled or not
  • overlapping vs. global IP addressing
  • subnet-based or group-based connectivity
  • 1 vNIC vs. multiple vNICs per container

(Disclosure: IBM. We make SDN-VE and OpenDOVE.)

@danehans
danehans commented Dec 4, 2014

@liljenstolpe I agree and was not trying to imply OVS or VXLAN are the only considerations. I agree with @NetCubist that kernel VXLAN + SD can be a good enough solution. My preferred direction is to leave default networking as-is and use the plugins model to implement any additional networking functionality, but it sounds like Docker has already made their decision.

@liljenstolpe

@thewmf The question is, in an L3-only network, do you NEED an overlay. In L2 networks you certainly do, and in some cases (such as L3 address overlap), an overlay network can address "issues", however they are not the only solutions, and the general case (say 90% of the traffic in a scale-out environment) they are probably not necessary. Therefore, do we want to assume that they will be present? It's an additional "cost" that may not always be justified.

@danehans The question is if we think that overlays are the base? If so, it's burdening the environment when it's not always necessary.

@danehans

@liljenstolpe I think it's hard to define what is needed without having detailed requirements to build against. One cloud provider may say that supporting overlapping IP's is a requirement but another may say it's not needed. This is a good example of why we, as a community, need to clearly define the requirements. Thus far, high-level analogies are the only thing to build against.

@thewmf
thewmf commented Dec 16, 2014

@liljenstolpe @danehans Agreed. Different requirements will lead to different implementations, which is why I suggested that we discuss requirements. I don't think it makes sense to lock in any technology unless it is needed.

I am working in an environment where we want to allow customers to bring their own possibly-overlapping IP addresses so we are definitely looking at overlays, but we can use a plug-in for that. But I'd like to hear people's opinions on the future of default networking. I would like to see Docker move away from NAT and port mapping, but I'm not sure how to do that on random developers' laptops. Maybe IPv6 ULAs... can people stomach that?

@unclejack
Contributor

There's an official proposal for networking drivers which can be found at #9983.
The architecture presented in the proposal would also enable multi-host networking for multiple Docker daemons.

This new proposal implements an architecture which has been discussed quite a bit. Implementing a proof of concept of the network drivers was also part of this effort.
We're not suggesting that the previous proposals had a lower quality or that they've required less effort. However, the design also had to be accepted by everyone and validated with a proof of concept, in addition to being good.

Should you discover something is confusing or missing from the new proposal, please feel free to comment.
If you'd like to continue the discussion, please comment on #9983. Please make sure to stay on topic and try to avoid writing long comments (or too many). This would help make it easier for everyone who's following the discussion.

Questions and lengthy discussions are more adequate for the #docker-network channel on freenode. Should you just want to talk about this, that is a better place to have the conversation.

We'd like to thank everyone who's provided input, especially those who've sent proposals. I will close this proposal now.

@unclejack unclejack closed this Jan 9, 2015
@phemmer
Contributor
phemmer commented Jan 9, 2015

Does this mean docker has no intention of developing/supporting multi-host networking natively? #9983 is just for the creation of a driver scheme, and not the specific goal of multi-host networking. If multi-host networking is still a goal, I would have expected this proposal to remain open, and for it to utilize #9983.

@erikh
Contributor
erikh commented Jan 9, 2015

@phemmer we have a vxlan implementation in our PoC already. It's not very good, but yes, this is intended to be supported first-class.

@erikh
Contributor
erikh commented Jan 9, 2015

We're reopening this after some discussion with @mavenugo pointing out that our proposal is not a solution for everything in here -- and it should be much closer.

We want this in docker and we don't want to communicate otherwise. So, until we can at least mostly incorporate this proposal into our new extension architecture, we will leave it open and solicit comments.

@erikh erikh reopened this Jan 9, 2015
@c4milo
c4milo commented Jan 9, 2015

@erikh would you mind giving us the main takeaways after your discussion with @mavenugo?

@mavenugo
Contributor
mavenugo commented Jan 9, 2015

@c4milo following is the docker-network IRC log between us regarding reopening the proposal.

madhu: erikh: backjlack thanks for all the great work
[06:12am] madhu: on closing the proposals
[06:13am] madhu: 9983 replaces 8952 and hence closing is accurate
[06:13am] madhu: but imho 8951 should be still open because it is beyond just drivers
[06:13am] madhu: but a generic architecture for all the considerations for a multi-host scenario
[06:14am] madhu: we can close it once all the scenarios are addressed. through other proposals or through 8951
[06:14am] backjlack: madhu: Personally, I'd rather see 9983 implemented and then revisit 8951 to request an update.
[06:15am] madhu: backjlack: okay. if that is the preferred approach sure
[06:15am] erikh: gh#8951
[06:15am] erikh: hmm.
[06:15am] erikh: need to fix that.
[06:15am] confounds joined the chat room.
[06:15am] madhu: keeping it open is actually better imho
[06:15am] erikh: hmm
[06:16am] erikh: backjlack: do you have any objections to keeping it open? madhu does have a pretty good point here.
[06:16am] erikh: we can incorporate it and close it if we feel necessary later
[06:16am] madhu: exactly. that way we can easily answer the questions that are raised
[06:17am] backjlack: erikh: My main concern is that it's more of a discussion around adding OVS support.
[06:17am] erikh: hmm
[06:17am] erikh: ok. let me review and get back to you guys.
[06:17am] madhu: thanks erikh backjlack
[06:17am] madhu: backjlack: just curious. is there any trouble in keeping it open vs closed ?
[06:18am] erikh: hmm
[06:19am] erikh: the only concern I have is that with several networking proposals that we're accidentally misleading our users
[06:19am] backjlack: madhu: If it's open, people leave comments like this one: #8952 (comment)
[06:19am] backjlack: They're under the impression nobody cares about implementing that and it's very confusing.
[06:20am] erikh: hmm
[06:20am] erikh: backjlack: let's leave it open for now
[06:20am] madhu: backjlack: okay good point
[06:20am] madhu: but we were waiting on the extensions to be available
[06:20am] erikh: if we incorporate everything into the new proposal, we will close it.
[06:20am] erikh: (And we can work together to fit that goal)
[06:20am] madhu: now that we are having the momentum, there will be code backing this all up
[06:20am] jodok joined the chat room.
[06:20am] madhu: thanks erikh that would be my suggestion too
[06:21am] erikh: backjlack: WDYT? I think it's reasonable to let people know (by example) we're trying to solve the problem, even if our answers don't necessarily line up with that proposal
[06:22am] backjlack: erikh: Sure, we can reopen the issue and update the top level text to let people know this is going to be addressed after #9983 gets implemented.
[06:22am] erikh: yeah, that's a good idea.
[06:22am] erikh: madhu: can you drive updating the proposal and referencing our new one as well?
[06:23am] erikh: I'll reopen it.
[06:23am] madhu: yes sir.
[06:23am] madhu: thanks guys. appreciate it

@c4milo
c4milo commented Jan 9, 2015

@mavenugo nice, thank you, it makes more sense now :)

@bmullan
bmullan commented Feb 5, 2015

Related to VxLAN and the network "overlay" the stumbling block to implementation/deployment was always the requirement for multicast to be enabled in the network... which is rare.

Last year Cumulus Networks and MetaCloud open sourced VXFLD to implement VxLAN with uni-cast and UDP.

They also submitted it for consideration for consideration as a standard.

MetaCloud has since been acquired by Cisco Systems.

VXFLD consists of 2 components that work together to solve the BUM (Broadcast, unicast Unknown & Multicast) problem with VxLAN by using unicast instead of the traditional multicast.

The 2 components are called VXSND and VXRD.

VXSND provides:
unicast BUM packet flooding via the Service Node Daemon (re the SND in VXSND).
VTEP (Virtual Tunnel End-Point) "learning"

VXRD provides:
a simple Registration Daemon (the RD in VXRD) designed to register local VTEPs with a remote vxsnd daemon.

the source for VXFLD is on Github: https://github.com/CumulusNetworks/vxfld

Be sure to read the two github VXFLD directory .RST files as they describe in more detail the two daemon's for VXFLD ... VXRD and VXSND.

I thought I'd mention VXFLD as it could potentially solve part of your proposal and... the code already exists.

If you use debian or ubuntu Cumulus also has pre-packaged 3 .deb files for VXFLD:

http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-common_1.0-cl2.2~1_all.deb

http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-vxrd_1.0-cl2.2~1_all.deb

and
http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-vxsnd_1.0-cl2.2~1_all.deb

@erikh erikh was unassigned by nerdalert Feb 5, 2015
@jessfraz jessfraz added Proposal feature and removed feature labels Feb 25, 2015
@rcarmo
rcarmo commented Feb 28, 2015

I'd like to chime in on this. I've been trying to put together a few arguments for and against doing this transparently to the user, and coming from a telco/"purist SDN" background it's hard to strike a middle ground between ease of use for small deployments and the kind of infrastructure we need to have it scale up into (and integrate with) datacenter solutions.

(I'm rather partial to the OpenVSwitch approach, really, but I understand how weave and pipework can be appealing to a lot of people)

So here are my notes:


This is just a high-level overview of how software-defined networking might work in a Docker/Swarm/Compose environment, written largely from a devops/IaaS perspective but with a fair degree of background on datacenter/telco networking infrastructure, which is fast converging towards full SDN.

There are two sides to the SDN story:

  • Sysadmins running Docker in a typical IaaS environment, where a lot of the networking is already provided for (and largely abstracted away) but where there's a clear need for communicating between Docker containers in different hosts.
  • On-premises telco/datacenter solutions where architects need deeper insight/control into application traffic or where hardware-based routing/load balancing/traffic shaping/QoS is already being enforced.

This document will focus largely on the first scenario and a set of user stories, with hints towards the second one at the bottom.

Offhand, there are two possible approaches from an end-user perspective:

  • Extending the CLI linking syntax and have the system build the extra bridge interfaces and tunnels "magically" (preserves the existing environment variable semantics inside containers)
  • Exposing networks as separate entities and make users aware of the underlying complexity (requires extra work for simple linking, may need extra environment variables to facilitate discovery, etc.).

This is largely described in http://www.slideshare.net/adrienblind/docker-networking-basics-using-software-defined-networks already, and is what pipework was designed to do.

Arguments for Keeping Things Simple (Sticking to Port Mapping)

Docker's primary networking abstraction is essentially port mapping/linking, with links exposed as environment variables to the containers involved - that makes application configuration very easy, as well as lessening CLI complexity.

Steering substantially away from that will shift the balance towards "full" networking, which is not necessarily the best way to go when you're focused on applications/processes rather than VMs.

Some IaaS providers (like Azure) provide a single network interface by default (which is then NATed to a public IP or tied to a load balancer, etc.), so the underlying transport shouldn't require extra network interfaces to work.

Arguments for Increasing Complexity (Creating Networks)

Docker does not exist in a vacuum. Docker containers invariably have to talk to services hosted in more conventional infrastructure, and Docker is increasingly being used (or at least proposed) by network/datacenter vendors as a way to package and deploy fairly low-level functionality (like traffic inspection, shaping, even routing) using solutions like OpenVSwitch and custom bridges.

Furthermore, containers can already see each other internally to a host - each is provided with a 172.17.0.0/16 IP address, which is accessible from other containers. Allowing users to define networks and bind containers to networks rather than solely ports may greatly simplify establishing connectivity between sets of containers.

Middle Ground

However, using Linux kernel plumbing (or OpenVSwitch) to provide Docker containers with what amount to fully-functional network interfaces implies a number of additional considerations (like messing with brctl) that may have unforeseen (and dangerous) consequences in terms of security, not to to mention the need to eventually deal with routing and ACLs (which are currently largely the host's concern).

On the other hand, there is an obvious need to restrict container (outbound) traffic to some extent, and a number of additional benefits that stem from providing limited visibility onto a network segment, internal or otherwise.

Minimal Requirements:

There are a few requirements that seem fairly obvious:

  • Docker containers should be able to talk to each other inside a swarm (i.e., a pre-defined set of hosts managed by Swarm) regardless of in which host they run.
  • That communication should have the least possible overhead (but, ideally, use a common enough form of encapsulation - GRE, IPoIP - that allows network teams to inspect and debug on the LAN using common, low-complexity tools)
  • One should be able to completely restrict outbound communications (there is a strong case to do that by default, in fact, since a compromised container may be used to generate potentially damaging traffic and affect the remainder of the infrastructure).

Improvements (Step 1):

  • Encrypted links when linking between Swarm hosts on open networks (which require extra setup effort)
  • Limiting outbound traffic from containers to specific networks or hosts (rather than outright on/off) is also desirable (but, again, require extra setup)

Further Improvements (Step 2):

  • Custom addressing and bridging for allowing interop with existing DC solutions
  • APIs for orchestrating and managing bridges, vendor interop.

Likely Approaches (none favored at this point):

  • Wrap OpenVSwitch (or abstract it away) into a Docker tool
  • Have two tiers of network support, i.e., beef up pipework (or weave) until it's easier to use and allow for custom OpenVSwitch-like solutions
@mk-qi
mk-qi commented Mar 20, 2015

hello everyone;

docker-muilt

i set the docker0 in hosta and hostb in the same network via vxlan ,and it could ping each other ,but docker alawys allocate the same ip between hosta and hostb,so i wonder if there any way or plugin or hack to help me to check if the ip if exist ?

@thockin
Contributor
thockin commented Mar 20, 2015

You need to pre-provision each docker0 with a different subnet range. Even
then you probably will not be able to ping across them unless you also add
your eth0 as a slave on docker0.

read this: http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/

On Thu, Mar 19, 2015 at 10:24 PM, mk-qi notifications@github.com wrote:

hello everyone;

[image: docker-muilt]
https://cloud.githubusercontent.com/assets/642228/6745878/74b88210-cef9-11e4-8595-2928832ed70a.png

i set the docker0 in hosta and hostb in the same network via vxlan ,and it
could ping each other ,but docker alawys allocate the same ip between hosta
and hostb,so i wonder if there any way or plugin or hack to help me to
check if the ip if exist ?


Reply to this email directly or view it on GitHub
#8951 (comment).

@fzansari

@mk-qi : You can use "arping" which is essentially a utility to discover if an IP is already in use within a network. Thats the way you can make sure docker does not use the same set of IPs when its "over" multiple Hosts.
Or another way is to statically assign IPs yourself to each docker

@mk-qi
mk-qi commented Mar 20, 2015

@thockin sorry , i has not draw the picture clearly . in fact the eth0 is the slave of docker0. and as i has said before , i can ping them on each other...

@shykes I saw your fork https://github.com/shykes/docker/tree/extensions/extensions/simplebridge it looks like it have ping ip operation before really assigning it, but i am not sure, i do not know whether you could give more information.

@mk-qi
mk-qi commented Mar 20, 2015

@fzansari thanks for reply , static ip allocation is ok , in fact we had useing pipwork +macvlan( +dhcp) for some small running cluster, but if running much of containers , this is very painful to manage ip, of course we can write tools, but I think hack the docker to directly Solveing the IP conflict problem , the problem will be much simpler. If this way is Possible

@SamSaffron

Having just implemented keepalived internally I think there would be an enormous benefit from simply implementing an interoperable vrrp protocol. It would allow docker to "play nice" without forcing it on every machine in the network.

For example:

Host 1 (ip address 10.0.0.1):

docker run --vrrp eth0 -p 10.0.0.100:80:80 --priority 100 --network-id 10 web

Host 2 (ip address 10.0.0.2: backup service)

docker run --vrrp eth0 -p 10.0.0.100:80:80 --priority  50 --network-id 10 web

Supporting vrrp give a very clean failover story and allows you to simply assign an IP to a service. It would take a lot to flesh out the details but I do think it would be an amazing change.

@cpuguy83
Contributor

Closing since multi-host networking, plugins, etc are all in since docker 1.9

@cpuguy83 cpuguy83 closed this Apr 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment