Support hairpin NAT#6810
Conversation
|
@jpetazzo can you take this one please ? |
|
I'll look as soon as I can (but I have a few big presentations looming right on the corner); meanwhile, I'd like to summon @mpetazzoni because I know he was severely impacted by hairpin NAT! |
|
Haven't tested it, but that's basically the change that I need indeed. We also never use It would be great of course in the long run if the proxy could be removed completely, but the changes for that look more complex (and should be looked at by people with more network knowledge than I have). I'll build this branch tomorrow, give it a spin and report back. |
|
We would be willing to remove the proxy if you can get this working.... |
|
I might, but should probably be a separate PR. |
|
+1, Confirmed working, the Docker daemon is no longer hogging CPU being busy moving data around via the userland proxy. Helps a lot! |
|
Ok but do thinks like mongodb going out via the ext interfaces back in to itself still work? |
|
i wonder if we can test removing the userland proxy with this..... |
|
@crosbymichael See second half of the description. It comments on that. |
|
I don't think we want to remove the userland proxy (do we?) but an option to disable it (daemon-wide) would be nice. (Rationale: when people want to port Docker to FreeBSD/Solaris, it will be nice to use the userland proxy, right?) |
|
The big problem with the proxy is that you lose client information. The server no longer knows the IP/port of the client. Addendum: I'm all for making it an option, but the default behavior should be the same everywhere. |
|
Yes, but then porting Docker will be more difficult, since you have to port the iptables logic. Also, AFAIR handling NAT connections to localhost requires tweaking some sysctls. I don't know if we want to do that. |
|
if/when docker gets ported to other OSs, iptables will have to be ported anyway as there are numerous other things using iptables besides potential localhost traffic. Plus I think implementing the iptables equivalent in the target OS will be trivial compared to everything else. And yes, it does require setting a single sysctl param on the bridge interface, |
|
@phemmer Hi, I tried to test this yesterday. Would you mind give me exact |
|
@LK4D4 sorry, missed your message. I successfully routed localhost by performing the following: |
|
@phemmer Would you mind to rebase? I think we really want to remove userland-proxy now. |
This re-applies commit b39d02b with additional iptables rules to solve the issue with containers routing back into themselves. The previous issue with this attempt was that the DNAT rule would send traffic back into the container it came from. When this happens you have 2 issues. 1) reverse path filtering. The container is going to see the traffic coming in from the outside and it's going to have a source address of itself. So reverse path filtering will kick in and drop the packet. 2) direct return mismatch. Assuming you turned reverse path filtering off, when the packet comes back in, it's goign to have a source address of itself, thus when the reply traffic is sent, it's going to have a source address of itself. But the original packet was sent to the host IP address, so the traffic will be dropped because it's coming from an address which the original traffic was not sent to (and likely with an incorrect port as well). The solution to this is to masquerade the traffic when it gets routed back into the origin container. However for this to work you need to enable hairpin mode on the bridge port, otherwise the kernel will just drop the traffic. The hairpin mode set is part of libcontainer, while the MASQ change is part of docker. This reverts commit 63c303e. Docker-DCO-1.1-Signed-off-by: Patrick Hemmer <patrick.hemmer@gmail.com> (github: phemmer)
|
rebased. no conflicts |
|
LGTM for me |
|
RIP userland-proxy |
|
Does docker0 bridge have to be set to have STP enabled for this? |
|
@mrunalp No. This requires no special features/modes other than that the container's external virtual interface have hairpin mode enabled. |
|
considering the original commit for this was by @ibuildthecloud, do you have any thoughts/warnings on what happened the first time around.. I would be curious to know since I was not involved then. |
|
Well we have an integration test for the issue so it can't happen again. |
|
ah well then thats perfect, this LGTM, I tried with routing localhost too and it worked |
|
ping @crosbymichael |
|
I can confirm that without userland proxy and with rules which was written above by @phemmer and |
|
ping @icecrime |
|
LGTM |
This re-applies commit b39d02b with additional iptables rules to solve the issue with containers routing back into themselves.
The previous issue with this attempt was that the DNAT rule would send traffic back into the container it came from. When this happens you have 2 issues:
The solution to this is to masquerade the traffic when it gets routed back into the origin container. However for this to work you need to enable hairpin mode on the bridge port, otherwise the kernel will just drop the traffic.
The hairpin mode set is part of libcontainer, while the MASQ change is part of docker.
The libcontainer change is docker-archive/libcontainer#62
Also, since part of this change is in libcontainer, I wasn't sure how to handle that. Such as whether the files in
/vendorshould be patched.Note, with this change, it is almost possible to remove the docker proxy. The only thing left that the proxy handles is connecting to a port mapping via localhost from the docker server (the TestAllocateTCPPortLocalhost test).
This can easily be supported by iptables, but requires 3 changes:
! -d 127.0.0.0/8on the-t nat OUTPUTrule needs to be removed. However from the comment on line 131, it looks like this was deliberately added for some reason.MASQUERADErule needs to be added to masquerade traffic so it's not appearing to come from127.0.0.1.net.ipv4.conf.$iface.route_localnet=1needs to be set on the bridge interface. Whileroute_localnetis there for security reasons (since local network traffic should never be routed), theroute_localnetinside a container's namespace will prevent any container from being able to send localnet traffic to another through the bridge. And then theaccept_localsetting in the container will prevent the traffic from being received as well.I ran the integration test suite with the proxy disabled, and the only tests that failed were those localhost tests.
ref: #4442 #5133