Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error creating vxlan interface: file exists #1765

Open
discotroy opened this Issue May 18, 2017 · 32 comments

Comments

Projects
None yet
@discotroy
Copy link

discotroy commented May 18, 2017

Previous related threads:

Comment at the current tail-end of #945 recommends opening a new ticket. I couldn't find one opened by the original poster, so here we go.

I've been using swarm for the past couple of months, and frequently hit upon this problem. I have a modest swarm (~8-9 nodes) all running Ubuntu 16.04, now with Docker 17.05-ce on. There is not a great amount of container churn, but I do use a stack yaml file to deploy ~20 services across ~20 encrypted overlay networks.

I tend to find that after a couple of stack deploy / stack rm cycles, my containers get killed at startup with the "Error creating vxlan: file exists" error. This prevents the containers coming up on a host and forces them to attempt to relocate, which may / may not work.

I have noted in the above issues that the problems are, several times over, thought to have been rectified, but yet always creep back in for various users.

To rectify the issue, I have tried rebooting the node, restarting iptables, removing the stack and re-creating, all of which work to varying degrees but are most definitely workarounds and not solutions.

I cannot think how I can attempt to reproduce this error, but if anyone wants to suggest ways to debug, I am at your service.

@mpepping

This comment has been minimized.

Copy link

mpepping commented May 22, 2017

We're suffering from the same issue on RHEL7 /w Docker 17.03-ee and are able to reproduce the issue by adding a Service on a swarm-node where the overlay-network isn't active yet.
Tried about the same level of troubleshooting as @discotroy and can confirm the rebooting or restarting docker-engine fixes the issue up to some level, with fluctuating results. Also open for suggestions on how to debug this issue.

@fcrisciani

This comment has been minimized.

Copy link
Member

fcrisciani commented May 25, 2017

Do you guys have some logs to share? would be super helpful to have a way to reproduce and grab logs with the engine in debug mode

Engine in debug:
echo '{"debug": true}' > /etc/docker/daemon.json
then: sudo kill -HUP <pid of dockerd>

@mpepping

This comment has been minimized.

Copy link

mpepping commented May 26, 2017

Will collect more logging. Here's some debug /w the error-message: https://gist.github.com/mpepping/50cb9b71b5535b318c6a548d4e8ba97b

@fcrisciani

This comment has been minimized.

Copy link
Member

fcrisciani commented May 26, 2017

@mpepping thanks, the error message is clear. The current suspect that I have is a race condition during the deletion of the sandbox that leak the vxlan interface behind it. When a new container comes up tries to create the vxlan interface again and instead finds that there is already one and errors out. The more interesting part now of the logs would be the block where there is suppose to be the interface deletion and figure out why that is not happening properly.

@fcrisciani

This comment has been minimized.

Copy link
Member

fcrisciani commented May 26, 2017

I'm also already trying to reproduce it locally, but if you guys narrow down a specific set of steps that are able to reproduce with high probability let me know

@mpepping

This comment has been minimized.

Copy link

mpepping commented May 27, 2017

@fcrisciani indeed it seems a race condition running into a locking issue. A breakdown of the steps, with debug output, is available at https://gist.github.com/mpepping/739e9a486b6c3266093a8af712869e90 .

Basically, the command-set for us to reproduce the following .. but the gist provides more detail:

docker swarm init
docker network create -d overlay  ucp-hrm
docker stack deploy -c stack.yml test
docker service ls #OK
docker stack rm test
docker service ls
docker stack deploy -c stack.yml test
docker service ls #NOK

Also, we're running into this issue on RHEL7 /w Docker 1703-ee on VMware vSphere virtuals. We were thus far unable to reproduce the issue on Virtualbox or VMware Fusion, using the same stack. Our next steps would be to run an other OS on VMware vSphere to reproduce the issue, and debug the vxlan config.

@pjutard

This comment has been minimized.

Copy link

pjutard commented May 29, 2017

Same problem here. Same scenario: multiple stacks deployed, each with its own network, after some docker stack rm and docker stack deploy, we get the "Error creating vxlan: file exists" error msg.
We have a swarm in this state right now...

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 21:43:09 2017
 OS/Arch:      linux/amd64
 Experimental: false

Using Docker4Azure

@mpepping

This comment has been minimized.

Copy link

mpepping commented May 30, 2017

Allright, some extensive tests led to interesting results. We're running RHEL7 /w docker-1703-ee.
The issue was direct reproducible when running the 3.10.0-327.10.1.el7.x86_64 kernel with iptables (firewalld removed). A docker swarm deploy/rm/deploy combo fails every test run on this setup.
After bumping the kernel (3.10.0-514.6.1.el7.x86_64) and installing+enabling the firewalld service, the results are much more reliable .. but still can break after 200+ or 800+ docker swarm deploy/rm/deploy runs .. after which rebooting the host is the only reliable way to fix this. Note that just bumping the kernel, or enabling firewalld isn't sufficient .. the combination of both made the difference in our use case.

@jgeyser

This comment has been minimized.

Copy link

jgeyser commented May 31, 2017

As per #562

You can correct this by running:

sudo umount /var/run/docker/netns/*
sudo rm /var/run/docker/netns/*

Not sure if this is a long term solution.

@mavenugo

This comment has been minimized.

Copy link
Contributor

mavenugo commented May 31, 2017

@jgeyser thats a workaround to get out of issue. But that is not a solution. We have to narrow down the RC and fix it in the code.

@dcrystalj

This comment has been minimized.

Copy link

dcrystalj commented Jun 1, 2017

@jgeyser this is not working for me. sometimes i also get issue Unable to complete atomic operation, key modified

I have tried removing docker-engine and leaving docker swarm, but it didn't work.

update:
Needed full machine restart as well

@dang3r

This comment has been minimized.

Copy link

dang3r commented Jun 4, 2017

We are encountering the same problem with the following configuration:

Linux hostname 4.9.0-0.bpo.2-amd64 #1 SMP Debian 4.9.13-1~bpo8+1 (2017-02-27) x86_64 GNU/Linux

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:53:29 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:53:29 2017
 OS/Arch:      linux/amd64
 Experimental: false
@lnshi

This comment has been minimized.

Copy link

lnshi commented Jun 9, 2017

Exactly same problem experienced, quite randomly.

@mpepping

This comment has been minimized.

Copy link

mpepping commented Jun 9, 2017

@lnshi Care to share about your environment .. os, docker version, using virtualisation?

@sanimej

This comment has been minimized.

Copy link
Contributor

sanimej commented Jun 12, 2017

@dang3r @lnshi Can you add the details on what triggers this issue for you ? Have you able to find any pattern or a way to recreate this issue ? If your host is a VM, what hypervisor are you using ?

@lnshi

This comment has been minimized.

Copy link

lnshi commented Jun 12, 2017

@sanimej Maybe I just misreported this, I just figured out that my actual problem is like this: I reported there: Issue #33626, it is also subnet overlaps problem, but seems different reason. Can you help on that also? thanks.

@sanimej

This comment has been minimized.

Copy link
Contributor

sanimej commented Jun 14, 2017

@dang3r @dcrystalj @discotroy If you are still having this issue can you check if your host has any udev rules that might rename interface names that start with vx. ?

For overlay networks, docker daemon creates a vxlan device with the name like vx-001001-a12eme where 001001 is the VNI id in hex, followed by shortened network id. This device then gets moved to a overlay network specific namespace. When the overlay network is deleted, the device is moved back to the host namespace before its deleted. If there is a udev rule that could rename these interfaces and if the rename happens before docker daemon can delete it, the host will end up with an orphaned interface with that vni id. So subsequent attempts to create that interface will fail.

@adpjay

This comment has been minimized.

Copy link

adpjay commented Jul 11, 2017

@mpepping Were you able to get the error message ("ERRO[0143] fatal task error error="starting container failed:...") to show up in the docker daemon logs? My swarm is in a state right now where several containers are in this condition. When I try to start one of the containers, the client returns an error message:
Error response from daemon: Error response from daemon: subnet sandbox join failed for "10.0.8.0/24": error creating vxlan interface: operation not supported
But I don't see any corresponding docker daemon API log message in my swarm.

I would like to forward all the daemon messages to splunk so that I can create an event recognizing when this condition occurs so we can execute a workaround to keep people moving forward and to validate that we aren't seeing it anymore when we get a fix.

@mpepping

This comment has been minimized.

Copy link

mpepping commented Jul 11, 2017

@adpjay Messages with the loglevel ERROR are logged by the daemon by default. Syslog should them up and send them to something like /var/log/messages or journald.
I our case, the exact error was: ERRO[0143] fatal task error error="starting container failed: subnet sandbox join failed for \"10.0.2.0/24\": error creating vxlan interface: file exists". The file exists message differs from the operation not supported message in your error-message. I our case a custom udev-rule for renaming network-interfaces was part of the issue. Maybe something worth checking out.

@adpjay

This comment has been minimized.

Copy link

adpjay commented Jul 11, 2017

@mpepping Thanks for responding. I see lots of messages for docker API GET calls in /var/log/messages and when I run docker logs -f <ucp-manager> (for any of the 3 managers in our swarm) but I don't see the error reported by the client. I wonder if there is a specific node in the swarm they show up on.
I did notice that the exact error was different (file exists vs. operation not supported) but I was thinking it could be due to the fact that our host has a different OS than you (we're running SUSE).
Thanks for the hint about the udev-rule I'll check it out.

@mpepping

This comment has been minimized.

Copy link

mpepping commented Jul 11, 2017

@adpjay You should be able to see the error at the daemon level (not in the UCP container logging). You should be able to see the error running something like journalctl -u docker.service. Good luck troubleshooting!

@RAKedz

This comment has been minimized.

Copy link

RAKedz commented Jul 18, 2017

I am getting this error on one of my nodes, I have 5 total. Any service trying to run on it will get his error:

error creating vxlan interface: file exists

I tried to do a 'docker system prune' and even booted the server, but it didn't fix it. Then someone mentioned it could be the network and I thought that could be it because I was heavily messing with the network because I have having issues with the encrypted network I created. I ended up creating a new non-encrypted network and using it for my services, abandoning the previous.

I began to examine the network on my working nodes and noticed that the encrypted network I was using was either removed or was still listed. But on the node not working the encrypted network was there but it was showing a scope of local unlike the others. (not sure how/why it was changed to local)

Bad node:
iw3w9kdywnay jupiter overlay local

Good nodes:
iw3w9kdywnay jupiter overlay swarm

When I tried to remove the network on the bad node I received this message:

Error response from daemon: network jupiter has active endpoints

Which is why the 'docker system prune' couldn't remove it.

I removed it by doing the following:

  1. Looked up its endpoint
    docker network inspect jupiter
  2. Remove it
    docker network disconnect -f jupiter ingress-endpoint
    docker network rm jupiter

Then I created a service to run on that node, and it started working for me.

This is my docker version:

Client:
Version: 17.03.0-ce
API version: 1.26
Go version: go1.7.5
Git commit: 3a232c8
Built: Tue Feb 28 08:01:32 2017
OS/Arch: linux/amd64

Server:
Version: 17.03.0-ce
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 3a232c8
Built: Tue Feb 28 08:01:32 2017
OS/Arch: linux/amd64
Experimental: false

Working on Digital Ocean - Ubuntu 16.04

@gitbensons

This comment has been minimized.

Copy link

gitbensons commented Sep 15, 2017

Found a workaround for this issue, without the need of rebooting or restarting docker daemon.
As @sanimej mentioned

For overlay networks, docker daemon creates a vxlan device with the name like vx-001001-a12eme where 001001 is the VNI id in hex, followed by shortened network id. This device then gets moved to a overlay network specific namespace. When the overlay network is deleted, the device is moved back to the host namespace before its deleted

So once you know which vxlan id fails to be created (did a strace of the docker daemon process, which is overkill for sure, but I was in a hurry)
4993 15:01:04.640588 recvfrom(30, "\254\0\0\0\2\0\0\0\267\273\0\0\212\265\372\377\357\377\377\377\230\0\0\0\20\0\5\6\267\273\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\3\0vx-000105-1158f\0\10\0\r\0\0\0\0\0\\\0\22\0\t\0\1\0vxlan\0\0\0L\0\2\0\10\0\1\0\5\1\0\0\5\0\5\0\0\0\0\0\5\0\6\0\0\0\0\0\5\0\7\0\1\0\0\0\5\0\v\0\1\0\0\0\5\0\f\0\0\0\0\0\5\0\r\0\1\0\0\0\5\0\16\0\1\0\0\0\6\0\17\0\22\265\0\0", 4096, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 172
So 000105-1158f aka 0x105 aka vxlan id 261 in my case.

Build a list of active network namespaces and its vxlan's on the failing host.
For example:
# for i in $(ls /var/run/docker/netns/*); do echo ":::: $ns" >> ip.link.show; nsenter -m -t <PID of docker daemon> nsenter --net=$ns ip -d link show ; done >> ip.link.show

Now that you know the affected network namespace, double nsenter into it
# nsenter -m -t <PID of docker daemon> bash
# nsenter --net=/var/run/docker/netns/<affected namespace> bash
# ip link delete vxlan1

After that, the error is gone. Pretty sure Docker Inc. knows about that workaround, why they don't share it is up to the imagination of the reader.
Hope this helps.

@lukewendling

This comment has been minimized.

Copy link

lukewendling commented Apr 25, 2018

I was this getting error on a docker swarm stack (docker v18.03) and finally removed the entire stack (docker stack rm) and re-created with docker stack deploy and the problem resolved.

@ctelfer

This comment has been minimized.

Copy link
Contributor

ctelfer commented Apr 27, 2018

So far have not been able to reproduce this locally. I've tried the steps described above and have also scripted them to run them repeatedly. No dice so far. Will try with larger #s of networks next.

Having said that, while inspecting the code I definitely found several race conditions. I think that one in particular could cause this issue, but without reproduction its hard to prove. Will issue a PR shortly.

@ctelfer

This comment has been minimized.

Copy link
Contributor

ctelfer commented May 4, 2018

I have found something that fits this issue, at least in the most recently described variant by @lukewendling . It is another version of re-oncing and it looks like it will very readily lead to the race condition described here. First, observe that when an endpoint gets deleted, libnetwork eventually invokes driver.Leave() which invokes network.leaveSandbox(). The leaveSandbox() code begins by locking the network (good) to prevent concurrent access, but one should then carefully observe that joining the network offers no such similar lock protection. This is the big warning sign.

Now, say that the last endpoint is getting removed from a network via leaveSandbox(). Then the following code fires which is protected by a ... ahem ... "lock":

        n.once = &sync.Once{}
        for _, s := range n.subnets {
                s.once = &sync.Once{}
        }

Now both the network and all of its subnet once's are reset. The goroutine hasn't finished leaving the sandbox yet and the network certainly hasn't been removed yet. (that requires a call to driver.DeleteNetwork()). Furthermore, the vxlan interface has not yet been removed. That happens in the network.DestroySanbox() call that immediately follows the re-oncing and that occurs only after some other blocking operations occur like removing filters and interfaces.

So, what happens if another endpoint attempts to join the network while this is occurring? Well, it depends on the timing. Well, libnetwork eventually calls driver.Join() which calls network.joinSandbox() as well as network.joinSubnetSandbox(). It's possible that the Join will call joinSanbox() before the once gets re-installed, causing it to skip initSandbox(). But, then, while moving on to joinSubnetSandbox(), both the network and subnet once's could get reset causing the "joining" goroutine to invoke joinSubnetSandbox() with a fresh sync.Once(). Then network.joinSubnetSandbox() will invoke network.initSubnetSandbox() which will use the same VXLAN ID to attempt to create a new interface with the same name as the one we are about to remove in the "leaving" goroutine. If the join thread wins the race, it will get an error due to the collision, abort, and leave the error in place in network.initErr. Future attempts to join the network, therefore, will fail. In this scenario, eventually, the leaveSandbox() would remove the vxlan interface, but not before the network and its sandbox is basically tainted.

Entirely removing the stack should include removing the network which would, in turn, remove the now faulty error condition along with the sandbox. That means that a subsequent deploy should work. This matches @lukewendling 's observation above. However, it does not match sightings where the vxlan interface is left in the sandbox.

The above scenario could also occur if the Join happened before Leave (or approximately so) and got up to, but not in, joinSubnetSandbox(), and then the leaveSandbox() occurs re-initializing the Onces . However, if leaveSandbox() re-initializes both Onces before joinSandbox() fires, then joinSandbox() will block on the network mutex when it tries to increment its epoch count in initSandbox(). This would block the goroutine through the end of leaveSandbox() preventing the rest of the race.

Again, not sure if this is the issue. It fits one (newer) set of symptoms, but not some of the older ones. But, at the end of the day, this re-oncing definitely seems to be a bad idea and should be studiously avoided.

@selansen

This comment has been minimized.

Copy link

selansen commented May 5, 2018

Agreed for re-oncing . I Kind of learnt its not a good idea recently :)

@theabeing

This comment has been minimized.

Copy link

theabeing commented Oct 26, 2018

same error here.
subnet sandbox join failed for "10.255.0.0/16": error creatin vx....

@fendo64

This comment has been minimized.

Copy link

fendo64 commented Feb 2, 2019

Hello,

We still have the issue with 18.09 (error creating vxlan interface: file exists).

In some cases (following a crash or other) vxlan are not cleared and when restarting our stack we have this error:

"network sandbox join failed: subnet sandbox join failed for "10.0.5.0/24": error creating vxlan interface: file exists"

To fix the problem we have to manually delete the vxlan down present on the host (ip link delete vx -.....).

I propose this correction, I think that could correct the problem:

diff --git a/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go b/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go
index cf32e45951..83462db3f3 100644
--- a/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go
+++ b/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go
@@ -340,7 +340,24 @@ func (n *network) joinSandbox(s *subnet, restore bool, incJoinCount bool) error
                        s.initErr = subnetErr
                        s.sboxInit = true
                }
+
+               if subnetErr != nil {
+                       // Delete vxlan interface if exist
+                       // TODO: Check interface is down ?
+                       vxlanName := n.generateVxlanName(s)
+
+                       if deleteErr := deleteInterface(vxlanName); deleteErr != nil {
+                               logrus.Warnf("could not delete vxlan interface, %s, error %v", vxlanName, deleteErr)
+                       } else {
+                               subnetErr = n.initSubnetSandbox(s, restore)
+                               if restore || subnetErr == nil {
+                                       s.initErr = subnetErr
+                                       s.sboxInit = true
+                               }
+                       }
+               }
        }
+
        if subnetErr != nil {
                return fmt.Errorf("subnet sandbox join failed for %q: %v", s.subnetIP.String(), subnetErr)
        }
@tafelpootje

This comment has been minimized.

Copy link

tafelpootje commented Feb 15, 2019

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Docker version:
Docker version 18.09.2
Ubuntu 16.04

@fendo64

This comment has been minimized.

Copy link

fendo64 commented Feb 15, 2019

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

The correction that I propose is after reading the code, I do not have the environment to test.
If a good soul, has a test environment, could he test my correction proposal.

@ryandaniels

This comment has been minimized.

Copy link

ryandaniels commented Mar 18, 2019

Ran into the same issue.
Docker version: 18.06.1-ce

Fixed after @sanimej / @fendo64 's work-around:
ip link delete vx-xxxx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.