-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error "Network sandbox join failed: could not get network sandbox (oper true): failed get network namespace "": no such file or directory" #25215
Comments
Any updates here? Seeing something along the same lines, i.e.
|
I see there's a work-in-progress pull request linked; moby/libnetwork#1369 |
Seems like if you get an error like this on a node, the workaround that I found for it is to recreate the overlay network... which you cant really do in a prod env... it's an upgrade showstopper for us |
Have to agree with @sebi-hgdata that a robust, non-service affecting workaround is needed here. We would like to migrate to 1.12 for the live-restore feature, but cannot do so as a result of this bug. |
We are getting the same issue in production. None of the containers on one of the hosts can start.
This doesn't happen often, but it never happened to us in 1.11.x. Any production worthy workarounds? |
@randunel Can you give some details about your setup and when this problem started ? Did it happen after upgrading to 1.12 ? Was there any change on one of the nodes in which you see this error ? |
We have a swarm setup with 1 manager host running We "upgraded" from 1.11 to 1.12 by creating a new infrastructure (new machines), so this can't be an "upgrade" problem. We keep having other network problems (at random times, one random container in host This particular issue is new to us. It started in
All 7 containers in
Details about the 2 days old infrastructure:
Other details (might be relevant or not), we provision the machines with |
@randunel of topic... when you have the networking issues, can you try setting up serf and doing a reachability test... seen that it resolves the issue (probably forcing a serf state sync) and you can also check your cluster membership:
|
I have this error after upgrading to 1.12 too. I use composer with several oveflow networks and docker swarm. This error appears by only one network. Several other overflow networks work well on the same setup. |
I have this problem with the version 1.12.1 too |
I have the same issue |
@groyee , do you use swarm with consul or just new docker 1.12 with internal discovery? |
Swarm with Consul |
Same issue. PR moby/libnetwork#1369 works for me : Any chance to get this merged soon? |
@gfyrag Did you update the unit file with docker 1.12 and then rebooted the machine ? Were there containers still running when you rebooted ? What was the restart policy of the containers ? |
@sanimej No need to update the unit file (The installed version was the v1.12.1, so i keep the same unit file). I just recompile the "dockerd" binary, replace the original binary and restart docker. After that, the overlay network start again to work. Then i also hard reboot the machine (the case where it didn't work), and it still work. The general scheme of my unit files is : There is no restart policy, it is a clean container started each time. |
This is critical issue, because many containers could not be started. Please fix this asap. |
Is this fixed in 1.12.1 then? |
No, this bug is still present in 1.12.1 |
Compile your own version (the repository come with the necessary tools) using the PR moby/libnetwork#1369 |
I am able to recreate it consistently with the following steps..
Given this, docker/libnetwork#1369 is the right fix for this issue. It will be available in the next patch release. |
How long to wait for the next patch release? Containers can not be started. This is a critical issue. |
Libnetwork fix was brought to docker master via vendoring in #25962 |
@sebi-hgdata If a 1.12.2 is going to be released, we'll make sure docker/libnetwork#1369 changes are part of it. |
Closing ... |
@mavenugo has a fix been released? Is it production ready, even by docker standards, at least present in the binaries? |
@randunel the fix (moby/libnetwork#1369) is merged into docker master. It will be cherry-picked into 1.12.2 branch when the bump branch is pulled. So yes. it will be available in the next docker release. |
The question is "has a fix been released", and the answer "yes, it will be available in the next docker release" means "no, it hasn't been released". You shouldn't close this issue until the fix is released (and tested). But anyway, let's continue hiding issues, it's no different from having docs from a not-released version on the website for 3 months. |
Common workflow in github is that an issue gets closed when the PR containing the fix for it has been merged. This step is even automated when the PR description contains the word "fixes" along with the issue number. Then user looks at the fix PR and derives which release contains the fix. In this case, the PR which brought the fix in docker/docker did not have the reference to this issue. This is why I suggested to close it manually. Regarding the testing, it is not always possible to recreate the exact scenario the user was in.
I understand you may feel this way because certain error messages are recurring in different issues opened at different times and across different docker versions, sometimes months apart from each other. Most of the times, at least from what I witnessed so far, they come from very different scenario, different exercised code path which happen to lead to similar (not always same) error messages. This is so common that when developers see two issues with same error messages, they reject a priori the idea (I'd call it temptation) that the two issues have the same root cause, to focus instead on what was happening when the issue was hit. Cheers |
Releasing a fix for this issue is also getting really urgent for us. We have this on a newly setup staging swarm with overlay networking. When the above error occurs the whole swarm is getting into an uncontrollable state - the only workaround so far is to restart all engines in the swarm. Absolutely unusable for production. |
Can anyone provide some workaround until this fix is released? This happens for us all the time in production. Absolutely frustrating. Currently the only option I know is physically remove the VM from the cloud and create a new one. |
The change was cherry-picked into the 1.12.x branch (https://github.com/docker/docker/blob/1.12.x/vendor/src/github.com/docker/libnetwork/drivers/overlay/overlay.go#L97-L106) through #26879. We will be releasing docker 1.12.2-rc1 today or tomorrow for testing |
Thank You! Can't wait for this release :-) |
1.12.2-rc1 was released for testing; https://github.com/docker/docker/releases/tag/v1.12.2-rc1 |
I gave upgrading a try... I now ran into issues that overlay networks were created multiple times - with the same name, but different IDs. |
@schmunk are you using compose with swarm mode (and to create the overlay networks)? |
@thaJeztah No, we're using |
@schmunk42 I have to recreate the duplicate overlay network creation issue to identify the root cause. You had 1.12.1 daemon running with some overlay networks and restarting the daemon with 1.12.2-RC1 resulted in all overlay networks created again with different id ? |
Also, after the upgrade were you able to confirm if the |
@sanimej Yes, but not all overlay networks were multiple times, but one a co-worker was redeploying. I haven't seen the |
@schmunk42 Is there any difference between the overlay network that got recreated vs the ones that didn't ? may be there were containers with restart policy on one but not the other ? |
@groyee Can you try the 1.12.2-RC1 and confirm if this issue has been fixed in your setup ? |
I am gonna give it a try now |
OK. So, I am editing my previous post. Here are the results: 1.12.2-RC1 defiantly fixed the issue. I don't see this error anymore. There is only one big issue. It takes almost 2 minutes for every container to start. I looked at the logs and most of the time it is spending trying to connect to the overlay network. This command takes the same amount of time to return: sudo docker network ls What is going on there? Why is taking so long? It used to be a matter of seconds, even less. If it makes any difference I have about 150 containers connected to this overlay network. |
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
8 machine Consul cluster in AWS runnning Docker 1.11.2.
Steps to reproduce the issue:
Describe the results you received:
Got the following error
network sandbox join failed: could not get network sandbox (oper true): failed get network namespace \"\": no such file or directory
for all containers from just a single hostDescribe the results you expected:
No errors during container restarts
Additional information you deem important (e.g. issue happens only occasionally):
It happened only on a single node.
The text was updated successfully, but these errors were encountered: