Dead plugin+systemd = dead docker #813

bboreham · 2015-12-10T11:02:27Z

I am developing a Docker Network plugin, so from time to time it crashes and/or I stop it running. If, during this time, I restart the host, Docker will often get completely hosed.

It seems this is a result of Docker waiting 15 seconds per endpoint to try to talk to my plugin and systemd only waiting a limited time for Docker to get started. Also it appears to be timing out three times for the same endpoint (at least in this example):

Docker should provide some way for installations to recover from this situation - maybe a --force-network-cleanup flag ? Or do the retrying in the background after getting the main daemon up and running.

Dec 10 10:16:35 systemd[1]: Starting Docker Application Container Engine...
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.392673377Z" level=info msg="API listen on /var/run/docker.sock"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.423531946Z" level=info msg="Firewalld running: false"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.463752856Z" level=error msg="getNetworkFromStore for nid d5a93b9cbfa323a53eae7808358d67ebd6470b801cf5de33b5991a7bcd5c79bf failed while trying to build sandbox for cleanup: network d5a93b9cbfa323a53eae7808358d67ebd6470b801cf5de33b5991a7bcd5c79bf not found"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.464265535Z" level=warning msg="Failed getting network for ep 72270e49642470d5676acd7f19ead85dd7e460030f71c7dc66436f1836107fa5 during sandbox cbbec073b80c886dc1f6dddbfd32f91ad5999638afdf4cf9a61bbdf29a1ba19a delete: network d5a93b9cbfa323a53eae7808358d67ebd6470b801cf5de33b5991a7bcd5c79bf not found"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.464430396Z" level=error msg="failed to delete sandbox cbbec073b80c886dc1f6dddbfd32f91ad5999638afdf4cf9a61bbdf29a1ba19a while trying to cleanup: could not cleanup all the endpoints in container 0317189ada28cb6a1f5c16c670c8ff744e7f6510db73035bcf991f02cf9ca531 / sandbox cbbec073b80c886dc1f6dddbfd32f91ad5999638afdf4cf9a61bbdf29a1ba19a"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.484161832Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:16:36 docker[1274]: time="2015-12-10T10:16:36.485561219Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:16:38 docker[1274]: time="2015-12-10T10:16:38.486948113Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:16:42 docker[1274]: time="2015-12-10T10:16:42.487467103Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:16:50 docker[1274]: time="2015-12-10T10:16:50.487936016Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:16:51 docker[1274]: time="2015-12-10T10:16:51.488902480Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:16:53 docker[1274]: time="2015-12-10T10:16:53.490038312Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:16:57 docker[1274]: time="2015-12-10T10:16:57.491067983Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:17:05 docker[1274]: time="2015-12-10T10:17:05.491327232Z" level=error msg="getEndpointFromStore for eid 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c failed while trying to build sandbox for cleanup: could not find endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: datastore not found for scope "
Dec 10 10:17:05 docker[1274]: time="2015-12-10T10:17:05.491416646Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:17:06 docker[1274]: time="2015-12-10T10:17:06.491579074Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:17:08 docker[1274]: time="2015-12-10T10:17:08.491737246Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:17:12 docker[1274]: time="2015-12-10T10:17:12.491906061Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:17:20 docker[1274]: time="2015-12-10T10:17:20.492103311Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:17:21 docker[1274]: time="2015-12-10T10:17:21.493062542Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:17:23 docker[1274]: time="2015-12-10T10:17:23.493250821Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:17:27 docker[1274]: time="2015-12-10T10:17:27.493425280Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:17:35 docker[1274]: time="2015-12-10T10:17:35.493618436Z" level=warning msg="Failed detaching sandbox eacfbde4430e0e618097962072c92a1fba5b46f0613134b74bedf29114fbfcb7 from endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: failed to get endpoint from store during leave: could not find endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: datastore not found for scope \n"
Dec 10 10:17:35 docker[1274]: time="2015-12-10T10:17:35.493672618Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:17:36 docker[1274]: time="2015-12-10T10:17:36.493849667Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:17:38 docker[1274]: time="2015-12-10T10:17:38.494796847Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:17:42 docker[1274]: time="2015-12-10T10:17:42.494940516Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:17:50 docker[1274]: time="2015-12-10T10:17:50.495108328Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:17:51 docker[1274]: time="2015-12-10T10:17:51.495295240Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:17:53 docker[1274]: time="2015-12-10T10:17:53.496253373Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:17:57 docker[1274]: time="2015-12-10T10:17:57.496416010Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:18:05 docker[1274]: time="2015-12-10T10:18:05.497531014Z" level=warning msg="Failed deleting endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: failed to get endpoint from store during Delete: could not find endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: datastore not found for scope \n"
Dec 10 10:18:05 systemd[1]: docker.service start operation timed out. Terminating.
Dec 10 10:18:05 systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 10 10:18:05 systemd[1]: Failed to start Docker Application Container Engine.

The text was updated successfully, but these errors were encountered:

bboreham · 2015-12-10T17:23:52Z

Similar report at moby/moby#17960 (comment), but user there is not using a plugin.

dave-tucker · 2015-12-11T11:27:19Z

I've seen this problem too while developing a new network plugin. I was on Debian 8 with systemd.
It happened when I stopped running my plugin and restarted the daemon.
As I recall, I had some networks created using my driver and one or two containers on each network - none of which were started.

The only way to recover from this was to restart my plugin, then restart docker 😭

retrying in the background after getting the main daemon up and running.

But what should happen to those containers if the plugin is still unreachable?

bboreham · 2015-12-11T11:47:03Z

The only way to recover from this was to restart my plugin

This implies one must never attempt to implement a plugin in a container. If that is a recommendation, it should be in the docs.

But what should happen to those containers if the plugin is still unreachable?

Leave them unstarted.

dave-tucker · 2015-12-11T11:56:26Z

This implies one must never attempt to implement a plugin in a container. If that is a recommendation, it should be in the docs.

Not sure it's a limitation - I haven't tried it with containers yet.
I do wonder, once the plugin container was started, whether the daemon might recover.

Leave them unstarted

Works for me

/cc @mrjana @mavenugo @aboch
Could one of you take a look at this?
If you can spell out what you think an appropriate fix might look like, I can see about putting a PR together.

bboreham · 2015-12-11T12:09:18Z

once the plugin container was started

If the user has rm'd it, perhaps misguidedly, they now have no opportunity to restart it - the docker daemon is not listening to commands and won't do so until the plugin is available. Perhaps one should be able to tell Docker that a particular container implements a driver, so docker rm would fail if there are endpoints.

Also consider the case that the driver is crashing before handling any commands.

mavenugo · 2015-12-11T12:21:42Z

@bboreham @dave-tucker there are many such cases and I have requested for a discussion with @calavera to get his inputs on docker plugin infra. To me, this issue is a manifestation of the way we treat plugin containers. instead of trying to band-aid this scenario, lets try and address it at an infra level.

dave-tucker · 2015-12-11T12:25:19Z

Thanks @mavenugo

bboreham · 2015-12-11T12:27:52Z

@mavenugo there was no intention of band-aiding on my side.

this issue is a manifestation of the way we treat plugin containers

Please note the problem I reported at the top will happen just the same if your plugin is not in a container.

mavenugo · 2015-12-11T12:34:56Z

Please note the problem I reported at the top will happen just the same if your plugin is not in a container.

@bboreham fair enough. But, one must note that plugin is actually an extension of the docker engine.
IMO, if the plugin is down that essentially means the engine is down. If the plugin is a container, the engine must do its best to bring it up before processing the request. I think we should explore the limit of engine's responsibility to play nice with the user's intent. if the plugin is not a container, but it is still down, engine cannot do much and staying down seems an appropriate behavior. But again, these are my opinion and @calavera may have more to say on this topic.

bboreham · 2015-12-11T13:04:48Z

OK, I see where you're coming from. To me, having no way to reconfigure this aspect of the engine when the engine is down is the fundamental problem.

I.e. you can get in to this state via some docker [...] create commands, but once you're in this state you cannot run a docker rm command to get out of it; you can only blow away your entire Docker installation.

mavenugo · 2015-12-11T13:19:27Z

Yes. Any such deadlock scenario must be avoided and is certainly a topic of discussion.

bboreham · 2016-01-15T14:57:14Z

Inspection of the code suggests the failure is happening on this code path:

NewDaemon()
  daemon.initNetworkController()
    libnetwork.New()
      controller.sandboxCleanup()
        "getEndpointFromStore for eid %s failed while trying to build sandbox for cleanup"

which all happens strictly before the call to daemon.restore() which would run the container that implements the plugin driver.

So it is impossible to have Docker survive a restart, when a plugin driver is implemented inside a container.

mavenugo · 2016-01-15T15:12:04Z

@bboreham thanks a lot for the investigation and it seems valid. I will look into this more closely and work with @calavera on the plugin side to see what we can do to mitigate this.

bboreham · 2016-01-15T17:26:27Z

To refine the problem, sandboxCleanup seems to only apply to local-scope drivers. This probably explains why I didn't always get the issue - it is very repeatable when you use a remote local-scope driver.

bboreham · 2016-01-15T17:32:15Z

And, getEndpointFromStore is hanging trying to find the driver to ask it what its scope is. Can we refactor that part given we know the scope?

mavenugo · 2016-01-15T18:39:30Z

@bboreham the only reason we do sandboxCleanup is to cleanup any stale endpoints left behind due to an ungraceful daemon restart/shutdown. If it is graceful, this will never happen.

Having said that, ungraceful restart can happen and we should try and do our best to cleanup the states.
The problem here is chicken and egg where the endpoint holds container resource, while the plugins are loaded as containers.

I think the right solution is to not load the plugin to cleanup the endpoint states during sandboxCleanup.
The plugin API Get performs the loading part internally and hence it results in timeout issue that you raised.

Fortunately, I think I have a workaround for this. Let me try a simpler solution to see if we can get this resolved.

bboreham · 2016-01-15T19:03:25Z

I do shutdown -r now to recreate this; I recall it looks as if Docker shuts down normally, but am not in front of that computer right now.

mavenugo · 2016-01-15T19:08:50Z

@bboreham the ungraceful restart of docker daemon can happen in multiple ways. Essentially if the network sandbox is not cleaned up (which is possible only in an ungraceful daemon shutdown), this can happen. PTAL sandboxCleanup and the logic is clear there.

cpuguy83 · 2016-01-16T17:25:37Z

But why is docker trying to initialize a plugin if it's not been specifically requested yet? Or better said, why is Docker requesting a plugin before a container is even attempted to be started that needs it?

mavenugo · 2016-01-16T17:36:13Z

@cpuguy83 this happens because libnetwork tries to cleanup any stale endpoint resource that are left behind due to an ungraceful restart of the daemon. If we dont perform the cleanup, these resources will be held (such as ip-address) and cannot be reused after the restart. For this cleanup, we reuse the same code-path and hence it is trying to access the driver. But this is a force cleanup case and we dont have to invoke the plugin for this cleanup. But in order to retain the same code-path it would be good to get #19390. if we dont have that, then it has to be managed entirely in libnetwork.

bboreham · 2016-01-16T17:42:25Z

It's just an incidental use of the function, while iterating though all local-scope endpoints.

I'll try to make a PR to take it out.

mavenugo · 2016-01-16T18:05:43Z

@bboreham it is not an incidental use and this should not be removed. We could definitely bypass the call to the driver during this force cleanup scenario. But the real problem is none of these.

The problem is that plugins.Get is lazy and greedy and it automatically performs retry.

Maybe we need a simple API in plugins pkg to return just the available plugins (without trying to load it).
That will solve this problem as well.

cpuguy83 · 2016-01-16T21:36:47Z

@mavenugo https://github.com/docker/docker/blob/master/pkg/plugins/plugins.go#L191
This was added recently.

mavenugo · 2016-01-16T21:50:25Z

@cpuguy83 yes. GetAll makes use of pl, err := loadWithRetry(name, false) for all the plugins.
am trying to avoid that and tried to expose the API directly in moby/moby#19390.

I think I should also add a GetActivePlugins API to just return what is active instead of trying to load the plugin.

mavenugo · 2016-01-16T21:52:42Z

@cpuguy83 actually, i could use the Scan() API. Though it sounds like a private API, it is actually Public. let me try using that instead. \o/

bboreham · 2016-01-17T21:09:41Z

OK, I tried my idea of just fixing the .scope() call, but it failed straight afterwards when controller.sandboxCleanup() called Delete() which called endpoint.Leave(). All of this seems misplaced in a routine where no remote drivers are available and we are just cleaning up stale info.

I tried your #880; it crashed. But I can no longer see your comment asking me to try it, so maybe you don't expect it to work yet.

While doing all this, I noticed there is a similar problem during shutdown: if Docker shuts down the container running the plugin, then shuts down another container attached to that network, the call to Leave() will fail. Maybe we do need that "system container" ordering after all, but at shutdown.

mavenugo · 2016-01-17T21:18:09Z

@bboreham i removed the comment after i figured the fix didn't work on certain scenarios.
But, I have resolved it now. Can you please try the latest #880 and let me know how it goes ?

It works fine in the scenarios that I tested.

mavenugo · 2016-01-17T22:17:56Z

@bboreham I tried a valid remote plugin manually and I hit the panic that you encountered. Have resolved that as well. Will do more validations and push the final fix.

bboreham · 2016-01-18T10:53:04Z

Thanks; I will test that later today. I raised the shutdown problem as a separate issue - #882

- Fixes moby#19404 - Fixes ungraceful daemon restart issue in systemd with remote network plugin (moby/libnetwork#813) Signed-off-by: Madhu Venugopal <madhu@docker.com>

thaJeztah · 2016-01-27T17:27:57Z

@mavenugo @bboreham can this be closed now that moby/moby#19465 was merged?

mavenugo · 2016-01-27T17:30:05Z

Yes.

bboreham · 2016-01-27T17:30:34Z

Yes, thanks.

- Fixes moby#19404 - Fixes ungraceful daemon restart issue in systemd with remote network plugin (moby/libnetwork#813) Signed-off-by: Madhu Venugopal <madhu@docker.com>

bboreham mentioned this issue Dec 17, 2015

Proposal: System-level containers moby/moby#18724

Closed

rade mentioned this issue Jan 10, 2016

systemd weave script issues weaveworks/weave#1607

Closed

mavenugo mentioned this issue Jan 16, 2016

Expose plugin retry option for clients moby/moby#19390

Closed

mavenugo mentioned this issue Jan 16, 2016

Force delete sandbox during sandboxCleanup #880

Merged

bboreham mentioned this issue Jan 18, 2016

Suggest disabling systemd timeout if using Weave Net plugin weaveworks/weave#1907

Merged

mavenugo added a commit to mavenugo/docker that referenced this issue Jan 19, 2016

Vendor libnetwork v0.6.0-rc1

ee3b42f

- Fixes moby#19404 - Fixes ungraceful daemon restart issue in systemd with remote network plugin (moby/libnetwork#813) Signed-off-by: Madhu Venugopal <madhu@docker.com>

mavenugo mentioned this issue Jan 19, 2016

Vendor libnetwork v0.6.0-rc1 moby/moby#19465

Merged

bboreham mentioned this issue Jan 20, 2016

add/remove DNS entries on network connect/disconnect weaveworks/weave#1914

Open

tiborvass pushed a commit to tiborvass/docker that referenced this issue Jan 26, 2016

Vendor libnetwork v0.6.0-rc1

c9b1601

- Fixes moby#19404 - Fixes ungraceful daemon restart issue in systemd with remote network plugin (moby/libnetwork#813) Signed-off-by: Madhu Venugopal <madhu@docker.com>

mavenugo closed this as completed Jan 27, 2016

bboreham mentioned this issue Feb 8, 2016

restarting Docker 1.10 after launching weave causes Docker to die weaveworks/weave#1959

Closed

bboreham mentioned this issue Jun 13, 2016

Failed to restart docker after weave stop weaveworks/weave#2359

Closed

SpComb mentioned this issue Sep 14, 2016

Handle Docker restarts kontena/kontena-ipam#14

Open

SpComb mentioned this issue Nov 9, 2016

Docker stalls at startup waiting for the kontena-ipam plugin kontena/kontena#1272

Closed

This was referenced Jul 28, 2017

px-dev hangs hosts, prevents restarts portworx/px-dev#45

Closed

Completely wedged docker when doing something with a rancher-nfs volume rancher/rancher#7043

Closed

thaJeztah mentioned this issue Aug 8, 2023

pkg/plugins: override timeouts during tests moby/moby#46166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead plugin+systemd = dead docker #813

Dead plugin+systemd = dead docker #813

bboreham commented Dec 10, 2015

bboreham commented Dec 10, 2015

dave-tucker commented Dec 11, 2015

bboreham commented Dec 11, 2015

dave-tucker commented Dec 11, 2015

bboreham commented Dec 11, 2015

mavenugo commented Dec 11, 2015

dave-tucker commented Dec 11, 2015

bboreham commented Dec 11, 2015

mavenugo commented Dec 11, 2015

bboreham commented Dec 11, 2015

mavenugo commented Dec 11, 2015

bboreham commented Jan 15, 2016

mavenugo commented Jan 15, 2016

bboreham commented Jan 15, 2016

bboreham commented Jan 15, 2016

mavenugo commented Jan 15, 2016

bboreham commented Jan 15, 2016

mavenugo commented Jan 15, 2016

cpuguy83 commented Jan 16, 2016

mavenugo commented Jan 16, 2016

bboreham commented Jan 16, 2016

mavenugo commented Jan 16, 2016

cpuguy83 commented Jan 16, 2016

mavenugo commented Jan 16, 2016

mavenugo commented Jan 16, 2016

bboreham commented Jan 17, 2016

mavenugo commented Jan 17, 2016

mavenugo commented Jan 17, 2016

bboreham commented Jan 18, 2016

thaJeztah commented Jan 27, 2016

mavenugo commented Jan 27, 2016

bboreham commented Jan 27, 2016

Dead plugin+systemd = dead docker #813

Dead plugin+systemd = dead docker #813

Comments

bboreham commented Dec 10, 2015

bboreham commented Dec 10, 2015

dave-tucker commented Dec 11, 2015

bboreham commented Dec 11, 2015

dave-tucker commented Dec 11, 2015

bboreham commented Dec 11, 2015

mavenugo commented Dec 11, 2015

dave-tucker commented Dec 11, 2015

bboreham commented Dec 11, 2015

mavenugo commented Dec 11, 2015

bboreham commented Dec 11, 2015

mavenugo commented Dec 11, 2015

bboreham commented Jan 15, 2016

mavenugo commented Jan 15, 2016

bboreham commented Jan 15, 2016

bboreham commented Jan 15, 2016

mavenugo commented Jan 15, 2016

bboreham commented Jan 15, 2016

mavenugo commented Jan 15, 2016

cpuguy83 commented Jan 16, 2016

mavenugo commented Jan 16, 2016

bboreham commented Jan 16, 2016

mavenugo commented Jan 16, 2016

cpuguy83 commented Jan 16, 2016

mavenugo commented Jan 16, 2016

mavenugo commented Jan 16, 2016

bboreham commented Jan 17, 2016

mavenugo commented Jan 17, 2016

mavenugo commented Jan 17, 2016

bboreham commented Jan 18, 2016

thaJeztah commented Jan 27, 2016

mavenugo commented Jan 27, 2016

bboreham commented Jan 27, 2016