-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving endpoint to store fails in windows under concurrent load #1950
Comments
cc @mavenugo |
A coworker hit another instance of this when doing 10 parallel docker run over ~1000 iterations. C:\Windows\system32\docker.exe: Error response from daemon: failed to update the store state of sandbox: failed to |
We have hit the same issue on an Azure VM running "Windows Server 2016 with Containers". We were trying to only create 5 containers at a time. Looks like the 5th container creation worked but starting the container failed with: failed to create endpoint gracious_pasteur on network <our_network>: failed to save endpoint aef7b83 to store: failed to update store for object type *windows.hnsEndpoint: open C:\ProgramData\docker/network/files/local-kv.db.lock: The process cannot access the file because it is being used by another process."} |
It seems somehow obvious that this is related, identical repro and endpoint object. My thoughts would be that another endpoint type wasn't handled properly here elsewhere and is causing continued issues. Original Moby bug / merged to libnetworking: https://github.com/docker/libnetwork/pull/987/files I have a couple of speculations but I'm new at playing with docker and have more limited networking experience than many, so please correct me if I'm making a blatant error or if the guess is wrong. I noticed a few other places where an endpoint (which inherit from the same class in WIndows) is warned about but not deleted or otherwise cleaned up. Unfortunately I can't tell if this is intentional in some cases. The changes made caused a missing missing endpoint in the KV store to throw a fatal log message, but elsewhere line on line 201 the same endpoint deletion failure is considered a warning. I'm guessing this is for the fall-through to the cleanup code, but uses of the KV store without a matching entry to remove (addition is fine, I'm guessing), or a program termination may be dangerous. Allowing warnings through on endpoints failures when KV stores are later used to look for them may be causing a failure of the file to be unlocked. Like I said, I'm new, so I'd hope the storage DB was thread safe. Windows should have an effectively unlimited file handle limit so that doesn't seem likely, although the behavior would be vaguely similar. |
On second thought the ErrorF and warn EPs is probably unsafe too, the underlying windows EP object may not be deleted before the process is killed or the KV store sent nil in the case of WarnF... and that's usually what's happening when the warnings occur. |
cc @daschott ^^ |
I did a little investigation on this as I was able to repro reasonably consistently. Here's a couple of interesting goroutines when close() fails as in-use. Will try to figure out what's going on, but it looks like a legitimate race - one thread closing the database, and another is clearly in the middle of a write.
|
The next reference to goroutine 200 is the one which fails:
|
I understand the access denied error now, if not the "process cannot access the file because...." error. It's in gitHub.com/boltdb/bolt/bolt_windows.go, function Looks like there's a race between closing the lockfile and deleting it. If another goroutine comes in between these two steps, the openfile in |
Note above is for the |
I'm really confused about the "cannot access the file". Questioning if defender is the culprit? Temporarily adding an exclusion to the folder containing network/files/local-kv.db.lock to see if still repros with that. Nope - not that. Still repros. |
Think I have it. By removing the lock file entirely, and locking the database instead using -1..0 byte range as the lock. So far so good in validation. |
Opened an issue in etcd-io/bbolt#121 for upstream. |
Fix submitted at etcd-io/bbolt#122 |
Signed-off-by: John Howard <jhoward@microsoft.com> This also adds go.etcd.io/bbolt as boltdb/bolt is no longer maintained, and we need etcd-io/bbolt#122 which was merged in https://github.com/etcd-io/bbolt/releases/tag/v1.3.1-etcd.8 in order to fix moby/libnetwork#1950. Note that I can't entirely remove boltdb/bolt as it is still used by other components. Still need to work my way through them.... These include containerd/containerd (containerd/containerd#2634), docker/swarmkit; moby/buildkit. And probably more....
Signed-off-by: John Howard <jhoward@microsoft.com> This also adds go.etcd.io/bbolt as boltdb/bolt is no longer maintained, and we need etcd-io/bbolt#122 which was merged in https://github.com/etcd-io/bbolt/releases/tag/v1.3.1-etcd.8 in order to fix moby/libnetwork#1950. Note that I can't entirely remove boltdb/bolt as it is still used by other components. Still need to work my way through them.... These include containerd/containerd (containerd/containerd#2634), docker/swarmkit; moby/buildkit. And probably more.... (cherry picked from commit 1a6e260) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: John Howard <jhoward@microsoft.com> This also adds go.etcd.io/bbolt as boltdb/bolt is no longer maintained, and we need etcd-io/bbolt#122 which was merged in https://github.com/etcd-io/bbolt/releases/tag/v1.3.1-etcd.8 in order to fix moby/libnetwork#1950. Note that I can't entirely remove boltdb/bolt as it is still used by other components. Still need to work my way through them.... These include containerd/containerd (containerd/containerd#2634), docker/swarmkit; moby/buildkit. And probably more.... Upstream-commit: 1a6e2609ead86144b75067bfe5154dad5e42d5cf Component: engine
Signed-off-by: John Howard <jhoward@microsoft.com> This also adds go.etcd.io/bbolt as boltdb/bolt is no longer maintained, and we need etcd-io/bbolt#122 which was merged in https://github.com/etcd-io/bbolt/releases/tag/v1.3.1-etcd.8 in order to fix moby/libnetwork#1950. Note that I can't entirely remove boltdb/bolt as it is still used by other components. Still need to work my way through them.... These include containerd/containerd (containerd/containerd#2634), docker/swarmkit; moby/buildkit. And probably more....
We are trying to create a lot of parallel containers on windows. As part of the workflow multiple endpoints are being created and saved simultaneously. When this happens some container creation fails as docker is not able to save the endpoint to store:
time="2017-09-12T19:53:30.792343000-07:00" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!: failed to create endpoint eager_newton on network nat: failed to save endpoint 53944a9 to store: failed to update store for object type *windows.hnsEndpoint: open C:\docker/network/files/local-kv.db.lock: The process cannot access the file cause it is being used by another process." error_type=types.internal module=api
I am waiting for a repro machine to track this. Probably only happens on slow machines but creating an issue for tracking. This is also causing some noise in docker CI aswell.
The text was updated successfully, but these errors were encountered: